Methods in Bioengineering Systems Analysis of Biological Networks
The Artech House Methods in Bioengineering Series Series Editors-in-Chief Martin L. Yarmush, M.D., Ph.D. Robert S. Langer, Sc.D.
Methods in Bioengineering: Biomicrofabrication and Biomicrofluidics, Jeffrey D. Zahn and Luke P. Lee, editors Methods in Bioengineering: Microdevices in Biology and Medicine, Yaakov Nahmias and Sangeeta N. Bhatia, editors Methods in Bioengineering: Nanoscale Bioengineering and Nanomedicine, Kaushal Rege and Igor Medintz, editors Methods in Bioengineering: Stem Cell Bioengineering, Biju Parekkadan and Martin L. Yarmush, editors Methods in Bioengineering: Systems Analysis of Biological Networks, Arul Jayaraman and Juergen Hahn, editors
Methods in Bioengineering Systems Analysis of Biological Networks Arul Jayaraman Department of Chemical Engineering, Texas A&M University
Juergen Hahn Department of Chemical Engineering, Texas A&M University
Editors
artechhouse.com
Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the U.S. Library of Congress.
British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library.
ISBN-13: 978-1-59693-406-1
Cover design by Yekaterina Ratner
© 2009 Artech House. All rights reserved. Printed and bound in the United States of America. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Artech House cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.
10 9 8 7 6 5 4 3 2 1
Contents CHAPTER 1 Quantitative Immunofluorescence for Measuring Spatial Compartmentation of Covalently Modified Signaling Proteins
1
1.1 Introduction
2
1.2 Experimental Design
3
1.3 Materials
3
1.3.1 Cell culture
3
1.3.2 Buffers/reagents
3
1.3.3 Immunofluorescence reagents 1.4 Methods
4 4
1.4.1 Cell culture and stimulation for phospho-ERK measurements
4
1.4.2 Antibody labeling of phosphorylated ERK (ppERK)
4
1.4.3 Fluorescence microscopy imaging of ppERK and automated image analysis
5
1.5 Data Acquisition, Anticipated Results, and Interpretation
6
1.6 Statistical Guidelines
7
1.7 Discussion and Commentary
8
1.8 Application Notes
8
1.9 Summary Points
8
Acknowledgments References
9 10
CHAPTER 2 Development of Green Fluorescent Protein-Based Reporter Cell Lines for Dynamic Profiling of Transcription Factor and Kinase Activation
11
2.1 Introduction
12
2.2 Materials
13
2.2.1 Cell and bacterial culture
13
2.2.2 Buffers and reagents
13
2.2.3 Cloning
14
2.2.4 Microscopy
14
2.3 Methods
14
2.3.1 3T3-L1 cell culture
14
2.3.2 Transcription factor reporter development
14
v
Contents
2.3.3 Kinase reporter development 2.4 Application Notes
17 23
2.4.1 Electroporation of TF reporter plasmids into 3T3-L1 preadipocytes
23
2.4.2 Monitoring activation of ERK in HepG2 cells
26
2.5 Data Acquisition, Anticipated Results, and Interpretation
28
2.6 Discussion and Commentary
29
2.7 Summary Points
30
Acknowledgments
31
References
31
CHAPTER 3 Comparison of Algorithms for Analyzing Fluorescent Microscopy Images and Computation of Transcription Factor Profiles
33
3.1 Introduction
34
3.2 Preliminaries
35
3.2.1 Principles of GFP reporter systems
35
3.2.2 Wavelets
36
3.2.3 K-means clustering
36
3.2.4 Principal component analysis
37
3.2.5 Mathematical description of digital images and image analysis
37
3.3 Methods
38
3.3.1 Image analysis based on wavelets and a bidirectional search
38
3.3.2 Image analysis based on K-means clustering and PCA
41
3.3.3 Determining fluorescence intensity of an image
43
3.3.4 Comparison of the two image analysis procedures
45
3.4 Data Acquisition, Anticipated Results, and Interpretation
46
3.4.1 Developing a model describing the relationship between the transcription factor concentration and the observed fluorescence intensity
46
3.4.2 Solution of an inverse problem for determining transcription factor concentrations
47
3.5 Application Notes
50
3.6 Summary and Conclusions
53
Acknowledgments
54
References
54
CHAPTER 4 Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks 4.1 Introduction
vi
57 58
4.2 Principles of Data-Driven Modeling
59
4.2.1 Types of experimental data
59
4.2.2 Data processing and normalization
60
4.2.3 Suitability of models used in conjunction with quantitative data
62
Contents
4.2.4 Issues related to parameter specification and estimation 4.3 Examples of Data-Driven Modeling
63 64
4.3.1 Example 1: Systematic analysis of crosstalk in the PDGF receptor signaling network
64
4.3.2 Example 2: Computational analysis of signal specificity in yeast
69
Acknowledgments
72
References
72
CHAPTER 5 Construction of Phenotype-Specific Gene Network by Synergy Analysis
75
5.1 Introduction
76
5.2 Experimental Design
78
5.3 Materials
79
5.3.1 Cell culture and reagents
79
5.3.2 Fatty acid salt treatment
79
5.4 Methods
79
5.4.1 Cytotoxicity measurement
79
5.4.2 Gene expression profiling
79
5.4.3 Metabolites measurements
80
5.4.4 Gene selection based on trends of metabolites
80
5.4.5 Calculation of the synergy scores of gene pairs
80
5.4.6 Permutation test to evaluate the significance of the synergy
82
5.4.7 Characterization of the network topology
82
5.5 Data Acquisition, Anticipated Results, and Interpretation
82
5.6 Discussion and Commentary
83
5.7 Applications Notes
83
5.8
5.7.1 Topological characteristics of the synergy network
84
5.7.2 Hub genes in the network
85
Summary Points
89
Acknowledgments
90
References
90
CHAPTER 6 Genome-Scale Analysis of Metabolic Networks
95
6.1 Introduction
96
6.2 Materials and Methods
98
6.2.1 Flux analysis theory
98
6.2.2 Model development
99
6.2.3 Objective function
100
6.2.4 Optimization
104
6.3 Data Acquisition, Anticipated Results, and Interpretation
105
6.3.1 Feasible solution determined
105
6.3.2 No feasible solution determined
106 vii
Contents
6.4 Discussion and Commentary
106
6.5 Summary Points
107
Acknowledgments
107
References
108
CHAPTER 7 Modeling the Dynamics of Cellular Networks
111
7.1 Introduction
112
7.2 Materials
113
7.2.1 Cell culture
113
7.2.2 Database
113
7.3 Methods
113
7.3.1 Network reconstruction
113
7.3.2 Network reduction
113
7.3.3 Kinetic modeling
117
7.3.4 Parameter estimation
120
7.4 Data Acquisition, Anticipated Results, and Interpretation
121
7.4.1 Model network
121
7.4.2 Dynamic simulation parameters
122
7.5 Discussion and Commentary
122
7.5.1 Modularity
122
7.5.2 Generalized kinetic expressions
123
7.5.3 Population heterogeneity
124
7.6 Application Notes
125
7.7 Summary Points
126
Acknowledgments
126
References
126
CHAPTER 8 Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods 8.1 Introduction
130
8.2 Considered System Class and Parametric Sensitivity
131
8.2.1 Example system: reversible covalent modification
131
8.2.2 Parametric steady-state sensitivity
132
8.3 Linear Sensitivity Analysis
134
8.4 Sensitivity Analysis Via Empirical Gramians
136
8.4.1 Gramians and linear sensitivity analysis
136
8.4.2 Empirical Gramians for nonlinear systems
137
8.4.3 A new sensitivity measure based on Gramians
138
8.4.4 Example: covalent modification system
140
8.5 Sensitivity Analysis Via Infeasibility Certificates
141
8.5.1 Feasibility problem and semidefinite relaxation viii
129
142
Contents
8.5.2 Infeasibility certificates from the dual problem
143
8.5.3 Algorithm to bound feasible steady states
144
8.5.4 Example: covalent modification system
145
8.6 Discussion and Outlook References
146 147
CHAPTER 9 Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae Using Dynamic Flux Balance Analysis
149
9.1 Introduction
150
9.2 Methods
151
9.2.1 Stoichiometric models of cellular metabolism
151
9.2.2 Classical flux balance analysis
152
9.2.3 Dynamic flux balance analysis
154
9.3 Results and Interpretation
155
9.3.1 Stoichiometric models of S. cerevisiae metabolism
155
9.3.2 Dynamic simulation of fed-batch cultures
157
9.3.3 Dynamic optimization of fed-batch cultures
159
9.3.4 Identification of ethanol overproduction mutants
164
9.3.5 Exploration of novel metabolic capabilities
167
9.4 Discussion and Commentary
172
9.5 Summary Points
175
Acknowledgments
175
References
176
Related Resources and Supplementary Electronic Information
178
CHAPTER 10 Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling 10.1 Introduction
179 180
10.1.1 Model structure
180
10.1.2 Parameter estimation
181
10.1.3 Identifiability metrics and conditions
182
10.1.4 Overview of the experimental design procedure
184
10.2 Methods
185
10.2.1 Initial perturbation and measurement design
185
10.2.2 Identifiability analysis
186
10.2.3 Impact analysis
188
10.2.4 Design modification and reduction
190
10.2.5 Design implementation
191
10.3 Data Acquisition, Anticipated Results, and Interpretation
192
10.3.1 Step 1: Initial perturbation and measurement design
193
10.3.2 Step 2: Identifiability analysis
193
10.3.3 Step 3: Impact analysis
194 ix
Contents
10.3.4 Step 4: Design reduction
196
10.3.5 Step 5: Identifiability analysis
197
10.4 Application Notes
197
10.4.1 Step 1: Initial perturbation and measurement design
198
10.4.2 Step 2: Identifiability analysis
200
10.4.3 Steps 3 to 5: Impact analysis, design reduction, and identifiability analysis
200
10.5 Discussion and Commentary
205
10.6 Summary Points
207
Acknowledgments
208
References
208
CHAPTER 11 Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes 11.1 Introduction 11.1.1 Adaptive sparse grid interpolation
211 212 213
11.2 Experimental Design
215
11.3 Materials
217
11.4 Methods
218
11.5 Data Acquisition, Anticipated Results, and Interpretation
221
11.5.1 Sorted grid points
222
11.5.2 Unique points
222
11.5.3 Unstable points
223
11.5.4 Interpretation and conclusions
223
11.6 Troubleshooting 11.6.1 Troubleshooting special cases: small and large problems
224 224
11.7 Discussion and Commentary
227
11.8 Application Notes
228
11.8.1 Comparison of adaptive sparse grid and GA-based optimization
228
11.8.2 Adaptive sparse grid-based optimization
228
11.8.3 Genetic algorithm
229
11.9 Summary Points
230
Acknowledgments
231
References
231
Related sources and supplementary information
232
CHAPTER 12 Reverse Engineering of Biological Networks 12.1 Introduction: Biological Networks and Reverse Engineering
x
233 234
12.1.1 Biological networks
234
12.1.2 Network representation
236
12.1.3 Motivation and design principles
237
Contents
12.1.4 Reverse engineering 12.2 Material: Time Series and Omics Data
238 239
12.2.1 Metabolomics
240
12.2.2 Proteomics and protein interaction networks
240
12.2.3 Transcriptomics
241
12.3 Approaches for Inference of Biological Networks
242
12.3.1 Genome-scale metabolic modeling
243
12.3.2 Boolean networks
245
12.3.3 Network topology from correlation or hierarchical clustering
247
12.3.4 Bayesian networks
248
12.3.5 Ordinary differential equations
250
12.4 Network Biology—Exploring the Inferred Networks
256
12.4.1 Graph theory
257
12.4.2 Motifs and modules
258
12.4.3 Stoichiometric analysis
260
12.4.4 Simulation of dynamics, sensitivity analysis, control analysis
261
12.5 Discussion and Comparison of Approaches
264
12.6 Summary Points
266
Acknowledgments
266
References
267
CHAPTER 13 Transcriptome Analysis of Regulatory Networks
271
13.1 Introduction
272
13.2 Methods
273
13.2.1 Materials
273
13.2.2 Cell harvesting
274
13.2.3 RNA purification
274
13.2.4 Transcriptional profiling using DNA microarrays
276
13.3 Data Acquisition, Anticipated Results, and Interpretation
281
13.3.1 Acquisition of DNA microarray data
281
13.3.2 Normalization
281
13.3.3 Network Component Analysis (NCA)
282
13.4 Discussion and Commentary
284
13.5 Application Notes
284
13.6 Summary Points
285
References
285
CHAPTER 14 A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks
287
14.1 Introduction
288
14.2 Materials
289 xi
Contents
14.3 Methods
291
14.3.2 Robust clustering of differential gene expression time series data using computational negative control approach
292
14.3.3 Transcriptional regulatory network analysis using PAINT
293
14.4 Data Acquisition, Anticipated Results, and Interpretation
296
14.4.1 Selection of number of clusters
296
14.4.2 PAINT result interpretation for gene coexpression clusters
296
14.5 Discussion and Commentary
296
14.5.1 Estimation of nondifferentially expressed genes (pi.not value)
297
14.5.2 Threshold for local false discovery rate analysis
297
14.5.3 Format of gene identifiers
298
14.5.4 Cluster size issues
298
14.5.5 TRANSFAC version issues
298
14.5.6 Annotation redundancy in the gene list and multiple promoters
299
14.5.7 Reference Feasnet selection/generation
299
14.5.8 Multiple testing correction in PAINT
299
14.6 Application Notes
300
14.7 Summary Points
300
Acknowledgments
301
References
301
About the Editors List of Contributors Index
xii
291
14.3.1 Identification of differentially expressed genes
303 304 307
CHAPTER
1 Quantitative Immunofluorescence for Measuring Spatial Compartmentation of Covalently Modified Signaling Proteins 1
Jin-Hong Kim and Anand R. Asthagiri
2
1
Division of Engineering and Applied Science and Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, e-mail:
[email protected] 2
Abstract Intracellular signaling pathways control cell behaviors and multicellular morphodynamics. A quantitative understanding of these pathways will provide design principles for tuning these signals in order to engineer cell behaviors and tissue morphology. The transmission of information in signaling pathways involves both site-specific covalent modifications and spatial localization of signaling proteins. Here, we describe an algorithm for quantifying the spatial localization of covalently modified signaling proteins from images acquired by immunofluorescence (IF) staining. As a case study, we apply the method to quantify the amount of dually phosphorylated extracellularregulated kinase (ERK) in the nucleus. The algorithm presented here provides a general schematic that can be modified and applied more broadly to quantify the spatial compartmentation of other covalently modified signaling proteins.
Key terms
ERK Image analysis Segmentation Site-specific modification Spatial localization Watershed algorithm
1
Quantitative Immunofluorescence for Measuring Spatial Compartmentation
1.1 Introduction Signal transduction networks control all aspects of cell behavior, such as metabolism, proliferation, migration, and differentiation [1]. Thus, engineering cell behaviors will hinge on understanding and tuning information flow in these signaling pathways. Intracellular signals transmit information in at least two major ways. First, signaling proteins undergo covalent modifications that alter their intrinsic enzymatic activity and/or their interactions with binding partners. In addition to the connectivity of the signal transduction network, signaling proteins are localized spatially. Where a signal is located can influence its accessibility to upstream and downstream factors, and therefore can play a significant role in controlling information flux [2]. Green fluorescent protein (GFP) has provided a powerful way to track the localization of signaling proteins [3]. Variants of GFP spanning a wide range of spectral properties have opened the door to monitoring colocalization of signaling proteins. A key challenge, however, is that quantifying signal propagation must involve not only tracking protein localization, but also the covalent state of that signal. Sensor platforms that track both spatial localization and covalent state/activity are emerging. Several involve fluorescence resonance energy transfer (FRET), a phenomenon wherein the close proximity of two complementary fluorphores allows one (the donor) to excite the other (acceptor) [4]. The quenching of the donor and the excitation of the acceptor serves as a FRET signal. One general strategy has been to introduce a chimeric version of the signaling protein. Both the acceptor and donor are placed in the protein whose folding into an active confirmation changes the FRET signal. Examples include the Raichu sensors for the cdc42/Rac/Rho family of GTPases [5]. In another design, the fluorophores have been placed in chimeric pseudosubstrates for tyrosine kinases [6] and caspases [7]. When these signaling enzymes act on the substrate, the refolding or the cleavage of the substrate changes the FRET signal. A final approach is to place one fluorophore on the signaling enzyme and the other fluorophore on a binding partner. When these are recruited to each other, FRET signal ensues. Examples of this third approach include the Raichu-CRIB sensors for Rho family of GTPases [8]. A major drawback of these tools, however, is that they are highly tailor-made and do not report on the remarkable diversity of covalent modifications that a single signaling protein undergoes. For example, the PDGF receptor is phosphorylated at multiple tyrosine residues, and each phosphorylation site enables its interaction with distinct downstream targets [9]. Such multisite covalent modifications are prevalent across signaling proteins. New mathematical modeling frameworks are being developed to handle the huge number of states in which a single signaling protein may be found [10]. Proteomic approaches are being developed to quantify site-specific covalent modifications in cell extracts on a large scale [11]. While this approach allows large-scale, quantitative analysis of covalent modifications to signaling proteins, it does not gauge subcellular spatial information. Thus, complementary methods are needed to quantify spatial information on signaling proteins that have undergone site-specific covalent modifications. Classical immunofluorescence (IF) staining provides an excellent starting point. In IF staining, antibodies are used to detect an antigen (e.g., signaling protein) in fixed cells [12]. These antibodies may be tagged with fluorophores, including quantum dots that have unique advantages over GFP. Furthermore, antibodies for site-specific covalent modifications are widely available commercially. A limiting factor, however, is that images acquired by 2
1.2
Experimental Design
IF are primarily analyzed qualitatively. Here, we describe image analysis algorithms that may be used to quantify IF images in an automated manner. As a case study, we apply the algorithms to quantify the level of nuclear extracellular-regulated kinase (ERK) signaling.
1.2 Experimental Design In this work, we developed and tested image analysis algorithms to quantify the spatial localization of phosphorylated signaling proteins. We focused on phosphorylated ERK, a signal that localizes to the nucleus and is required for cell proliferation [13]. We performed a dose-dependence assay to gauge how the localized signal responds to different amounts of stimuli. Such dose-response studies provide a well-defined approach to test whether our measurement methodology could discern quantitative changes in signaling. It is useful to conduct such experiments in systems that have been confirmed to trigger the signal of interest using other experimental assays. Therefore, we chose a stimulus, epidermal growth factor (EGF), that is well known to trigger ERK signaling [14, 15]. We used MCF-10A cells that respond to EGF by triggering ERK phosphorylation as confirmed by Western blotting [16].
1.3 Materials 1.3.1
Cell culture
1. 6-well plate (Corning). 2. Micro cover glass, 18 mm circle (VWR). 3. Dulbecco’s modified L-glutamine (Gibco).
Eagle’s
medium/Ham’s
F-12
containing
HEPES
and
4. Epidermal Growth Factor (Peprotech). 5. Hydrocortisone (Sigma-Aldrich). 6. Insulin (Sigma-Aldrich). 7. Choleratoxin (Sigma-Aldrich). 8. Bovine serum albumin (Sigma-Aldrich). 9. Trypsin EDTA 0.05% (Gibco). 10. Penicillin/streptomycin (Gibco).
1.3.2
Buffers/reagents
1. Phosphate buffered saline (Gibco). 2. Paraformaldehyde (Sigma-Aldrich). 3. Tween-20 (Sigma-Aldrich). 4. Methanol (EMD). 5. Glycine (Sigma-Aldrich). 6. Triton X-100 (Sigma-Aldrich). 7. Goat serum (Gibco). 3
Quantitative Immunofluorescence for Measuring Spatial Compartmentation
8. NP-40 (Sigma-Aldrich). 9. NaCl (Sigma-Aldrich). 10. Na2HPO4 (Sigma-Aldrich). 11. NaH2PO4 (Sigma-Aldrich). 12. NaN3 (Sigma-Aldrich). 13. PD98059 (Calbiochem). 14. Na3VO4 (Sigma-Aldrich). 15. NaF (Sigma-Aldrich). 16. β-glycerophosphate (Sigma-Aldrich).
1.3.3
Immunofluorescence reagents
1. Primary antibodies: i.
Phospho-p44/42 MAPK (Thr202/Tyr204), Polyclonal: #9101 (1:200) and Monoclonal: #4377 (1:50) (Cell Signaling Technology, Inc.).
2. Secondary antibody: i.
Alexa Fluor 488 (1:200) (Molecular Probe).
3. 4’,6-diamidino-2-phenylindole (DAPI) (Sigma-Aldrich). 4. ProLong Gold antifade (Molecular Probe).
1.4 Methods 1.4.1
Cell culture and stimulation for phospho-ERK measurements
1. Culture MCF-10A cells in Dulbecco’s modified Eagle’s medium/Ham’s F-12 containing HEPES and L-glutamine supplemented with 5% (v/v) horse serum, 20 ng/mL EGF, 0.5 μg/ml hydrocortisone, 0.1 μg/ml cholera toxin, 10 μg/ml insulin, and 1% penicillin/streptomycin. 2. Plate cells on sterilized glass cover glass placed in the 6-well tissue culture plates at 1 × 105 cells per well and grow cells in growth medium for 24 hours to allow adhesion. 3. For G0 synchronization, wash cells twice with PBS and culture them in serum free medium for 24 hours: DMEM/F-12 supplemented with 1% penicillin/streptomycin and 0.1% bovine serum albumin. 4. For EGF stimulation, reconstitute recombinant human EGF in sterile H2O at 100 μg/ml and dilute it in serum free medium to designated concentrations. 5. Make sure EGF containing medium is warmed to 37°C. Then stimulate cells for 15 minutes by adding 2 ml of EGF containing medium to each well. Either cells incubated in the absence of EGF or treated with a pharmacological inhibitor of MEK, PD98059, can be used as a negative control while cells treated with 10 ng/ml can serve as a positive control.
1.4.2
Antibody labeling of phosphorylated ERK (ppERK)
1. After 15 minutes of EGF stimulation, place 6-well plates on the ice and wash cells twice with ice-cold PBS.
4
1.4
Methods
2. Fix cells in freshly prepared 2% paraformaldehyde (pH 7.4) for 20 minutes at room temperature in the presence of phosphatase inhibitors at the following concentrations: 1 mM sodium orthovanadate, 10 mM sodium fluoride, and 10 mM β-glycerophosphate. Rinse with 0.1 mM solution of Glycine in PBS three times. 3. Permeabilize cells in PBS containing 0.5% NP-40 and the phosphatase inhibitors for 10 minutes at 4°C with gentle rocking. Rinse with PBS three times. 4. Dehydrate cells in ice-cold pure methanol for 20 minutes at –20°C. Rinse with PBS three times. 5. Block with IF Buffer: 130 mM NaCl, 7 mM Na2HPO4, 3.5 mM NaH2PO4, 7.7 mM NaN3, 0.1% bovine serum albumin, 0.2% Triton X-100, 0.05% Tween-20 and 10% goat serum for 1 hour at room temperature. 6. Incubate with anti-phospho-p44/42 MAPK antibody in IF buffer overnight at 4°C. Rinse three times with IF buffer at room temperature on the rocker for 20 minutes each. Washing step is essential to minimize background staining. 7. Sequentially incubate with Alexa dye-labeled secondary antibodies in IF buffer for 45 minutes at room temperature. Rinse three times with IF buffer at room temperature on the rocker for 20 minutes each. Make sure to protect samples from light. 8. Counterstain nuclei with 0.5 ng/ml DAPI for 15 minutes at room temperature and rinse with PBS twice with gentle rocking for 5 minutes each. 9. Mount with ProLong Gold antifade. Dry overnight in a place that can protect samples from light.
1.4.3 Fluorescence microscopy imaging of ppERK and automated image analysis 1. Acquire fluorescence images using filters for DAPI and FITC. Start with a sample that is expected to give the highest FITC signal (e.g., the positive control, 10 ng/ml EGF). Using this positive control, empirically choose an exposure time so that the highest pixel intensity in a given field is close to the saturation level (generally 255). Be sure that the chosen exposure time does not saturate the FITC signal in other fields of the positive control sample. These steps identify an exposure time that maximizes the dynamic range of ppERK signals that may be quantified. The exposure time determined in this way should then be fixed and used to capture images from all other samples. 2. Segment DAPI (nuclei) images using a combination of edge detection and watershed algorithms. The algorithm to process a single image is written in MATLAB (MathWorks) as described below (steps 2, i–v). This algorithm can be iterated to process multiple images in a single execution. i.
Import a DAPI image using imread function. DAPI = imread ( ‘DAPI image.tif’ )
ii.
The edge function detects the edge of the objects using gradients in pixel intensity across the objects and returns a binary image where the edge of objects is traced. Different masks are available in edge function. ‘sobel’ and ‘canny’ methods were successfully used in this study. [edgeDAPI, thresh] = edge ( DAPI, ‘sobel’)
5
Quantitative Immunofluorescence for Measuring Spatial Compartmentation
Optionally, imdilate and imerode functions can be used together to enhance the results of edge detection. iii. Fill in the inside of the traced nuclei using imfill function. edgefillDAPI = imfill (edgeDAPI, ‘holes’) iv. The edge detection method often cannot distinguish cells that are spaced too closely. The watershed algorithm can be used along with the distance transform to separate merged multiple nuclei. Use of bwdist and watershed functions will generate an image having lines that would separate touching cells. Optionally, imhmin function can be used to prevent over-segmentation, which is a known problem of watershed algorithm in some cases. Finally, change the obtained image into the binary image to match the class type. distDAPI = -bwdist (~edgefillDAPI) distDAPI2= imhmin (dsitDAPI,1) ridgeDAPI= watershed(distDAPI2) ridgeDAPI2= im2bw(ridgeDAPI) v.
Merge two images generated by edge detection algorithm and watershed algorithm to create a single nuclear compartment image. segmentedDAPI = edgefillDAPI & ridgeDAPI2
3. Additionally, apply size thresholds to the images to exclude noncellular objects. The distribution of nucleus size can be approximated as a normal distribution. Thus, use three standard deviations above and below the mean area of nuclei as the upper and lower cut-off values. 4. Using FITC images, calculate the average fluorescence level of the noncell areas on a per pixel base to account for the background level for each image. 5. Using the segmented images (nuclear mask) and FITC image together, calculate the area of individual nucleus and sum up the FITC values in this area. Finally, the phospho-protein intensity for each cell can be calculated in the following way: multiply the average background level by the area of the nucleus and subtract this value from the total FITC in the nucleus. ppERK =
∑ FITC − Background × AR
nucleus
nucleus
1.5 Data Acquisition, Anticipated Results, and Interpretation We quantified the level of ppERK in the nucleus of MCF-10A cells that were stimulated with 0.01 or 10 ng/ml EGF or left untreated for 15 minutes. At a qualitative level, the dose-dependent phosphorylation of ERK was evident (Figure 1.1). Furthermore, the localization of ppERK to the nucleus was most evident at the highest EGF concentration. The dose-dependent activation of ERK was confirmed using our quantitative image processing algorithms (Figure 1.2). At the highest EGF concentration, the average amount of nuclear ppERK was approximately five-fold above the response when EGF was absent. Meanwhile, a relatively moderate amount of EGF (0.01 ng/ml) induced only a three fold increase in nuclear ppERK. 6
1.6
Statistical Guidelines
Figure 1.1 Serum-starved MCF-10A cells were stimulated with 0, 0.01, and 10 ng/ml EGF. Following 15 minutes of stimulation, cells were immunostained against ppERK (FITC) and nuclei were counterstained with DAPI. The scale bar represents 50 μm.
Figure 1.2 Average nuclear ppERK intensities in samples treated with 0, 0.01, and 10 ng/ml EGF. The error bars indicated S.E. (n = 3) with duplicates performed in each experiment. The asterisk denotes p < 0.01 (student’s t test).
Since these measurements were conducted at the single-cell level, one can analyze the variation in cell responses across the population. We generated a histogram representing the distribution of nuclear ppERK levels across the population for the three different EGF concentrations (Figure 1.3). In the absence of EGF, most cells fall into a narrow range of low nuclear ppERK intensity. As the EGF concentration was increased, this distribution shifted gradually to the right. These results indicate that the level of nuclear ppERK is a graded response to EGF stimulation at the single-cell level.
1.6 Statistical Guidelines Total of three independent trials (n = 3) were conducted to gather statistically meaningful data. In each trial, duplicates were prepared for each condition to minimize errors associated with sample preparation. For each sample, five images were collected at the multiple fields. All together, at least 150 cells were analyzed for each condition. In each trial, the average amount of nuclear ppERK for each condition was expressed relative to the level in 10 ng/ml EGF sample. Thus, a statistical test was not performed between the 10 ng/ml EGF sample and other samples. One tailed-Student’s t test was performed between 0 and 0.01 ng/ml EGF sample and indicated that these values were different with a p-value less than 0.01. Error bars represent standard error with n = 3.
7
Quantitative Immunofluorescence for Measuring Spatial Compartmentation
Figure 1.3 Histogram representation of the distribution of the nuclear ppERK levels in cell populations treated with 0, 0.01, and 10 ng/ml EGF.
1.7 Discussion and Commentary Intracellular signaling pathways control cell behaviors and multicellular morphodynamics. A quantitative understanding of these pathways will provide design principles for tuning these signals in order to engineer cell behaviors and tissue morphology. The transmission of information in these pathways involves both site-specific covalent modifications to signaling proteins and spatial localization of these signals. Here, we describe algorithms for quantifying signal localization from immunofluorescence staining for phosphorylated ERK. Our data reveal that in epithelial cells, ERK exhibits a graded response to EGF not only at the population level, but also at the level of individual nuclei. These results are consistent with the other studies that reported graded ERK responses to various stimuli in other mammalian cell systems [17, 18]. The algorithms presented here should facilitate quantitative, high throughput analysis of images acquired by IF staining.
1.8 Application Notes The method described in this report would be particularly useful in quantifying the spatiotemporal signaling response at a single-cell level. The algorithm should allow automated and high throughput quantification of subcellular compartmentation of signaling events in response to multiple combinations and doses of environmental stimuli. It should also prove useful for quantitative studies of cell-to-cell variation in signaling. Such measurements would provide valuable quantitative data for systems-level analysis of signal transduction networks, the regulatory architecture that governs cellular decision-making.
1.9 Summary Points •
8
Before beginning image acquisition, choose an exposure time that maximizes the dynamic range of signals that may be quantified. Chosen exposure time should be fixed and used to capture images from all the samples.
Acknowledgments
Troubleshooting Table Problem
Explanation
Potential Solutions
Background is too high
Nonspecific binding of primary or secondary antibody Basal ERK activity mediated by autocrine factor
Make sure to follow the required blocking and washing steps thoroughly Perform a negative control using only the secondary antibody and skipping the primary antibody incubation to assess the level of nonspecific binding Prepare a sample treated with PD98059 to quench ERK activity all together Increase the exposure time until DAPI signals at the location of nuclei become saturated. It will ensure the contrast between nuclei and the background Alternatively, imadjust or contrast functions can be used in MATLAB to enhance the contrast of a DAPI image before performing nuclear segmentation Rinse and wipe the slides with alcohol to get rid of dried salts and stain Avoid air bubbles when mounting the sample with antifade Adjust the upper and lower limits of area threshold of nuclei appropriately to exclude noncellular objects with qualitative verification Use imhmin function to reduce over-segmentation
The number of segmented nuclei Failure to detect the edge of is significantly less than the actual some of the nuclei number of nuclei
The number of nuclei is significantly over-counted
Many noncellular objects were considered as nuclei Over-segmentation from watershed algorithm
•
Qualitatively verify that the edge detection and watershed algorithms properly segment individual nuclei.
•
Size thresholds are often necessary to exclude noncellular objects.
•
Account for background fluorescence level to measure exclusively fluorescence signals from signaling proteins.
•
Choose a proper sample size (e.g., the number of cells analyzed in each trial), depending on the degree of cell-to-cell variance of the target proteins.
•
Add phosphatase inhibitors at fixation and permeabilization steps if target signal molecules are phosphoproteins.
•
Rigorous washing after incubation with antibodies is essential to minimize background staining.
Acknowledgments The authors thank the members of the Asthagiri Lab for helpful discussions. Funding for this work was provided by The Jacobs Institute for Molecular Engineering for Medicine.
9
Quantitative Immunofluorescence for Measuring Spatial Compartmentation
References [1] [2] [3] [4] [5] [6] [7] [8]
[9] [10] [11] [12] [13] [14] [15] [16]
[17] [18]
10
Asthagiri, A.R., and D.A. Lauffenburger, “Bioengineering models of cell signaling,” Annu. Rev. Biomed. Eng., Vol. 2, 2000, pp. 31–53. Haugh, J.M., “Localization of receptor-mediated signal transduction pathways: the inside story,” Mol. Interv., Vol. 2, No. 5, 2002, pp. 292–307. Misteli, T., and D.L. Spector, “Applications of the green fluorescent protein in cell biology and biotechnology,” Nat. Biotechnol., Vol. 15, No. 10, 1997, pp. 961–964. Pollok, B.A., and R. Heim, “Using GFP in FRET-based applications,” Trends Cell. Biol., Vol. 9, No. 2, 1999, pp. 57–60. Mochizuki, N., et al., “Spatio-temporal images of growth-factor-induced activation of Ras and Rap1,” Nature, Vol. 411, No. 6841, 2001, pp. 1065–1068. Ting, A.Y., et al., “Genetically encoded fluorescent reporters of protein tyrosine kinase activities in living cells,” Proc. Natl. Acad. Sci. USA, Vol. 98, No. 26, 2001, pp. 15003–15008. Tyas, L., et al., “Rapid caspase-3 activation during apoptosis revealed using fluorescence-resonance energy transfer,” EMBO Rep., Vol. 1, No. 3, 2000, pp. 266–270. Graham, D.L., P.N. Lowe, and P.A. Chalk, “A method to measure the interaction of Rac/Cdc42 with their binding partners using fluorescence resonance energy transfer between mutants of green fluorescent protein,” Anal. Biochem., Vol. 296, No. 2, 2001, pp. 208–217. Claesson-Welsh, L., “Platelet-derived growth factor receptor signals,” J. Biol. Chem., Vol. 269, No. 51, 1994, pp. 32023–32026. Hlavacek, W.S., et al., “Rules for modeling signal-transduction systems,” Sci. STKE, Vol. 2006, No. 344, 2006, p. RE6. Wolf-Yadlin, A., et al., “Multiple reaction monitoring for robust quantitative proteomic analysis of cellular signaling networks,” Proc. Natl. Acad. Sci. USA, Vol. 104, No. 14, 2007, pp. 5860–5865. Giepmans, B.N., et al., “The fluorescent toolbox for assessing protein location and function,” Science, Vol. 312, No. 5771, 2006, pp. 217–224. Wetzker, R., and F.D. Bohmer, “Transactivation joins multiple tracks to the ERK/MAPK cascade,” Nat. Rev. Mol. Cell. Biol., Vol. 4, No. 8, 2003, pp. 651–657. Gutkind, J.S., “Regulation of mitogen-activated protein kinase signaling networks by G proteincoupled receptors,” Sci. STKE, Vol. 2000, No. 40, 2000, p. RE1. Yarden, Y., and M.X. Sliwkowski, “Untangling the ErbB signalling network,” Nat. Rev. Mol. Cell. Biol., Vol. 2, No. 2, 2001, pp. 127–137. Graham, N.A., and A.R. Asthagiri, “Epidermal growth factor-mediated T-cell factor/lymphoid enhancer factor transcriptional activity is essential but not sufficient for cell cycle progression in nontransformed mammary epithelial cells,” J. Biol. Chem., Vol. 279, No. 22, 2004, pp. 23517–23524. Mackeigan, J.P., et al., “Graded mitogen-activated protein kinase activity precedes switch-like c-Fos induction in mammalian cells,” Mol. Cell. Biol., Vol. 25, No. 11, 2005, pp. 4676–4682. Whitehurst, A., M.H. Cobb, and M.A. White, “Stimulus-coupled spatial restriction of extracellular signal-regulated kinase 1/2 activity contributes to the specificity of signal-response pathways,” Mol. Cell. Biol., Vol. 24, No. 23, 2004, pp. 10145–10150.
CHAPTER
2 Development of Green Fluorescent Protein-Based Reporter Cell Lines for Dynamic Profiling of Transcription Factor and Kinase Activation 1
2
Colby Moya and Arul Jayaraman 1
Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX 222 Jack E. Brown Engineering, 3122 TAMU, College Station, TX 77843-3122; phone: (979) 845-3306; fax: (979) 845–6446; e-mail:
[email protected]
2
Abstract One of the main goals of systems biology is the development of quantitative models for describing and predicting cellular responses on the basis of regulatory molecules such as transcription factors and signaling kinases. The regulation of gene expression by transcription factors and kinases, through different expression and activation dynamics, is integral in governing the expression of specific genes and cellular phenotypes. In this chapter, we present methods to engineer systems suitable for monitoring the dynamic activity of transcription factors and kinases. These methods were used to develop reporter cell lines for the transcription factor PPARγ, as well as a reporter construct for the kinase ERK1/2.
Key terms
Dynamic expression profiling GFP Transcription factor Kinase FRET Adipocytes
11
Development of Green Fluorescent Protein-Based Reporter Cell Lines
2.1 Introduction An important requirement for the development of signal transduction models is the ability to quantitatively describe the activation dynamics of these regulatory molecules. However, the activation of transcription factors has been conventionally monitored using protein binding techniques such as electrophoretic mobility shift assay or chromatin immunoprecipitation [1], while kinase activity is typically investigated using enzymatic assays. While these techniques are capable of providing snapshots of activation at a small set of single time points, they can yield only qualitative data (e.g., mobility shift assay) and require the use of multiple cell populations for each time point at which activity is to be measured. As a result of the limited sampling points and frequencies, the true dynamics of regulatory molecules are not easily captured. Hence, there is a need for methods to investigate time-dependent activation of regulatory molecules in a quantitative manner. Green fluorescent protein (GFP) reporter systems have been recently developed for the continuous and noninvasive monitoring of transcription factors and kinase activation dynamics. Transcription factor reporter systems involve expressing GFP under the control of a minimal promoter such that GFP expression and fluorescence is observed only when a transcription factor is activated (i.e., when the transcription factor binds to its specific DNA binding sequence and induces expression from a minimal promoter). Since wild-type GFP has a half-life of ~72 hours, short half-life variants of GFP have been used so that profiling the activation and decay of different transcription factors (i.e., dynamics) can be carried out. Prior work from our lab has used GFP-based profiling to continuously monitor activation of a panel of transcription factors underlying the inflammatory response in hepatocytes for 24 hours [2–5], where the dynamics of GFP fluorescence is a quantitative indicator for dynamics of the transcription factor being profiled. Fluorescence resonance energy transfer (FRET) has been used to monitor the dynamics of kinase signaling and activity [6, 7]. Recent advances have utilized FRET as a real-time indicator of the activity of numerous kinases and proteases. An example of this is the use of FRET to monitor the activity of protein kinase C (PKC) in living cells by correlating FRET changes to PKC substrate binding and phosphorylation [6]. FRET occurs from the transfer of energy from a donor fluorophore to an acceptor fluorophore upon excitation. When the donor and acceptor proteins are in close proximity (< 10 nm) energy is transferred and the spectral properties change in such a way that excitation of the donor results in emission in the spectral range of the acceptor [8]. Most studies using GFP or other reporter genes involve transiently introducing the reporter plasmid into cells and monitoring changes in activation. However, the stable insertion of reporter plasmids to generate reporter cell lines is more advantageous, as they result in a relatively homogenous population (in terms of number of reporter plasmid copies in each cell and fraction of cell population that contain the reporter plasmid); thereby, increasing the efficiency of profiling. In this chapter, we describe methods for developing GFP-based transcription factor and kinase reporter plasmids. PPARγ is used as the model transcription factor while the kinase activation methods are based on ERK1/2; however, these methods are applicable for any transcription factor whose DNA binding sequence is known and any kinase whose substrate and binding partner have been identified. In addition, we also describe
12
2.2
Materials
methods for generating reporter cell lines with the transcription factor reporter plasmids using 3T3-L1 adipocytes as the model cell line.
2.2 Materials 2.2.1
Cell and bacterial culture
1. Complete growth medium for 3T3-L1 preadipocytes: Dubelco’s Modified Eagle Medium (DMEM, Hyclone, Logan, Utah) supplemented with 10% adult bovine serum (BS, Hyclone, Logan, Utah), 200 units/ml and 200 μg/ml of penicillin/streptomycin, respectively (Hyclone, Logan, Utah) and glucose (4.5 g/L). 2. Complete growth medium for HepG2 cells: Modified Eagle Medium (MEM, Hyclone, Logan, Utah) supplemented with 10% fetal bovine serum (FBS, Hyclone, Logan, Utah), 200 units/ml and 200 μg/ml of penicillin/streptomycin, respectively (Hyclone, Logan, Utah) and glucose (1 g/L). 3. Cryogenic freezing medium: DMEM supplemented with 20% fetal bovine serum (FBS, Hyclone, Logan, Utah), 10% dimethyl sulfoxide (DMSO, Fisher, Pittsburgh, Pennsylvania), 200 units/ml and 200 μg/ml of penicillin/streptomycin, respectively (Hyclone, Logan, Utah). 4. LB media (10g Bacto-tryptone, 5g yeast extract, 10g NaCl, pH 7.5, in 1L). 5. Kanamycin (Fisher, Pittsburgh, Pennsylvania). 6. LB agar plates supplemented with 30 μg/mL Kanamycin. 7. 10 cm cell culture dish (Corning, Lowell, Massachusetts). 8. 24-well and 6-well cell culture plates (Corning, Lowell, Massachusetts). 9. Cloning cylinder 6 × 8 mm (Corning, Lowell, Massachusetts). 10. E.coli XL1blue electrocompetent cells. 11. Lab-Tek chambered coverglass with two wells (Fisher Scientific, Rochester, New York). 12. Petrolatum (Fisher Scientific, Rochester, New York).
2.2.2
Buffers and reagents
1. 10X Annealing Buffer [100 mM Tris HCl (pH 7.5), 1M NaCl, 10 mM EDTA]. 2. TE Buffer [10 mM Tris-HCl (pH 7.5), 1 mM EDTA]. 3. Restriction enzymes (BglII, HindIII, EcoRI, BamHI, NotI, XhoI) (NEB, Ipswich, Massachusetts). 4. Wizard SV Gel and PCR Clean-Up System (Promega, Madison, Wisconsin). 5. T4 DNA Ligase (NEB, Ipswich, Massachusetts). 6. Antarctic Phosphatase (NEB, Ipswich, Massachusetts). 7. Trypsin (0.05%) supplemented with ethylenediaminetetraacetic acid (EDTA, 0.02 g/L) in Hanks Balanced Salt Solution (HBSS) without calcium or magnesium (Hyclone, Logan, Utah). 8. 1X Phosphate Buffered Saline (PBS) (pH 7.3). 9. GenJet transfection reagent for HepG2 cells (Signagen, Gaithersburg, Maryland). 10. GoTaq PCR mix (Promega, Madison, Wisconsin).
13
Development of Green Fluorescent Protein-Based Reporter Cell Lines
2.2.3
Cloning
1. Plasmid pCEP4CyPet-MAMM (Addgene, Cambridge, Massachusetts). 2. Plasmid pCEP4YPet-MAMM (Addgene, Cambridge, Massachusetts). 3. Plasmid pEYFP-N1 (Clontech, Mountain View, California). 4. GenePulser XCell electroporation system (Bio-Rad, Hercules, California). 5. Electroporation cuvettes; 2 mm and 4 mm gap (Bio-Rad, Hercules, California). 6. Marligen plasmid maxiprep kit (Marligen biosciences, Ijamsville, Maryland). 7. Eppendorf miniprep kit (Eppendorf, Westbury, New York).
2.2.4
Microscopy
1. Zeiss Axiovert 200M inverted fluorescent microscope (Carl Zeiss Microimaging, Inc., Thornwood, California) (or a similar fluorescent microscope). 2. U0126 (MEK1/2 inhibitor) (Cell Signaling Technologies, Danvers, Massachusetts). 3. PMA (phorbol 12-myristate 13-acetate) (Fisher, Pittsburgh, Pennsylvania). 4. Human recombinant Interleukin-6 (R&D Systems, Minneapolis, Minnesota).
2.3 Methods 2.3.1
3T3-L1 cell culture
1. 3T3-L1 preadipocytes are grown in DMEM supplemented with 10% bovine serum (BS) and 2% penicillin/streptomycin. Once confluency is reached, cells are passed at a 1:10 dilution for routine propagation. Cells are passed at least twice prior to the experiment to ensure recovery from cryopreservation. 2. 3T3-L1 preadipocytes can be terminally differentiated into mature adipocytes by culturing in media containing a cocktail of hormones. Confluent preadipocytes are cultured in DMEM media supplemented with 10% fetal bovine serum (FBS), 1 μM dexamethasone, 0.5 mM 3-isobutyl-1-methylxanthine (IBMX), 2 nM 3,3’,5-triiodo-L-thyronine (T3), 1 μg/mL insulin and 2% penicillin/streptomycin for 48 hours. Cells are cultured for another 48 hours in DMEM media supplemented with 10% fetal bovine serum (FBS), 2 nM 3,3’,5-triiodo-L-thyronine (T3), 1 μg/mL insulin, and 2% penicillin/streptomycin for 48 hours. 3T3-L1 adipocytes are maintained in DMEM containing media supplemented with 10% fetal bovine serum (FBS) and 2% penicillin/streptomycin. 3. For long-term storage of 3T3-L1 preadipocytes, add 5 mL of freezing media to a cell pellet of ~5 × 106 cells and freeze down 1 mL aliquots in liquid nitrogen.
2.3.2
Transcription factor reporter development
2.3.2.1 Identification of response elements The first step in the development of a transcription factor (TF) reporter is the identification of the TF response element or binding site (i.e., DNA sequence to which the TF binds to regulate gene expression) (Figure 2.1). Publicly available curated databases such as TRANSFAC can be used to identify response elements for different TF. We used 14
2.3
Methods
TF
RE
Prom
GFP
(a)
***
TF RE
Prom
GFP
(b) Figure 2.1 Illustration of the GFP reporter system. (a) No fluorescence is observed in the absence of TF binding to the DNA response element (RE). (b) Binding of TF results in activation of the promoter (Prom) and transcription of the gfp gene.
the TRANSFAC database as well as literature that provided binding sequences for TFs specific to our work. The following is a basic approach using TRANSFAC to identify a TF response element. 1. Access http://www.gene-regulation.com/pub/databases.html and click on the “Search TRANSFAC public” tab. 2. After successfully logging in, click on the “matrix” tab. 3. On the matrix page, enter the full name of the TF of interest or its acronym. 4. Change the “Table field to search in” box to “Factor name” and submit. 5. Click the appropriate returned result. 6. A nucleotide base matrix is generated listing the bases most likely to make up the TF response element.
2.3.2.2 Identification of TF binding sites We have engineered reporter constructs for several TFs. The following section will use PPARγ as an example. 1. Analyze the generated TRANSFAC matrix for PPAR: PO 01 02 03 04 05 06 07 08 09 10 11 12 13
A 14 12 32 19 31 0 0 0 0 72 72 71 0
C 16 21 6 3 0 0 0 0 70 0 0 0 0
G 13 14 9 45 41 72 71 0 2 0 0 1 72
T 17 16 17 3 0 0 1 72 0 0 0 0 0
N N W G R G G T C A A A G 15
Development of Green Fluorescent Protein-Based Reporter Cell Lines
14 15 16 17 18 19 20 21 XX
0 0 0 70 16 17 8 14
0 0 69 0 20 21 15 2
70 0 1 2 2 15 13 16
2 72 2 0 20 3 13 12
G T C A N N N N
From the table above, we can see that the response element is conserved from position 5 through position 17. The corresponding bases (from 5’ to 3’) would be RGGTCAAAGGTCA. The R represents purines (adenine and guanine) and their relatively equal numbers suggest that either one can be used in the design of the reporter construct. In this example, we choose guanine due to its slightly higher prevalence. 2. Synthesize the TF binding element as complimentary oligonucleotides. The TF binding element consists of three tandem repeats of the response element separated by a single base, with appropriate restriction enzyme sites for cloning. The sequence designed for PPAR is given here: 5’AGATCTAAGCTTGGGTCAAAGGTCATGGGTCAAAGGTCAAGGGTCAAAGGT CAGAATTC 3’ The red and blue bases represent restriction sites for BglII and EcoRI, respectively, while the purple bases represent the restriction site for HindIII. The green bases make up the binding site repeats for PPARγ separated by a single base (see Section 2.6, #1).
2.3.2.3 Cloning TF binding elements into reporter plasmids The TF binding element is cloned upstream of a minimal CMV promoter which controls expression of the EGFP reporter gene. In the absence of TF binding, the minimal promoter is not active and minimal EGFP is detected. When a TF binds to its response element, it activates the promoter leading to transcription of the EGFP gene. In this example, the PPAR binding sequence is cloned into pCMVmin-d2egfp-N1 [4]. 1. Reconstitute the complimentary oligonucleotides making up the TF binding element in TE buffer to a concentration of 100 μM. Mix the two oligonucleotides to a final concentration of 40 ng/μL each in a 50 μL reaction. Anneal the DNA strands by incubation at 95ºC for 5 minutes, followed by gradual cooling to room temperature at a rate of 0.5ºC/min. Annealing and cooling can be performed using a standard PCR thermocycler. 2. Digest 25 μL of the annealed DNA with 25 units of BglII (or any other appropriate restriction enzyme) in a 50 μL reaction at 37ºC for 16 hours. The vector into which the TF binding element is to be cloned (pCMVmin-d2egfp-1) is also digested in parallel. 3. Precipitate the digested DNA using 5 μL sodium acetate (3M, pH 5.2) and 150 μL absolute ethanol. Vortex the reactions and incubate at –20ºC for 1 hour. Pellet the DNA by centrifugation at 16,000 × g for 30 minutes. Decant the supernatant (taking care not to disturb the DNA pellet) and resuspend in 30 μL ddH2O. 16
2.3
Methods
4. To the single-digested DNA and vector, add 20 units of EcoRI (NEB, Ipswich, Massachusetts), or any other appropriate enzyme, along with appropriate buffer. Make up the volume to 50 μL and incubate at 37ºC for 16 hours. 5. Separate the double-digested DNA on a 1% low melting point agarose gel and purify the DNA fragment using the Wizard SV Gel and PCR Clean-Up System (Promega, Madison, Wisconsin) as per manufacturer’s suggestions. Electrophoresis is carried out at 100 volts for 60 minutes. 6. Treat 1 μg of the digested vector with antarctic phosphatase (NEB, Ipswich, Massachusetts) for 1 hour at 37ºC to remove the 5’ phosphate group from the vector. Heat inactivate the phosphatase reaction by incubating at 65ºC for 5 minutes. 7. Set up three ligation reactions using the double-digested TF binding element oligonucleotides and the phosphatase-treated vector. As a starting point, use molar ratios of 5:1 (insert:vector), 0:1 (control), and 1:5. Allow the insert and vector to ligate at 16ºC for 1 hour using T4 DNA ligase followed by heat inactivation at 65ºC for 10 minutes. 8. While the ligations are being heat inactivated, prepare three sterile, 2 mm gap electroporation cuvettes by placing them on ice along with electrocompetent cells. Mix 2 μL (10 ng) of the ligation reaction with 40 μL of electrocompetent cells and electroporate (2,500V and 25 μF) with the GenePulser XCell electroporation system (Bio-Rad, Hercules, California) or any other comparable electroporation unit. 9. Immediately add 1 mL of LB media to the cells and allow them to recover at 37ºC for 1 hour with agitation. 10. Collect the electroporated cells by centrifuging at 12,000 × g for 30 seconds. Decant the supernatant and resuspend the cells in the residual media (~50 μL). Plate the cells on LB agar plates containing kanamycin (30 μg/ml) and incubate at 37ºC overnight. 11. Collect kanamycin resistant colonies (~10) from the ligation plates and inoculate overnight in 5 mL of LB media supplemented with 30 μg/mL of kanamycin. 12. Extract plasmid DNA from overnight cultures using the Eppendorf miniprep kit (Eppendorf, Westbury, New York) as per the manufacturer’s protocol. 13. Perform multiple restriction digests to verify the fidelity of the obtained clone. In our scheme, since a HindIII restriction site is present in the vector, we engineered a second HindIII site in the TF binding element. Therefore, plasmids having the TF binding element correctly inserted will have two HindIII sites whereas incorrect clones will have only a single site. 14. Propagate the putative correct clone(s) and extract the plasmid using a plasmid maxi-prep kit (Marligen Biosciences, Ijamsville, Maryland). 15. Sequence the plasmid to verify the fidelity of the inserted binding element.
2.3.3
Kinase reporter development
The functionality of the FRET-based kinase reporter (Figure 2.2) is primarily based on a linker region that contains a substrate domain (pink), a phosphoamino acid binding domain (red) and a flexible aminoacid domain (green) which links the other two domains. Kinases will bind to and phosphorylate specific amino acid residues within the substrate domain. Phosphorylated residues are then recognized and bound by the phosphoamino acid binding domain, which results in a conformational change within
17
Development of Green Fluorescent Protein-Based Reporter Cell Lines
433 nm
475 nm 433 nm
507 nm
Kinase (phosphorylation) Phosphatase (dephosphorylation)
Figure 2.2 Schematic illustrating the expected spectral overlap during FRET. When two fluorescent proteins (cyan fluorescent protein and yellow fluorescent protein, CFP and YFP, respectively) are sufficiently distant from one another, they retain their individual spectral properties (i.e., CFP: excitation 433 nm, emission 475 nm). If the two proteins come into close proximity, the spectral properties change such that an excitation at a low wavelength (433 nm) will result in emission at a high wavelength (507 nm).
the construct. Due to this conformational change, the two fluorescent proteins (CFP and YFP) come into close proximity, resulting in FRET. Once the substrate and binding domains for a kinase are identified, they can be synthetically developed using oligonucleotides. The acceptor fluorescent protein and donor fluorescent protein (in our example, Ypet and CyPet, respectively) can be generated from a plasmid template using PCR, while the flexible linker region can be synthesized as complimentary oligonucleotides, annealed together and amplified. Figure 2.3 summarizes the cloning steps involved in the development of a FRET construct.
2.3.3.1 Selection of FRET elements 1. Identify and select a suitable substrate domain. For example, to monitor the activation of extracellular signal regulated kinase (ERK), we selected a domain from Elk1 as the substrate domain because it is a downstream target of ERK [9, 10]. The Elk1 domain contains the phosphoamino acid motifs serine-proline (SP) and threonine-proline (TP) which can be phosphorylated by bound ERK. The Elk1 domain also contains a FQFP motif which is recognized by ERK as a docking domain [11]. 2. Identify and select a suitable phosphoamino acid binding domain which is the next key consideration of the design. Some phosphoamino acid binding domains reported in the literature include 14-3-3 [12], forkhead associated (FHA) domain [13], and several WW domains [14, 15]. We chose the WW-domain as our phosphoamino acid binding domain due to the native binding affinity for phosphoserine or phosphothreonine. 3. Engineer a linker region which will join the substrate domain and the phosphoamino acid domain. The primary purpose of this domain is to allow flexibility. Therefore, when selecting the aminoacid residues for this region, it is recommended that amino acids that may provide stiffness to the region (i.e., proline), be left out. The linker region in our construct is glycine and serine rich (GSHSGSGKP). Another consideration for the linker region is length. There exists an 18
2.3
Figure 2.3 plasmid.
Methods
Cloning scheme involved in the development of the pCyPet-WWElk1-YPet-N1 FRET reporter
optimal linker length that will allow for maximal FRET. As seen above, our linker region is nine amino acids long which should make for a good starting point for most designs. 4. Select the appropriate fluorophores for the construct. We have developed our FRET construct using CyPet and YPet, which are variants of CFP and YFP, respectively [16]. The cDNA corresponding to the two fluorescent proteins were linked by a 306 base sequence that consists of, in order, the DNA corresponding to a WW domain, a flexible linker and Elk phosphorylation sites.
2.3.3.2 Fluorescent protein PCR 1. Develop forward and reverse primers for PCR amplification of the genes that encode the donor and acceptor fluorescent proteins, CyPet and Ypet, from the plasmids: pCEP4CyPet-MAMM and pCEP4YPet-MAMM [16]. Each primer should be developed as oligonucleotides containing 18 to 20 bases that are complementary to the
19
Development of Green Fluorescent Protein-Based Reporter Cell Lines
template, six to eight bases of the recognition sequence for appropriate restriction enzymes, and an additional 10 to 12 base sequence to facilitate restriction enzyme binding and digestion. Note that the reverse primer of the upstream fluorescent protein (CyPet) must be designed in such a way that the stop codon is eliminated; otherwise, the entire FRET construct will not be translated (see Section 2.6, #2). 2. Amplify the CyPet and YPet genes using PCR in 25 μL reactions using primers at a final concentration of 0.1 μM and 100 ng of template. Perform PCR for 40 cycles with an annealing temperature of 59°C (the annealing temperature should be roughly 10 degrees lower than the calculated melting temperatures of the primer). Any commercially available PCR kit can be used. 3. Remove unincorporated dNTPs and buffers by cleaning the PCR product using a PCR clean-up kit. Elute the PCR product with 35 μL of ddH2O.
2.3.3.3 Fluorescent protein cloning 1. Digest 10 μg of YPet PCR product with 30 units of BamHI and 25 units of NotI (or any other appropriate restriction enzymes) in a 50 μL reaction at 37°C for 16 hours. 2. Digest the CyPet PCR product with 40 units of XhoI and 40 units of EcoRI at the same conditions of the digest in step 1. 3. Digest 10 μg of the vector pEYFP-N1 with 30 units BamHI and 25 units of NotI as above. 4. Separate the double-digested plasmid and PCR products on a 1% low melting point agarose gel and purify the DNA fragments using the Wizard SV Gel and PCR Clean-Up System (Promega, Madison, Wisconsin) as per manufacturer’s suggestions. 5. Follow steps 6 through 12 of Section 2.3.2.3 for cloning the digested YPet PCR product into the digested and phosphatase treated pEYFP-N1 vector to generate plasmid pYPet-N1. 6. Perform multiple restriction digests to verify the fidelity of the obtained clones. For example, digestion with enzymes that contain a restriction site in the ypet gene should be used along with enzymes that contain a restriction site in the vector backbone. Therefore, plasmids which have the ypet gene correctly inserted will display two appropriately sized bands on an agarose gel. 7. The putative correct clones are further propagated and the plasmid sequenced to verify the fidelity of the inserted gene. 8. Digest the newly engineered pYPet-N1 with 40 units each of EcoRI and XhoI in a 50 μL reaction overnight at 37°C. 9. Gel purify 10 μg of the EcoRI/XhoI digested pYPet-N1 as per step 4 above. 10. Follow steps 6 through 12 of Section 2.3.2.3 to clone the EcoRI/XhoI digested CyPet PCR product (from step 2) into the gel purified and newly photophase treated pYPet-N1. The end result of this step should be pCyPet-YPet-N1 which is a plasmid containing both flourophores separated by a small arbitrary nucleotide sequence (step 2 of Figure 2.3). 11. Digest the newly engineered pCyPet-YPet-N1 with 45 units each of EcoRI and BamHI in a 50 μL reaction overnight at 37°C. 12. Gel-purify 10 μg of the EcoRI/BamHI digested pCyPet-YPet-N1 as per step 4 above. 13. Phosphatase-treat the purified pCyPet-YPet-N1 as per step 6 of Section 2.3.2.3. 20
2.3
Methods
Figure 2.4 DNA sequence of the oligonucleotide fragments used for constructing the FRET plasmid linker region. Both sense and antisense strands are shown.
2.3.3.4 Linker oligonucleotide development and annealing The linker region of the FRET construct contains the phosphoamino acid binding domain, the substrate region and the 9 amino acid flexible motif that links them. After identifying the appropriate domains for the construct, their complementary nucleic acid sequences can be combined to yield a functional FRET protein upon translation. (See Section 2.6, #3.) 1. Identify the DNA corresponding to the amino acid sequences of the three parts of the linker. In our ERK construct, the phosphoamino acid and substrate domains are 105 and 159 bases long, respectively, while the flexible motif is 27 bases in length. 2. Using the sense strand of the DNA sequence as the basis, divide the entire sequence into multiple fragments that are approximately equal in length. The last section of the sense strand may not be exactly the same length as the others. 3. Develop the complementary antisense strand of the first fragment so that it covers approximately 50% of the 5’ end of the sense strand. The second antisense fragment should span the remaining 50% of the previous fragment and 50% of the sense strand of the second fragment. Continue to develop fragments until the entire sequence is covered (Figure 2.4). 4. Synthesize the different fragments as oligonucleotides using any commercial DNA synthesis source. 5. Reconstitute each oligonucleotide to 100 μM with sterile TE buffer. 6. Anneal and amplify the oligonucleotides with GoTaq DNA polymerase or any other suitable polymerase. Add 0.5 μL of each oligonucleotide to the reaction and amplify for 40 cycles with an annealing temperature of 56ºC (see Section 2.6, #4). 7. Perform PCR once more using 1/20th of the reaction from step 6 as the template and the 5’ synthetic sense oligonucleotide as the forward primer (1 μM) and the 5’ synthetic antisense oligonucleotide as the reverse primer (1 μM). Perform PCR for 22 cycles with an annealing temperature of 62ºC. High yield of the linker by PCR can be obtained by performing numerous (4 to 5) reactions and combining them just prior to precipitation. 8. Precipitate the 50 μL linker PCR reaction with 5 μL sodium acetate (3M, pH 5.2) and 150 μL absolute ethanol. Vortex the product and incubate at –20ºC for 1 hour. Vortex once more and spin down at 16,000g for 30 minutes. Decant supernatant and resuspend DNA pellet in 30 μL ddH2O. 21
Development of Green Fluorescent Protein-Based Reporter Cell Lines
2.3.3.5 Linker region cloning 1. Digest 10 μg of the linker region with 40 units of EcoRI and 40 units BamHI (or other appropriate restriction enzymes whose recognition sequences were incorporated into the synthetic oligonucleotides). 2. Separate the PCR product on a 2% low melting point agarose gel and purify the fragment using the Wizard SV Gel and PCR Clean-Up System (Promega, Madison, Wisconsin) as per manufacturer’s suggestions. 3. Use EcoRI/BamHI digested and phosphatase treated pCyPet-YPet-N1 from step 13 of Section 2.3.3.3 as the vector and the gel purified linker for cloning. Follow steps 6 through 13 of Section 2.3.3.3 to clone the linker into pCyPet-YPet-N1. The end result of this step should be the final FRET construct, pCyPet-WWElk1-YPet-N1, which contains both fluorescent proteins separated by all three functional domains in the linker region (step 3 of Figure 2.4). 4. Sequence the plasmid to verify the fidelity of the FRET construct.
2.3.3.6 FRET control plasmid development The functioning of the FRET construct must be validated using appropriate controls and microscopy measurements. Since FRET occurs, in part, through spectral overlap of one fluorophore (donor, CyPet) with another (acceptor, YPet), the extent of FRET signal is strongly influenced by the distance between the two fluorophores. Therefore, baseline signal values must be established to facilitate quantitative assessment of FRET signal. This can be accomplished by using plasmids that express either the donor (pCyPet-S) or acceptor (pYPet-S) fluorescent protein alone as these will provide the maximal intensity of each fluorophore. Similarly, a chimera between the donor and acceptor (pCyPet-YPet-Chimera, a CyPet-Ypet fusion protein similar to CyPet-WWElk1-YPet but without the linker region) should be used as it provides the maximum FRET that can be achieved with the donor and acceptor fluorescent proteins (i.e., the closest distance between the two proteins). 1. Create a new forward primer for PCR of YPet from plasmid pCEP4YPet-MAMM [16]. This primer should contain a Kozak initiation start sequence (GCCACC) downstream of a restriction enzyme (BamHI) recognition sequence to aid in protein translation. The reverse primer from step 1 of Section 2.3.3.2 should be used as the new reverse primer for development of pYPet-S. 2. Create a new reverse primer for PCR of CyPet from plasmid pCEP4CyPet-MAMM [16]. This primer must complement the 3’ end of the cypet gene including the stop codon (previous reverse primer contained a mutation to eliminate the stop codon). Additionally, it must contain a restriction enzyme (NotI) recognition sequence for cloning. Create a new forward primer analogous to the forward primer created in step 1 of Section 2.3.3.2, except that a different restriction enzyme (BamHI) is used instead of XhoI. 3. Set up two 25-μL PCR reactions for genes of both fluorescent proteins. Using any commercially available polymerase kit, add primers at a final concentration of 0.1 μM each, along with 100 ng of template. Perform the reaction for 40 cycles with an annealing temperature of 59°C.
22
2.4
Application Notes
4. Purify the reactions with a PCR clean-up kit for removal of the buffers and dNTPs. Elute PCR products with 35 μL of ddH2O. 5. Use BamHI/NotI digested and phosphatase treated pEYFP-N1 as the cloning vector for both CyPet-S and YPet-S. 6. Follow steps 6 through 13 of Section 2.3.3.3 for engineering of the pCyPet-S and pYPet-S constructs. 7. Sequence the plasmid to verify the fidelity of the control constructs.
2.4 Application Notes 2.4.1
Electroporation of TF reporter plasmids into 3T3-L1 preadipocytes
We demonstrate applicability of the method described above by generating a reporter cell line for the transcription factor PPARγ in 3T3-L1 preadipocytes. PPARγ is well established as a master-regulator of adipocyte differentiation and function [8]. As preadipocytes differentiate into adipocytes in culture, the activity of PPARγ is expected to continuously change.
2.4.1.1 Electroporation of TF reporter plasmids into 3T3-L1 preadipocytes 1. Linearize the TF reporter construct containing the PPAR binding sites (15 μg) with 60 units of ApaLI (NEB, Ipswich, Massachusetts) (or any other enzyme that cuts only in an unnecessary portion of the plasmid such as the ampicillin resistance gene) in 60 μL total volume. Place reaction in a 37ºC water bath for 16 hours. 2. Precipitate ApaLI digested plasmid with 6 μL sodium acetate (3M, pH 5.2) and 180 μL absolute ethanol. Vortex the reactions and incubate at –20ºC for 1 hour. Vortex once more and spin down the reactions at 16,000g for 30 minutes. Decant supernatant and resuspend pellets in 30 μL sterile PBS. It is important that the plasmid be sterile because it will be mixed with preadipocytes for electroporation. 3. Grow 3T3-L1 cells to confluence in a T-25 flask. 4. Wash cells three times with 1X PBS. Add 1 mL of trypsin-EDTA to the flask and incubate in a 37ºC incubator for 5 minutes. 5. Remove the cells from the flask by adding 4 mL of complete growth medium. Pipette cells to a centrifuge tube and centrifuge at 800 rpm for 5 minutes at 4ºC. While cells are being centrifuged, place a sterile 4-mm gap cuvette on ice. 6. Aspirate supernatant from the centrifuge tube (see Section 2.6, #5). Reconstitute cell pellet in 400 μL of complete growth medium (see Section 2.6, #6) and pipette into the cold sterile cuvette. To the 400 μL cell suspension add the 30 μL sterile DNA solution from step 2. Pipette gently to mix. 7. Using the GenePulser XCell electroporation system (Bio-Rad, Hercules, California), electroporate the cell/DNA suspension at 240V and 950 μF with a time constant of ~48 ms (see Section 2.6, #7). 8. Immediately after electroporation, gently add 600 μL of complete growth medium to the cuvette and mix once. Set aside cuvette at room temperature for 5 minutes to allow cells to recover.
23
Development of Green Fluorescent Protein-Based Reporter Cell Lines
9. Remove cells from the cuvette and place in a 100-mm cell culture dish with 14 mL of complete growth medium. Incubate dish at 37°C.
2.4.1.2 Clonal selection 1. After 48 hours of incubation, change growth medium and supplement with 800 μg/ml of G418 (see Section 2.6, #8). 2. Change and supplement the medium with G418 every 48 hours. At 7 to 10 days post G418 addition, individual colonies should become visible. Allow the colonies to continue to grow until there are enough cells to select in a single colony. Stop culture if the colonies begin to touch each other. 3. Place the cells under the microscope and mark the colonies which look well developed (i.e., proper cell morphology), are separated from other colonies, and are sufficiently large to be isolated. Roughly 10 to 15 colonies should be marked for propagation. 4. Wash the dish 2X with sterile PBS, dip the bottom edge of a 6 × 8-mm cloning cylinder in sterile petrolatum. Using sterile forceps, place the cloning cylinder directly over the marked colony. Repeat this process for all marked colonies. 5. Add 30 μL of trypsin-EDTA to the center of the cloning cylinder. Make sure there are no air bubbles between the colony and the trypsin. Place the cells in a 37ºC incubator for 5 minutes. 6. Add 50 μL of media to all cylinders to stop the reaction. Gently pipette the media/trypsin mixture in each cylinder multiple times to ensure the cells are detached and to ensure the cells are not clumped together. Place each colony in a single well of a 24-well plate. Make sure there is no cross-contamination between colonies because the cells are now individual populations. (See Section 2.6, #9.) 7. Culture the cells in the 24-well plate until they become confluent. Confluency may be reached at different times for each clone; so culture appropriately. Typically, 3T3-L1 cells will become confluent in 2 to 4 days after passing from the dish. 8. Wash the cells in the 24-well plate with sterile PBS and add 100 μL of trypsin-EDTA. Place the cells in a 37°C incubator for 5 minutes. 9. Add 400 μL of complete growth medium and pipette several times to ensure the cells have been detached from the well. Transfer all 500 μL of cell suspension (including trypsin) to a single well of a 6-well plate. Repeat until all the cells have been transferred to a 6-well plate, and culture cells for another 2 to 4 days (i.e., until confluence). 10. Wash confluent cells in the 6 well-plate with sterile PBS and trypsinize with 400 μL of trypsin-EDTA. Incubate the cells in a 37ºC incubator for 5 minutes. 11. Add 1.6 mL of complete growth medium to each well, pipette to break up cell clumps and completely remove cells from flask. 12. Pipette the cell suspension into a T-25 and add medium to a total volume of 5 mL. Grow cells to confluency. 13. Once confluency is reached, cells should be frozen down in 1 mL aliquots (~1 × 106 cells) as per the storage section above.
24
2.4
Application Notes
2.4.1.3 Clonal screening In order to obtain a clone with the TF reporter plasmid stably integrated into the genome without altering cell function, it is necessary to screen multiple colonies (see Section 2.6, #10). Typically, it is recommended that 10 to 15 colonies be screened for activation of the TF by monitoring the induction of GFP fluorescence upon exposure to a specific ligand relative to the initial time point (see Section 2.6, #11). Reporter clones that demonstrate significant GFP induction are identified for further screening and purification [5]. 1. Identify and grow the TF reporter clones to ~70% confluency in 24-well tissue culture plates. Switch to phenol red free media 16 hours prior to starting the experiment. (See Section 2.6, #12.) 2. Determine the extent of GFP fluorescence by adding a known inducer of the TF being studied. For example, thiazolidinedione (TZD), a well-established PPAR agonist [17], can be added to induce expression of PPARγ. 3. Monitor the temporal change in GFP fluorescence using fluorescence microscopy. Once the maximum GFP signal is observed, trypsinize the cells for flow cytometry-based cell sorting. 4. Using flow cytometry, sort the cell population based on intensity of GFP expression and isolate the population that exhibits the maximum GFP fluorescence. This population contains cells that can be exposed to a specific ligand to activate the TF of interest (i.e., responsive population). This sorting step is also called “positive sorting.” 5. Culture the sorted cells in 6-well tissue culture plates. When cells are ~70% confluent, trypsinize them and once again sort using flow cytometry. 6. Using flow cytometry, collect cells that do not exhibit any fluorescence. Since these cells were not stimulated with any ligand, isolation of cells that exhibit the least fluorescence represents the population where background expression of GFP is minimal (“negative sorting”). 7. Culture the twice-sorted cells and again stimulate with a known ligand. Isolate responsive (GFP expressing) cells by flow cytometry. These represent the population that has the highest signal-to-noise ratio and demonstrates maximum induction of the TF of interest (i.e., can be used for dynamic profiling of TF activation).
2.4.1.4 Monitoring PPARγ activation in 3T3-L1 adipocytes The above-described procedure was used to isolate a 3T3-L1 preadipocyte reporter cell line for monitoring the activation of the PPARγ during adipocyte differentiation and enlargement (Figure 2.5). Preadipocytes were grown to confluence and differentiated into mature adipocytes using the differentiation protocol above, and the fluorescence intensity was monitored every 48 hours. The data in Figure 2.5 show that no fluorescence was detected at the beginning of differentiation, indicating that PPARγ is not active at this time point. The fluorescence intensity increases after day 6 and is significantly higher at days 8 and 10 relative to the initial time point (i.e., at the later stages of adipocyte differentiation).
25
Development of Green Fluorescent Protein-Based Reporter Cell Lines
Figure 2.5 Fluorescence images of cells from a single 3T3-L1 PPARγ reporter clone from induction of adipocyte differentiation through development of the mature adipocyte phenotype. The initial image was taken immediately after addition of the differentiation medium. All other images were taken at 48-hour intervals through 10 days of culture.
2.4.2
Monitoring activation of ERK in HepG2 cells
We demonstrate functionality of the ERK reporter construct through induction of the MEK/ERK pathway. Stimulation of cells containing the reporter plasmid by inducers such as phorbol 12-myristate 13-acetate (PMA), epidermal growth factor (EGF), and interleukin-6 (IL-6) will lead to activation of ERK and phosphorylation of downstream targets such as Elk1 [9]. In this example, we use IL-6 as it is a well-known activator of the MAPK pathway. 1. Prepare cell culture coverslips (Fisher Scientific, Rochester, New York) by coating with 2 mL of PBS supplemented with fibronectin (10 μg/mL) per well. Incubate slides at 37°C for 1 hour. Aspirate the PBS/fibronectin; gently wash once with media and set aside. 2. Seed ~8 × 105 HepG2s per well. Seed two wells per construct to be transfected. 3. Grow HepG2 cells in cell culture coverslips overnight. Transfection should be performed using cells that are 70% to 80% confluent.
26
2.4
Application Notes
4. Replenish medium in coverslips with 1 mL of fully supplemented medium 1 hour prior to transfection. 5. Mix 1.5 μg of pCyPet-YPet-Chimera, pCyPet-WWElk1-Ypet, pCyPet-S, and pYPet-S in 100 μL serum and antibiotic free medium in eight separate tubes. 6. Gently mix GenJet HepG2 transfection reagent (or any other commercial transfection reagent) prior to pipetting. To eight additional tubes containing 100 μL serum and antibiotic free medium add 4.5 μL GenJet reagent and mix gently by flicking the tubes several times. 7. Immediately add the 100 μL of medium containing the GenJet reagent to the 100 μL of medium containing plasmids to form the transfection complex by incubating the DNA-GenJet mixture at room temperature for 15 minutes. 8. Add the transfection complex dropwise to cells and gently rock the plate to uniformly disperse the transfection complex. Return plate to incubator and incubate for 12 to 18 hours before replacing media with fresh medium not containing the transfection complex. 9. Continue to grow cells for 18 to 24 hours before stimulation of ERK with IL-6. 10. Stimulate cells with IL-6. The stimulation time and concentration may vary based on the cell line being used; in our example, we used 100 ng/mL of IL-6. 11. Place slide chambers on the stage of a Zeiss Axiovert 200M inverted microscope equipped with two cool SNAP cameras and a 60X water immersion objective lens. 12. Collect FRET data for pCyPet-S (CFP alone), pYPet-S (YFP alone), pCyPet-WWElk1-YPet (+IL-6), and pCyPet-WWElk1-YPet (-IL-6) using three channels: CFP (donor) channel, YFP (acceptor) channel, and FRET channel. CFP and YFP channels are configured to use the respective excitation and emission filters (400 nm excitation and 470 nm emission for CFP, 480 nm excitation and 530 nm emission for YFP) while the FRET channel is configured to use a CFP excitation and YFP emission. 13. Determine the extent of bleed-over of the donor signal to the FRET channel by measuring the FRET channel signal with the pCyPet-S (CFP only) transfected cells. Similarly, calculate bleed-over of the acceptor signal by measuring the FRET channel signal of the pYPet-S (YFP only) transfected cells. 14. Calculate corrected FRET (FRETc) by using the following equation: FRETc = FRET − ( Df Dd)[CFP] − ( Df Da)[YFP] where FRET, [CFP], and [YFP] are the signals visualized through the FRET, CFP, and YFP filter sets, respectively. The constants Df/Dd and Df/Da are the bleed through constants describing donor emission visible in the FRET channel and direct excitation of acceptor, respectively [18]. Figure 2.6 shows the FRETc signal in control and HepG2 cells stimulated with IL-6 for 2 hours. Based on measurement from approximately 150 cells, it is evident that IL-6 stimulation induces a statistically-significant change in ERK activation in HepG2 cells.
27
Development of Green Fluorescent Protein-Based Reporter Cell Lines 1.85
1.8
FRET Signal
1.75
1.7
1.65
1.6
1.55
1.5 Control
IL-6
Figure 2.6 Activation of ERK in HepG2 cells by IL-6. HepG2 cells transfected with the ERK reporter construct were stimulated with 100 ng/mL of IL-6 for 2 hours. Data shown are average of 150 single-cell measurements captured using the FRET channel (400 nm excitation and 530 nm emission) from two independent cultures. * indicates statistical significance at p < 0.001.
2.5 Data Acquisition, Anticipated Results, and Interpretation For the TF reporter, hormonal cues added to the adipocyte differentiation media—insulin, IBMX, T3, and dexamethasone—are “inputs” that adipocytes respond to and initiate activation of the TF being investigated. The “output” of the system is the induction of fluorescence through activation of PPARγ at different stages of adipocyte differentiation. We expect PPARγ to be activated differently during adipocyte differentiation, leading to different levels of fluorescence at different stages of adipocyte differentiation (Figure 2.5). The fluorescence data will be interpreted based on difference in the magnitude of activation or down-regulation between different time points as well as between any time point and the initial fluorescence value. For the FRET-based ERK reporter, specific addition of a ligand such as IL-6 serves as the input for signaling through the MAPK signal transduction pathway leading to phosphorylation of ERK1/2 and activation of downstream transcription factors. Similarly, activation of ERK will be evaluated based on activation observed with the CyPet-Ypet chimera (FRET construct without the flexible linker) and in the presence of MAPK activity inhibitors (e.g., U0126). In general, for any system, we expect the temporal changes in fluorescence to be correlated to the temporal activation profile of the regulatory molecule being investigated, and when profiling activation of multiple TF, the data need to be interpreted based on the TF activated at each time point and/or the relative magnitudes of activation.
28
2.6
Discussion and Commentary
2.6 Discussion and Commentary 1. Transcription factor binding elements can be designed in several configurations. In the PPAR example, three tandem binding sites were used with a single nucleotide spacer between each. Since the dynamics of the tertiary conformation induced by binding of a TF to its response element are not fully understood, it is advisable to use two to four tandem repeats to mitigate any inhibitory effects induced in the promoter region by the binding of the TF. Similarly, the nucleotide spacer, which works to increase the efficiency of binding, can be increased or decreased. More spacing between the tandem repeats may be needed when the design uses fewer binding sites and vice versa. 2. PCR can be done with any commercially available kit that uses a high fidelity proof reading polymerase. We have had success with Promega’s GoTaq PCR kit and have seen minimal errors in the amplified sequence. 3. The linker region should be developed with several considerations in mind. First, it needs to be composed of amino acids which will provide flexibility to the construct. This is important because the acceptor and donor fluorescent proteins must come into close contact with each other upon kinase phosphorylation. Secondly, when developing the synthetic oligonucleotides, one or two bases may need to be added to keep the FRET protein “in-frame.” For example, the 3’ end of CyPet (the upstream protein of the construct) will have a mutated stop codon. As a result, the restriction enzyme recognition sequence (e.g., GGA TCC for BamHI) will be transcribed as two additional codons followed by reading of the synthetic linker region. Therefore, it is critical that the sequence of the synthetic linker region be designed such that codons are always in frame. 4. Annealing of the synthetic oligonucleotides can be done in the early stages of PCR using any standard PCR kit. Supplementation with dNTPs will allow for the polymerase to add phosphate groups to the portions of the linker which need to be combined. Annealing and combining of the synthetic oligonucleotides may require the majority of the reaction time so additional rounds of PCR amplification may need to be performed. 5. It is important to remove all media from the pellet because salt and serum contamination interferes with plasmid electroporation. Also, the volume to be used for electroporation is crucial. 6. Complete growth medium may adversely affect the efficiency of transfection. If this method is used and low transfection efficiency is observed, reconstituting the pellet in DMEM without serum and without antibiotics can be used to increase the efficiency. Additionally, there are commercially available electroporation buffers (e.g., hypoosmolar buffer and iso-osmolar buffer) that are ideal for reconstitution. 7. The optimal voltage and capacitance values for each cell type needs to be determined. The time constant for an exponential decay pulse should be around 48 ms. If it is less than 30 ms or greater than 60 ms, the electroporation is likely to have failed. Salt concentration, cell number, plasmid purity, media composition can all have an effect on electroporation efficiency. Values reported here are specific to the Genepulser Xcell unit (Biorad, Hercules, California). However, similar trends will be seen on any other commercially available electroporation unit.
29
Development of Green Fluorescent Protein-Based Reporter Cell Lines
Troubleshooting Table Problem
Explanation
Potential Solutions
No significant FRET Low transfection efficiency False positive clones
Inappropriate design Optimize linker length and composition Serum or antibiotic present Remove serum and antibiotics from transfection mix G418 concentration too low Determine the lowest concentration of G418 that kills all cells of a negative control
8. Geneticin sulfate salt (G418) is used as the selection antibiotic for the electroporated cells. Unfortunately, each G418 lot contains significant differences in potency. Additionally, some cell types are more susceptible to G418 than others. Therefore, a kill curve is recommended when switching from one lot of G418 to another and when the cell type is changed. If a kill curve is not performed, false positive clones may result or all cells may die, even those which have successfully incorporated the plasmid. 9. Clonal screening must be done prior to experimentation because not all clones will exhibit the same response when stimulated. The site of plasmid incorporation and the number of integrated copies play a major role in the responsiveness of the reporter clone. A high level of background can be seen in clones which incorporated the plasmid in a region of a chromosome which is highly active. 10. For high electroporation efficiencies, cells plated in a 100-mm dish will result in multiple colonies after selection. Therefore, cells should be grown such that individual colonies can be isolated. If the size of the isolated colony is small, cells should be passed to a single well of a 48-well plate and not a 24-well plate. 11. It is advised that no more than three to six clones be screened at any one time as the possibility of cross-contamination increases significantly. Screening more than one TF construct at a time is also not advised as cross-contamination between constructs cannot be easily determined. Generally speaking, an endpoint analysis should be sufficient for screening the clones. The analysis should include fluorescent images of the initial and final time points so that image analysis can be done to determine the change in fluorescence from the initial time point to the final time point. 12. The medium used for this portion of the experiment should not contain phenol red as phenol red interferes with fluorescence imaging.
2.7 Summary Points The methods detailed in this chapter describe the development of: 1. GFP-based reporter plasmids for dynamically monitoring transcription factor activation; 2. FRET-based reporter plasmids for dynamically monitoring kinase activity; 3. Stable reporter cell lines for dynamic profiling of TF activation.
30
Acknowledgments
Acknowledgments This work was supported by grants from the National Science Foundation CBET0651864 and the American Heart Association (AY0755112Y). The authors wish to thank Professor Robert Burghardt and Dr. Roula Barhoumi Mouneimne for help with FRET imaging and analysis.
References [1]
[2]
[3] [4]
[5]
[6] [7]
[8] [9] [10] [11]
[12] [13] [14] [15]
[16]
[17]
[18]
Elnitski, L., V.X. Jin, P.J. Farnham, and S.J. Jones, “Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques,” Genome Res., Vol. 16, 2006, pp. 1455–1464. King, K.R., S. Wang, A. Jayaraman, M.L. Yarmush, and M. Toner, “Microfluidic flow-encoded switching for parallel control of dynamic cellular microenvironments,” Lab Chip, Vol. 8, 2008, pp. 107–116. Lu, P.J., X.Z. Zhou, M. Shen, and K.P. Lu, “Function of WW domains as phosphoserine- or phosphothreonine-binding modules,” Science, Vol. 283, 1999, pp. 1325–1328. Thompson, D.M., K.R. King, K.J. Wieder, M. Toner, M.L. Yarmush, and A. Jayaraman, “Dynamic gene expression profiling using a microfabricated living cell array,” Anal. Chem., Vol. 76, 2004, pp. 4098–4103. Wieder, K.J., K.R. King, D.M. Thompson, C. Zia, M.L. Yarmush, and A. Jayaraman, “Optimization of reporter cells for expression profiling in a microfluidic device,” Biomed. Microdevices, Vol. 7, 2005, pp. 213–222. Violin, J.D., J. Zhang, R.Y. Tsien, and A.C. Newton, “A genetically encoded fluorescent reporter reveals oscillatory phosphorylation by protein kinase C,” J. Cell. Biol., Vol. 161, 2003, pp. 899–909. Zhang, J., Y. Ma, S.S. Taylor, and R.Y. Tsien, “Genetically encoded reporters of protein kinase A activity reveal impact of substrate tethering,” Proc. Natl. Acad. Sci. USA, Vol. 98, 2001, pp. 14997–15002. Ni, Q., D.V. Titov, and J. Zhang, “Analyzing protein kinase dynamics in living cells with FRET reporters,” Methods, Vol. 40, 2006, pp. 279–286. Davis, R.J., “Transcriptional regulation by MAP kinases,” Mol. Reprod. Dev., Vol. 42, 1995, pp. 459–467. King, K.R., S. Wang, D. Irimia, A. Jayaraman, M. Toner, and M.L. Yarmush, “A high-throughput microfluidic real-time gene expression living cell array,” Lab Chip, Vol. 7, 2007, pp. 77–85. Fantz, D.A., D. Jacobs, D. Glossip, and K. Kornfeld, “Docking sites on substrate proteins direct extracellular signal-regulated kinase to phosphorylate specific residues,” J. Biol. Chem., Vol. 276, 2001, pp. 27256–27265. Fu, H., R.R. Subramanian, and S.C. Masters, “14-3-3 proteins: structure, function, and regulation,” Annu. Rev. Pharmacol. Toxicol., Vol. 40, 2000, pp. 617–647. Durocher, D., J. Henckel, A.R. Fersht, and S.P. Jackson, “The FHA domain is a modular phosphopeptide recognition motif,” Mol. Cell., Vol. 4, 1999, pp. 387–394. Nguyen, A.W., and P.S. Daugherty, “Evolutionary optimization of fluorescent proteins for intracellular FRET,” Nat. Biotechnol., Vol. 23, 2005, pp. 355–360. Verdecia, M.A., M.E. Bowman, K.P. Lu, T. Hunter, and J.P. Noel, “Structural basis for phosphoserine-proline recognition by group IV WW domains,” Nat. Struct. Biol., Vol. 7, 2000, pp. 639–643. Shao, D., and M.A. Lazar, “Peroxisome proliferator activated receptor g, CCAAT/enhancer-binding protein a, and cell cycle status regulate the commitment to adipocyte differentiation,” J. Biol. Chem., Vol. 272, 1997, pp. 21473–21478. Dumasia, R., et al., “Role of PPAR-gamma agonist thiazolidinediones in treatment of pre-diabetic and diabetic individuals: a cardiovascular perspective,” Curr. Drug. Targets Cardiovasc. Haematol. Disord., Vol. 5, 2005, pp. 377–386. Sorkin, A., et al., “Interaction of EGF receptor and grb2 in living cells visualized by fluorescence resonance energy transfer,” Curr. Biol., Vol. 10, 2000, pp. 1395–1398.
31
CHAPTER
3 Comparison of Algorithms for Analyzing Fluorescent Microscopy Images and Computation of Transcription Factor Profiles Zuyi Huang and Juergen Hahn* *Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, Texas 77843-3122, e-mail:
[email protected]
Abstract Obtaining quantitative data about protein concentrations is an important component in systems biology; however, only a few options for generating such data exist. One of these is to use green fluorescent protein (GFP) reporter systems as an indicator of protein concentration; however, the measurements consist of a series of fluorescent microscopy images that need to be analyzed to derive time-dependent quantitative data. This chapter presents two techniques for determining data from fluorescent microscopy images. The first technique uses wavelets to sharpen the image contrast between cells and a bidirectional search to identify the cell region. The second technique is based on K-means clustering and uses principal component analysis (PCA). A comparison of these two methods is made where the dynamics of NF-κB in TNF-α signaling pathway is investigated.
Key terms
Green fluorescent protein (GFP) reporter systems Fluorescent microscopy images Image analysis Inverse problem Transcription factor profiles Principal component analysis (PCA) K-means clustering Wavelet TNF-α signaling pathway Mathematical modeling
33
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
3.1 Introduction Signal transduction plays a key role in systems biology as signal transduction pathways are responsible for relaying cellular information and are involved in the regulation of cellular responses. An understanding of signal transduction mechanisms offers the potential for improved treatment options for diseases; for example, abnormalities of the Jak/STAT signaling pathway have been linked to colon cancer [1] and abnormalities of MAPK signaling have been associated with gastric cancer [2]. One possibility for developing an understanding of the dynamics of signal transduction pathways is the derivation of models describing the pathways. However, deriving an accurate signal transduction pathway is nontrivial as the mechanisms tend to involve many components and the system will have a large degree of uncertainty in both its structure and parameter values (for some examples of signal transduction pathways, see the models in [3–5]). Validation and refinement of any model is a crucial step for modeling signal transduction pathways; however, these steps can only be undertaken if experimental data is available or can easily be derived. One popular approach for collecting experimental data for signal transduction pathways involves Western blotting (e.g., in [6, 7]). While performing a Western blot is a relatively simple experiment, it does have the drawbacks that: (1) Western blotting is a destructive measurement technique, and (2) the data is semi-quantitative in nature [8, 9]. The first drawback poses a problem for the use of Western blots for experiments where a time series of a concentration profile of a particular protein is to be measured, while the latter results from the limitation of the technique itself (i.e., it is not always possible to determine “how black a Western blot is” and to what protein concentration this level of color corresponds). One promising approach for taking dynamic measurements is to use a green fluorescent protein (GFP) reporter system [10, 11]. This method is based upon the idea that expression of certain genes will also result in the formation of GFP for a cell line that has been modified accordingly. It is then possible to take fluorescent microscopy images which show the fluorescence of the cells, where the degree of fluorescence can be correlated with the concentration of the transcription factor that is present in the nucleus of the cells. Compared with the data from Western blots, the fluorescence intensity profile provides more easily quantifiable data, which can be used to validate or update the mathematical model. GFP reporter systems have been extensively used in clone isolation [12], identification and detection of promoter activity [13–15], the assessment of gene transfer and expression [16, 17], and the study of hepatitis B virus replication [18], to just name a few applications. However, using fluorescent microscopy images of GFP reporter cells is a relatively new approach [19]. An automated image analysis procedure to identify the GFP localization regions with standard MATLAB commands has been presented in [20]; however, the procedure only determines regions of fluorescence and does not provide quantitative data about the fluorescence intensity. Analyzing fluorescent microscopy images to obtain quantitative information is not a trivial task due to several reasons: (1) not all cells will express GFP; (2) fluorescence seen in images can vary over time due to fluctuations occurring during the measurement process as well as other cellular functions; and (3) some of the fluorescence seen in the images may be an artifact of the image. Image analysis algorithms are required in order to address these points. Accordingly, developing algorithms for analyzing fluorescent microscopy images of GFP reporter cells is an important step for obtaining quantitative data of protein concentrations in signal transduction pathways. In this regard, two 34
3.2
Preliminaries
image analysis methods are presented in this work. The goal of these algorithms is to determine which areas of an image represent cells where fluorescence can be seen and to quantify the amount of fluorescence in a second step. The first method uses wavelets for sharpening image contrast and then searches through the individual pixels of an image in two directions to determine if a pixel corresponds to a cell, the background, or to an artifact of the image and/or measurement. The second technique is based on K-means clustering and PCA. A comparison of these two techniques is provided and a series of images of hepatocytes stimulated with three different concentrations of TNF-α have been analyzed. As the fluorescence intensity is only an indicator of the amount of GFP present in a cell, the data is further analyzed in order to determine the concentration of the transcription factor responsible for transcription of RNA containing code for GFP. Based on the dynamic data from these two image analysis algorithms, the NF-κB dynamics for different stimulation concentrations of TNF-α is obtained from a proposed NF-κB-GFP model.
3.2 Preliminaries 3.2.1
Principles of GFP reporter systems
A DNA fragment encoding GFP is inserted into DNA in GFP reporter systems. Due to stimulation from the stimulant, transcription factor (TF) is activated and translocates to the nucleus. The transcription factor then binds to the promoter and transcribes DNA, which also includes code for a green fluorescent protein as this code has been previously inserted into the DNA. In a next step GFP-RNA is then translated into GFP which, after post-translational modification, induces the green fluorescence seen in fluorescent microscopy images. Figure 3.1 shows a simple illustration of the principles of GFP reporter system. No fluorescence can be seen until the transcription factor binds to the promoter; however, fluorescence is easily visible after transcription factor activation. Stronger stimulation will lead to a larger concentration of the transcription factor mole-
(a)
(b) Figure 3.1 GFP reporter systems. The DNA response element (RE) to which the TF binds is upstream of a minimal promoter that controls GFP expression: (a) before the transcription factor binds to promoter; and (b) after the transcription factor binds to promoter.
35
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
cules in the nucleus, which, in turn, results in increased GFP expression and more fluorescence. The dynamics of transcription factor can be indirectly measured by quantifying the time-series of the fluorescence intensity seen in fluorescent microscopy images.
3.2.2
Wavelets
Equation (3.1) shows the wavelet transformation. W ( a , b) = ∫
+∞ −∞
f (t )
1 a
⎛ t − b⎞ ψ*⎜ ⎟ dt ⎝ a ⎠
(3.1)
where a is a real variable representing the scale or dilation, b is a real variable represent2 2 ing time shift or translation, ψ(t) is a wavelet function [e.g., ψ(t) = e − t cos( π t ) for a In2 Morlet wavelet], * denotes the complex conjugate operator, and f (t) is the processed signal. Wavelet transforms can be considered as a collection of inner products of f(t) and ψa,0(t − b) at a and b; that is,W ( a , b ) = f (t ), ψa, 0 (t − b ) . The values of W(a,b) for different a provide the frequency domain information, while the values W(a,b) for different b provide time domain information. The availability of both frequency domain and time domain information makes wavelet transformation particularly attractive for denoising of data [20, 21]. The principle of wavelet denoising is that wavelet transforms can decompose an image into multiple scales by dilation and compression and then remove the noise at multiple scales by thresholding [22]. The general procedure for denoising via wavelets consists of three steps [23]: (1) decompose an image into N levels via a wavelet transform, where N is an integer chosen by experience; (2) calculate a threshold [24] and compare the high frequency components of each level from 1 to N to this threshold; (3) reconstruct the image by using the low frequency components of level N and the modified high frequency components of levels 1 to N.
3.2.3
K-means clustering
K-means clustering is a method for identifying patterns in data and for dividing data into k disjoint clusters [25]. The principle of K-means clustering is to minimize the objective function shown in (3.2) by determining centroids for each of the k clusters: k
min f = ∑ μ
∑
i = l x j ∈S
xj − μ i
2
(3.2)
i
where Si, i = 1, 2, …, k, represents all points belonging to the ith cluster, μi is the centroid of all the points xj ∈ Si, and μ is the collection of all the centroids. μi is calculated by (3.3).
μi =
∑x
x j ∈S
j
i
Ni
where Ni is the total number of the data points in cluster Si. The procedure to perform K-means clustering consists of the following steps: 36
(3.3)
3.2
Preliminaries
1. The initial centroids μi, i = 1, 2,… , k, for the k clusters are assigned or randomly sampled from the data points. 2. Each data point xj is assigned to a cluster m. This decision is made by determining the smallest value for ||xj − μm||2 among all possible ones ||xj − μi||2, i = 1, 2, …, k. 3. The function f from (3.2) is evaluated by computing the sum of the distances for all data points as well as for all clusters. 4. Equation (3.3) is used to update the centroid of each cluster by averaging the data points of the corresponding cluster. 5. Steps 2 through 4 are repeated iteratively until the relative change in the objective function f between iterations is less than a certain threshold. The iterative refinement procedure is known as Lloyd’s algorithm [26, 27]. The key idea for K-means clustering is the selection of the initial centroids for the k clusters. A proper choice for the initial centroids will make the clustering algorithm converge faster to the optimal solution.
3.2.4
Principal component analysis
Principal component analysis (PCA) [28] is a well-established technique for identifying multivariable patterns in data. A data matrix X can be composed as follows using PCA: X = TP T + E
(3.4)
where T is the score matrix, P is the loading matrix, and E is the residual between the actual data and the reconstruction by PCA. The columns of P represent principal components of the data matrix, while the columns of T are the projections of the data matrix onto the principal components [29]. The motivation for using PCA for image analysis comes from the work presented in [30, 31], which shows that clusters in a score plot from PCA are associated with features of an image. Furthermore, combining K-means clustering and PCA has been widely studied for clustering [32, 33].
3.2.5
Mathematical description of digital images and image analysis
The tri-stimulus theory states that any visual color can be represented by overlaying three color information channels. For television and computer graphics, the standard colors used are red, green, and blue [30]. An RGB image can be represent by a three-dimensional tensor ⎡(r , g , b)11 ⎢ M=⎢ M ⎢(r , g , b) i1 ⎣
K O K
(r , g , b)1 j ⎤ ⎥ M ⎥ (r , g , b)ij ⎥⎦
(3.5)
where M is of size i × j × 3. i × j is the resolution of the image, which means that there are i rows of pixels in the images and each row has j columns of pixels. Each pixel of the image has three intensity values (i.e., one each for red, green, and blue). M can be rewritten as a two-dimensional matrix X with the size of (i × j) × 3 as shown in (3.6) by listing 37
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
the three intensity values of each pixel in a row, such that each row of X represents the red, green, and blue values of a pixel: ⎡ r1 ⎢r 2 X=⎢ ⎢ M ⎢r ⎣ i×j
g1 g2 M g i×j
b1 ⎤ b2 ⎥ ⎥ M ⎥ bi × j ⎥⎦
(3.6)
The intensity for each pixel is defined as the sum of the red, green, and blue values. I =r + g + b
(3.7)
Another option is to use the tensor M and calculate the intensity for each pixel according to (3.7). This results in the RGB image M being transformed into a two-dimensional gray image by replacing each pixel by its intensity: ⎡(r + g + b)11 ⎢ N=⎢ M ⎢(r + g + b) i1 ⎣
K O
(r + g + b)1 j ⎤
⎥ M ⎥ L (r + g + b)ij ⎥⎦
(3.8)
where N is of dimension i × j. Image analysis extracts information from a time-series of images, represented by the M matrices recorded at different points in time. A common image analysis procedure is to: (1) record images at different time points, (2) separately analyze the images, and (3) combine the image analysis results for different time points to determine dynamics of the system. One popular application of such an image analysis procedure is the study of MRI intensity patterns [34].
3.3 Methods The goal of the analysis of fluorescent microscopy images for the purpose of this work is to determine which areas of an image represent cells where fluorescence can be seen and then quantify the average amount of fluorescence over these cells. It is an important aspect of this analysis to distinguish between cells where fluorescence can be seen and the background, which can consist of regions without cells or regions where cells are not producing fluorescent proteins. This section presents two algorithms to determine the fluorescent cell regions in images obtained from fluorescent microscopy. In a next step, calculation of the fluorescence intensity of an image on the basis of the calculated fluorescent cell regions is discussed, and finally a comparison of the results returned by the two image analysis algorithms is presented.
3.3.1
Image analysis based on wavelets and a bidirectional search
The first method is based upon sharpening the contrast in the image using wavelets and performing a bidirectional search to determine bright regions of an image. Once the cell region has been determined, the average fluorescence intensity is computed from the 38
3.3
Methods
original image. It is important to note here that the transformed image is only used for determining the area representing fluorescent cells and not for determining the average fluorescence, as the wavelet transform will affect the value of the observed fluorescence intensity. The use of wavelets can increase the contrast of images by denoising. The effect of this is illustrated in Figure 3.2. While it is difficult to identify the fluorescent cell region from the original image shown in Figure 3.2(a), the image processed by wavelets clearly shows the cell region [Figure 3.2(b)]. Figure 3.2(c) compares the signals before wavelet denoising and after wavelet denoising along the horizontal red line shown in Figure 3.2(b). It can be seen that the application of wavelets significantly reduces the noise and also sharpens the contrast of the image. A type of wavelet called coiflets is used in this work due to the type of contrast seen in the images to be analyzed. Once the image contrast has been increased to the point where regions of fluorescent cells are clearly visible, a search algorithm that identifies these regions in the images can be applied. The key idea behind a bidirectional search is that if there are two or more points next to one another in the horizontal or the vertical direction whose intensities are higher than a threshold, then these points are classified as belonging to a fluorescent cell region. The threshold value is calculated on the basis of the largest value and the mean value of the intensities of the entire image: THR =
Ma − Me + Me , Ma = max( N ), Me = mean( N) k
(3.9)
where THR is the threshold, Ma is the maximum intensity found in the image, Me is the mean intensity over the entire image, and k is a constant which can be adjusted to take image contrast into account. Figure 3.3 illustrates the procedure of the bidirectional search. In this case, a horizontal search is performed first, followed by a search in the vertical direction. The algorithm moves from one pixel of the image to the next and determines if the pixel intensity is above the threshold. If two pixels in a row are found with intensities above the threshold, then these pixels are classified as belonging to a fluorescent cell region. This is illustrated in Figure 3.3(a) where the algorithm moves from one pixel of the image to the next in a horizontal direction. Once the pixel labeled “A” has been found to have a brightness above the threshold, the algorithm looks at the next horizontal pixel and
122
Intensity
120
Signal before wavelet-denoising Signal after wavelet-denoising
118 116 114 112 110 108 0
(a)
(b)
200
400
600
800
1000
1200
Pixel (c)
Figure 3.2 Improving contrast of images via wavelets: (a) original image; (b) processed image; and (c) intensity in one line of the image before and after processing.
39
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
(a) Figure 3.3
(b)
Illustration of the bidirectional search: (a) horizontal search; and (b) vertical search.
determines that the pixel labeled “B” also shows a fluorescence intensity above the threshold. Since both of these pixels are located next to one another, this region consisting of pixels “A” and “B” is considered a region representing a fluorescent cell. The other three pixels in the picture which have fluorescence intensities above the threshold fluorescence are not found by the directional search in the horizontal direction. The reason for this is that these pixels do not have an adjacent pixel (in the horizontal direction) that is also above the threshold. This search algorithm tries to distinguish between cells, which have to consist of many pixels at the chosen level of magnification, and other, smaller bright spots which may represent artifacts of the measurement technique. In a following step, the algorithm scans the pixels of the image in the vertical direction. A search in this region identifies the region consisting of pixels “C” and “D” in Figure 3.3(b). While searching in this direction alone would have missed the cell region consisting of “A” and “B,” the bidirectional search ensures that both the regions labeled “A-B” and “C-D” are detected. The pixel labeled “E” is not detected by this algorithm; however, this pixel does not represent a major region as it only consists of one pixel with a fluorescence intensity above the threshold. In practice, the images that have been analyzed have cell regions consisting of dozens to hundreds of adjacent pixel. Therefore, the assumption that two or more adjacent pixels are required for identifying part of an image as a cell is very reasonable. However, it is always recommended to combine the automated search routine of this algorithm with a visual inspection of the images derived from fluorescent microscopy [Figure 3.4(a)] but also from light microscopy. One illustrative result from the presented algorithm is shown in Figure 3.4(b) where the cell regions that have above average fluorescence intensity from Figure 3.4(a) have been determined. The individual steps of the algorithm are shown in Figure 3.5. Summarizing, the algorithm for the image analysis procedure based on wavelets and the bidirectional search can be described as follows: 1. The three-dimensional data matrix M, (3.5), of the fluorescent microscopy image is transformed into the two-dimensional matrix N, (3.8), as wavelet denoising algorithms are only available for one or two dimensional matrices.
40
3.3
(a)
Methods
(b)
Figure 3.4 Fluorescence regions determine by bidirectional search: (a) processed fluorescent microscopy image; and (b) regions with fluorescence intensities above threshold.
2. The wavelet “coif3” is used in the 4-level 2D wavelet decomposition with the MATLAB command “wavedec2”. The command “wbmpen” is then used to calculate the threshold for 2D denoising. Finally, a denoised image Ndenoise is obtained by using the command “wdencmp”. 3. The threshold for fluorescent cell regions is calculated from (3.9). 4. The bidirectional search algorithm in Figure 3.5 is implemented to obtain the fluorescent cell regions S from the de-noised image Ndenoise. 5. On the basis of the fluorescent cell regions S and the original intensity matrix N, the fluorescent intensity for the GFP image is calculated. This step will be discussed in Section 3.3.3. 6. Steps 1 through 5 are implemented for each image of a time-series of images. The intensities for the images at different points in time construct the fluorescence intensity profile for the time-series of images.
3.3.2
Image analysis based on K-means clustering and PCA
Another option for image analysis is to use a procedure based upon K-means clustering and PCA to group pixels of an image with similar brightness. PCA can indicate the variation of a cluster by calculating the distance from a pixel to the first principal component as illustrated in Figure 3.6. The image analysis procedure based upon K-means clustering and PCA is described in the following. In a first step, PCA is used to divide the pixels of the image into two clusters. The centroids of these two clusters are used as the initial centroid values for K-means clustering, which then assigns each of the pixels of the images to one of the two clusters. PCA is used to determine the cluster with higher variability which is divided in a next step. PCA is used again to compute the initial centroids of the three clusters for K-means clustering, which assigns the pixels of the image to one of these three clusters. The procedure is repeated until a sufficient number of clusters are obtained. For the images investigated in this work, it was found that six clusters are sufficient to make a distinction between fluorescent cells and image background. The first few clusters with higher fluorescence intensity are considered to represent fluorescent cells, while the remaining ones represent the background. Summarizing, image analysis based upon K-means clustering and PCA is described by the following.
41
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
Figure 3.5
42
Bidirectional search algorithm.
3.3
Methods
30 PC 1
Blue
20 10 0 200
20 100 Green
Figure 3.6
10 0
0
Red
Principal component analysis applied to images to determine pixels with similar features.
1. The RGB image can be brought into the form of X, shown in (3.6). 2. The algorithm based on PCA and K-means clustering is implemented to determine the fluorescent cell regions S. The details of this algorithm are shown in Figure 3.7. 3. The fluorescence intensity is calculated based upon the fluorescent cell regions S and the original intensity matrix N. The exact procedure for this calculation is discussed in Section 3.3.3. 4. Steps 1 through 3 are implemented for each image. The fluorescence intensity profile for the time-series of images is computed by combining the fluorescence intensities for the individual images into a vector. An example of the results from this procedure is shown in Figure 3.8. Six clusters representing different fluorescence intensity levels are calculated and the first four clusters with higher fluorescence intensity are considered as the fluorescent cell regions.
3.3.3
Determining fluorescence intensity of an image
The bidirectional search or the search algorithm based on PCA and K-means clustering only determines the regions of an image corresponding to fluorescent cells and not the fluorescence intensity. The fluorescence intensity is computed from the original images by the following formula: Nb Nb ⎛ Nf ⎞ ⎛ Nf ⎞ ⎜ ∑ I f , k ∑ I b, k ⎟ ⎜ ∑ I f , k ∑ I b, k ⎟ ⎜ ⎟ ⎜ ⎟ I = ⎜ k =1 − k =1 − ⎜ k =1 − k =1 Nf Nb ⎟ Nf Nb ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ stimulation ⎝ ⎠ control
(3.10)
where If,k refers to the fluorescence intensity of the kth pixel in the fluorescent cell region, Ib,k refers to the fluorescence intensity of the kth pixel in the background region, Nf is the total number of the pixels in the fluorescent cell regions, Nb is the total number of the pixels in the background regions, ( )stimulation refers to the intensity for the image with the stimulation, and ( )control refers to the intensity of images of negative control experiments (i.e., experiments where no stimulation is applied). The reason for subtracting the background intensity is to reduce measurement noise due to brightness 43
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
Figure 3.7
44
Image analysis based on K-means clustering and PCA.
3.3
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Methods
Figure 3.8 Fluorescent cell regions and clusters calculated by K-means clustering and PCA: (a) original image; (b) cluster 1; (c) cluster 2; (d) cluster 3; (e) cluster 4; (f) cluster 5; (g) cluster 6; and (h) combination of clusters 1, 2, 3, and 4.
variation. The reason for subtracting the intensity of (negative) control experiments is to reduce other effects that can cause fluorescence. If no significant changes are seen in the control experiments then it can be concluded that the changes in the fluorescence intensity are due to stimulation and it is not required to subtract the control term as its main effect on the measurements will be in the form of a small noise term.
3.3.4
Comparison of the two image analysis procedures
Since both image analysis techniques are based upon different concepts, it is warranted to compare the properties of the two algorithms. The summary of this comparison is shown in Table 3.1. The method based on wavelets and the bidirectional search has processed the images faster in our investigations than the method based on K-mean clustering and PCA. The drawback of the method based on wavelets and the bidirectional search is that it can not provide the information about different intensity levels, unless it is modified to look at different threshold values. However, this step can be complicated as it is nontrivial to determine a good value for even one threshold when the images have low contrast. In comparison, the method based upon K-means clustering and PCA can provide the information about different intensity levels. This information can be helpful as it can be used to remove image artifacts like artificially bright spots. From the comparison of these two methods, it can be concluded that the method based on K-means clustering and PCA is generally a better choice as it can process lower-quality images and provides information about different intensity levels. However, this can come at the cost of an increased computational burden. The bidirectional search technique can be a viable alternative if image quality is good and fast processing times are important. Table 3.2 highlights when which of the two methods may be the better choice to implement.
45
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
Table 3.1
Comparison of Two Image Analysis Techniques
Method
Advantages
Drawbacks
Image analysis based on Computationally inexpensive: for example, Cannot provide information about different wavelets and bidirectional a movie with 42 images was processed in intensity levels search ~5 minutes on a desktop computer (Pentium Search is performed in only two directions 4 CPU 2.8 GHz, 1 GB memory) Threshold is sensitive to the quality of the images Image analysis based on Can provide information about different Can be computationally expensive: on averK-means clustering and intensity levels age required about an order of magnitude PCA more computation time than bidirectional Can be used to remove artificial bright search spots in the image Suitable for any shape of cells, also suitable for poor quality images
Table 3.2 Method
When to Use Which Image Analysis Method Method Should Be Used
Image analysis based on wavelets and If images with good contrast and large bright regions bidirectional search in the images are available If a threshold value can be easily obtained For quick evaluation Image analysis based on K-means For a variety of images, even those with low contrast clustering and PCA Information about intensity levels can improve image analysis results
3.4 Data Acquisition, Anticipated Results, and Interpretation A fluorescence intensity profile can be computed by the techniques presented in Section 3.3. The fluorescence intensity can be assumed to be directly proportional to the concentration of green fluorescence protein. However, the purpose of using GFP reporter systems is to measure the transcription factor concentration and not the GFP concentration. Therefore, the transcription factor concentration needs to be computed from the fluorescence intensity profile. The first step of computing the transcription factor concentration from the fluorescence intensity profile consists of developing a dynamic model between these two quantities. An inverse problem can then be solved in a second step to actually compute the transcription factor concentration over time.
3.4.1 Developing a model describing the relationship between the transcription factor concentration and the observed fluorescence intensity The dynamic model used in this work is based upon the model published by Subramanian and Srienc [10]; however, several modifications are made. Specifically:
46
•
The amount of DNA remains constant in our work as the cells do not proliferate. This results in (3.11), where p represents the concentration of the DNA.
•
No growth dilution terms need to be included in the model for either the GFP m-RNA, m, balance (3.12), the nonfluorescent protein, n, balance (3.13), or the fluorescent protein, f, balance (3.14).
•
The transcription rate needs to be modified so that it depends on the amount of activated transcription factor present in the nucleus. This change results in the
3.4
Data Acquisition, Anticipated Results, and Interpretation
Monod kinetics shown in (3.12), Sm
CNF − κB p , CNF − κB is the concentration of C + CNF − κB
NF-κB, replacing the original term which was solely based upon the amount of m-RNA present. While it was sufficient for the original model to neglect the transcription factor concentration, this is not the case for the model developed here as the transcription factor concentration is a crucial element of signal transduction and is regulated inside the cell. The resulting model is given by (3.11) through (3.14). dp =0 dt
(3.11)
dm CNF − κB = Sm p − Dm m dt C + CNF − κB
(3.12)
dn = Sn m − Dn n − Sf n dt
(3.13)
df = s f n − Dn f dt
(3.14)
where Sm is a reaction constant describing the transcription rate with a value of 373 1/hr; Dm is a constant describing the mRNA degradation rate and is equal to 0.45 1/hr; Sn is a reaction constant for the translation rate with a value of 780 1/hr; Dn is a constant associated with the protein degradation rate and is equal to 0.5 1/hr; Sf is associated with the fluorophore formation rate and has a value of 0.347 1/hr. These values are identical to the ones reported by Subramanian and Srienc [10], with the exception of the values for Dm and Dn, which were slightly modified to account for model adjustments as well as to take experimental observations into account. The rationale behind the procedure used for estimation of C, which has a value of 108 nM, will be discussed in the Section 3.5. The initial conditions for this system are p(0) = 5, m(0) = 0, n(0) = 0, and f(0) = 0. Equations (3.11) through (3.14) describe the relationship between the concentration of the transcription factor and activated GFP, f. The experimental measurements consist of the fluorescence intensity, I$, from the images which is directly proportional to the concentration of activated green fluorescent protein: f = ΔI$
(3.15)
where Δ is the ratio between activated GFP and computed fluorescence intensity. As I$ can be obtained from the fluorescent microscopy images that have been processed by one of the procedures described in Section 3.3, the dynamics of NF-κB can be computed by solving an inverse problem involving (3.12) through (3.15).
3.4.2 Solution of an inverse problem for determining transcription factor concentrations Solving an inverse problem is nontrivial as high frequency noise and measurement errors will be accentuated when the measurements are differentiated [35]. Several com47
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
mon methods for solving inverse problems are: (1) approximating the time-derivative of the measured output by numerical differentiation; (2) approximating the time-derivative of the measured output by a filter, which can remove the noise (for an example see [36]); (3) integrating the differential equation of the measured output to avoid the differentiation operation and approximating the integral with RungeKutta methods. However, these techniques can introduce additional noise due to the numerical calculations in addition to the experimental noise or may introduce a time delay in the data due to aggressive low-pass filtering. One solution to solving inverse problems that is less sensitive to measurement noise is to use a regularization procedure. The technique presented in the following is one type of such a procedure as the inverse problem is solved by determining an analytical solution and estimating parameters of this analytical solution [37]: 1. The system of equations can be viewed as a linear system with a static nonlinearity in the input. Accordingly, (3.12) can be rewritten: dm = Sm pu − Dm m dt u=
CNF − κB C + CNF − κB
(3.16)
(3.17)
with an alternative input u such that the relationship between u and the fluorescent intensity is linear. 2. Even though the shape of CNF-κB (and accordingly of u), is not known, as it is the purpose of this algorithm to determine CNF-κB from data, it is usually possible to make certain assumptions about the shape that the concentration profile might have. For example, data provided in [7] suggests that the concentration profile of CNF-κB will oscillate for a continuous stimulation with TNF-α. Therefore, it is appropriate to choose the Laplace transform of u as a second order transfer function multiplied by a step input: u( s) =
ω2n T ⋅ α 2 2 s + 2 εωn s + ωn s
(3.18)
where ε, ωn, and Tα are the parameters describing the input u. 3. A Laplace transformation is applied to (3.16), (3.13), and (3.14), resulting in m( s) =
Sm pu( s) s + Dm
(3.19)
n( s) =
Sn m( s) s + Dn + Sf
(3.20)
f ( s) =
48
Sf n( s) s + Dn
(3.21)
3.4
Data Acquisition, Anticipated Results, and Interpretation
n(s) and m(s) can be eliminated from (3.19) to (3.21) such that a transfer function between u(s) and f(s) is derived: f ( s) =
Sf s + Dn
⋅
Sn S p u( s) ⋅ m s + Dn + Sf s + Dm
(3.22)
Substituting (3.18) into (3.22) results in f ( s) =
Sf s + Dn
⋅
Sn S p T ωn2 ⋅ m ⋅ 2 ⋅ α s + Dn + Sf s + Dm s + 2 εωn s + ωn2 s
(3.23)
4. f(t) can be obtained by performing an inverse Laplace transform of (3.23): f (t ) = A1 + A2 e − Dn t + A3 e
− ( Dn + S f
)t
(
)
+ A4 e − εω n t sin ωn 1 − ε 2 t + ϕ
(3.24)
where A1, A2, A3, A4, A5, and ϕ are all constants with the following values Sf Sn Sm pTa
A1 =
(
Dn Dm Δ Dn + Sf
A2 = A3 =
A4 =
)
−Sn Sm pωn2 Ta Dn Δ( Dm − Dn )( Dn2 − 2 εωn Dn + ωn2 )
(
Δ Dn + Sf
)(
Sn Sm pωn2 Tα Dm − Dn − Sf ⎜⎛⎝ Dn + Sf
)(
(
Sf Sn Sm pωn2 Tα
Dm Δ( Dn − Dm ) Dm − Dn − Sf
A5 =
)
⎛ A − A εω ⎞ n⎟ 7 A72 + ⎜⎜ 6 2 ⎟ 1 ω − ε ⎠ ⎝ n
)( D
2 m
2
(
− 2 εωn Dn + Sf
) + ω ⎟⎞⎠ 2 n
− 2 εωn Dm + ω2n )
2
Δ C0 ( ad1 + bd0 ) A6 = bd12 + bd02 −C d A7 = 2 0 1 2 bd1 + bd0 a = − εωn b = ωn 1 − ε 2
d1 = −( a3 + 4a)b 3 + (3a3 a 2 + 2 a2 a + 4a 3 + a1 )b
d0 = b 4 + a3 a 3 − (3a3 a + a2 + 6a 2 )b 2 + a2 a 2 + a 4 + a1 a
(
)
a1 = Dn2 + Dn Sf Dm a2 = D + Dn Sf + 2 Dn Dm + Dm Sf 2 n
a3 = 2 Dn + Dm + Sf C0 = Sf Sn Sm pω2n Tα ϕ = arctan
A7 ωn 1 − ε 2 A6 − A7 εωn 49
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
The values of the parameters ε, ωn, and Tα are estimated by fitting f(t) to the experimental data. 5. u(t) is given by an inverse Laplace transformation of (3.18) which can be used to compute the profile of NF-κB from (3.17):
CNF − κB =
( sin(ω
), 1 − ε t + φ)
CTα 1 − ε 2 − CTα e − εω n t sin ωn 1 − ε 2 t + φ
(1 − Tα )
1 − ε + Tα e 2
where φ = arctan
− εω n t
1− ε ε
n
2
(3.25)
2
The parameters ε, ωn, and Tα, have the values determined in step 4. While this procedure was derived for a transcription factor profile exhibiting damped oscillations, it is possible to derive other expressions for the fluorescence intensity and the transcription factor concentration profiles using the same procedure that was outlined in this section.
3.5 Application Notes To illustrate the implementation of the above two GFP image analysis methods as well as the procedure for computing the transcription factor profiles, time-series of images of hepatocytes constantly stimulated with three different concentrations of TNF-α (i.e., 6 ng/ml, 13 ng/ml, and 19 ng/ml) have been analyzed. Additionally, negative control experiments without TNF-α stimulation were carried out. The experiments were conducted for 15 hours and measurements were taken every 60 minutes. For each concentration of TNF-α, three images were recorded showing different areas of the experiment. Both image analysis techniques were applied to all the images. The mean and one-standard deviation error bars of each of these determined time-series are shown in Figure 3.9. It can be concluded that both analysis algorithms are able to correctly capture the trends. The results returned by the method based upon PCA and K-means clustering seem to have slightly smaller error bars; however, the difference is not significant. Furthermore, these results have to be put in the right context by comparing it to the amount of information that could have been captured by other, semi-quantitative measurement techniques. It is sufficient to say that both image analysis techniques return comparable results for the investigated images. The results from the method based on PCA and K-means clustering are used to determine the dynamic profile of NF-κB. The analysis that is described here has also been applied to data generated by the technique based upon wavelets and bidirectional search. However, since the results were found to be very similar, only one set of data is used here due to space constraints. The image analysis procedure returned the profile of the intensity I$ seen in the fluorescent microscopy images. Before I$ is used to derive the profile of NF-κB, the parameters C in the GFP model and Δ from (3.15), which links the concentration of activated GFP and the fluorescence intensity seen in an image, are estimated by the following procedure:
50
3.5
(a)
(b)
Application Notes
(c)
Figure 3.9 Image analysis results for the fluorescent microscopy images of NF-κB for cells stimulated with TNF-α: (a) TNF-α = 6 ng/ml; (b) TNF-α = 13 ng/ml; and (c) TNF-α= 19 ng/ml.
1. The CNF-κB data for cells stimulated by TNF-α = 10 ng/ml in wild-type cells from the paper by Hoffmann, et al. [7] is used to identify C, ε, ωn, and T in (3.25) with nonlinear least square optimization command in MATLAB, lsqnonlin. C, ε, ωn, and T are found to be 108 nM, 0.17, 4.49 and 0.27, respectively. Figure 3.10 shows that the output of (3.25) with the estimated parameters C, ε, ωn, and T fits the CNF-κB data from [7] well. 2. The model described by (3.11) through (3.14) is used with the estimated value of C, to compute the profile of the GFP. The input of this model is the concentration of NF-κB. The CNF-κB data for cells stimulated by TNF-α = 10 ng/ml in wild-type cells from the paper by Hoffman et al. [7] is used as an input to calculate the profile of f. As the the CNF-κB concentrations are given at discrete points, the values between two time points are estimated by linear interpolation. 3. The fluorescence intensity for TNF-α = 10 ng/ml is computed by the described procedure from the experimental results shown in Figure 3.11 (red line). Δ is estimated by the ratio of the steady state value of the f value computed from the model from step 2 and the steady state fluorescence intensity computed from the experimental data by the image analysis procedure. 4. The estimated value for Δ is 2.5562 × 104. A comparison of the experimental data analyzed by the presented analysis algorithm and the fluorescence intensity profile computed from (3.11) through (3.14) for an input of TNF-α = 10 ng/ml is shown in Figure 3.11.
Figure 3.10
The output from (3.25) with the estimated parameters and the original CNF-êB from [7].
51
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
Figure 3.11 NF-κB data.
The experimental data and the output f/Δ from the identified GFP model for Hoffmann’s
After C and Δ have been estimated, their values are used to derive the profile of NF-κB from the fluorescence intensity profile I$ for TNF-α concentrations other than 10 ng/ml. The procedure for solving the inverse problem is given as follows: 1. The parameters , n, and T are estimated to fit f, given by (3.24), to ΔI$ using a nonlinear least squares optimization method. 2. The corresponding NF-κB profile is given by (3.25) using the values of the estimated parameters ε, ωn, and T . Table 3.3 shows the parameters ε, ωn, and T estimated for TNF-α concentrations of 6, 13, and 19 ng/ml. The corresponding curves of f/Δ, as predicted by (3.24), are shown in Figure 3.12 together with the experimental data obtained from the image analysis techniques. The three corresponding CNF-κB profiles are shown in Figure 3.13. After constant stimulation by TNF-α, CNF-κB increases and reaches its maximum value after approximately 40 minutes. The NF-κB concentrations reach their steady state values after approximately 6 hours. Comparison of the results for these three TNF-α concentrations show that stimulation with increased levels of TNF-α lead to higher peak values and a larger steady state value of the concentration of NF-κB. These results are reasonable as larger TNF-α concentrations are able to activate more IKKn (neutral form of IKK kinase), which releases more NF-κB from the complex (IκBα| NF-κB) and then induces more NF-κB in the nucleus [38]. It can be concluded from Figure 3.13 that the image analysis techniques and the solution of the inverse problem presented in this work can obtain quantitative data for the transcription factors NF-κB.
Table 3.3 to $I
52
Estimated Parameters for Fitting Δ
TNF-α Concentration
ε
ωn
Tα
6 ng/ml 13 ng/ml 19 ng/ml
0.20 0.20 0.28
4.52 4.52 4.61
0.26 0.31 0.35
3.6
(a)
(b)
Summary and Conclusions
(c)
Figure 3.12 The experimental data I$ and the fitted curve f/Δ for TNF-α at (a) 6 ng/ml, (b) 13 ng/ml, and (c) 19 ng/ml.
Figure 3.13 NF-κB profiles computed via solution of the inverse problem based upon one of the presented image analysis techniques for TNF-α at 6 ng/ml, 10 ng/ml, 13 ng/ml, and 19 ng/ml.
3.6 Summary and Conclusions This work presented techniques for determining dynamic concentration profiles of transcription factors from a series of fluorescent microscopy images. The first image analysis method presented in this work uses wavelets to sharpen the contrast of the images. This sharpening step is followed by a two-directional search which determines if a pixel corresponds to a fluorescent cell. The second image analysis technique uses K-means clustering and principal component analysis (PCA) to cluster the pixels according to their fluorescence intensity levels. It has been found that the first algorithm is simpler to implement and requires less computation time, while the latter technique tends to give better results for low-contrast images. Additionally, the technique involving PCA and K-means clustering is able to determine several regions with the same intensity level whereas the bidirectional search technique only detects regions where the fluorescence intensity is above a certain threshold. The results for the fluorescence intensity profiles obtained from these techniques were similar for all the test cases investigated in this work. A second contribution of this chapter is the introduction of a technique that determines the transcription factor concentration from the fluorescence intensity profile. The procedures are illustrated by determining the NF-κB dynamics in hepatocytes for different stimulation concentrations of TNF-α. 53
Comparison of Algorithms for Analyzing Fluorescent Microscopy Images
Acknowledgments The authors gratefully acknowledge partial financial support from the National Science Foundation (Grant CBET# 0706792) and the ACS Petroleum Research Fund (Grant PRF# 48144-AC9). The authors are grateful for the fluorescence microscopy images provided by Dr. Arul Jayaraman and Mr. Fatih Senocak.
References [1]
[2]
[3] [4] [5]
[6]
[7] [8] [9]
[10] [11] [12]
[13]
[14] [15]
[16] [17]
[18]
[19]
54
Corvinus, F.M., C. Orth, R. Morigg, S.A. Tsareva, S. Wagner, E.B. Pfitzner, D. Baus, R. Kaufmann, L.A. Huberb, K. Zatloukal, H. Beug, P. Ohlschlager, A. Schutz, K.-J. Halbhuber, and K. Friedrich, “Persistent STAT3 activation in colon cancer is associated with enhanced cell proliferation and tumor growth,” Neoplasia, Vol. 7, No. 6, 2005, pp. 545–555. Judd, L.M., B.M. Alderman, M. Howlett, A. Shulkes, C. Dow, J. Moverley, D. Grail, B.J. Jenkins, M. Ernst, and A.S. Giraud, “Gastric cancer development in mice lacking the SHP2 binding site on the IL-6 family co-receptor gp130,” Gastroenterology, Vol. 126, No. 1, 2004, pp. 196–207. Heinrich, P.C., I. Behrmann, S. Haan, and H.M. Hermanns, “Principles of interleukin (IL)-6-type cytokine signaling and its regulation,” Biochem., Vol. 374, 2003, pp. 1–20. Singh, A.K., A. Jayaraman, and J. Hahn, “Modeling regulatory mechanisms in IL-6 signal transduction in hepatocytes,” Biotechnol. Bioeng, Vol. 95, No. 5, 2006, pp. 850–862. Huang, Z., Y. Chu, F. Senocak, A. Jayaraman, and J. Hahn, “Model update of signal transduction pathways in hepatocytes based upon sensitivity analysis,” Proceedings Foundations of Systems Biology 2007, September 2007, Stuttgart, Germany. Birtwistle, M.R., M. Hatakeyama, N. Yumoto, B. A Ogunnaike, J. B Hoek, and B.N. Kholodenko, “Ligand-dependent responses of the ErbB signaling network: experimental and modeling analyses,” Molecular Systems Biology, Vol. 3, No. 144, 2007, pp. 1–16. Hoffmann, A., A. Levchenko, M.L. Scott, and D. Baltimore, “The IκB–NF-κB signaling module: temporal control and selective gene activation,” Science, Vol. 298, No. 8, 2002, pp. 1241–1245. Kurien, B.T., and R.H. Scofield, “Western blotting,” Methods, Vol. 38, No. 4, 2006, pp.283–293. Pan, Q., A.L. Saltzman, Y. Ki Kim, C. Misquitta, O. Shai, L.E. Maquat, B.J. Frey, and B.J. Blencowe, “Quantitative microarray profiling provides evidence against widespread coupling of alternative splicing with nonsense-mediated mRNA decay to control gene expression,” Genes Dev., Vol. 20, No. 2, 2006, pp. 153–158. Subramanian, S., and F. Srienc, “Quantitative analysis of transient gene expression in mammalian cells using the green fluorescent protein,” J. Biotechnol., Vol. 49, 1996, pp. 137–151. King, K. R., S. Wang, D. Irimia, A. Jayaraman, M. Toner, and M. L. Yarmush, “A high-throughput microfluidic real-time gene expression living cell array,” Lab Chip, Vol. 7, 2007, pp. 77–85. Choe, J., H.H. Guo, and G. V. D. Engh, “A dual-fluorescence reporter system for high-throughput clone characterization and selection by cell sorting,” Nucleic Acids Research, Vol. 33, No. 5, 2005, e49. Miksch, G., F. Bettenworth, K. Friehs, E. Flaschel, A. Saalbach, and T.W. Nattkemper, “A rapid reporter system using GFP as a reporter protein for identification and screening of synthetic stationary-phase promoters in Escherichia coli,” Appl. Microbiol. Biotechnol., Vol. 70, 2006, pp. 229–236. Ducrest, A.L, M. Amacker, J. Lingner, and M. Nabholz, “Detection of promoter activity by flow cytometric analysis of GFP reporter expression,” Nucleic Acids Research, Vol. 30, No. 14, 2002, e65. Tee, C. S., M. Marziah, C.S. Tan, and M.P. Abdullah, “Evaluation of different promoters driving the GFP reporter gene and selected target tissues for particle bombardment of DendrobiumSonia 17,” Plant Cell Rep., Vol. 21, 2003, pp. 452–458. Carroll, J. A., P.E. Stewart, P. Rosa, A.F. Elias, and C.F. Garon, “An enhanced GFP reporter system to monitor gene expression in Borrelia burgdorferi,” Microbiology, Vol. 149, 2003, pp.1819–1828. Cheng, L., C. Du, D. Murray, X. Tong, Y. Zhang, B.P. Chen, and R.G. Hawley, “A GFP reporter system to assess gene transfer and expression in human hematopoietic progenitor cells,” Gene Therapy, Vol. 4, 1997, pp. 1013–1022. Gouskosa, T., F. Wightmanb, S.R. Lewinb, and J. Torresia, “Highly reproducible transient transfections for the study of hepatitis B virus replication based on an internal GFP reporter system,” Journal of Virological Methods, Vol. 121, 2004, pp. 65–72. Hoffman, R.M., “In vivo imaging with fluorescent proteins: the new cell biology,” Acta histochemica, Vol. 106, 2004, pp.77–87.
Acknowledgments
[20]
[21]
[22]
[23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36]
[37]
[38]
Venkataraman, S., J.L. Morrell-Falvey, M.J. Doktycz, and H. Qi, “Automated image analysis of fluorescence microscopic images to identify protein-protein interactions,” Proc. 27th Annu. Conf. IEEE Engineering in Medicine and Biology, Shanghai, 2005. Jung, C.K., J.B. Lee, X.H. Wang, and Y.H. Song, “Wavelet based noise cancellation technique for fault location on underground power cables,” Electric Power Systems Research, Vol. 77 , 2007, pp.1349–1362. Sharmaa, A., G. Sheoranb, Z.A. Jafferya, and Moinuddina, “Improvement of signal-to-noise ratio in digital holography using wavelet transform,” Optics and Lasers in Engineering, Vol. 48, 2008, pp. 42–47. Donoho, D.L., and I.M. Johnstone, “Ideal de-noising in an orthonormal basis chosen from a library of bases,” C.R.A.S. Paris, t. 319, Ser. I, 1994, pp. 1317–1322. Donoho, D.L., “De-noising by soft-thresholding,” IEEE Trans. on Inf. Theory, Vol. 41, No. 3, 1995, pp. 613–627. Kaufman, L., and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons, 1990. Lloyd, S.P., “Least squares quantization in PCM,” IEEE Trans. on Inf. Theory, Vol. 28, No. 2, 1982, pp. 129–137. Sabin, J., and R. Gray, “Global convergence and empirical consistency of the generalized Lloyd algorithm,” IEEE Trans. on Inf. Theory, Vol. 32, No. 2, 1986, pp. 148–155. Hotelling, H., “Analysis of a complex of statistical variables into principal components,” Journal of Educational Psychology, Vol. 24, 1933, pp. 417–441. Jackson, J.E., A User’s Guide to Principal Components, New York: John Wiley & Sons, 2003. Geladi, P., and H. Grahn, Multivariate Image Analysis, New York: John Wiley & Sons, 1996. Bharati, M.H.M., and J.F. Macgregor, “Multivariate image analysis for real–time process monitoring and control,” Industrial and Engineering Chemistry Research, Vol. 37, 1998, pp. 4715–4724. Guillemin, F., M.F. Devaux, and F. Guillon, “Evaluation of plant histology by automatic clustering based on individual cell morphological features,” Image Anal. Stereol., Vol. 23, 2004, pp. 13–22. Ding, C., and X. He, “K-means clustering via principal component analysis,” Proc. of Intl. Conf. Machine Learning (ICML 2004), 2004, pp. 225–232. Meier, D.S., and C.R.G. Guttmann, “Time-series analysis of MRI intensity patterns in multiple sclerosis,” NeuroImage, Vol. 20, No. 2, 2003, pp. 1193–1209. Benyon, P.R., “The inversion of dynamic systems,” Mathematics and Computers in Simulation, XXI, 1979, pp. 335–339. Puebla, H., and J. Alvarez-Ramirez, “Stability of inverse-system approaches in coherent chaotic communication,” IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Application, Vol. 48, No. 12, 1979, pp.1413–1423. Huang, Z., F. Senocak, A. Jayaraman, and J. Hahn, “Integrated modeling and experimental approach for determining transcription factor profiles from fluorescent reporter data,” BMC Systems Biology, Vol. 2, No. 64, 2008. Lipniacki, T., P. Paszek, A.R. Brasier, B. Luxon, and M. Kimmel, “Mathematical model of NF-kB regulatory module,” Journal of Theoretical Biology, Vol. 228, 2004, pp.195–215.
55
CHAPTER
4 Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks 1*
2
1
1
Jason M. Haugh , Timothy C. Elston , Murat Cirit , Chun-Chao Wang , Nan Hao2, and Necmettin Yildirim3 1
Department of Chemical & Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695 2 Department of Pharmacology, University of North Carolina, Chapel Hill, NC 27599 * e-mail:
[email protected] 3 Division of Natural Sciences, New College of Florida, Sarasota, FL 34243
Abstract Mathematical modeling has emerged as a valuable tool for characterizing and predicting the spatiotemporal dynamics of biochemical reaction pathways and networks in living cells; however, the power of such models is currently limited by the availability of quantitative, kinetic data for comparison and validation. In this chapter, we discuss data-driven modeling of intracellular reaction networks, with a focus on signal transduction in eukaryotic cells. Experimental data types and their limitations, approaches for data processing and normalization, types of models and issues related to model simplification, and parameter estimation methods are covered. To illustrate these principles, we offer two recent examples of data-driven modeling, each dealing with signal transduction through mitogen-activated protein kinases (MAPKs): elucidation of crosstalk between phosphoinositide 3-kinase (PI3K)- and Ras-dependent pathways in mammalian cells, and analysis of feedback regulation as a mechanism for signaling specificity in the yeast pheromone response.
Key terms
Signal transduction Cell biology Kinase cascade Crosstalk Parameter estimation
57
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
4.1 Introduction At a certain level of abstraction, living cells are picoliter-sized reaction vessels in which thousands of biochemical reactions and intermolecular binding processes take place in a dynamic, coordinated, and highly regulated fashion. This is an energy intensive process. Intracellular enzyme activities are modulated by covalent modifications that are rapidly added and removed in a seemingly futile cycle in order to respond to changes in the cell’s external environment. These reactions are responsible for governing cell function, and their dysregulation and modulation by infectious agents constitute the molecular basis for human disease. From the perspective of chemical kinetics, the inner workings of the cell are fascinating, but we are still a long way from a mechanistic understanding of these reactions and quantitative characterization of their rates. In this chapter, we discuss how mathematical modeling is applied in tandem with biochemical measurements to achieve this goal. Whether before, during, or after the collection of experimental data, quantitative modeling is a valuable approach for critically assessing and organizing hypotheses that integrate the many processes that might be at play [1]. And, to the extent that a model is trained on a sufficient amount of quantitative data and its mechanistic assumptions are sound, it may be used to predict the outcomes of novel experiments and thus generate new, hypothesis-driven research. Some experiments will inevitably contradict the model predictions, but as with conceptual, “arrow diagram” models, one iteratively refines the model based on new data. The examples presented here are focused on mechanisms of signal transduction in eukaryotic cells, which are responsible for controlling cell cycle progression, cell motility, responses to stress, programmed cell death, and differentiation of cell function [2, 3]. These reaction pathways transmit information about the cell’s external microenvironment, making them fundamentally distinct from metabolic pathways, which deal in currencies of energy and reducing power. We further narrow our focus on modeling of cell signaling that is both data-driven and rooted in biochemical mechanisms. We distinguish data-driven models from purely theoretical models, where experimental data are either not available or not accessible with current technology, and mechanistic models are distinguished from purely phenomenological and purely statistical/correlative models. To supplement the topics presented here, the reader is referred to a number of reviews on the subject of modeling signal transduction processes [4–8]. Rather than presenting detailed recipes of experimental or modeling techniques, this chapter aims to shed light on the inherent relationship between the two in data-driven modeling. In Section 4.2, we discuss the advantages and shortcomings of different experimental methodologies from the standpoint of modeling and the formulation of models of the appropriate type and level of complexity. Emphasis is placed on the pressing need for model simplification and more systematic approaches for model parameter specification. Then, in Section 4.3, we present two examples of how we have applied those modeling principles to understand specific cell signaling systems.
58
4.2
Principles of Data-Driven Modeling
4.2 Principles of Data-Driven Modeling 4.2.1
Types of experimental data
Depending on one’s point of view, cell biology is currently either in a data-rich or data-deprived state. There is a wealth of genomic and proteomic data that have yielded mostly qualitative information about the connectivity of pathways, yet there is relatively little in the way of measurements characterizing their dynamics. Here, we briefly discuss the various quantitative experimental methodologies that define the current state of the art and weigh their advantages, caveats, and limitations. We choose to classify measurement techniques in three categories: population endpoint, single-cell endpoint, and single-cell kinetic (Table 4.1). An endpoint measurement is one in which the experiment is stopped at a certain time and the sample is prepared for analysis, whereas a kinetic measurement is one in which the biochemical readout is monitored in real time. Important considerations include dynamic range (the range of measured values from the lowest limit of detection to the upper limit of assay linearity), throughput (the number of conditions that can be compared in each independent experiment), the ability to multiplex (measure multiple readouts at once), and the ability to assess subcellular localization. In population endpoint measurements, a large number of cells (103 to 108) are subjected to identical experimental conditions, and a lysate of the cell collective is prepared for analysis. Hence, information about individual cells is lost, and information about subcellular localization is at best indirect; depending on the method of lysis, the preparation can be subdivided based on density and/or detergent solubility into fractions representing different subcellular compartments (cytosol, plasma membrane, endosomes and Golgi, nuclei, and so forth). Despite these shortcomings, this approach has several advantages, including potentially high sensitivity and throughput and broad versatility
Table 4.1
Capabilities and Limitations of Common Experimental Methodologies
Population Endpoint: Immunoblotting * Dot blot/ELISA Sandwich ELISA In vitro enzymatic assay Antibody array Mass spectrometry Single-Cell Endpoint: Flow cytometry Immunofluorescence Single-Cell Kinetic: Confocal; localization TIRF; localization Spectral shift FRET
Dynamic Range
Throughput
Multiplexing
Spatial Detail
+++++ +++ ++++ ++++ +++ +++
+++ +++++ ++++ +++ ++ +
+ + + – +++ +++++
– – – – – –
++ ++
++ +
++ +
– +++
++ +++ +++ +
+ + ++ +
+ + – –
+++ ++ + ++
Various measurement techniques are rated according to typical performance in four categories: dynamic range/sensitivity, throughput (the number of conditions that can be compared in each independent experiment), the ability to multiplex (measure multiple readouts at once), and the ability to assess subcellular localization. *Immunoblotting using enhanced chemiluminescence and a high-sensitivity, cooled charge coupled device camera for imaging; the traditional method using photographic film for imaging gives a sigmoidal response over a much narrower dynamic range (contributing to the false notion that immunoblotting is generally not quantitative).
59
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
for measuring a variety of molecular readouts; all of these depend critically on the quality of the reagents used. The most common population endpoint measurements, such as immunoblotting, enzyme-linked immunosorbent assays (ELISAs), and in vitro enzymatic assays, involve protein immobilization and the use of antibodies for specific capture or detection. All things being equal, assays that involve an initial separation step (e.g., gel electrophoresis or the use of a capture antibody) tend to be more specific and therefore have a higher dynamic range. There is also a general trade-off between the ability to multiplex and throughput, as exemplified by antibody arrays [9] and especially current mass spectrometry technology [10, 11]. These methods can also be used in conjunction with coimmunoprecipitation to assess protein-protein interactions; however, because of the work-up time involved, this approach is strongly biased to detect only very stable interactions. From the standpoint of experimental data, the inability to measure intracellular protein-protein interactions quantitatively is arguably the most significant limitation for data-driven modeling. Single-cell endpoint measurements, which provide information about individual cells, include flow cytometry and immunofluorescence microscopy. Both involve incubation with antibodies, detection of fluorescence, and in the case of intracellular proteins, cell fixation and permeabilization. Flow cytometry offers high throughput in terms of assembling population statistics for each sample and moderate throughput in terms of comparing multiple samples. Immunofluorescence offers information about subcellular localization, but the analysis is tedious and therefore low in throughput. Single-cell kinetic measurements generally involve microscopic imaging of live cells, in which case information about subcellular localization is obtained. Although this approach suffers from many of the same throughput issues as immunofluorescence, the ability to observe signaling kinetics in real time and in conjunction with cell behavior makes it unique [12, 13]. The basis for the measurement is the introduction of a biosensor, either genetically encoded or microinjected into the cell; genetically encoded biosensors are fusion proteins comprised of a protein or protein domain of interest, to which a fluorescent protein such as enhanced green fluorescent protein is attached. A limited degree of multiplexing is offered through the use of multiple biosensors labeled with different fluorophores. The dynamic range of the measurement is affected by which particular biosensor and microscopy modality [e.g., wide-field fluorescence, confocal fluorescence, or total internal reflection fluorescence (TIRF)] are used, and the basis for the measurement [e.g., a shift in spectral properties of the fluorophore, as in calcium imaging, translocation to a particular membrane or intracellular compartment, or changes in Förster resonance energy transfer (FRET)]. The most significant limitation of this approach is that there are currently only a small number of biosensors that work well for quantitative studies; another caveat of using biosensors is that they might significantly interfere with or otherwise modulate the signaling processes they were meant to detect.
4.2.2
Data processing and normalization
All data require some form(s) of processing prior to any sort of quantitative analysis. Some of these are obvious and routine; for example, the subtraction of assay/image background and the linear rescaling of images for presentation. Typically, quantitative data are also normalized. The purpose of normalization is to adjust for sources of vari60
4.2
Principles of Data-Driven Modeling
ability, so that the reproducibility of experimentally deduced trends may be compared in a statistically meaningful way. The manner in which this is done varies and is context-dependent (and, in some cases, arbitrary), and hence this topic is worthy of some discussion. Variability arises because of both the biological system and the assay itself. Biological variability is significant in any measurement involving cells; this is because, no matter how carefully the parameters of the cell culture are controlled, the culture will vary from experiment to experiment. Assay variability arises from heterogeneity within a sample (e.g., from cell to cell in single-cell measurements) and in the preparation of samples, which affects the comparison of conditions within the same experiment, and also from temporal and lot-to-lot changes in the reagents used, which along with biological variability affect the comparison of independent experiments. Sample heterogeneity at the single-cell or population level is generally normalized by dividing the signal by a second measurement that should not be affected by the perturbations being tested. For example, population endpoint measurements are typically normalized by the total amount of cellular protein in the sample or by the amount of an abundant species that should be invariant from sample to sample (e.g., actin or tubulin). This is especially important when comparing samples derived from the same cell line/strain but which have been differentially modified over some period of time; for example, comparing control cells to cells in which over-expression of a wild-type gene or expression of a mutant gene has been introduced. Day-to-day variability of the assay reagents and other assay conditions can be normalized by the measurement of a common standard sample; however, this approach is of little use in the typical case where biological variability is also prominent. To normalize for biological variability, it is often appropriate to use a negative or positive control sample, acquired in each independent experiment. A pitfall of using a negative control for normalization (e.g., fold-induction) is that it often has the lowest and least reliable signal. For more complex data sets of the sort that is desirable for quantitative modeling, with measurements at multiple time points for a variety of experimental conditions, choosing how to normalize the data by a positive control condition (e.g., maximum stimulation of otherwise unperturbed cells) is subject to some ambiguity. Normalizing by the value at a particular time point is a common practice, but the choice of the time point might be considered arbitrary; normalizing by the maximum (peak) value in each experiment is less arbitrary but nonetheless tends to obscure comparisons between control and noncontrol conditions at time points other than in the vicinity of the peak. For such data sets, we contend that normalizing in a manner that incorporates all of the time-dependent data for the control condition is more appropriate. Examples include normalizing by the mean value of the control time course, its “area under the curve” (e.g., [14]), or by normalization factors that minimize its variance across all experiments (e.g., as assessed by the mean coefficient of variation). The latter approach, which we currently favor, is briefly described here and demonstrated in the two examples presented in Section 4.3. Suppose there are n experiments for which data are collected at m time points. During each of the n experiments the same control is run. Let Xij denote the experimental readout for the control in the ith experiment at the jth time point. Often the quantity of interest Yij (e.g., the concentration of chemical species) is related to Xij by an unknown scale factor. That is, Yij = αi Xij. Under ideal conditions, the control would not vary from
61
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
experiment to experiment. Therefore, we seek the set of αi’s that minimize a suitable quantity F, for example m
n
(
F = ∑ ∑ Yij − Yj j =1 i =1
)
2
1⎡ n , F = ∑ ⎢∑ Yij − Yj j =1 Y ⎣ i =1 m
(
)
2
1 2
⎤ ⎥ ⎦
where Yj is the mean value that results for time point j. The minimization is subject to a constraint that eliminates the trivial solution, αi = 0 for all i. Once the αi have been found, they are used to scale the experimental time series, allowing the mathematical model to be fit to all the data simultaneously.
4.2.3
Suitability of models used in conjunction with quantitative data
In formulating a suitable mathematical description of a system, it is important to cast the model at an appropriate level of abstraction, which should be weighed carefully along with considerations of computational feasibility. While all models of biochemical processes are expected to include fundamentals of chemical reaction kinetics, they are expected to vary along two axes of increasing complexity: from deterministic to stochastic, and from well-stirred to spatially extended (Figure 4.1). In deterministic models, continuum variables such as species concentrations evolve according to ordinary or partial differential equations (ODEs and PDEs, respectively) and associated initial and boundary constraints, whereas in stochastic models, molecules and molecular complexes are modeled as discrete entities whose states are updated probabilistically [15, 16]. So-called hybrid models incorporate both continuum and discrete variables [17]. On the other axis, well-stirred models assume spatial homogeneity within the domain of interest, and any transport processes in the model (e.g., trafficking between intracellular compartments [18]) are incorporated as reaction terms, whereas spatially
Figure 4.1 Two axes of mechanistic model complexity. Models can be characterized according to whether they are deterministic or stochastic and whether or not they explicitly account for spatial gradients. Roughly speaking, the degree of computational difficulty increases as one moves from the lower left to the upper right quadrant. In each corner, techniques used to implement such models are listed along with, in parentheses, the type of experimental data that might be described. Abbreviations: ODE, ordinary differential equation; PDE, partial differential equation; SDE, stochastic differential equation; BD, Brownian dynamics.
62
4.2
Principles of Data-Driven Modeling
extended models account for spatial gradients and therefore describe the underlying transport processes explicitly, according to physicochemical principles [8]. For data-driven modeling of biochemical systems, the chosen complexity of the model should depend not only on what qualitative information is available in the literature, however reliable, but also in large part on the amount and type of quantitative, experimental data available. For instance, population endpoint measurements tend to be the most versatile and quantitative, yet they do not provide the kind of information that would justify a stochastic or spatially extended description of the model. Therefore, even though more complex models might be formulated, it is most appropriate to cast the model as a set of deterministic ODEs (see Section 4.3, Examples 1 and 2). Data-driven stochastic models generally benefit from single-cell information, which is obtained most quantitatively (albeit without spatial information) from flow cytometry data [19–21], and spatially extended models must be driven almost exclusively by single-cell kinetic (live-cell microscopy) data [22–26].
4.2.4
Issues related to parameter specification and estimation
Another aspect of model complexity that must be carefully considered when making comparisons to data is the amount of molecular detail to include. A comprehensive model, explicitly including all of the “known” biochemistry, comes at the expense of having to identify a large set of parameter values (rate constants and initial concentrations) [27]. Prominent examples of signaling pathway/network models with ~100 or more adjustable parameters have been offered [28–31], and in such cases the parameter values are typically culled from published in vitro measurements using purified components (or assumed to be similar in magnitude to parameters for related interactions where such data are available) or adjusted by hand to reconcile the sparse biochemical data assembled in various cell types and laboratories. Although models using this approach have proven valuable, it must be recognized that there is a great deal of uncertainty associated with such a parameter specification exercise. Formulation of very detailed models also dictates a qualitative assessment, wherein the model is judged by its ability to correctly produce the gross kinetic features seen in a relatively small collection of measurements [1]. The other approach is to simplify the model so as to reduce the number of adjustable parameters, to the point where a more direct, quantitative comparison or fit to the data becomes feasible and adequately constrained. Thus, the degree of model simplification is largely determined by the variety of experimental conditions and biochemical readouts in the data set; this, we contend, is the art of data-driven modeling. Simplification of kinetic models is achieved in a number of ways, including the use of scaled, dimensionless variables and through knowledge or assumptions about fast versus slow rate processes. Another mode of simplification is the lumping of multiple processes into a single step, which is warranted when quantitative data related to that particular step are absent or unattainable, or when its details are poorly characterized. Supposing that a model with an appropriate level of granularity has been tailored for a particular set of measurements, how does one fit the model output to the data? This can be somewhat tricky, because even with appropriate simplification, a pathway/network model is going to have more than a handful of adjustable parameters. Indeed, it is becoming increasingly clear that the values of parameters in models with even modest 63
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
complexity are not uniquely identifiable, even with near perfect kinetic data [32]. With that said, there are efficient methods for identifying a (nonunique) set of parameters that fit the data optimally well. One approach, which has been used to great effect in the modeling of the cell cycle, is the use of global optimization algorithms such as ODRPACK, which implements the Levenberg-Marquardt method with variable step size [33, 34]. Another strategy, which is gaining in popularity, involves Monte Carlo–based or “genetic” algorithms, wherein all of the parameter values are adjusted randomly, according to distributions centered on the current values, and the resulting parameter set is either accepted or rejected with certain probability or based on specified criteria related to the goodness of fit. The classic example of such an approach is the Metropolis algorithm [35] (Figure 4.2). In this method, parameter sets that improve the goodness of fit are always accepted, whereas sets that yield a poorer fit are accepted with a probability determined by a Boltzmann-like function; the overall error (χ2) is analogous to the energy, which is compared with a user-specified parameter that is analogous to the thermal energy scale or temperature (the lower the “temperature,” the lower the probability of acceptance). A commonly used variation is simulated annealing, in which the “temperature” is steadily reduced with time, making it more efficient for finding a global optimum [36, 37]. Regardless of the method used, it is important to note that the units of the model and those of the measurement are rarely the same, and so a conversion/alignment factor for each data type must usually be assigned or used as a fit parameter. Faced with the inherent problem of identifying unique parameter values, it might not be fruitful to seek one single, “best” solution to the parameter estimation problem; another approach, which we have demonstrated in Examples 1 and 2 below, is the ensemble or collective fitting approach [32, 38]. In this method, one accumulates a large number (potentially > 1,000) of parameter sets (the ensemble) that fit the data almost equally well. Starting with a single, near-optimal parameter set, the Metropolis algorithm is suitable for collecting the ensemble. At least for ODE models, which are solved with very little computational effort, it is no large task to recompute the model output for each or these parameter sets; the output of the “model,” then, may be taken as the ensemble mean, with its standard deviation yielding a measure of the variability in the model fit or prediction. An advantage of this approach is that one can readily infer whether or not a particular parameter is well constrained by the fit by inspection of the distribution of its values across the ensemble. Arguably, this evaluation is more insightful than the typical sensitivity analysis, which only assesses how the model responds to small changes in the parameter values, made one parameter at a time.
4.3 Examples of Data-Driven Modeling 4.3.1 Example 1: Systematic analysis of crosstalk in the PDGF receptor signaling network Historically, intracellular signal transduction has been characterized in terms of pathways of sequential activation processes, such as the canonical mitogen-activated protein kinase (MAPK) cascades; a prominent example is the Ras à Raf à MEK à extracellular signal-regulated kinase (Erk) pathway in mammals, which is both a master integrator of upstream inputs and a master controllers of transcription factors and 64
4.3
Examples of Data-Driven Modeling
(a) Figure 4.2 Parameter estimation using the Metropolis algorithm. (a) Schematic of the algorithm. The values of all model parameters are adjusted at random, according to distributions centered on the previous values, and the resulting quality of fit determines the probability of accepting each successive parameter set. Alignment of the model output to the data is achieved through the assignment of conversion factors, which may be estimated in a separate subroutine. The performance of the algorithm is tuned by adjusting the values of α, which characterizes how much the parameters change in each step, and β, the stringency of the acceptance criterion. (b) Illustration of the algorithm run in a highly stringent mode, wherein each accepted move almost always results in a better fit (lower SSD), starting from random guesses of the parameter values. (c) After achieving a near-minimum SSD value, the algorithm may be reinitiated with a relaxed stringency, allowing a large number of parameter sets to be collected in an ensemble. The average output of the parameter set ensemble constitutes the output of the model. Quantitative predictions are made through uniform changes (e.g., setting a particular parameter equal to zero) across the ensemble.
65
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
(b)
(c) Figure 4.2
(continued)
other effectors [39]. Although our current understanding of signal transduction networks includes more complex interactions, including those between the classically defined pathways (crosstalk) and those responsible for feedback regulation/reinforcement, such interactions have not yet been adequately characterized. In an effort to quantify the relative contributions of classical and crosstalk interactions in a signaling network, population endpoint measurements and computational modeling were systematically combined to study signaling mediated by platelet-derived growth factor (PDGF) receptors in fibroblasts [40] (Figures 4.3 and 4.4). The PDGF receptor signaling network is important in dermal wound healing and embryonic development [41], stimulating directed cell migration, survival, and proliferation through the aforementioned Ras/Erk pathway and exceptionally robust activation of phosphoinositide 3-kinases (PI3Ks), which produce specific lipid second messengers at the plasma membrane [42–44]. Erk phosphorylation and PI3K-dependent Akt phosphorylation in PDGF-stimulated NIH 3T3 fibroblasts were measured by quantitative immunoblotting for an array of 126 66
(c)
Figure 4.3 Data-driven model to characterize crosstalk in the PDGF receptor signaling network. (a) A portion of the quantitative data set shows that inhibition of Ras (by expression of dominant-negative S17N Ras) or PI3K (using the LY compound) affects the dynamics of PDGF-stimulated Erk phosphorylation. (b) Whereas inhibition of Ras or PI3K only partially blocks Erk phosphorylation, the double-inhibition experiment shows that Ras and PI3K account for all of the major pathways from PDGF receptors to Erk. (c) Conceptual model of the PDGF receptor signaling network based on the entire data set. (d) A coarse-grained kinetic model of the network is aligned directly to the data using a variation of the Metropolis algorithm and the parametric ensemble approach. All panels are adapted from [40] (with permission of the authors).
(d)
(b)
(a)
4.3 Examples of Data-Driven Modeling
67
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
(a)
(b)
Figure 4.4 Quantification of Ras- and PI3K-dependent MEK phosphorylation pathways in the PDGF receptor signaling network. (a) For each parameter set in the model ensemble, the quantity Cxij is defined as the maximum catalytic efficiency of pathway i (i = 1, Ras-dependent; i = 2, PI3K-dependent) towards site j on MEK divided by the catalytic efficiency of the corresponding phosphatase reaction. On the dashed line, the two pathways are equally potent by this measure. (b) When MEK kinases and phosphatases are far from saturation, the steady-state fractions of MEK in the unphosphorylated, singly phosphorylated, and doubly phosphorylated states are readily calculated. The MEK Activation Comparator (MAC) is a ratio devised to compare the MEK phosphorylation capacity of PI3K-dependent signaling crosstalk to that of the classical Ras-dependent pathway. All panels are adapted from [40] (with permission of the authors).
experimental conditions, sampling different combinations of PDGF dose, stimulation time, and molecular manipulations; considering biological replicates and parallel determination of total Erk and Akt levels, this set of data comprises 2,772 total measurements. A selected portion of the Erk data shows that blocking the activity of either Ras or PI3K only partially reduces PDGF-stimulated Erk phosphorylation [Figure 4.3(a)], whereas simultaneous inhibition of Ras and PI3K almost completely abolished PDGF-stimulated Erk phosphorylation [Figure 4.3(b)], indicating that Ras and PI3K are responsible for all of the major pathways from PDGF receptors to Erk, and at least one mode of PI3K-dependent crosstalk to Erk is independent of Ras. By comparison, the Akt phosphorylation results showed that the PI3K pathway is not significantly affected by perturbations affecting Ras and Erk; crosstalk is apparently unidirectional, from PI3K to Ras/Erk, in this network [40]. This conceptual model was further refined by additional experiments, which characterized two known negative feedback mechanisms and established that PI3K-dependent crosstalk affects the Erk pathway both downstream and upstream of Ras [Figure 4.3(c)]. Motivated by the dynamics revealed in this unique data set, a kinetic model of the network was formulated and used to quantify the relative magnitudes of the PI3K-dependent and -independent inputs collaborating to activate Erk. A total of 34 unspecified parameter values were estimated using the ensemble approach described in the previous section; taken together, the data force the model to reconcile time- and PDGF dose-dependent features of the network observed under the various experimental conditions tested [Figure 4.3(d)].
68
4.3
Examples of Data-Driven Modeling
Analysis of the parameter sets chosen by the algorithm revealed a consistent ratio of PI3K- and Ras-dependent contributions to the dual phosphorylation of MEK, the kinase activity directly upstream of Erk [Figure 4.4(a)]. We formulated a single number, the MEK activation comparator (MAC), which compares the capacities of the two pathways to generate dually phosphorylated MEK. Importantly, the MAC quantifies these inputs in a way that uncouples them from negative feedback effects. This analysis revealed that, whereas the PI3K-dependent MEK activation pathway is predicted to be intrinsically much less potent than the Ras-dependent pathway under maximal PDGF stimulation conditions, feedback regulation of Ras renders the PI3K-dependent pathway somewhat more important [Figure 4.4(b)]. A similar analysis was performed relating the PI3K-dependent and PI3K-independent signaling modes upstream of Ras [40]. The computational approach was also used to generate hypothetical predictions with an eye towards future experiments. Whereas inhibition of PI3K affects crosstalk interactions both upstream and downstream of Ras, the model ensemble predicts unique kinetic signatures that might be expected if either mechanism were silenced selectively [40], which could help validate the point of action of a particular PI3K-dependent pathway on Erk.
4.3.2
Example 2: Computational analysis of signal specificity in yeast
Yeast is well recognized as an excellent model organism for systems level analysis [45]. Their ability to undergo efficient homologous recombination is particularly useful for studying the functional role of proteins in vivo, through gene disruption or gene replacement. Because of this property, the yeast pheromone response system is arguably the best-characterized signaling pathway of any eukaryote. This pathway bears strong similarities to signaling networks in mammals. In particular, the MAPK components share extensive sequence similarity with their human counterparts [46]. Another feature common to the yeast pheromone response pathway and response pathways of higher organisms is the sharing of signaling proteins among multiple systems. This property makes the pheromone response pathway an excellent system for studying signal specificity. Depending on specific external cues, yeast cells initiate either a mating response or an invasive growth program. Mating is initiated when haploid cell types a and α secrete and respond to type-specific pheromones, which act through G protein-coupled receptors on cells of the opposite mating type [47]. Alternatively, invasive growth occurs in nutrient-poor conditions [48]. Combined genetic and biochemical studies revealed that both mating and invasive growth require a protein kinase cascade comprised of Ste20 (MAP4K), Ste11 (MAP3K), and Ste7 (MAP2K) [Figure 4.5(a)]. The pathways diverge at the level of the MAP kinase. Whereas deletion of one MAP kinase gene (KSS1) blocks invasive growth, deletion of a second MAP kinase gene (FUS3) impairs pheromone-induced cell-cycle arrest. Deletion of FUS3 leads to enhanced activity of Kss1 [49]. However, the mechanism by which this cross inhibition occurs was unknown. We recently combined mathematical modeling with experimental analysis to investigate how Fus3 limits the activity of Kss1 [50]. Six mathematical models were developed to describe different hypothetical mechanisms of cross inhibition. All six models were fit to the time courses for Fus3 and Kss1 activation obtained from wild-type cells as well as from strains containing various genetic alterations. The experiments yielded a data set of 69
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
(a)
(b)
(c) Figure 4.5 Data-driven modeling of signal specificity in yeast. (a) Components of the mating and invasive-growth pathways. Activation steps are indicated with arrows, and inhibition steps are indicated with a T-shaped line. (b) The sum of the squared differences (SSD) between the experimental data and output of the six models versus the number of accepted realizations in the Monte Carlo optimization routine. (c) A simple model that incorporates two mechanisms of cross-inhibition: Fus3 inhibits the rate of Kss1 phosphorylation (red dashed line), and Fus3 increases the rate of Kss1 dephosphorylation (blue dashed line). All panels are adapted from [50] (with permission of the authors).
over 300 measurements. To compare the performance of each of the six potential models, the Monte Carlo approach described above was used. Figure 4.5(b) shows a plot of the sum of the squared differences (SSD) versus the number of accepted realizations in the Monte Carlo optimization process for each of the six models. After 800 accepted realizations the SSD converged for each model. Model I performed the best (minimum SSD) and the next best models were II, III, and V, which roughly performed equally well. Each of the six models fall into one of two distinct cases [Figure 4.5(c)]: (1) active Fus3 inhibits Kss1 phosphorylation, and (2) active Fus3 increases Kss1 dephosphorylation. Models I and II are mathematically the simplest and demonstrate the key difference between the two hypothetical mechanisms of cross inhibition. To compare the two models, 100 parameter sets were randomly selected from those accepted by the Monte Carlo optimization routine. The model equations were run using these parameter sets to generate a distribution of solutions. Figure 4.6(a) shows comparisons between the models’ output and experimental data for Kss1 activity in WT cells (black circles) and cells in which the MAPK Fus3 has been deleted (red circles). Note that only Model I is able to capture the rapid increase in Kss1 activity seen in the fus3Δ strain. The confidence intervals presented in these plots indicate that this behavior is not a con-
70
4.3
Examples of Data-Driven Modeling
(a)
(b)
Figure 4.6 Representative results for the Fus3 cross-inhibition models. (a) Model I, in which Fus3 inhibits the activation of Kss1, is able to capture the rapid increase in Kss1 activity seen in a fus3Δ mutant, whereas Model II, in which Fus3 increases the rate at which Kss1 is deactivated, cannot capture this effect. A7 (b) Model I accurately predicts the results for the Ste7 mutant in which feedback phosphorylation has been disrupted. The circles are experimental data points and the lines are model results. Results for the wild-type cells are indicated in black and red indicates results for the mutants. All panels are adapted from [50] (with permission of the authors).
sequence of the specific choice of parameter values, but a general property of the models. Motivated by these results, a simplified model of cross-inhibition was developed that captures the two general mechanisms by which Fus3 might regulate Kss1. Analysis of the simple model revealed that the two mechanisms of cross-inhibition have opposite effects on the rate at which the system relaxes to steady state. If Fus3 inhibits Kss1 phosphorylation, the relaxation rate is reduced; if Fus3 increases deactivation, the relaxation rate is increased. Consequently, a mechanism that increases the dephosphorylation rate of Kss1 is incompatible with the experimental data because it cannot simultaneously account for: (1) the large increase in maximum Kss1 activity seen in the fus3Δ strain, and (2) the slow decline in Kss1 activity observed in wild-type cells. Because the MAP2K Ste7 is feedback phosphorylated by Fus3 [51–53] and directly catalyzes Kss1 activation, this protein was considered to be the most likely target for Fus3-mediated cross-inhibition of Kss1. The sites at which Fus3 phosphorylates Ste7 have been mapped, and a mutant lacking each of these phosphorylated residues has been described (Ste7A7) [54]. Consistent with the results of the computational investiga71
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
A7
tions, Ste7 exhibits a significant elevation in the extent of Kss1 phosphorylation comWt pared with wild-type Ste7 (Ste7 ); furthermore, the mathematical model describing this scenario accurately predicts the extent and duration of the increase in Kss1 activation promoted by Ste7A7 [Figure 4.6(b)] [50].
Acknowledgments This work was supported by National Institutes of Health grants R01-GM067739 and R21-GM074711 to J.M.H. and R01-GM079271 and R01-GM073180 to T.C.E.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
[15] [16] [17] [18] [19]
[20] [21]
72
Mogilner, A., R. Wollman, and W.F. Marshall, “Quantitative modeling in cell biology: what is it good for?” Dev. Cell, Vol. 11, 2006, pp. 279–287. Hunter, T., “Signaling—2000 and beyond,” Cell, Vol. 100, 2000, pp. 113–127. Pawson, T., “Specificity in signal transduction: from phosphotyrosine-SH2 domain interactions to complex cellular systems,” Cell, Vol. 116, 2004, pp. 191–203. Tyson, J.J., K.C. Chen, and B. Novak, “Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell,” Curr. Opin. Cell Biol., Vol. 15, 2003, pp. 221–231. Ma’ayan, A., R.D. Blitzer, and R. Iyengar, “Toward predictive models of mammalian cells,” Annu. Rev. Biophys. Biomol. Struct., Vol. 34, 2005, pp. 319–349. Kholodenko, B.N., “Cell-signalling dynamics in time and space,” Nat. Rev. Mol. Cell. Biol., Vol. 7, 2006, pp. 165–176. Janes, K.A., and M.B. Yaffe, “Data-driven modelling of signal-transduction networks,” Nat. Rev. Mol. Cell. Biol., Vol. 7, 2006, pp. 820–828. Haugh, J.M., “Mathematical modeling of biological signaling networks,” in Wiley Encyclopedia of Chemical Biology, New York: John Wiley & Sons, 2008. Nielsen, U.B., and B.H. Geierstanger, “Multiplexed sandwich assays in microarray format,” J. Immunol. Meth., Vol. 290, 2004, pp. 107–120. Domon, B., and R. Aebersold, “Mass spectrometry and protein analysis,” Science, Vol. 312, 2006, pp. 212–217. Huang, P.H., and F.M. White, “Phosphoproteomics: Unraveling the signaling web,” Mol. Cell, Vol. 31, 2008, pp. 777–781. Meyer, T., and M.N. Teruel, “Fluorescence imaging of signaling networks,” Trends Cell Biol., Vol. 13, 2003, pp. 101–106. Giepmans, B.N.G., S.R. Adams, M.H. Ellisman, and R.Y. Tsien, “The fluorescent toolbox for assessing protein location and function,” Science, Vol. 312, 2006, pp. 217–224. Park, C.S., I.C. Schneider, and J.M. Haugh, “Kinetic analysis of platelet-derived growth factor receptor/phosphoinositide 3-kinase/Akt signaling in fibroblasts,” J. Biol. Chem., Vol. 278, 2003, pp. 37064–37072. Kepler, T.B., and T.C. Elston, “Stochasticity in transcriptional regulation: Origins, consequences, and mathematical representations,” Biophys. J., Vol. 81, 2001, pp. 3116–3136. Li, H., Y. Cao, L.R. Petzold, and D.T. Gillepie, “Algorithms and software for stochastic simulation of biochemical reacting systems,” Biotechnol. Prog., Vol. 24, 2008, pp. 56–61. Dallon, J.C., “Numerical aspects of discrete and continuum hybrid models in cell biology,” Appl. Numerical Math., Vol. 32, 2000, pp. 137–159. Lauffenburger, D.A., and J.L. Linderman, Receptors: Models for Binding, Trafficking, and Signaling, New York: Oxford University Press, 1993. Pirone, J.R., and T.C. Elston, “Fluctuations in transcription factor binding can explain the graded and binary responses observed in inducible gene expression,” J. Theor. Biol., Vol. 226, 2004, pp. 111–121. Altan-Bonnet, G., and R.N. Germain, “Modeling T cell antigen discrimination based on feedback control of digital ERK responses,” PLoS Biol., Vol. 3, 2005, article no. e356. Perez, O.D., and G.P. Nolan, “Phospho-proteomic immune analysis by flow cytometry: from mechanism to translational medicine at the single-cell level,” Immunol. Rev., Vol. 210, 2006, pp. 208–228.
Acknowledgments
[22]
[23]
[24]
[25]
[26] [27] [28] [29]
[30]
[31]
[32]
[33] [34] [35] [36] [37]
[38] [39] [40] [41] [42]
[43] [44] [45] [46]
Hirschberg, K., C.M. Miller, J. Ellenberg, J.F. Presley, E.D. Siggia, R.D. Phair, and J. Lippincott-Schwartz, “Kinetic analysis of secretory protein traffic and characterization of Golgi to plasma membrane transport intermediates in living cells,” J. Cell Biol., Vol. 143, 1998, pp. 1485–1503. Slepchenko, B.M., J.C. Schaff, J.H. Carson, and L.M. Loew, “Computational cell biology: spatiotemporal simulation of cellular events,” Annu. Rev. Biophys. Biomol. Struct., Vol. 31, 2002, pp. 423–441. Reynolds, A.R., C. Tischer, P.J. Verveer, O. Rocks, and P.I.H. Bastiaens, “EGFR activation coupled to inhibition of tyrosine phosphatases causes lateral signal propagation,” Nat. Cell Biol., Vol. 5, 2003, pp. 447–453. Janetopoulos, C., L. Ma, P.N. Devreotes, and P.A. Iglesias, “Chemoattractant-induced phosphatidylinositol 3,4,5-trisphosphate accumulation is spatially amplified and adapts, independent of the actin cytoskeleton,” Proc. Natl. Acad. Sci. USA, Vol. 101, 2004, pp. 8951–8956. Schneider, I.C., and J.M. Haugh, “Quantitative elucidation of a distinct spatial gradient-sensing mechanism in fibroblasts,” J. Cell Biol., Vol. 171, 2005, pp. 883–892. Weng, G., U.S. Bhalla, and R. Iyengar, “Complexity in biological signaling systems,” Science, Vol. 284, 1999, pp. 92–96. Bhalla, U.S., P.T. Ram, and R. Iyengar, “MAP kinase phosphatase as a locus of flexibility in a mitogen-activated protein kinase signaling network,” Science, Vol. 297, 2002, pp. 1018–1023. Schoeberl, B., C. Eichler-Jonsson, E.D. Gilles, and G. Muller, “Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors,” Nat. Biotechnol., Vol. 20, 2002, pp. 370–375. Hatakeyama, M., S. Kimura, T. Naka, T. Kawasaki, N. Yumoto, M. Ichikawa, J. Kim, K. Saito, M. Saeki, M. Shirouzu, S. Yokoyama, and A. Konagaya, “A computational model on the modulation of mitogen-activated protein kinase (MAPK) and Akt pathways in heregulin-induced ErbB signalling,” Biochem. J., Vol. 373, 2003, pp. 451–463. Kiyatkin, A., E. Aksamitiene, N.I. Markevich, N.M. Borisov, J.B. Hoek, and B.N. Kholodenko, “Scaffolding protein Grb2-associated binder 1 sustains epidermal growth factor-induced mitogenic and survival signaling by multiple positive feedback loops,” J. Biol. Chem., Vol. 281, 2006, pp. 19925–19938. Gutenkunst, R.N., J.J. Waterfall, F.P. Casey, K.S. Brown, C.R. Myers, and J.P. Sethna, “Universally sloppy parameter sensitivities in systems biology models,” PLoS Comp. Biol., Vol. 3, 2007, article no. e189. Zwolak, J.W., J.J. Tyson, and L.T. Watson, “Parameter estimation for a mathematical model of the cell cycle in frog eggs,” J. Comput. Biol., Vol. 12, 2005, pp. 48–63. Sible, J.C., and J.J. Tyson, “Mathematical modeling as a tool for investigating cell cycle control networks,” Methods, Vol. 41, 2007, pp. 238–247. Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller, “Equation of state calculations by fast computing machines,” J. Chem. Phys., Vol. 21, 1953, pp. 1087–1092. Hansmann, U.H.E., and Y. Okamoto, “New Monte Carlo algorithms for protein folding,” Curr. Opin. Struct. Biol., Vol. 9, 1999, pp. 177–183. Gonzalez, O.R., C. Kuper, K. Jung, P.C. Naval, and E. Mendoza, “Parameter estimation using simulated annealing for S-system models of biochemical networks,” Bioinformatics, Vol. 23, 2007, pp. 480–486. Brown, K.S., and J.P. Sethna, “Statistical mechanical approaches to models with many poorly known parameters,” Phys. Rev. E, Vol. 68, 2003, article no. 021904. Kolch, W., “Meaningful relationships: the regulation of the Ras/Raf/MEK/Erk pathway by protein interactions,” Biochem. J., Vol. 351, 2000, pp. 289–305. Wang, C.-C., M. Cirit, and J.M. Haugh, “PI3K-dependent crosstalk interactions converge with Ras as quantifiable inputs integrated by Erk,” Mol. Syst. Biol., Vol 5, 2009, article no. 246. Heldin, C.-H., and B. Westermark, “Mechanism of action and in vivo role of platelet-derived growth factor,” Physiol. Rev., Vol. 79, 1999, pp. 1283–1316. Vanhaesebroeck, B., S.J. Leevers, K. Ahmadi, J. Timms, R. Katso, P.C. Driscoll, R. Woscholski, P.J. Parker, and M.D. Waterfield, “Synthesis and function of 3-phosphorylated inositol lipids,” Annu. Rev. Biochem., Vol. 70, 2001, pp. 535–602. Hawkins, P.T., K.E. Anderson, K. Davidson, and L.R. Stephens, “Signalling through Class I PI3Ks in mammalian cells,” Biochem. Soc. Trans., Vol. 34, 2006, pp. 647–662. Engelman, J.A., J. Luo, and L.C. Cantley, “The evolution of phosphatidylinositol 3-kinases as regulators of growth and metabolism,” Nat. Rev. Genet., Vol. 7, 2006, pp. 606–619. Hao, N., M. Behar, T.C. Elston, and H.G. Dohlman, “Systems biology analysis of G protein and MAP kinase signaling in yeast,” Oncogene, Vol. 26, 2007, pp. 3254–3266. Dohlman, H.G., and J.W. Thorner, “Regulation of G protein-initiated signal transduction in yeast: Paradigms and principles,” Annu. Rev. Biochem., Vol. 70, 2001, pp. 703–754.
73
Data-Driven, Mechanistic Modeling of Biochemical Reaction Networks
[47] [48] [49]
[50]
[51] [52] [53]
[54]
74
Wang, Y.Q., and H.G. Dohlman, “Pheromone signaling mechanisms in yeast: A prototypical sex machine,” Science, Vol. 306, 2004, pp. 1508–1509. Truckses, D.M., L.S. Garrenton, and J. Thorner, “Jekyll and Hyde in the microbial world,” Science, Vol. 306, 2004, pp. 1509–1511. Sabbagh, W., L.J. Flatauer, A.J. Bardwell, and L. Bardwell, “Specificity of MAP kinase signaling in yeast differentiation involves transient versus sustained MAPK activation,” Mol. Cell, Vol. 8, 2001, pp. 683–691. Hao, N., N. Yildirim, S.C. Parnell, M.J. Nagiec, R.H. Shanks, B. Errede, H.G. Dohlman, and T.C. Elston, “A computational analysis of feedback regulation as a mechanism for signaling specificity in yeast,” in revision, 2009. Errede, B., A. Gartner, Z. Zhou, K. Nasmyth, and G. Ammerer, “MAP kinase-related FUS3 from S. cerevisiae is activated by STE7 in vitro,” Nature, Vol. 362, 1993, pp. 261–264. Errede, B., and Q.Y. Ge, “Feedback regulation of MAP kinase signal pathways,” Philos. Trans. R. Soc. Lond. B Biol. Sci., Vol. 351, 1996, pp. 143–149. Zhou, Z., A. Gartner, R. Cade, G. Ammerer, and B. Errede, “Pheromone-induced signal transduction in Saccaromyces cerevisiae requires the sequential function of three protein kinases,” Mol. Cell. Biol., Vol. 13, 1993, pp. 2069–2080. Maleri, S., Q. Ge, E.A. Hackett, Y. Wang, H.G. Dohlman, and B. Errede, “Persistent activation by constitutive Ste7 promotes Kss1-mediated invasive growth but fails to support Fus3-dependent mating in yeast,” Mol. Cell. Biol., Vol. 24, 2004, pp. 9221–9238.
CHAPTER
5 Construction of Phenotype-Specific Gene Network by Synergy Analysis 1
1
2
3
Xuerui Yang , Xuewei Wang , Ming Wu , Ertugrul Dalkic , and Christina Chan*1,2,3 1
Department of Chemical Engineering and Material Science, Michigan State University, East Lansing, MI 48824 2 Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 3 Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824 *e-mail: Christina Chan at
[email protected]
Abstract Complex cellular activities are believed to be coordinately regulated by genes that function in a network. Reconstructing these gene networks can provide insights into the molecular mechanisms of cell physiology and thus represents a fundamental challenge in systems biology. Elucidating the phenotype-gene interaction through the reconstruction of context-specific gene networks remains elusive. In this chapter, we present a methodology that integrates multi-level biological data to infer a cooperative gene network with respect to a specific phenotype. Our method introduces the concept of synergy and builds a network that consists of gene pairs with significant synergistic relations, which implies cooperation. We apply our method to reconstruct a synergistic gene network for saturated free fatty acid (FFA)-induced cytotoxicity, and analyze the properties of the network. Scale-free characteristics and multiple-hub genes are found in the network, revealing many important cooperative candidate genes in regulating the FFA-induced cytotoxicity. These candidates are supported by the literature. Key terms
Synergy Gene network reconstruction Phenotype-specific Free fatty acid Metabolite Cytotoxicity
75
Construction of Phenotype-Specific Gene Network by Synergy Analysis
5.1 Introduction The development of diseases can be traced to abnormal activities of the cells in specific tissues or parts of the body. Hepatic disorders, such as steatosis and nonalcoholic steatohepatitis (NASH), are associated with saturated free fatty acid (FFA)-induced cytotoxicity of liver cells [1, 2]. Cellular activities are tuned by regulatory machineries involving genes, proteins, and metabolites. For instance, saturated FFA-induced cytotoxicity is coordinately regulated by a set of genes that interact in a complex network [3]. Therefore, reconstructing the gene networks that give rise to the different phenotypes may provide insights into the cellular mechanisms involved, ultimately, in the development of diseases and disorders [4, 5]. This chapter describes a methodology to reconstruct phenotypic and context-specific gene networks based on the assumption that only a subset of the genes is relevant to the target phenotype. The phenotype addressed in this chapter is saturated FFA-induced cytotoxicity. Methods of gene selection relevant to a phenotype have been based predominantly on fold changes in the genes across different conditions or correlations between the genes and the phenotype, using statistical tests [6] or correlation measures [7], respectively. Statistical tests typically yield too many genes for analysis, while correlation measures only select genes that are statistically correlated with the phenotype, thereby missing potentially relevant genes that are not highly correlated to the phenotype. Incorporating prior information, such as gene set enrichment analysis [8] or trend profiles, into the gene selection methods can help to alleviate these problems; however, the source and quality of the prior knowledge will affect the results. FFAs modulate intracellular metabolic pathways involved in glucose [9], triglyceride (TG) [10], and amino acid [11] metabolism. Tuned by the gene network [3, 7], some of these alterations are involved in the induction of cytotoxicity by saturated FFAs [9–11]. Therefore, integrating multiple levels of information (i.e., gene expression and metabolite profiles) would better reflect the “multilevel” characteristic of cellular activities, such as saturated FFA-induced cytotoxicity, and thereby aid in the selection of genes that are involved in the observed phenotype, and in the reconstruction of a phenotype-specific gene network. Various methods, such as correlation [12], mutual information [13, 14], and Bayesian network analysis [15], have been used to construct gene networks. These methods do not directly incorporate the phenotype in identifying the gene interactions. Instead, they typically build a gene network for each of the conditions, and compare the networks to identify the gene interactions that are specific to a condition or phenotype. Consequently, these methods are computationally expensive and sensitive to the quality of data. Since the size (i.e., the number of genes included in the network reconstruction) and the noise level of the samples can affect the networks reconstructed for each of the conditions, it is difficult to determine whether the differences in the networks across the conditions are real changes in the mechanisms or simply an artifact due to the size or noise levels. Alternatively, methods have been developed to select sets of gene pairs relevant to a phenotype based on classification models, such as support vector machine [16, 17], decision tree [18], and probabilistic model [19]. Intuitively, if a phenotype-prediction based on a pair of genes performs better than that based on either one of the genes,
76
5.1
Introduction
then the pair of genes is suggested to have cooperative effects on the phenotype. However, these classification methods fail to differentiate the cooperative effects of the genes pairs from the independent contributions of the individual genes [20]. To address this shortcoming, we present a method that distinguishes the difference in the cooperative versus individual effects of the genes. Biological activities are regulated by multiple factors, many of which function cooperatively (i.e., synergistically). The basic idea is that the whole (i.e., the regulatory system) is greater than the sum of the individual parts (i.e., regulators) of a system. Synergy is defined as the “additional” contribution provided by the “whole” as compared to the sum of the contributions of the individual “parts.” An example of synergy can been seen with transcription factors, such as GATA4 and dHAND, which together cooperatively and dramatically up-regulate cardiac (target) gene expression levels more significantly than the sum of the effects from either of the transcription factors alone [21]. The concept of synergy will be used in this chapter to assess the cooperative effect of two genes on a phenotype. In a multivariate system, the synergistic effect of two factors on a phenotype is the gain in the “mutual information” over the sum of the information provided by each factor on a phenotype. A positive synergy denotes that two factors regulate a phenotype, either cooperatively (e.g., co-activating) or antagonistically (e.g., competitive inhibiting). Thus, one can predict the phenotype with a certain confidence from either of the two factors; however, knowing both factors brings additional information, which enhances the confidence of the prediction. Negative synergy denotes redundancy; thus, knowing both factors brings redundant information to the prediction of the phenotype. Zero synergy denotes that at least one of the two factors has no effect on the phenotype, and therefore brings neither additional nor redundant information to the prediction of the phenotype. Systematic assessment of synergy was first applied in neuroscience, where the goal was to understand the neuron code by evaluating the strength of correlations between the neurons upon activation by a stimulus [22, 23]. More recently the concept of synergy has been applied to the field of systems biology [24–26]. Investigators developed an information theoretic measure of synergy from discretized gene expression data and applied this measure to identify cooperative gene interactions associated with neural interconnectivity [24] and prostate cancer development [25]. More recently, the concept of synergy and the information theoretic measure of synergy have been applied directly to continuous gene expression data [20]. In this chapter we introduce an integrative methodology to reconstruct phenotype-specific gene networks based on synergy analysis. First, we select the phenotype-specific genes by integrating the gene expression and metabolite profiles in the context of saturated FFA-induced cytotoxicity. Next, we assess the synergistic effects between the gene pairs. Unlike other computational methods used to identify gene interactions, the fundamental concept of synergy is to identify the cooperative gene interactions responsible for the phenotype, and these cooperative gene interactions may or may not be direct interactions. Finally, with the identified synergistic gene pairs, we build a synergy network. Topological analyses reveal the structural characteristics of the network while the hub genes provide insights into potential mechanism(s) involved in the induction of the phenotype (i.e., saturated FFA-induced cytotoxicity).
77
Construction of Phenotype-Specific Gene Network by Synergy Analysis
5.2 Experimental Design Human hepatoblastoma cells (HepG2/C3A) were used for the study. These cells offer the advantages of ease of culture and experimentation, are of human-origin, and have been shown to retain many hepatospecific functions and therefore suggested as a good model cell system for hepatocellular function, such as lipid metabolism [27] and fatty acid transport [28]. Cell lines offer an advantage over primary hepatocytes in that they are more amenable to genetic manipulation. Two different types of free fatty acids were employed (i.e., saturated and monounsaturated fatty acids), corresponding to the major dietary fractions. Palmitic acid was chosen as the representative saturated fatty acid and oleic acid as the monounsaturated fatty acid. They are the major fatty acids of their classes found in serum/plasma. While the total concentration of FFAs in plasma may reach millimolar range under pathological conditions [29], most studies of obese/type 2 diabetic patients have reported fatty acid concentration at about 0.7 mM. We therefore used 0.7 mM as a standard concentration of palmitate and oleate. The proposed experiments and methodology, described in a flowchart (Figure 5.1), for reconstructing the phenotype-specific gene network with synergy analysis are as follows. 1. Identify the phenotype of interest: cytotoxicity levels induced by saturated FFA. 2. Obtain metabolite data. 3. Obtain gene-expression data with cDNA microarray. 4. Select the metabolites that are associated with the phenotype (i.e., cytotoxicity). 5. Select the genes by matching the trend of the metabolite and gene profiles. 6. Obtain synergistic gene pairs by calculating their synergy scores. 7. Construct a synergy network with the gene pairs that are significantly synergistic. Topological analysis of the synergy network revealed the type, structure, and other characteristics of the network. Statistical analysis of the synergistic gene pairs yielded
Figure 5.1
78
Flowchart of the proposed methodology.
5.3
Materials
hub genes (i.e., the genes that appeared most frequently in the synergistic gene pairs). The high frequency of the hub genes suggests that the hub genes are potentially important in producing the phenotype.
5.3 Materials 5.3.1
Cell culture and reagents
HepG2 cells were cultured in Dulbecco’s Modified Eagle Medium (DMEM) (Invitrogen, Carlsbad, California) with 10% fetal bovine serum (FBS) (Biomeda Corp., Foster City, California) and penicillin-streptomycin (penicillin: 10,000 U/ml, streptomycin: 10,000 μg/ml) (Invitrogen, Carlsbad, California). Freshly trypsinized HepG2 cells were suspended at 5 × 105 cells/ml in standard HepG2 culture medium and seeded at a density of 106 cells per well in standard 6-well tissue culture plates. After seeding, the cells were incubated at 37°C in a 90% air/10% CO2 atmosphere, and 2 mm of fresh medium was supplied every other day to the cultures after removal of the supernatant. The HepG2 cells were cultured in standard medium for 5 to 6 days to achieve 90% confluent before treating with FFAs, or other additives. HepG2 cell number was assessed by trypan blue dye exclusion using a hematocytometer.
5.3.2
Fatty acid salt treatment
Sodium salts of palmitate (P9767) and oleate (O7501) were purchased from Sigma-Aldrich. Palmitate or oleate was complexed to 0.7 mM bovine serum albumin (BSA, fatty acid free) dissolved in the media, which mimics the physiological concentration of albumin in human blood (3.5% to 5%, [30]). In all the experiments, the vehicle (0.7 mM BSA) was used as the control. Fatty acid free BSA was purchased from MP Biomedicals (Chillicothe, Ohio).
5.4 Methods 5.4.1
Cytotoxicity measurement
HepG2 cells were cultured in different media for 24 hours and the supernatants collected. Cells were washed with PBS and kept in 1% triton-X-100 in PBS for 24 hours at 37°C. Cell lysate was then collected, vortexed for 15 seconds, and centrifuged at 7,000 rpm for 5 minutes. Cytotoxicity detection kit (Roche Applied Science, Indianapolis, Indiana) was used to measure the LDH levels in the supernatants and in the cell lysates. The fraction of LDH released into the medium was normalized to the total LDH (LDH released into the medium + LDH remaining in the cell lysates) [31].
5.4.2
Gene expression profiling
Cells were cultured in 10-cm tissue culture plates until confluence and then exposed to different treatments. RNA was isolated with Trizol reagent. The gene expression profiles were obtained with cDNA microarray. Analyses were done at the Van Andel Institute,
79
Construction of Phenotype-Specific Gene Network by Synergy Analysis
Grand Rapids, Michigan. The procedure of the microarray analysis was described previously [5].
5.4.3
Metabolites measurements
The fluxes of the various metabolites were measured according to [32, 33]. The concentrations of glucose, lactate, FFA, glycerol, and glycerol were measured by enzymatic kits from Sigma–Aldrich, while beta-hydroxybutyrate and triglycerides were measured using enzymatic kits from Stanbio Laboratories. These metabolites were assayed according to the manufacturers’ instructions. The concentration of acetoacetate in the media was measured by an enzymatic fluorimetric assay [34]. Concentrations of Asp, Glu, Gly, NH3, Arg, Thr, Ala, Pro, Tyr, Val, Met, Orn, Lys, Ile, Leu, and Phe were measured by the AccQTag amino acid analysis method (Waters) coupled with fluorescence detection. The concentrations of Ser, Asn, Gln, and His were measured by a modification of the AccQTag method. Cystine concentration in the media and supernatants was measured using HPLC according to a previously published protocol [35]. All the measured fluxes were normalized to total protein in the cell extract, measured with the bicinchoninic acid (BCA) method (Pierce Chemicals, Rockford, Illinois).
5.4.4
Gene selection based on trends of metabolites
The statistical significance of the changes in the metabolite levels across the conditions (i.e. BSA (control), Palmitate and Oleate) were assessed using two-sample t-test for each metabolite. Eleven metabolites differed significantly across the three conditions, and four representative trends were extracted from these metabolites (Figure 5.2). There remained 7,394 genes after removing the EST/hypothetical proteins and ORF of unknown functions from the list of ~20,000 genes. Genes with expression patterns that matched the four representative metabolite trends were selected. Two-sample t-test was applied to each gene to assess the significance in their fold change across the different conditions. Finally, 610 genes were selected from the full list of 7,394 genes. The p-value cutoff was set at 0.05.
5.4.5
Calculation of the synergy scores of gene pairs
An information theory–based score was calculated to quantify the synergy between the genes [26]. Given two genes, G1 and G2, and a phenotype P, the synergy score between G1 and G2 with respect to the phenotype P is defined as Syn(G1,G2;P) = I(G1,G2;P) − [I(G1;P) + I(G2;P)] where I(G1;P) is the mutual information between G1 and P, I(G2;P) is the mutual information between G2 and P, and I(G1,G2;P) is the mutual information between (G1,G2) and P. This equation reflects the definition of synergy, the additional contribution provided by the “whole” as compared to the sum of the contributions of the individual “parts.” Mutual information (I) was calculated using a clustering-based method from continuous data [20].
80
Figure 5.2 Four representative trends of the metabolites. Eleven metabolites differed significantly across the three conditions, and four representative trends were extracted from these metabolites. Trend I: BSA < Palm and Palm > Ole; Trend II: BSA > Palm and Palm < Ole; Trend III: BSA < Palm < Ole; Trend IV: BSA > Palm > Ole.
5.4 Methods
81
Construction of Phenotype-Specific Gene Network by Synergy Analysis
The synergy scores range from [–1 1]. A positive synergy score indicated that two genes jointly provided additional information on the phenotype, a negative synergy score indicated that the two genes provided redundant information about the phenotype, and a zero score indicated that the two genes provided no additional information about the phenotype. The 610 genes that were selected based on the metabolite trends generated 185,745 gene pairs, of which 436 pairs had significant synergy scores.
5.4.6
Permutation test to evaluate the significance of the synergy
A permutation test was performed to assess the statistical significance of the synergy of the gene pairs. The phenotypes (i.e., toxic and nontoxic) were randomly permutated to be uncorrelated with the gene expression profiles. The synergy scores of the genes were then recalculated based on the permutated phenotype. This process was repeated 100 times to calculate the p-values of the synergy score for each gene pair. Finally, Benjamin-Hochberg false discovery rate procedure [36] was performed to adjust the p-values for all the gene pairs and thereby control the expected false discoveries. The p-value cutoff was set at 0.05.
5.4.7
Characterization of the network topology
A synergy network was built with gene pairs that have statistically significant synergy scores. The network was composed of nodes that represented the genes, and edges that represented the synergy of the gene pairs. Graph theoretical (topological) analysis of the reconstructed gene networks was used to assess the generated network and how it compared with other biological networks [37]. We characterized the topology of the synergy network by its degree distribution and shortest path length. Degree distribution provides a distribution of the number of edges associated with the nodes. Shortest path length is the lowest number of edges that connect two nodes and is measured using a bread-first search algorithm [38].
5.5 Data Acquisition, Anticipated Results, and Interpretation In the proposed methodology, the phenotype (cytotoxicity), metabolite, and gene expression profiles were collected as “inputs” and integrated as described above. The anticipated results include the representative trends of the metabolites relevant to the phenotype, genes that match the representative trends of the metabolites, and gene pairs with significant synergy scores. Gene pairs that have statistically significant synergy scores indicate possible membership of the gene pairs in a shared pathway or potential cross-talk between different pathways. Graphical representation of these synergistic gene pairs yields a network of gene-gene interactions that are associated with the phenotype. Topology analysis of the synergy network reveals the characteristics of the network, such as degree distribution, modularity and centrality. The hub genes, which have the highest number of edges, suggest that they may be central regulators in the induction of the phenotype.
82
5.6
Discussion and Commentary
5.6 Discussion and Commentary Phenotype-specific gene network reconstruction is a useful approach to extract gene interaction information from microarray data and to help provide insight into disease mechanisms. Multiple methods (e.g., meta-analysis [39] and pair-wise relevance [40]) have been developed to reconstruct gene networks associated with diseases. The information theoretic measure of synergy provides a convenient method to identify cooperative gene pairs with respect to a phenotype. Therefore in this chapter, we present an alternative strategy to reconstruct phenotype-specific networks based on the concept of synergy. A major concern that often arises in network reconstruction using microarray data is the high computational cost. To alleviate this limitation, we preselect a subset of genes by matching the trend of their expression profiles, across the different conditions, to the profiles of the phenotype-relevant metabolites. This step reduces the number of genes to be analyzed. Concomitantly, the trend-based analysis allows the incorporation of prior knowledge of the gene expression patterns. In other words, it permits the inclusion of genes of particular interest that are known to be related to the phenotype, whether or not these gene profiles are statistically correlated with the phenotype. Gene pairs with statistically significant synergy scores suggest potential combinatorial effects of those genes on the phenotype. However, the scores cannot distinguish between the types of combinatorial effects, such as additive or antagonistic. Nevertheless, this limitation can be addressed by integrating physical interaction data (i.e., protein-protein and protein-DNA interaction) into the analysis. In addition, the proposed analysis pipeline consists of several steps (see Figure 5.1), including selection of genes relevant to the phenotype, calculation of the synergy scores for each gene pair, evaluation of the statistical significance of the synergy score, and analysis of the network topology to identify biologically relevant genes. The methods for these steps are not limited to those proposed in this chapter; alternative methods for each of the steps in the framework can be used. For example, pattern recognition methods can be used to select the genes, discretization-based entropy estimation can be used to calculate the synergy score, Bayesian FDR control procedure can be used to evaluate the statistical significance of the synergy score, and so on. A comparison of the different methods for each step could be performed to determine the optimal procedure for each step.
5.7 Applications Notes Based upon the concept of synergy, we reconstructed a synergy network specifically for the phenotype of saturated FFA-induced cytotoxicity. In this application, we first selected the phenotype-relevant genes by integrating the metabolites altered by saturated FFAs [11] with the global gene expression profile and extracting the genes that followed the trends of the metabolites. From the selected genes, the synergy analysis revealed synergistic gene pairs, which were used to build a synergy network. The reconstructed network suggested potential gene targets that may play central roles in the induction of the phenotype.
83
Construction of Phenotype-Specific Gene Network by Synergy Analysis
Troubleshooting Table Problem
Possible Cause
Solution
Too many genes are selected
The criterion for gene selection is too loose
Too few genes are selected
The criterion for gene selection is too strict
The synergy network is too big for interpretation
False positive: nonsynergistic gene pairs are included
The synergy network is too small
The criterion of the permutation test is too strict
Use stricter criteria in the statistical test (i.e., a lower p-value cutoff) Incorporate prior knowledge to remove the genes that are irrelevant to the research target Try different gene selection methods Relax the statistical test (i.e., using a higher p-value cutoff) Add more genes of interest based on prior knowledge Try different gene selection methods Run more permutation tests to reduce the variance Use a lower p-value cutoff Use a higher p-value cutoff
5.7.1
Topological characteristics of the synergy network
The synergy network, shown in Figure 5.3, is composed of 292 genes with 436 connection edges. The synergy network is characterized by relatively short path lengths, ranging from 2 to 10 (Figure 5.4), while the characteristic path length, or average diameter, of the network is 4.872. The network demonstrates small world characteristics of real networks [41], suggesting that the propagation of communication between the genes is relatively fast. The degree distribution, P(k), provides the probability that a randomly selected vertex has k links to its neighbors. A power law distribution suggests that P(k) ~k−γ, where k is the degree and γ is the degree exponent, and in most biological, scale-free networks γ ranges around 2 and 3 [41]. The degree distribution of our synergy network is shown in Figure 5.5, and γ ∼ 2, suggesting the synergy network, similar to other biological networks, is scale-free. Therefore, most of the genes are sparsely connected, while few of the
Figure 5.3 The synergy network. The synergy network is composed of 292 genes with 436 connection edges. The size of the nodes indicates its degree.
84
5.7
Applications Notes
Frequency
20000 16000 12000 8000 4000 0 1
Figure 5.4
2
3
4
5 6 7 8 9 10 11 12 13 Shortest path length
The distribution of shortest path lengths in the synergy network.
Log (Ndegree)
2.5 2 Log (Ndegree) = −1.9 Log(degree) + 2.3005
1.5 1 0.5 0 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Log (degree) Figure 5.5 The degree distribution of the synergy network. For a degree value k, Ndegree is the number of genes with degree k in the network.
genes (hubs) are connected to many genes and play important roles in sustaining the integrity of the network, which suggest their importance in the biological function or the phenotype. In summary, the topology of the synergy network differs from the bell-like Poisson distribution characteristic of a random and statistically homogeneous network and suggests the existence of hub genes.
5.7.2
Hub genes in the network
The genes in the synergy network are listed and ranked by their degree (http://www.chems.msu.edu/groups/chan/gene pairs and hub genes.xls) Table 5.1 lists the genes with the highest degree, which are therefore “hub genes” in the synergy network. These genes include P4HA1, AHDC1, MACF1, INSIG2, and SH3RF2. P4HA1, proline 4-hydroxylase, alpha polypeptide, is the alpha subunit of the protein proline 4-hydroxylase (P4H). P4H is a key enzyme involved in the biosynthesis of collagens [42, 43]. Collagens, the most abundant protein in the extracellular matrix (ECM) and the main protein of connective tissues, play important roles in regulating cellular activities and are linked to multiple diseases, including cardiovascular diseases and cancers [44–46]. In liver, alteration in the synthesis of collagen is related to liver cell apoptosis, hepatic fibrosis, and cirrhosis [47–49]. Physiologically, the synthesis, processing, secretion, and degradation of collagens are tightly modulated by regulatory factors, including P4HA1. As a critical functional subunit of P4H, P4HA1 is involved in the 85
Construction of Phenotype-Specific Gene Network by Synergy Analysis
Table 5.1
The Hub Genes in the Synergy Network.
Gene Symbol
Degree
Full Name
P4HA1
22
AHDC1 MACF1 INSIG2 SH3RF2 …
20 19 18 13
procollagen-proline, 2-oxoglutarate 4-dioxygenase (proline 4-hydroxylase), alpha polypeptide AT hook, DNA binding motif, containing 1 microtubule-actin crosslinking factor 1 insulin induced gene 2 SH3 domain containing ring finger 2
post-translational modification of procollagen [42, 50]. As shown in Figure 5.6, P4HA1 is located within the endoplasmic reticulum and catalyzes the post-translational formation of 4-hydroxyproline in the -Xaa-Pro-Gly- sequences in procollagens, which is essential for proper folding of the procollagen polypeptide [50]. Inhibiting P4H generates unstable intracellular collagens which cannot be secreted [51]. On the other hand, over-expressing P4HA1 causes excess synthesis of collagen [52]. The deregulation of P4H or P4HA1 in collagen synthesis has been attributed to cytotoxicity and apoptosis in various types of cells [53–56]. P4HA1 regulates the ECM components by controlling the synthesis and secretion of procollagens, which modulates fibrosis, cell proliferation, and apoptosis [53–56]. The role of P4HA1 in palmitate-induced lipotoxicity is unclear. In our data, however, the level of P4HA1 is significantly down-regulated in palmitate as compared to the control (p=0.0014) and oleate (p=0.018) samples. Therefore, it is plausible that palmitate-induced cytotoxicity is mediated, in part, by P4HA1 through altered synthesis or improper folding of collagen peptides, although experimental validations are needed to confirm this hypothesis. In addition, a number of proteins have been identified or predicted to interact with P4HA1, from publicly available protein-protein interaction databases, such as KEGG [58] and STRING [59, 60]. An interaction network (Figure 5.7) obtained from STRING shows some of the proteins that potentially interact with P4HA1. P4HA1 is centrally positioned
2-ketoglutarate
procollagen-L-proline
proline 4-hydroxylase
succinate
procollagen trans 4-hydroxy-L-proline
P4HA1: prolyl 4-hydroxylase is a key enzyme in collagen synthesis
Figure 5.6 P4HA1 catalyzes the formation of 4-hydroxyproline. (Graph adapted from MetaCyc database (http://www.metacyc.org/) [57].)
86
5.7
Applications Notes
Figure 5.7 The protein-protein interaction network associated with P4HA1. Network was generated from the STRING database (http://string71.embl.de/) [59, 60]. P4HA1 is located in the middle of the network, which is shown in red.
and integrates different gene clusters in this network, supporting our result that P4HA1 is a hub gene (with the highest degree, 22) in the synergy network. AHDC1, AT hook, DNA binding motif, containing 1, contains 2 AT hook DNA binding domains and can be phosphorylated upon DNA damage, probably by ATM or ATR [61]. Although the in vivo function or phenotype of this gene has not been identified, it is known that AHDC1 encodes seven different isoforms, some of which contain HMG-I and HMG-Y, DNA-binding domains [62]. HMG proteins are involved in nucleosome phasing, 3’ end processing of mRNA transcripts, and transcription of genes close to AT rich regions, and are thereby related to the pathogenesis of inflammatory and autoimmune diseases [63–65]. Although the function of AHDC1 is currently unknown, it may be involved in DNA damage and inflammatory responses. The significant alteration of this gene by palmitate (p=0.015 for palmitate versus BSA and 0.014 for palmitate versus oleate) and the identification of this gene in the synergy network hints at the possibility that palmitate may affect DNA damage and inflammatory responses through AHDC1. MACF1, microtubule-actin crosslinking factor 1, also called ACF7 (actin cross-linking factor 7), is a member of the spectraplakin family of cytoskeletal cross-linking proteins that possess actin- and microtubule-binding domains [66, 67]. It may be involved in microtubule dynamics to facilitate actin-microtubule interactions at the cell periphery and in coupling the microtubule network to cellular junctions [62]. Cell-cell contact 87
Construction of Phenotype-Specific Gene Network by Synergy Analysis
and cell-surface interactions through the cytoskeleton and ECM are involved in the control and regulation of cell motility, tissue remodeling, gene expression, differentiation, and proliferation [68]. In the literature, a large number of cytoskeletal and ECM genes were found to be down-regulated by palmitate treatment [69, 70]. Therefore, it is possible that our method has identified two central genes (i.e., P4HA1 and MACF1) in mediating the effect of palmitate on the ECM and cytoskeletal structure. INSIG2, insulin-induced gene 2, encodes a protein that blocks the processing of sterol regulatory element binding proteins (SREBPs), which regulate human lipogenic and adipocyte metabolism [71]. As shown in Figure 5.8, in endoplasmic reticulum (ER), SREBP cleavage-activating protein (SCAP) can bind to the regulation domain of SREBP and transfer SREBP into golgi apparatus, where SCAP activates protease S1P to cleave the regulation domain and activate the transcription activation/DNA binding domain of SREBP. Activated SREBP then can be transported to the nucleus to bind to the cis-element SRE to promote the expression of a series of enzymes that are involved in lipid synthesis [72]. INSIG2 can bind to SCAP and inhibit its function, thereby blocking lipid synthesis [73]. Indeed, reduced INSIG2 levels in adipocytes resulted in SREBP activation, which increased the expression of genes involved in adipogenesis [74]. In our system of HepG2 cells, microarray analysis found that the gene expression levels of INSIG2 were down-regulated by both palmitate (p=0.011) and oleate (p=0.14). Therefore, the suppression of INSIG2 expression likely contributes to the increased TG synthesis observed in the FFA cultures [10]. SH3RF2, SH3 domain containing ring finger 2, is a putative protein whose function is unknown, but from sequence analysis SH3RF2 contains 3 Src homology 3 (SH3) domains and a RING-type zinc finger domain (Figure 5.9, from the InterPro database). RING-type zinc finger domain is found in many E3 ubiquitin-protein ligases [75]. E3 ubiquitin-protein ligases determine the substrate specificity for ubiquitination and are therefore involved in targeting proteins for degradation by the Ubiquitin-Proteasome System [76]. During this process, RING fingers, by interacting with E2 ubiquitin-conjugating enzymes, promotes ubiquitination [77, 78]. Although the exact function of SH3RF2 is not known, the RING-type zinc finger domain suggests SH3RF2 as a putative
Figure 5.8 INSIG2 plays an important role in lipid synthesis [72, 73]. (Figure adapted from [72].) By binding to SCAP, INSIG2 blocks the processing of SREBPs and therefore suppresses lipid synthesis.
88
5.8
Znf RING
SH3
SH3 2
Summary Points
SH3 2x2
Figure 5.9 The protein domains of SH3RF2. Figure modified from the InterPro database. SH3RF2 contains 3 Src homology 3 (SH3) domains and a RING-type zinc finger domain.
E3 ubiquitin-protein ligase, which may be involved in the protein degradation pathway through the Ubiquitin-Proteasome System. In addition, SH3RF2 contains another type of domain, SH3 domains, which are present in many proteins involved in intracellular signal transduction pathways [79, 80]. SH3 domains recognize and bind to the proline-rich motifs (-X-P-P-X-P-) on the associated proteins. Therefore, SH3 domains are recruited by the signaling proteins to direct protein-protein interactions and thereby specify distinct regulatory pathways mediated by different protein binding domains [81, 82]. Taken together, the SH3 domains of SH3RF2 could potentially serve as a targeting domain that determines the substrate specificity of SH3RF2 as a putative E3 ubiquitin-protein ligase, therefore facilitating SH3RF2 to subject certain types of proteins to degradation via the ubiquitin-proteasome system. Interestingly, this gene is significantly up-regulated in the palmitate culture (p=0.019 for palmitate versus BSA and 0.025 for palmitate versus oleate). This result is consistent with the literature suggesting that palmitate induces protein degradation. Chronic treatment of palmitate increases the levels of unfolded or misfolded proteins, which induces Endoplasmic reticulum (ER) stress [83, 84]. This lends support to our finding that P4HA1 was down-regulated in palmitate, suggesting the possibility that procollagen may be misfolded in the palmitate culture as compared to the oleate and control cultures. The accumulation of unfolded or misfolded proteins activates the ubiquitin-proteasome system. Indeed, recent studies found palmitate strongly enhances ubiquitination by activating E3 ubiquitin ligases, resulting in enhanced protein degradation. Therefore, SH3RF2, putative ubiquitin-protein ligase, may be potentially involved in palmitate-induced cytotoxicity by triggering ubiquitination. In addition, SH3RF2, composed of 3 SH3 and a zinc finger domain which are all protein-protein interaction domains, could recruit a diverse set of proteins for degradation, supporting its central position in our synergy network. Therefore, the synergy network identified a novel protein, SH3RF2, which potentially plays a central role in palmitate-induced cytotoxicity. The domain knowledge of this protein suggests SH3RF2 may be involved in palmitate-induced cytotoxicity by recruiting unfolded proteins and triggering ubiquitination. In summary, we built a synergy network specifically for palmitate-induced cytotoxicity. This network is scale-free and has multiple hub genes. The hub genes are related to cellular activities, such as cellular contact, cytotoxicity, metabolic pathways, protein degradation, which may play important roles in palmitate-induced cytotoxicity. Therefore, these hub genes suggest potential mechanisms involved in palmitate-induced cytotoxicity.
5.8
Summary Points
In this chapter, we have achieved the following goals:
89
Construction of Phenotype-Specific Gene Network by Synergy Analysis
1. Integrated gene and the metabolite profiles to identify a select group of genes that may be involved in palmitate-induced cytotoxicity. 2. Reconstructed a phenotype-specific synergy network. Topology analysis of the synergy network revealed scale-free characteristics and multiple hub genes, which are typical characteristics shared by many biological networks. These hub genes suggest potential mechanisms and may be targets for modulating palmitate-induced cytotoxicity.
Acknowledgments This research was supported in part by Michigan State University (MSU) Quantitative Biology and Modeling Initiative Fellowship, the MSU Foundation, the National Science Foundation (BES 0425821 and DBI 0701709), and the National Institutes of Health (R01GM079688-01, R21CA126136-01, R21RR024439, and R21GM075838).
References [1] [2] [3]
[4]
[5]
[6] [7] [8]
[9]
[10]
[11]
[12]
[13] [14]
90
Scheen, A.J., and F.H. Luyckx, “Obesity and liver disease,” Best Pract. Res. Clin. Endocrinol. Metab., Vol. 16, No. 4, December 2002, pp. 703–716. Farrell, G.C., and C.Z. Larter, “Nonalcoholic fatty liver disease: from steatosis to cirrhosis,” Hepatology, Vol. 43, No. 2, Suppl. 1, February 2006, pp. S99–S112. Li, Z., S. Srivastava, S. Mittal, X. Yang, L. Sheng, and C. Chan, “A Three Stage Integrative Pathway Search (TIPS) framework to identify toxicity relevant genes and pathways,” BMC Bioinformatics, Vol. 8, 2007, p. 202. Said, M.R., T.J. Begley, A.V. Oppenheim, D.A. Lauffenburger, and L.D. Samson, “Global network analysis of phenotypic effects: protein networks and toxicity modulation in Saccharomyces cerevisiae,” Proc Natl Acad Sci U S A, Vol. 101, No. 52, December 28, 2004, pp. 18006–18011. Srivastava, S., Z. Li, X. Yang, M. Yedwabnick, S. Shaw, and C. Chan, “Identification of genes that regulate multiple cellular processes/responses in the context of lipotoxicity to hepatoma cells,” BMC Genomics, Vol. 8, 2007, p. 364. Tusher, V.G., R. Tibshirani, and G. Chu, “Significance analysis of microarrays applied to the ionizing radiation response,” Proc. Natl. Acad. Sci. USA, Vol. 98, No. 9, April 24, 2001, pp. 5116–5121. Li, Z., and C. Chan, “Integrating gene expression and metabolic profiles,” J. Biol. Chem., Vol. 279, No. 26, June 25, 2004, pp. 27124–27137. Li, Z., S. Srivastava, X. Yang, S. Mittal, P. Norton, J. Resau, B. Haab, and C. Chan, “A hierarchical approach employing metabolic and gene expression profiles to identify the pathways that confer cytotoxicity in HepG2 cells,” BMC Syst. Biol., Vol. 1, 2007, p. 21. Lam, T.K., A. Carpentier, G.F. Lewis, G. van de Werve, I.G. Fantus, and A. Giacca, “Mechanisms of the free fatty acid-induced increase in hepatic glucose production,” Am J Physiol Endocrinol Metab, Vol. 284, No. 5, May 2003, pp. E863–873. Listenberger, L.L., X. Han, S.E. Lewis, S. Cases, R.V. Farese, Jr., D.S. Ory, and J.E. Schaffer, “Triglyceride accumulation protects against fatty acid-induced lipotoxicity,” Proc. Natl. Acad. Sci. USA, Vol. 100, No. 6, March 18, 2003, pp. 3077–3082. Li, Z., S. Srivastava, R. Findlan, and C. Chan, “Using dynamic gene module map analysis to identify targets that modulate free fatty acid induced cytotoxicity,” Biotechnol. Prog., Vol. 24, No. 1, January–February 2008, pp. 29–37. Eisen, M.B., P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl. Acad. Sci. USA, Vol. 95, No. 25, December 8, 1998, pp. 14863–14868. Liang, K.C., and X. Wang, “Gene regulatory network reconstruction using conditional mutual information,” EURASIP J. Bioinform. Syst. Biol., Vol. No. 2008, p. 253894. Basso, K., A.A. Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, and A. Califano, “Reverse engineering of regulatory networks in human B cells,” Nat. Genet., Vol. 37, No. 4, April 2005, pp. 382–390.
Acknowledgments
[15] [16]
[17] [18] [19] [20]
[21]
[22] [23] [24]
[25] [26] [27]
[28] [29]
[30] [31]
[32] [33]
[36]
[37]
[38] [39] [40]
[41]
Pe’er, D., A. Regev, G. Elidan, and N. Friedman, “Inferring subnetworks from perturbed expression profiles,” Bioinformatics, Vol. 17, Suppl. 1, 2001, pp. S215–224. Furey, T.S., N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler, “Support vector machine classification and validation of cancer tissue samples using microarray expression data,” Bioinformatics, Vol. 16, No. 10, October 2000, pp. 906–914. Tang, E.K., P.N. Suganthan, and X. Yao, “Gene selection algorithms for microarray data based on least squares support vector machine,” BMC Bioinformatics, Vol. 7, 2006, p. 95. Diaz-Uriarte, R., and S. Alvarez de Andres, “Gene selection and classification of microarray data using random forest,” BMC Bioinformatics, Vol. 7, 2006, p. 3. Paul, T.K., and H. Iba, “Gene selection for classification of cancers using probabilistic model building genetic algorithm,” Biosystems, Vol. 82, No. 3, December 2005, pp. 208–225. Watkinson, J., X. Wang, T. Zheng, and D. Anastassiou, “Identification of gene interactions associated with disease from gene expression data using synergy networks,” BMC Syst. Biol., Vol. 2, 2008, p. 10. Dai, Y.S., P. Cserjesi, B.E. Markham, and J.D. Molkentin, “The transcription factors GATA4 and dHAND physically interact to synergistically activate cardiac gene expression through a p300-dependent mechanism,” J. Biol. Chem., Vol. 277, No. 27, July 5, 2002, pp. 24390–24398. Schneidman, E., W. Bialek, and M.J. Berry, 2nd, “Synergy, redundancy, and independence in population codes,” J Neurosci, Vol. 23, No. 37, December 17, 2003, pp. 11539–11553. Brenner, N., S.P. Strong, R. Koberle, W. Bialek, and R.R. de Ruyter van Steveninck, “Synergy in a neural code,” Neural Comput., Vol. 12, No. 7, July 2000, pp. 1531–1552. Varadan, V., D.M. Miller, 3rd, and D. Anastassiou, “Computational inference of the molecular logic for synaptic connectivity in C. elegans,” Bioinformatics, Vol. 22, No. 14, July 15, 2006, pp. e497–e506. Varadan, V., and D. Anastassiou, “Inference of disease-related molecular logic from systems-based microarray analysis,” PLoS Comput. Biol., Vol. 2, No. 6, June 16, 2006, p. e68. Anastassiou, D., “Computational analysis of the synergy among multiple interacting genes,” Mol. Syst. Biol., Vol. 3, 2007, p. 83. Cianflone, K., H. Vu, Z. Zhang, and A.D. Sniderman, “Effects of albumin on lipid synthesis, apo B-100 secretion, and LDL catabolism in HepG2 cells,” Atherosclerosis, Vol. 107, No. 2, June 1994, pp. 125–135. Guo, W., N. Huang, J. Cai, W. Xie, and J.A. Hamilton, “Fatty acid transport and metabolism in HepG2 cells,” Am. J. Physiol. Gastrointest. Liver Physiol., Vol. 290, No. 3, March 2006, pp. G528–534. Artwohl, M., M. Roden, W. Waldhausl, A. Freudenthaler, and S.M. Baumgartner-Parzer, “Free fatty acids trigger apoptosis and inhibit cell cycle progression in human vascular endothelial cells,” FASEB J., Vol. 18, No. 1, January 1, 2004, pp. 146–148. Peters, T., All About Albumin: Biochemistry, Genetics, and Medical Applications, San Diego, CA: Academic Press, 1996. Srivastava, S., and C. Chan, “Hydrogen peroxide and hydroxyl radicals mediate palmitate-induced cytotoxicity to hepatoma cells: Relation to mitochondrial permeability transition,” Free Radic. Res., Vol. 41, No. 1, January 2006, pp. 38–49. Chan, C., F. Berthiaume, K. Lee, and M.L. Yarmush, “Metabolic flux analysis of cultured hepatocytes exposed to plasma,” Biotechnol. Bioeng., Vol. 81, No. 1, January 5, 2003, pp. 33–49. Chan, C., F. Berthiaume, K. Lee, and M.L. Yarmush, “Metabolic flux analysis of hepatocyte function in hormone- and amino acid-supplemented plasma,” Metab. Eng., Vol. 5, No. 1, January 2003, pp. 1–15. Benjamini, Y., and Y. Hochberg, “Controlling the false discovery rate—A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society Series B-Methodological, Vol. 57, No. 1, 1995, pp. 289–300. Christensen, C., A. Gupta, C.D. Maranas, and R. Albert, “Large-scale inference and graph-theoretical analysis of gene-regulatory networks in B-Subtilis,” Physica a-Statistical Mechanics and Its Applications, Vol. 373, January 1, 2007, pp. 796–810. Cormen, T.H., C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms, Cambridge, MA: MIT Press, 2001. Rasche, A., H. Al-Hasani, and R. Herwig, “Meta-analysis approach identifies candidate genes and associated molecular networks for type-2 diabetes mellitus,” BMC Genomics, Vol. 9, 2008, pp. 310. Jiang, W., X. Li, S. Rao, L. Wang, L. Du, C. Li, C. Wu, H. Wang, Y. Wang, and B. Yang, “Constructing disease-specific gene networks using pair-wise relevance metric: application to colon cancer identifies interleukin 8, desmin and enolase 1 as the central elements,” BMC Syst. Biol., Vol. 2, 2008, p. 72. Barabasi, A.L., and Z.N. Oltvai, “Network biology: Understanding the cell’s functional organization,” Nature Reviews Genetics, Vol. 5, No. 2, February 2004, pp. 101–U115.
91
Construction of Phenotype-Specific Gene Network by Synergy Analysis
[42]
[43]
[44] [45] [46] [47]
[48]
[49]
[50] [51] [52]
[53]
[54]
[55]
[56] [57]
[58] [59]
[60]
[61]
[62]
92
Chen, L., Y.H. Shen, X. Wang, J. Wang, Y. Gan, N. Chen, S.A. LeMaire, J.S. Coselli, and X.L. Wang, “Human prolyl-4-hydroxylase alpha(I) transcription is mediated by upstream stimulatory factors,” J. Biol. Chem., Vol. 281, No. 16, April 21, 2006, pp. 10849–10855. Annunen, P., H. Autio-Harmainen, and K.I. Kivirikko, “The novel type II prolyl 4-hydroxylase is the main enzyme form in chondrocytes and capillary endothelial cells, whereas the type I enzyme predominates in most cells,” J. Biol. Chem., Vol. 273, No. 11, March 13, 1998, pp. 5989–5992. Bedossa, P., and V. Paradis, “Liver extracellular matrix in health and disease,” J. Pathol., Vol. 200, No. 4, July 2003, pp. 504–515. Rodriguez-Feo, J.A., J.P. Sluijter, D.P. de Kleijn, and G. Pasterkamp, “Modulation of collagen turnover in cardiovascular disease,” Curr. Pharm. Des., Vol. 11, No. 19, 2005, pp. 2501–2514. Stone, P.J., “Potential use of collagen and elastin degradation markers for monitoring liver fibrosis in schistosomiasis,” Acta Trop., Vol. 77, No. 1, October 23, 2000, pp. 97–99. Bickel, M., K.H. Baringhaus, M. Gerl, V. Gunzler, J. Kanta, L. Schmidts, M. Stapf, G. Tschank, K. Weidmann, and U. Werner, “Selective inhibition of hepatic collagen accumulation in experimental liver fibrosis in rats by a new prolyl 4-hydroxylase inhibitor,” Hepatology, Vol. 28, No. 2, August 1998, pp. 404–411. Clement, B., C. Chesne, A.P. Satie, and A. Guillouzo, “Effects of the prolyl 4-hydroxylase proinhibitor HOE 077 on human and rat hepatocytes in primary culture,” J. Hepatol., Vol. 13, Suppl. 3, 1991, pp. S41–47. Faouzi, S., B. Le Bail, V. Neaud, L. Boussarie, J. Saric, P. Bioulac-Sage, C. Balabaud, and J. Rosenbaum, “Myofibroblasts are responsible for collagen synthesis in the stroma of human hepatocellular carcinoma: an in vivo and in vitro study,” J. Hepatol., Vol. 30, No. 2, February 1999, pp. 275–284. Myllyharju, J., “Prolyl 4-hydroxylases, the key enzymes of collagen biosynthesis,” Matrix Biol., Vol. 22, No. 1, March 2003, pp. 15–24. Rocnik, E.F., B.M. Chan, and J.G. Pickering, “Evidence for a role of collagen synthesis in arterial smooth muscle cell migration,” J. Clin. Invest., Vol. 101, No. 9, May 1, 1998, pp. 1889–1898. John, D.C., R. Watson, A.J. Kind, A.R. Scott, K.E. Kadler, and N.J. Bulleid, “Expression of an engineered form of recombinant procollagen in mouse milk,” Nat. Biotechnol., Vol. 17, No. 4, April 1999, pp. 385–389. Xia, S.H., J. Wang, and J.X. Kang, “Decreased n-6/n-3 fatty acid ratio reduces the invasive potential of human lung cancer cells by downregulation of cell adhesion/invasion-related genes,” Carcinogenesis, Vol. 26, No. 4, April 2005, pp. 779–784. Huet, C., P. Monget, C. Pisselet, and D. Monniaux, “Changes in extracellular matrix components and steroidogenic enzymes during growth and atresia of antral ovarian follicles in the sheep,” Biol. Reprod., Vol. 56, No. 4, April 1997, pp. 1025–1034. Dong, M.S., S.H. Jung, H.J. Kim, J.R. Kim, L.X. Zhao, E.S. Lee, E.J. Lee, J.B. Yi, N. Lee, Y.B. Cho, W.J. Kwak, and Y.I. Park, “Structure-related cytotoxicity and anti-hepatofibric effect of asiatic acid derivatives in rat hepatic stellate cell-line, HSC-T6,” Arch. Pharm. Res., Vol. 27, No. 5, May 2004, pp. 512–517. Ju, H., J. Hao, S. Zhao, and I.M. Dixon, “Antiproliferative and antifibrotic effects of mimosine on adult cardiac fibroblasts,” Biochim. Biophys. Acta, Vol. 1448, No. 1, November 19, 1998, pp. 51–60. Caspi, R., H. Foerster, C.A. Fulcher, P. Kaipa, M. Krummenacker, M. Latendresse, S. Paley, S.Y. Rhee, A.G. Shearer, C. Tissier, T.C. Walk, P. Zhang, and P.D. Karp, “The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases,” Nucleic Acids Res., Vol. 36, Database Issue, January 2008, pp. D623–D631. Aoki, K.F., and M. Kanehisa, “Using the KEGG database resource,” Curr Protoc Bioinformatics, Vol. 1, October 2005, pp. Unit 1 12. von Mering, C., L.J. Jensen, B. Snel, S.D. Hooper, M. Krupp, M. Foglierini, N. Jouffre, M.A. Huynen, and P. Bork, “STRING: known and predicted protein-protein associations, integrated and transferred across organisms,” Nucleic Acids Res., Vol. 33, No. Database issue, January 1, 2005, pp. D433–437. von Mering, C., L.J. Jensen, M. Kuhn, S. Chaffron, T. Doerks, B. Kruger, B. Snel, and P. Bork, “STRING 7—recent developments in the integration and prediction of protein interactions,” Nucleic Acids Res., Vol. 35, No. Database issue, January 2007, pp. D358–362. Matsuoka, S., B.A. Ballif, A. Smogorzewska, E.R. McDonald, 3rd, K.E. Hurov, J. Luo, C.E. Bakalarski, Z. Zhao, N. Solimini, Y. Lerenthal, Y. Shiloh, S.P. Gygi, and S.J. Elledge, “ATM and ATR substrate analysis reveals extensive protein networks responsive to DNA damage,” Science, Vol. 316, No. 5828, May 25, 2007, pp. 1160–1166. Thierry-Mieg, D., and J. Thierry-Mieg, “AceView: A comprehensive cDNA-supported gene and transcripts annotation,” Genome Biol., Vol. 7, Suppl. 1, 2006, pp. S12 11–14.
Acknowledgments
[63]
[64]
[65]
[66] [67] [68] [69]
[70]
[71]
[72]
[73]
[74] [75] [76]
[77] [78] [79] [80] [81] [82] [83]
[84]
Voll, R.E., V. Urbonaviciute, M. Herrmann, and J.R. Kalden, “High mobility group box 1 in the pathogenesis of inflammatory and autoimmune diseases,” Isr. Med. Assoc. J., Vol. 10, No. 1, January 2008, pp. 26–28. Tesniere, A., T. Panaretakis, O. Kepp, L. Apetoh, F. Ghiringhelli, L. Zitvogel, and G. Kroemer, “Molecular characteristics of immunogenic cancer cell death,” Cell Death Differ., Vol. 15, No. 1, January 2008, pp. 3–12. Jiang, W., and D.S. Pisetsky, “Mechanisms of Disease: the role of high-mobility group protein 1 in the pathogenesis of inflammatory arthritis,” Nat. Clin. Pract. Rheumatol., Vol. 3, No. 1, January 2007, pp. 52–58. Kodama, A., I. Karakesisoglou, E. Wong, A. Vaezi, and E. Fuchs, “ACF7: An essential integrator of microtubule dynamics,” Cell, Vol. 115, No. 3, October 31, 2003, pp. 343–354. Gong, T.W., C.G. Besirli, and M.I. Lomax, “MACF1 gene structure: a hybrid of plectin and dystrophin,” Mamm. Genome, Vol. 12, No. 11, November 2001, pp. 852–861. Zamir, E., and B. Geiger, “Molecular complexity and dynamics of cell-matrix adhesions,” J. Cell Sci., Vol. 114, No. Pt. 20, October 2001, pp. 3583–3590. Draghici, S., P. Khatri, A.L. Tarca, K. Amin, A. Done, C. Voichita, C. Georgescu, and R. Romero, “A systems biology approach for pathway level analysis,” Genome Res., Vol. 17, No. 10, October 2007, pp. 1537–1545. Swagell, C.D., D.C. Henly, and C.P. Morris, “Expression analysis of a human hepatic cell line in response to palmitate,” Biochem. Biophys. Res. Commun., Vol. 328, No. 2, March 11, 2005, pp. 432–441. Krapivner, S., S. Popov, E. Chernogubova, M.L. Hellenius, R.M. Fisher, A. Hamsten, and F.M. van’t Hooft, “Insulin-induced gene 2 involvement in human adipocyte metabolism and body weight regulation,” J. Clin. Endocrinol. Metab., Vol. 93, No. 5, May 2008, pp. 1995–2001. Horton, J.D., J.L. Goldstein, and M.S. Brown, “SREBPs: activators of the complete program of cholesterol and fatty acid synthesis in the liver,” J. Clin. Invest., Vol. 109, No. 9, May 2002, pp. 1125–1131. Yabe, D., M.S. Brown, and J.L. Goldstein, “Insig-2, a second endoplasmic reticulum protein that binds SCAP and blocks export of sterol regulatory element-binding proteins,” Proc. Natl. Acad. Sci. USA, Vol. 99, No. 20, October 1, 2002, pp. 12753–12758. Rosen, E.D., C.J. Walkey, P. Puigserver, and B.M. Spiegelman, “Transcriptional regulation of adipogenesis,” Genes Dev., Vol. 14, No. 11, June 1, 2000, pp. 1293–1307. Freemont, P.S., “The RING finger. A novel protein sequence motif related to the zinc finger,” Ann. NY Acad. Sci., Vol. 684, June 11, 1993, pp. 174–192. Hershko, A., H. Heller, E. Eytan, and Y. Reiss, “The protein substrate binding site of the ubiquitin-protein ligase system,” J. Biol. Chem., Vol. 261, No. 26, September 15, 1986, pp. 11992–11999. Freemont, P.S., “RING for destruction?” Curr. Biol., Vol. 10, No. 2, January 27, 2000, pp. R84–R87. Barinaga, M., “A new finger on the protein destruction button,” Science, Vol. 286, No. 5438, October 8, 1999, pp. 223, 225. Cohen, G.B., R. Ren, and D. Baltimore, “Modular binding domains in signal transduction proteins,” Cell, Vol. 80, No. 2, January 27, 1995, pp. 237–248. Pawson, T., “Protein modules and signalling networks,” Nature, Vol. 373, No. 6515, February 16, 1995, pp. 573–580. Ren, R., B.J. Mayer, P. Cicchetti, and D. Baltimore, “Identification of a ten-amino acid proline-rich SH3 binding site,” Science, Vol. 259, No. 5098, February 19, 1993, pp. 1157–1161. Morton, C.J., and I.D. Campbell, “SH3 domains. Molecular ‘Velcro,’” Curr. Biol., Vol. 4, No. 7, July 1, 1994, pp. 615–617. Guo, W., S. Wong, W. Xie, T. Lei, and Z. Luo, “Palmitate modulates intracellular signaling, induces endoplasmic reticulum stress, and causes apoptosis in mouse 3T3-L1 and rat primary preadipocytes,” Am. J. Physiol. Endocrinol. Metab., Vol. 293, No. 2, August 2007, pp. E576–586. Karaskov, E., C. Scott, L. Zhang, T. Teodoro, M. Ravazzola, and A. Volchuk, “Chronic palmitate but not oleate exposure induces endoplasmic reticulum stress, which may contribute to INS-1 pancreatic beta-cell apoptosis,” Endocrinology, Vol. 147, No. 7, July 2006, pp. 3398–3407.
93
CHAPTER
6 Genome-Scale Analysis of Metabolic Networks Ranjan Srivastava Department of Chemical, Materials and Biomolecular Engineering, University of Connecticut, 191 Auditorium Road, U-3222, Storrs, CT 06269
Abstract Metabolic modeling, particularly at the genome-scale, can be a useful tool in providing insights regarding metabolic processes for various organisms of interest. As with any other tool, however, it is subject to a number of limitations. If these limitations are kept in mind, metabolic modeling can provide a powerful means to understand and manipulate micro-organisms for purposes of fundamental research, as well as for accomplishing practical objectives. These practical objectives may include the efficient production of commercially relevant products or drugs, or they may include better approaches to treating microbial pathogens. A strategy is presented here for developing, implementing, and analyzing metabolic models of prokaryotes. It is assumed that experimental data for such studies may be scarce. For this reason, an optimization-based approach for implementing metabolic models of underdetermined systems is reviewed and discussed.
Key terms
Mathematical modeling Metabolic flux analysis Flux balance analysis Genome-scale
95
Genome-Scale Analysis of Metabolic Networks
6.1 Introduction With the advent of high-throughput processes for studying biological systems, particularly at the subcellular, cellular, and tissue levels, arranging the collected data into a coherent theoretical framework is a nontrivial task. As technology advances, no abatement of this deluge of data is in sight. On the contrary, it is likely that the amount of data generated will only increase. Fortunately computational biology is well suited to dealing with such high volumes of information. One particular approach that has been gaining popularity in leveraging the increasing amount of genomic and metabolomic data available is metabolic modeling [1–24]. This methodology utilizes computational biology, the theory of reaction kinetics, and applied mathematics to evaluate, analyze, and ultimately engineer metabolic networks. Metabolic modeling is a powerful tool yielding great benefits for basic research, as well as being useful for applied purposes. From the basic research perspective, metabolic modeling allows one to carry out in silico experiments or computational simulations to address questions regarding the fundamentals of metabolism of various organisms. As a result, it may be used as a method for generating hypotheses or for screening which experiments will yield the most information regarding a specific question. Simulations, however, should not be considered as a substitute for experiments. Rather, the model should be viewed as a tool, similar to a microscope, or some another piece of equipment in the lab. A model, by definition, is an approximation of nature. Thus, results should always be confirmed experimentally. However, the modeling approach does provide significant benefits. In particular it may provide useful insights into what an organism is doing and/or why it is behaving in a given way. Also, by determining which experiments are likely to have the highest impact in helping to understand a particular phenomena, modeling and simulations may allow efficient direction of scarce resources, both labor and financial, to make sure the most promising experiments are carried out. Applications of metabolic modeling may be found along the biotechnological spectrum, ranging from the production of commercially important metabolites [1, 25–27] to analysis of biomedically important problems [28–30]. Through the use of metabolic and biochemical engineering, recombinant organisms have become a “workhorse” for the production of key metabolites that have significant commercial importance. Using metabolic modeling, it is possible to help determine how metabolic networks should be reconfigured or engineered to optimize production. From the biomedical perspective, by identifying how metabolic resources are distributed during a given pathology, it may be possible to identify means to treat the disease. Metabolic modeling has particular promise in helping to deal with microbial pathogens. For example, the regulation of many virulence genes is regulated by carbon dioxide in pathogenic bacteria [31]; however, the link between this regulatory behavior to the many metabolic reactions in which carbon dioxide is involved has been relatively unexplored. Other potential applications of metabolic modeling include areas such as biofuels production, bioremediation, and engineering of microbial consortia to address a variety of issues facing society. The term “genome-scale” metabolic modeling refers to the development of a quantitative framework describing the entire metabolic network of an organism. To date, most such models, whether genome-scale or not, have focused on prokaryotic organisms rather than eukaryotic organisms due to issues with intracellular compartmentalization. However, recently that has been changing [32–34]. Regardless of the type of organism used, the general approach taken to develop such a model is to start out with the anno96
6.1
Introduction
tated genome-sequence. From there, one may reconstruct the metabolic network. Specifically, genes encoding enzymes involved in metabolic reactions are identified. If a particular enzyme is present, it is inferred that the associated metabolic reaction is present. In this way, the entire metabolic network may be built up from the genome-sequence. However, there are several caveats associated with such an approach. For example, if the DNA sequence for an identified gene has partial homology to an enzyme known to catalyze a metabolic reaction, how much sequence identity must it share before the reaction is considered to be present? To some extent, this particular problem may be mitigated by determining whether or not other metabolic reactions related to the pathway in question are present. If the complimentary pathways are present, that may argue for inclusion of the pathway suggested by the partially matching gene sequence [35]. Additionally, there may be genes that encode for enzymes that have structural homology to a known metabolic enzyme but lack any type of sequence homology. Another possible source of error is the presence of genes whose functions, although currently unknown, interact and impact metabolism. Finally, errors in annotating genes may also result in the incorrect and/or incomplete reconstruction of metabolic networks. Although these issues may appear to be daunting, they actually highlight one of the primary benefits of genome-scale modeling. The mathematical model is effectively a quantitative hypothesis. By carrying out simulations and comparing results to what is actually observed experimentally, mismatches between experiments and theory may be identified. Using this information as a foundation, hypotheses regarding the connectivity and functioning of the metabolic network may be revised. New simulations may be carried out and compared to experimental results, and in this way, the model should ideally approach what is observed experimentally. As this iterative process progresses, the model may be used to elucidate what is occurring in reality and furthermore guide future experiments. Once the metabolic network has been reconstructed, it is possible to convert it into a mathematical model. The conversion is accomplished through the use of the theory of reaction kinetics. Taking advantage of the knowledge of metabolite stoichiometry of the system of reactions, a system of ordinary differential equations (ODEs) describing how each metabolite varies with respect to time may be generated. Through various assumptions described in greater detail in Section 6.2, it is possible to simplify the equations from ODEs to algebraic equations. Generally speaking, at the genome-scale level, there are many more variables than there are equations, resulting in an under-determined system. It is possible to reduce the number of degrees of freedom through appropriate and well-designed experiments. Isotopic carbon labeling of substrate has proven to be a particularly useful approach [6, 10, 34, 36–40]. With sufficient data it is sometimes possible to reduce the number of degrees of freedom to the point where the system is completely determined or even over-determined. For the over-determined system, it is possible to use the extra data to provide a consistency check. If sufficient data is unavailable to render the system determined, it is still possible to carry out the metabolic analysis using an optimization strategy. The premise of such an approach is that the organism is attempting to optimize some type of objective function. The objective function may be maximization of growth rate, optimization of energy efficiency, or some other biological process deemed appropriate. The optimization calculation may then be used to determine how metabolic resources should be distributed 97
Genome-Scale Analysis of Metabolic Networks
across pathways in order to best achieve the stated objective function. The objective function strategy is justified by assuming that since organisms have evolved under selection pressures, they are by their nature approaching some kind of optimal level. The issue of course is: What has a given organism been optimized for? Should one even be looking at the organismal level or is selection at the species level more appropriate? The focus of this work will be to aid the researcher in developing and carrying out simulations of prokaryotic organisms where the system is underdetermined.
6.2 Materials and Methods 6.2.1
Flux analysis theory
Metabolic flux analysis (MFA) is the technique by which flux distributions through metabolic pathways are either determined or predicted [6, 41, 42]. Fluxes are calculated through the development of stoichiometric models of the metabolic reaction network. Generally the theory of reaction kinetics requires that variation of intracellular metabolites over time be described via a system of differential equations such as shown in (6.1). dX = r − μX dt
(6.1)
where X is a vector of the metabolites of interest, r is the vector of rate expressions, and μ is the bacterial growth rate. The term μX represents the dilution of the metabolites as the cells grow. Since intracellular metabolite levels tend to be very low and the dilution term is relatively small compared to the other reactions affecting the metabolite, the dilution term is generally assumed to be negligible [6]. If the experimental system can be manipulated to operate at a steady state or if it can be assumed that the response of metabolite pools to perturbations is very rapid, the variation with respect to time may be approximated as zero. The model is then reduced to a system of algebraic equations as illustrated by (6.2). 0 =r
(6.2)
It is further possible to write the rate expression in terms of the stoichiometric coefficients and their associated fluxes, such that r = ST ν
(6.3)
where S T is the matrix of stoichiometric coefficients and ν is the vector of fluxes. Substituting (6.3) into (6.2) results in 0 = ST ν
(6.4)
Knowing the metabolic network along with measured extracellular fluxes, it is possible to use (6.4) to determine metabolic fluxes. MFA allows one to determine a number of other cellular features. As pointed out by Stephanopoulos et al. [6], features such as nodal rigidity, alternative pathways, values of nonmeasurable fluxes, and maximum 98
6.2
Materials and Methods
theoretical yields may be determined. Nodal rigidity refers to how much or how little the flux distribution through a given branch point will change when operating conditions are changed. Alternative pathway analysis may be required when several different pathways appear feasible, but the actual pathway is unknown. Through MFA, it may be possible to show that some of the pathways are actually not used (zero flux) or impossible (negative flux). Such an analysis was used to identify the correct pathway for citric acid fermentation in C. lipolytica [43]. Actual experimental measurement of some fluxes may simply be impossible. However, enough information from other fluxes may be available to uniquely identify what the unknown flux must be. Due to the knowledge of the stoichiometry of the reaction system, MFA also allows one to determine the maximum specific product yield for a given substrate. Flux balance analysis (FBA) is a variant of MFA. In FBA, the goal is to find the all the feasible flux distributions for an organism under prescribed conditions (rich media, minimal media, and so forth) [5, 18, 41, 44]. The resulting solution is a bound convex cone made up of all the possible flux distributions in flux space. Experiments may then be carried out to determine where within the flux cone the actual flux distributions lie [17, 19]. It is also possible to perform an in silico analysis to determine the specific flux distribution of an organism by postulating an objective function that the organism is attempting to optimize. Once the objective function is specified, linear programming may be used to determine the flux distribution [17, 19, 44, 45]. Identification of the objective function is not trivial, as is discussed in Section 6.2.3.1, and is dependent upon the environment in which the organism finds itself. The benefit of FBA is that it may be carried out at the genome scale with limited data and still provide insight into how the organism can and will behave [3, 17–20, 45–48].
6.2.2
Model development
Before a metabolic model may be generated, it is necessary to have a reconstruction of the metabolism for the organism of interest. The metabolic reconstruction for a number of organisms has already been carried out and are readily available on the Web. Two of the best-known repositories are BioCyc (http://www.biocyc.org) and the Kyoto Encyclopedia of Genes and Genomes, more commonly referred to as KEGG (http:// www.genome.jp/kegg/). Both Web sites have the metabolic networks of hundreds of different organisms available. Both sites also allow users to download the metabolic networks to their personal computers in a variety of different formats, such as the Systems Biology Markup Language (SBML). With such a copy of the metabolic network available, one can use the information as the source for developing a mathematical model for simulation purposes. If the metabolic reconstruction for the organism of interest is unavailable, it may still be possible to generate the metabolic network. As long as the annotated genome sequence for the organism is known, the network can be reconstructed using the Pathway Tools software package [49, 50], details of which may be found at http:// bioinformatics.ai.sri.com/ptools/ptools-overview.html. The Pathway Tools software can take various genome sequence formats, such as the GenBank Database format (generally denoted by a “.gbk” extension at the end of the sequence file), and generate a Pathway Genome Database (PGDB). The PGDB may be analyzed directly to study the metabolic network. It may also be used to generate a file in SBML or another format more amenable 99
Genome-Scale Analysis of Metabolic Networks
for generating the mathematical model through direct parsing. An example of the E. coli network as generated by the Pathway Tools software is illustrated in Figure 6.1.
6.2.3
Objective function
Given that the systems to be modeled are generally under-determined, one method by which the distribution of metabolic resources may be estimated is through the utilization of optimization theory. In order to use such an approach, it is necessary to postulate an objective function. However, determination of the objective function is far from a trivial matter. Although the objective function approach does have its uses, it is critical to be aware of the issues associated with this method so that results from such modeling exercises will be evaluated in light of these limitations.
6.2.3.1 Objective function choices Several different objective functions have been utilized for metabolic modeling purposes. Some key ones include: •
Maximization of biomass production;
•
Maximization of ATP production rate;
•
Minimization of nutrient uptake rate;
•
Minimization of redox potential production rate;
•
Minimization of ATP production rate.
Maximization of biomass production has been by far the most popular choice as an objective function [5, 17, 19, 44, 45]. The premise for this particular choice is that an organism that can outgrow its competition is the one that will ultimately dominate its niche. Thus the organism that distributes its metabolic resources in such a way as to accomplish this objective will be in the best position to survive. Maximization of the ATP production rate is justified on the basis that by having excess ATP available, the cell is better able to leverage its existing metabolic resources [51–53]. As a result, such an organism is ultimately able to out-compete other organisms. Recent experimental studies have provided strong support for this approach as being a viable objective function in determining the distribution of metabolic fluxes [52]. Minimization of nutrient uptake rate is based on an efficiency argument [2]. The argument is best illustrated by an analogy comparing the organism of interest to a car. Given two automobiles, the first one requires a certain amount of gasoline to travel a fixed number of miles. The second car only requires half that amount of gasoline to go the same the distance. Thus the second car is the better car, because it can travel the same distance using less fuel. Minimization of nutrient uptake is based on the same principle, effectively stating that given two organisms, the one capable of surviving on less nutrients is the more optimal one. The remaining two objective functions are based on an optimization of energy efficiency argument. Minimization of either the redox potential production rate [2, 54] or the ATP production rate [2] follows the same rationale as that of the minimization of the nutrient uptake rate. The cell capable of functioning while requiring less energy is considered the “better” cell in this scenario.
100
Figure 6.1 The network shown is an example of the E. coli metabolic network as generated by the Pathway Tools software [49, 50]. Each node represents a metabolite, with the shape specifying the type of metabolite. For example, triangles represent amino acids, while squares represent carbohydrates. The lines connecting the nodes represent the metabolic or transport reactions that the metabolites are involved in. Details on how to get and use the Pathway Tools software are available at http://bioinformatics.ai .sri.com/ptools/ptools-overview.html.
6.2 Materials and Methods
101
Genome-Scale Analysis of Metabolic Networks
Depending upon the scenario (e.g., the environment the cells are in, the resources available, the type of competition faced), other objective functions may be more suitable for use in the modeling analysis. Indeed, if the environmental conditions change during cellular growth (i.e., going from a nutrient-rich environment to one that is nutrient poor), the objective itself may change. However, the above list should provide a fairly comprehensive set of starting points.
6.2.3.2 Objective function determination/evaluation The choice of an objective function may not be obvious for a given organism. As a result, a researcher might consider several different objective functions. Various methods exist to help evaluate the choice and quality of the various objective functions. These methodologies may be divided into two broad categories. One is the use of an optimization-based approach, while the second involves a probabilistic analysis. Although both methods are complementary to each other, only a brief description of the optimization approach is provided here. The probabilistic approach is described in more detail. Note each of these methods require a means of carrying out the optimization in order to evaluate the quality of the objective function. Details on how this might be accomplished are provided in Section 6.2.4. The method developed by Burgard and Maranas [55] utilizes an inverse optimization approach for inferring or disproving various objective functions. A weighted combination of fluxes is maximized, where the weighting factors are referred to as coefficients of importance or CoI. The CoI’s are calculated in reference to experimental flux data and are determined in such a way that they sum to one. As a result, by looking at the value of the CoI, it is possible to determine the importance of the contribution of the particular flux being weighted. Since the objective function is ultimately a combination of fluxes, if the CoI is low, then that particular flux is not truly contributing to the objective function. If the CoI is high, then the flux is indeed appropriate for the objective function. The probabilistic approach was developed by Knorr et al. [54] based on the work of Stewart, Box, and others [56, 57]. The method involves carrying out a Bayesian-based model discrimination analysis to determine the posterior probabilities of each objective function of interest. To facilitate the approach, the posterior probability of an objective is normalized to the sum of all of the evaluated posterior probabilities in what is referred to as the posterior probability share. The posterior probability shares for each objective function may be compared. The objective function with the highest probability share is the most likely objective function relative to the other objective functions being evaluated. It is critical to note that this approach will always result in a “best” objective function. However, if all of the objective functions are poor choices, then this method will pick the best of the poor choices. It does not change the fact that final objective function selected may not be a good one if all of the objective functions evaluated are of poor quality. It is therefore very important that all of the objective functions be assessed with a critical eye. The basis for determining the posterior probability shares for the objective begins with the proportionality described in (6.5): −p p ⎛⎜ Mj Y⎞⎟ ∝ p(Mj )2 j ⎠ ⎝
102
2
−δ 2
vj
(6.5)
6.2
Materials and Methods
where Mj is the objective function; p(Mj) is the prior probability of Mj; Y is the matrix of weighted experimental data, where the weighting is simply the reciprocal of the standard deviation for the appropriate response value, as described by Stewart et al. [57]; pj is the number of parameters estimated in Mj; and δ is the number of available degrees of freedom. Given that the parameters for the flux modeling are essentially the stoichiometric coefficients, and they are already known, there are no further parameters to be estimated [24]. As a result, unless some modified variation of the metabolic analysis is carried out, pj will have a value of 0. The matrix v j represents the products of the deviation of the data from the values predicted by the model for the objective function, Mj evaluated at the maximum likelihood of the parameter vector θ. The ikth element may be calculated via (6.6): n
[
(
v ik ( θ j ) = ∑ Yiu − Fji ξ u , θ j u =1
)][Y
ku
(
− Fjk ξ u , θ j
)]
(6.6)
where Fji is the weighted model prediction, described in more detail below, for objective function Mj, which is a function of the vector of independent variables and parameters denoted by ξ u and θ j , respectively. As mentioned for Y, the weighting for Fji is the reciprocal of the standard deviation for the appropriate response value [57]. Note that the weighting is also calculated for Yiu. The subscripts i and k denote specific response values, while u represents the experimental run in which the data was collected. To calculate the normalized posterior probability share, the individual posterior probability is simply normalized to the sum of all of the calculated posterior probabilities:
π ⎛⎜ Mj Y⎞⎟ = ⎝ ⎠
p ⎛⎜ Mj Y⎞⎟ ⎠ ⎝ ⎞ ⎛ ∑k p ⎜⎝ Mk Y⎟⎠
(6.7)
The objective function with the largest value of Π is the most likely one given the experimental data set Y. This last point is a critical one. As more quantity of data is received or more accurate data is generated, the results of the analysis may change. Thus it is generally a good idea to revisit the assessment of the objective function when new data is obtained.
6.2.3.3 Caveats The very notion that the organism has a specific objective function has significant biological ramifications. It supposes that the organism is trying to accomplish one goal above and beyond all others. Furthermore, it suggests that the objective function is only being applied at one scale, that of the organism. It neglects the possibility of application of an objective function at a different scale or across multiple scales. For example, imagine a population of organisms where the objective is the survival of the population as a whole, ultimately resulting in enhanced survivability for the individuals. In this scenario it may be that the organisms are operating in a suboptimal fashion at the individual level. This suboptimal operation may be due to the production of 103
Genome-Scale Analysis of Metabolic Networks
metabolites by the organism to help its neighbors survive. However, those resources that the organism is using to help its neighbor survive could have been used for itself. Additionally, the resulting metabolic burden on the organism from aiding its neighbor may slow down its growth rate, which might hinder its ability to compete as effectively for resources. If, despite these setbacks at the individual level, the chances for survival of the population increase, then the population of suboptimal individuals will be selected for evolutionarily. However, under this scenario, selection of any objective function for optimization to model the organism at an individual level is unlikely to provide good estimations of the metabolic distribution. Even when focusing on the organismal scale is appropriate, there are still critical issues the modeler must be aware of. The choice of one objective function over another may be appropriate at a given time. However, this choice is critically dependent upon the context of the system. In other words, the environment in which the organism finds itself will impact what type of objective function is appropriate. If the environment changes, the objective function may very well change. However, under such circumstances, it is quite likely that there will be a shift in the distribution of metabolic fluxes. As a result, the assumption that the organism is operating at a “steady state” will no longer be valid, undermining the development of (6.2). It is still possible to utilize the metabolic modeling approach described here. However, the analysis would have to be broken up into two phases. The first phase would be prior to the environmental shift or perturbation and would utilize the first objective function. The second phase would begin after enough time had passed since the environmental shift such that the organism had adapted to its new state. At this point, calculations would be based on the use of the second objective function. Another issue to be aware of is the possibility of the existence of multiple simultaneous objective functions. At the organismal level, it is possible to deal with this problem through appropriate construction/selection of the objective function. However, when dealing with multiple multiscale objective functions, the problem becomes increasingly difficult to the point of intractability. Without a doubt, the best scenario is one in which sufficient experimental data is available such that the system is determined or overdetermined. Under such circumstances, an optimization approach is unnecessary, and one does not have to speculate as to what the objective function for an organism might be. However, in the scenario where such data is not available, the optimization route can prove useful in providing an approximation of what is occurring within the organism of interest, as long as the above caveats are kept in mind. Furthermore, in addition to clarifying ongoing questions, the metabolic modeling may help determine new questions, generate hypotheses which may be tested experimentally, and provide a previously unknown research thrust. Such results in this age of high throughput biotechnology can prove to be extremely valuable in aiding researchers to wade through deluge of data being generated.
6.2.4
Optimization
Once the mathematical model is generated and the objective function is chosen, it is possible to carry out the optimization. If the model is generated in accord with the approach described in (6.4) and the objective function is constructed so that it is also linear in nature, the problem may be cast as linear programming one. A number of fine 104
6.3
Data Acquisition, Anticipated Results, and Interpretation
commercial software packages are available for solving linear programming problems. However, a high-quality free package is also available from the GNU Project and Free Software Foundation that has been used successfully to optimize genome-scale models [54]. Specifically the GNU Linear Programming Kit (GLPK) (http://www.gnu.org/software/glpk/) may be freely downloaded and is capable on running on a variety of major computing platforms. Significant documentation is also available, making the GLPK relatively easy to install and run. If the model is developed in some other fashion resulting in nonlinear constraints, or if a nonlinear objective function is chosen, the GLPK will not suffice. It will then be necessary to find another optimization software package capable of dealing with nonlinear problems. Once again, many fine commercial software packages are available for carrying out such analysis.
6.3 Data Acquisition, Anticipated Results, and Interpretation Once the appropriate objective function is identified and the simulation is run, there are two possible outcomes. The first is that a feasible solution is determined. The second is that no feasible solution can be determined. Both results can be informative regarding the metabolism of the organism in question.
6.3.1
Feasible solution determined
Ideally, upon carrying out the simulations, the value of the resulting objective function will be optimized and feasible solution will result. In such a situation, the distribution of the metabolic fluxes may be analyzed to determine how metabolic resources are being allocated by the organism. Given these results, the next steps are generally dependent upon the goals of the researcher. If basic research into the fundamentals of metabolism is being studied, then hopefully the solution to old questions will have been resolved or at least hinted at; new questions will inevitably arise. If the purpose is to engineer the organism to optimize production of a metabolite or protein, then the resulting simulations should provide some insight into what the next steps should be to accomplish the stated goal. Regardless of what is being studied, in all cases it is imperative that the simulation results be verified experimentally. It is especially critical to do so before another round of simulations is carried out. If there is some discrepancy in the first round of simulations that is not identified through experimental analysis, then the second round of simulations will be built upon a flawed foundation. The resulting inaccuracies will propagate through future simulations, leading to erroneous results. Another point the researcher must be cognizant of is whether the results of the simulation make sense. For example, if there are not very many experimentally determined constraints on the organism, then optimization of the objective function may become trivialized. Specifically, if it is determined that the production of a given metabolite is to be maximized, then, if the pathway exists, the simulation will predict all of the substrate is converted into that metabolite. In reality, the organism will require distribution of the substrate through other metabolic pathways for growth, energy production, and so on. By including such constraints explicitly, the researcher forces the simulation to account 105
Genome-Scale Analysis of Metabolic Networks
for the distribution of resources along pathways that might be suboptimal for the production of the given metabolite. However, in reality, without utilizing the other metabolic pathways, the organism simply may not survive. Thus it is up to the researcher to evaluate the simulation results with a critical eye.
6.3.2
No feasible solution determined
Oftentimes after carrying out the optimization process, it will not be possible to determine a feasible solution. Such a result may be due to an incomplete or incorrect constraint. For example, assume that based on the metabolic reconstruction, the mass balance constraint for metabolite X was determined to be X: ν1 + ν 2 ν 3 = 0
(6.8)
Because there are only source terms and no sink terms, any nonzero value for the fluxes would result in an accumulation of metabolite X, violating the steady state assumption and the resulting constraint. The only possible solution that is consistent with the above constraint is if all of the fluxes are 0. However, if based on experimental considerations or if any of the given fluxes contribute to other constraints, then it is possible that those fluxes might have a nonzero value. Thus the optimization of the system can not yield a feasible solution. In such a scenario, it is possible the metabolite is participating in a hitherto unknown reaction where the metabolite may be a reactant. The resulting constraint would then have a form similar to the following: X: ν1 + ν 2 + ν 3 − ν ? = 0
(6.9)
where the flux, v?, may now act as the sink term. Such a simulation result actually turns out to be quite useful, as it highlights metabolites which might be participating in reactions not previously known. As a result, it may be possible to design experiments in which these particular metabolites are traced to determine how they are ultimately distributed throughout the cell. To identify whether a given metabolic constraint is causing problems, it is only necessary to comment out the constraint from the input file to the GLPK software. It is important to realize that removing the constraint is not the equivalent of removing the metabolite. Removal of the constraint simply means the mass balance is not closed, which provides the flexibility needed to allow for potential participation of the metabolite in other reactions.
6.4 Discussion and Commentary Genome-scale metabolic modeling has a great deal to offer the basic research and metabolic engineering communities. It is a powerful tool that can be used to help elucidate metabolic processes that might not otherwise be easily amenable to experimental studies. Furthermore, based on insights provided by such analysis, it may be possible to address unanswered questions, formulate new hypotheses, and better manipulate organisms for biotechnological purposes, or, if the organism in question is a pathogen, better treat illnesses caused by that microbe. 106
6.5
Summary Points
Clearly there are many significant assumptions underlying the metabolic modeling approach. However, other standard wet lab technologies, such as microscopy and microarrays, are also fraught with limitations. As long as one is aware of these limitations, the results generated via these tools can be extremely insightful. The same is true for metabolic modeling. It is simply another tool available to the researcher. By being aware of the limitations of this approach, it is possible to gain some truly valuable insight into the system that is being studied.
6.5 Summary Points An attempt has been made to provide a strategy for developing and implementing genome-scale metabolic models. It has been assumed that the system developed will be for modeling of a prokaryotic organism and the model will be an underdetermined one. As a result, an optimization strategy will be required. Based on these assumptions, the following steps summarize the method just described and which may be used by the researcher: •
Acquire or generate metabolic reconstruction for organism of interest: • Many organisms are available from BioCyc (http://www.biocyc.org) or KEGG (http://www.genome.jp/kegg/); • If metabolic reconstruction is unavailable, but annotated genome sequence is, then one can use Pathway Tools Software to generate metabolic reconstruction (http://bioinformatics.ai.sri.com/ptools/ptools-overview.html).
•
Convert metabolic network to metabolic/mathematical model and select objective function: • Objective function selection is critical and requires careful thought and consideration; a list of some of the most commonly used objective functions is provided in Section 6.2.3.1; • Evaluate objective functions.
•
Carry out linear programming. One can use the GNU Linear Programming Kit, available from http://www.gnu.org/software/glpk/. Evaluate the solution: • •
If a feasible solution is generated, make sure the solution results make sense; If no feasible solution can be generated, identify which constraints are causing the problem and mark those metabolites for future experimental study.
It should be emphasized that this approach is an iterative process, and may require several rounds of updating based on what was learned in previous trials, followed by repeated analysis.
Acknowledgments Support for this work was provided in part by the NIH National Library of Medicine through grant 1R03LM009753-01.
107
Genome-Scale Analysis of Metabolic Networks
References [1]
[2] [3] [4] [5]
[6] [7]
[8] [9] [10] [11] [12]
[13]
[14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]
108
Vallino, J.J., and G. Stephanopoulos, “Metabolic flux distributions in corynebacterium glutamicum during growth and lysine overproduction,” reprinted from Biotechnology and Bioengineering, Vol. 41, 1993, pp. 633–646, in Biotechnol. Bioeng., Vol. 67, No. 6, 2000, pp. 872–885. Savinell, J.M., and B.O. Palsson, “Network analysis of intermediary metabolism using linear optimization. I. Development of mathematical formalism,” J. Theor. Biol., Vol. 154, 1992, pp. 421–454. Schilling, C.H., et al., “Genome-scale metabolic model of Helicobacter pylori 26695,” J. Bacteriol., Vol. 184, No. 16, 2002, pp. 4582–4593. Schilling, C.H., et al., “Combining pathway analysis with flux balance analysis for the comprehensive study of metabolic systems,” Biotechnol. Bioeng., Vol. 71, No. 4, 2000, pp. 286–306. Schilling, C.H., and B.O. Palsson, “Assessment of the metabolic capabilities of haemophilus influenzae Rd through a genome-scale pathway analysis,” J. Theor. Biol., Vol. 203, No. 3, 2000, pp. 249–283. Stephanopoulos, G.N., A.A. Aristidou, and J. Nielsen, Metabolic Engineering: Principles and Methodologies, San Diego, CA: Academic Press, 1998. Sauer, U., D.C. Cameron, and J.E. Bailey, “Metabolic capacity of bacillus subtilis for the production of purine nucleosides, riboflavin, and folic acid,” Biotechnol. Bioeng., Vol. 59, No. 2, 1998, pp. 227–238. Sauer, U., et al., “Physiology and metabolic fluxes of wild-type and riboflavin-producing bacillus subtilis,” Appl. Environ. Microbiol., Vol. 62, No. 10, 1996, pp. 3687–3696. Raman, K., P. Rajagopalan, and N. Chandra, “Flux balance analysis of mycolic acid pathway: targets for anti-tubercular drugs,” PLoS Comput. Biol., Vol. 1, No. 5, 2005, p. e46. Park, S.M., et al., “Metabolite and isotopomer balancing in the analysis of metabolic cycles: II. Applications,” Biotechnol. Bioeng., Vol. 62, No. 4, 1999, pp. 392–401. Goel, A., et al., “Analysis of metabolic fluxes in batch and continuous cultures of bacillus subtilis,” Biotechnol. Bioeng., Vol. 42, No. 6, 1993, pp. 686–696. Goel, A., et al., “Metabolic fluxes, pools, and enzyme measurements suggest a tighter coupling of energetics and biosynthetic reactions associated with reduced pyruvate kinase flux,” Biotechnol. Bioeng., Vol. 64, No. 2, 1999, pp. 129–134. Hatzimanikatis, V., and J.E. Bailey, “Effects of spatiotemporal variations on metabolic control: approximate analysis using (log)linear kinetic models,” Biotechnol. Bioeng., Vol. 54, No. 2, 1997, pp. 91–104. Hatzimanikatis, V., et al., “Metabolic networks: enzyme function and metabolite structure,” Curr. Opin. Struct. Biol., Vol. 14, No. 3, 2004, pp. 300–306. Hatzimanikatis, V., et al., “Exploring the diversity of complex metabolic networks,” Bioinformatics, Vol. 21, No. 8, 2005, pp. 1603–1609. Holms, H., “Flux analysis and control of the central metabolic pathways in Escherichia coli,” FEMS Microbiol Rev, Vol. 19, No. 2, 1996, pp. 85–116. Ibarra, R.U., J.S. Edwards, and B.O. Palsson, “Escherichia coli K-12 undergoes adaptive evolution to achieve in silico predicted optimal growth,” Nature, Vol. 420, No. 6912, 2002, pp. 186–189. Edwards, J.S., M. Covert, and B. Palsson, “Metabolic modelling of microbes: the flux-balance approach,” Environ. Microbiol., Vol. 4, No. 3, 2002, pp. 133–140. Edwards, J.S., R.U. Ibarra, and B.O. Palsson, “In silico predictions of escherichia coli metabolic capabilities are consistent with experimental data,” Nat. Biotechnol., Vol. 19, No. 2, 2001, pp. 125–130. Edwards, J.S., and B.O. Palsson, “Systems properties of the haemophilus influenzae RD metabolic genotype,” J. Biol. Chem., Vol. 274, No. 25, 1999, pp. 17410–17416. Fell, D.A., and J.R. “Small, Fat Synthesis in Adipose Tissue. An Examination of Stoichiometric Constraints,” Biochem. J., Vol. 238, No. 3, 1986, pp. 781–786. Fischer, E., and U. Sauer, “Large-scale in vivo flux analysis shows rigidity and suboptimal performance of bacillus subtilis metabolism,” Nat. Genet., Vol. 37, No. 6, 2005, pp. 636–640. Bonarius, H.P., et al., “Metabolic flux analysis of hybridoma cells in different culture media using mass balances,” Biotechnol. Bioeng., Vol. 50, No. 3, 1996, pp. 299–318. Bailey, J.E., “Complex biology with no parameters,” Nat. Biotechnol., Vol. 19, No. 6, 2001, pp. 503–504. Bailey, J.E., et al., “Inverse metabolic engineering: a strategy for directed genetic engineering of useful phenotypes,” Biotechnol. Bioeng., Vol. 79, No. 5, 2002, pp. 568–579. Alper, H., et al., “Identifying gene targets for the metabolic engineering of lycopene biosynthesis in Escherichia coli,” Metab. Eng., Vol. 7, No. 3, 2005, pp. 155–164. Alper, H., K. Miyaoku, and G. Stephanopoulos, “Construction of lycopene-overproducing E. coli strains by combining systematic and combinatorial gene knockout targets,” Nat. Biotechnol., Vol. 23, No. 5, 2005, pp. 612–616.
Acknowledgments
[28] [29] [30] [31] [32] [33] [34]
[35] [36] [37]
[38] [39] [40] [41] [42] [43] [44] [45]
[46]
[47] [48] [49] [50] [51]
[52] [53]
[54] [55]
Lee, K., et al., “Metabolic flux analysis of postburn hepatic hypermetabolism,” Metab. Eng., Vol. 2, No. 4, 2000, pp. 312–327. Lee, K., et al., “Metabolic flux analysis: a powerful tool for monitoring tissue function,” Tissue Eng., Vol. 5, No. 4, 1999, pp. 347–368. Lee, K., et al., “Profiling of dynamic changes in hypermetabolic livers,” Biotechnol. Bioeng., Vol. 83, No. 4, 2003, pp. 400–415. Stretton, S., and A.E. Goodman, “Carbon dioxide as a regulator of gene expression in microorganisms,” Antonie Van Leeuwenhoek, Vol. 73, No. 1, 1998, pp. 79–85. Forster, J., et al., “Genome-scale reconstruction of the Saccharomyces Cerevisiae metabolic network,” Genome Res., Vol. 13, No. 2, 2003, pp. 244–253. Forster, J., et al., “Large-scale evaluation of in silico gene deletions in Saccharomyces Cerevisiae,” Omics, Vol. 7, No. 2, 2003, pp. 193–202. Gombert, A.K., et al., “Network identification and flux quantification in the central metabolism of Saccharomyces Cerevisiae under different conditions of glucose repression,” J. Bacteriol., Vol. 183, No. 4, 2001, pp. 1441–1451. Green, M.L., and P.P.D. Karp, “A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases,” BMC Bioinformatics, Vol. 5, 2004, p. 76. Zupke, C., et al., “Numerical isotopomer analysis: estimation of metabolic activity,” Anal. Biochem., Vol. 247, No. 2, 1997, pp. 287–293. Klapa, M.I., J.C. Aon, and G. Stephanopoulos, “Systematic quantification of complex metabolic flux networks using stable isotopes and mass spectrometry,” Eur. J. Biochem., Vol. 270, No. 17, 2003, pp. 3525–3542. Klapa, M.I., et al., “Metabolite and isotopomer balancing in the analysis of metabolic cycles: I. Theory,” Biotechnol. Bioeng., Vol. 62, No. 4, 1999, pp. 375–391. Christensen, B., A.K. Gombert, and J. Nielsen, “Analysis of flux estimates based on (13)C-labelling experiments,” Eur. J. Biochem., Vol. 269, No. 11, 2002, pp. 2795–2800. Christensen, B., and J. Nielsen, “Isotopomer analysis using Gc-Ms,” Metab. Eng., Vol. 1, No. 4, 1999, pp. 282–290. Varma, A., and B.O. Palsson, “Metabolic flux balancing—Basic concepts, scientific and practical use,” Bio-Technology, Vol. 12, No. 10, 1994, pp. 994–998. Vallino, J.J., and G. Stephanopoulos, “Metabolic flux distributions in Corynebacterium Glutacmicum during growth and lysine overproduction,” Biotechnol. Bioeng., Vol. 41, 1993, pp. 633–646. Aiba, S., and M. Matsuoka, “Identification of metabolic model: citrate production from glucose by Candida Lipolytica,” Biotechnol. Bioeng., Vol. 21, 1979, pp. 1373–1386. Reed, J.L., and B.O. Palsson, “Thirteen years of building constraint-based in silico models of Escherichia coli,” J. Bacteriol., Vol. 185, No. 9, 2003, pp. 2692–2699. Edwards, J.S., and B.O. Palsson, “The Escherichia coli Mg1655 in silico metabolic genotype: its definition, characteristics, and capabilities,” Proc. Natl. Acad. Sci. USA, Vol. 97, No. 10, 2000, pp. 5528–5533. Reed, J.L., and B.O. Palsson, “Genome-scale in silico models of E. coli have multiple equivalent phenotypic states: assessment of correlated reaction subsets that comprise network states,” Genome Res., Vol. 14, No. 9, 2004, pp. 1797–1805. Price, N.D., et al., “Genome-scale microbial in silico models: the constraints-based approach,” Trends Biotechnol., Vol. 21, No. 4, 2003, pp. 162–169. Price, N.D., et al., “Network-based analysis of metabolic regulation in the human red blood cell,” J. Theor. Biol., Vol. 225, No. 2, 2003, pp. 185–194. Paley, S.M., and P.D. Karp, “The pathway tools cellular overview diagram and omics viewer,” Nucleic Acids Res., Vol. 34, No. 13, 2006, pp. 3771–3778. Karp, P.D., S. Paley, and P. Romero, “The pathway tools software,” Bioinformatics, Vol. Vol. 18, Suppl. 1, 2002, pp. S225–S232. Vo, T.D., H.J. Greenberg, and B.O. Palsson, “Reconstruction and functional characterization of the human mitochondrial metabolic network based on proteomic and biochemical data.” J. Biol. Chem., Vol. 279, No. 38, 2004, pp. 39532–39540. Schuetz, R., L. Kuepfer, and U. Sauer, “Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli,” Mol. Syst. Biol., Vol. 3, 2007, p. 119. Ramakrishna, R., et al., “Flux-balance analysis of mitochondrial energy metabolism: consequences of systemic stoichiometric constraints,” Am. J. Physiol. Regul. Integr. Comp. Physiol., Vol. 280, No. 3, 2001, pp. R695–R704. Knorr, A.L., R. Jain, and R. Srivastava, “Bayesian-based selection of metabolic objective functions,” Bioinformatics, Vol. 23, No. 3, 2007, pp. 351–357. Burgard, A.P., and C.D. Maranas, “Optimization-based framework for inferring and testing hypothesized metabolic objective functions,” Biotechnol. Bioeng., Vol. 82, 2003, pp. 670–677.
109
Genome-Scale Analysis of Metabolic Networks
[56] [57]
110
Stewart, W.E., T.L. Henson, and G.E.P. Box, “Model discrimination and criticism with single-response data,” AIChE Journal, Vol. 42, No. 11, 1996, pp. 3055–3062. Stewart, W.E., Y. Shon, and G.E.P. Box, “Discrimination and goodness of fit of multiresponse mechanistic models,” AIChE Journal, Vol. 44, No. 6, 1998, pp. 1404–1412.
CHAPTER
7 Modeling the Dynamics of Cellular Networks 1,2
1*
Ryan Nolan and Kyongbum Lee 1
Department of Chemical and Biological Engineering, Tufts University, Medford, MA 02180 Wyeth BioPharma, Andover, MA 01810 * 4 Colby Street, Room 142, Medford, MA 02155-6013; phone: 617-627-4323; fax: 617-627-3991; e-mail:
[email protected] 2
Abstract Optimization of tissue or cell function is often difficult due to a limited understanding of the biochemical activity of the system as a whole. A mathematical model simulating the dynamics of cellular biochemical processes would significantly reduce the experimental burden for such optimization. Insights into system dynamics would also enable fundamental advances in understanding whole-cell regulatory mechanisms. This chapter presents a modeling strategy to simulate the changes in cell density and metabolite concentrations during an unsteady cell culture process. The methodology involves three steps. First, from a genome-scale metabolic reaction network, graph-theoretical analysis is applied to systematically reduce the network to a manageable set of modules. Second, kinetic rate expressions are defined for each module to characterize the initial state of the system during balanced growth. Third, the transition periods following a system perturbation are explained by the generation of metabolically distinct subpopulations. This methodology is illustrated with an application to a batch culture of Chinese hamster ovary cells producing a recombinant therapeutic protein.
Key terms
Cellular dynamics Metabolic network Modularity Metabolic flux analysis Elementary flux modes Enzyme kinetics Parameter estimation Genetic algorithm Bayesian network analysis
111
Modeling the Dynamics of Cellular Networks
7.1 Introduction The living cell is an exceedingly complex system with very many interacting molecular components including genes and other nucleic acids, enzymes and other proteins, and small molecule metabolites. These molecules are “chemically connected” through their shared participation in cellular reactions and regulatory events, giving rise to a “biochemical network.” Examples of such networks include gene regulatory circuits, signal cascades, and metabolic reaction networks. Advances in genomics, proteomics, and informatics have generated an increasingly vast database of information on the compositions of these biochemical networks. For many unicellular organisms, genome-scale metabolic models have been assembled that catalogue the types of enzymes present in the cell and thereby define the stoichiometric connections between the metabolites and reactions. However, translating such catalogues into dynamic computational models has remained elusive due to the complexity of biochemical networks and incomplete knowledge of the components’ kinetic and regulatory behavior. Current genome-scale or whole-cell models (based on, for example, flux balance analysis) assume conditions of pseudo-steady state and/or optimality to provide snapshots (global descriptions) of observed or desired overall cellular activity. At the other end of the modeling spectrum are mechanistic models consisting of coupled differential rate equations that describe the time profiles of the systems’ molecular components. The advantage of these kinetic models is that they can lead to powerful insights into the dynamics of the system, for example, offering explanations and predictions on stability, attainable steady states, and responses to time-varying stimuli. On the other hand, the forms and parameters of the rate equations used for model simulations are often culled from varied sources in the literature, because they are not generally available for whole-cell networks. Moreover, published data sometimes reflect isolated in vitro, rather than in vivo settings, and thus lack internal consistency. The limited amount of biological knowledge (e.g., mechanism-based rate equations) and dearth of reliable in vivo data on parameters have set practical limits on the scope and scale of kinetic models. Indeed, there are relatively few examples of kinetic models that do not focus on a particular subsystem, such as a metabolic pathway or signaling subnetwork. The goal of this chapter is to present a data-driven, multiresolution modeling strategy that can supplement or complement biological knowledge-driven approaches for developing dynamic models of whole-cell networks. The central premise of this modeling strategy is that a biochemical network may be abstracted as an organized ensemble of modules. Each module may be represented by one or more rate equations to varying degrees of detail depending on available knowledge, data, and the overall modeling goal. Modular partitioning and coarse graining can systematically reduce network complexity and afford estimation of a reasonable number of self-consistent parameters from experimental data. The premise is based on recent developments in topological analyses of cellular networks, which present a strong case for modular organization. The illustrative example used in this chapter is a metabolic reaction network. Metabolic networks are large, consisting of several hundred to thousand component species. There are a number of readily accessible databases with comprehensive, species-specific compositional information of metabolic networks. In contrast, there are no comparable sources of data on the mechanisms of enzyme action and corresponding rate equations. Standard forms of metabolic reaction rate equations are generally nonlinear functions of
112
7.2
Materials
metabolites and include multiple coefficient parameters. In this regard, the challenges associated with dynamic modeling of metabolic networks are broadly representative.
7.2 Materials 7.2.1
Cell culture
The methodology described herein was applied to an industry-relevant fed-batch process of Chinese Hamster Ovary (CHO) cells producing a recombinant antibody. Briefly, the basal and feed media used were chemically defined, protein-free, proprietary formulations. Both the cells and media were products of Wyeth BioPharma (Andover, Massachusetts). The cells were seeded at >1 × 106 cells/mL and carried approximately 2 weeks while maintaining a measured viability of >85%. Samples were taken twice daily and analyzed for viable cell density, viability, pH, osmolarity, O2, CO2, glucose, lactate, ammonia, amino acids, and recombinant antibody. The goal of this chapter is to present a broadly applicable modeling method, and additional, CHO cell culture-specific details are not presented here.
7.2.2
Database
We have made extensive use of the KEGG database [1], which provides a species-specific listing of enzymes, reactions, and metabolites for the organism of interest. An especially useful feature of this database is its ftp site (http://www.genome.jp/kegg/download/ftp.html), which allows data downloads in various file formats.
7.3 Methods 7.3.1
Network reconstruction
1. Generate a genome-scale metabolic reaction network from annotated database such as KEGG. 2. Following the initial assembly, perform additional manual curation steps as necessary to add missing reactions (e.g., within a linear pathway) and remove pathways that are irrelevant to the metabolic phenotype under investigation (e.g., xenobiotic metabolism). Ensure phenotype consistent directionality of certain pathways (e.g., macromolecule biosynthesis) and prevent irrelevant cycles among cofactor metabolites (e.g., nucleotide recycling) by imposing reaction irreversibility and reaction coupling.
7.3.2
Network reduction
7.3.2.1 Structural reduction 1. Create a directed graph with metabolites as nodes and reactions as edges (Figure 7.1). Directed edges between nodes are established based on reaction involvement. For example, in the reaction A + B → C + D, directed edges are defined from A to C, A to D, B to C, and B to D. 113
Modeling the Dynamics of Cellular Networks
Carbon metabolite
206 metabolites
Currency metabolite
197 reactions
Direct carbon link
659 links
Direct currency link
Figure 7.1 Genome-scale network. From the KEGG genome database, enzyme-catalyzed reactions were collected to form a complete metabolic reaction network for the CHO cell. Pathways included were: glycolysis, ppp, tca cycle, amino acid metabolism, oxidative phosphorylation, and biomass and recombinant protein synthesis. Cell culture experiments were conducted to rule out alternative or parallel pathways.
2. Remove noncarbon currency and carbon-shuttle metabolites (Table 7.1) to form a carbon-backbone network (Figure 7.2). The effect of this step should be a significant reduction in graph connectivity (Table 7.2). 3. Classify the graph nodes according to the following criteria: input or output if degree = 1, intermediate if degree ≥ 2, and cycle if there exists a path from itself to itself. Using Table 7.1
Noncarbon Currency and Carbon-Shuttle Metabolites
Noncarbon Currency
Carbon-Shuttle*
Nucleotide phosphates Nucleoside phosphates Orthophosphate Pyrophosphate + NAD(P) /NAD(P)H FAD/FADH2 + H H2O O2 NH3 H2O2 Sulfate Sulfite Oxidized/reduced ferredoxin 3’-phosphoadenylyl sulfate/adenosine 3’,5’-bisphosphate
ACP CoA CO CO2 HCO3− THF THF-derivatives L-glutamate/2-oxoglutarate L-glutamate/L-glutamine tetrahydrobiopterin/dihydrobiopterin S-adenosyl-L-methionine/S-adenosyl-L-homocysteine
* Metabolite pairs such as L-glutamate/2-oxoglutarate were removed only when they act as carbon-shuttles. For example, the pair L-glutamate/2-oxoglutarate is removed from: 4-aminobutanoate + 2-oxoglutarate = succinate semialdehyde + L-glutamate, but not from: L-glutamate + NAD+ + H2O = 2-oxoglutarate + NH3 + NADH + H+.
114
7.3
Methods
Input metabolite
188 metabolites
Intermediate metabolite
195 reactions
Cycle metabolite
225 links
Output metabolite
Figure 7.2 Carbon-backbone network. The network size was significantly reduced to a carbon-backbone network by removing noncarbon currency and carbon-shuttle metabolites. From the directed graph, metabolites were then defined as input or output (degree = 1), intermediate (degree = 2), or cycle (there exists a path from and to itself). These distinctions were used to form stoichiometrically conserved pseudo-reaction modules.
these definitions, determine graph-paths from inputs to cycles and outputs, and from inputs and cycles to outputs. 4. Apply elementary flux mode (EFM) analysis to every pair of connected input and output nodes in the graph. The EFM analysis should identify every stoichiometrically conserved EFM between an input and an output node mapped by a graph-path (reaction sequence). In general, one graph-path uniquely maps to one EFM. It should be noted that the EFM algorithm applied to a carbon-backbone network of central carbon metabolism with all external metabolites as inputs and outputs will often result in >40,000 pathways. In contrast, the EFM algorithm applied to the graph-paths will result far fewer pathways (in our CHO cell network, the number of graph-paths was 36). On occasion using only one start and one end node as inputs to the EFM algorithm may result in an empty set. When this occurs, an additional node (input, cycle, or output) is required to form a stoichiometrically conserved pathway. In these instances, an additional node is systematically screened as an added input to the EFM algorithm, which then completes an EFM pathway. Table 7.2
Graph Connectivity
Step
Description
Metabolites
Reactions
Links
1 2 3 4 5 6
Genome-scale network Carbon-backbone network Graph nodes defined EFM pathways defined Pseudo-reaction module network Kinetic model network
206 188 N/A N/A 34 20
197 195 N/A N/A 36 18
659 225 N/A N/A N/A N/A 115
Modeling the Dynamics of Cellular Networks
The EFM algorithm can be implemented in MATLAB (Mathworks, Natick, Massachusetts) using Metatool [2]. 5. Finally, for each of the EFM pathways generated, sum the involved reactions, with the noncarbon currency and carbon-shuttle metabolites reintroduced, to form a pseudo-reaction module (Figure 7.3).
7.3.2.2 Functional reduction 1. To the network of pseudo-reaction modules, add cellular exchange reactions to account for transport of metabolites across the cell membrane, and generate a stoichiometric reaction matrix, S. For comprehensive functional analysis, it is advised to include the significant currency metabolites (i.e., O2, ATP, NADH, NADPH, FADH2, and NH3). In the CHO network, 25 exchange reactions were added to 36 pseudo-reaction modules, resulting in an S matrix of 35 metabolites by 61 reactions. 2. Quantify a steady state flux distribution (e.g., balanced growth) using metabolic flux analysis (MFA). Reducing the network using pseudo-reaction modules eliminates reaction segments that are not connected to an input or output metabolite. Thus, MFA problem should be well posed. We recommend a least-squares solution to the MFA problem using constrained optimization. The objective function for this problem is: Minimize: ∑ ( v k − v kobs ) , ∀k ∈ {external fluxes} 2
Subject to: S ⋅ v = 0
Graph Path GLC
GLC
G6 P
G6P
G6P
F6 P
F6 P
F6 P
F16 P 2
F16 P 2 GP
Elementary Mode
GAP G13P 2 G3P PE P PYR
F16 P 2 GP GAP G13P 2 G3P PE P
x2 x2 x2 x2
Pseudo-Reaction Module
GLC
GAP + GP
2 ADP
GAP
2 ATP
G13P 2
2 NADH 2 PYR
G3P PE P PYR
Figure 7.3 Network reduction strategy. From the carbon-backbone network, a set of graph paths was defined between all inputs, cycles, and outputs. For each path, the endpoints served as inputs/outputs in an elementary flux modes (EFM) algorithm applied to the carbon-backbone network (center). The reactions in an EFM were then combined to form a stoichiometrically conserved pseudo-reaction module, which included currency metabolites.
116
7.3
Methods
where vk and vkobs are, respectively, predicted and observed external flux components of v. 3. Inequality constraints can be added (if necessary) based on the thermodynamic feasibility of reaction pathways. The rationale for these constraints is somewhat lengthy and is not included in this chapter. A detailed introduction to thermodynamic pathway constraints can be found in [3]. The process is as follows: i.
Estimate Gibbs energies of formation (Gfi) for the metabolites in the network using group contribution theory [4].
ii.
Calculate a standard Gibbs free energy change for each reaction (ΔGRXN°).
iii. Sum the ΔGRXN values across each stoichiometrically balanced pathway (i.e., EFM) in the network to obtain standard pathway Gibbs free energy change (ΔGPATH) values. iv. Express these values as inequality constraints in the form G⋅v = 0, where G is a pathway-scaled matrix of ΔGPATH° values. 4. From the steady-state flux distribution, remove reactions and pathways that contain a negligible flux. Negligible fluxes can be determined based on a cutoff value set as a fraction (e.g., 1%) of the median flux value. For the CHO network, removal of such reactions resulted in a final reaction network (to be used for subsequent kinetic modeling) consisting of 20 metabolites and 30 reactions (18 pseudo-reactions and 12 cellular exchange reactions).
7.3.3
Kinetic modeling
7.3.3.1 Rate equations Many options exist for saturable enzyme reaction rate expressions. In this chapter, we use Michaelis-Menten type equations, which are the most commonly used kinetic expressions for relating the concentrations of substrates, inhibitors, and activators to the reaction velocities. Depending on the enzymes involved and data available, other types of rate equations may be more appropriate. We will discuss this point further in Section 7.5.2. 1. Define Michaelis-Menten enzymatic rate equations for each reaction. For example, the rate v for the reaction A + B → C + D is defined as v = vmax ⋅ X ⋅
[ A] ⋅ [B] Km, A + [ A] Km, B + [ B]
where v = reaction rate (mM/day), vmax = maximum balanced growth reaction flux (mmol/109 cells/day), X = viable cell density (106 cells/mL), Km = Michalis-Menten constant (mM), and [A] and [B] = concentrations of substrates A and B, respectively (mM). Approximating the vmax values from the balanced growth reaction flux (as determined from the MFA analysis in the previous section) will significantly reduce the number of estimated parameters. 2. Define cell growth with Monod kinetics as follows: dx [ ATP] = μ ⋅ X μ = μmax ⋅ dt Km, ATP + [ ATP] 117
Modeling the Dynamics of Cellular Networks
where μ = growth rate (1/day), μmax = maximum growth rate (1/day), and [ATP] = intracellular concentration of ATP (mM). 3. Define mass balances on the metabolite concentrations as dA = S⋅v dt where A = metabolite concentration vector, S = reaction stoichiometric matrix, and v = reaction rate expression vector. 4. Estimate the unknown Km parameters (see Section 7.3.4).
7.3.3.2 Dynamic simulations For a low-density seed batch culture, balanced growth estimation would be sufficient to characterize the dynamics of the system. However, for a high-density, fed-batch culture such as the one used as an example here, a different approach is required to accurately simulate the transition periods that occur throughout the process. This situation is demonstrated in Figure 7.4, where the balanced growth parameters can accurately predict the metabolite profiles through the first 1.5 days. However, at day 1.5 a perturbation to the system occurs, in the form of glutamine depletion, resulting in a significant deviation from the expected balance growth trajectory. One possible explanation is that in response to this perturbation, a fraction of the cell culture population metabolically adjusts to the depletion by altering the activity of specific enzymes (e.g., reversal of
Viable Cell Density
0
1
2
3
Antibody
0
1
2
2
3
0
Asparagine
Glutamine
0
1
Glucose
3
0
1
2
1
2
Lactate
3
0
Alanine
3
0
1
2
1
2
3
Ammonia
3
0
1
2
3
Figure 7.4 Balanced growth simulation. The model for balanced growth (days 0 to 2) included 42 parameters (Km’s), which were fit to experimental data using a genetic algorithm. The graphs depict the measured (n) and simulated (l) data. The x-axis is days and the y-axis is concentration. All metabolites were accurately predicted through day 1. At day 1.5 the culture experienced a perturbation that resulted in a transition to a new metabolic state.
118
7.3
Methods
glutamine synthetase and lactate dehydrogenase). The result is a transition period during which a heterogeneous population of cells develops. This is an important modeling assumption, and is further discussed in the commentary section below. The heterogeneous distribution can be described with a simple transition state (Markov process, i.e., the probability of transitioning to a future state is dependent only on the current state and independent of any past states) model, where a fraction of the population responds quickly to the perturbation with a probability k, and another fraction continues to be metabolically active at the same balanced growth rate with a probability 1 – k (Figure 7.5). Over time k → 1 and another steady state is achieved with a new homogeneous population. Such process dynamics can be modeled as follows. 1. Define a perturbation event (observed or hypothesized) that will trigger a metabolic transition, as well as the response(s) of the system to compensate. In the CHO example, the event was glutamine depletion and the responses were reversal of the two aforementioned enzymes. 2. For each event, define a new network with the appropriate response variables adjusted, as well as a Markov probability variable k to represent the fraction of the total population that exists in this metabolic state. 3. Assume the balanced growth phase is a baseline state from which the cell deviates, and set the previously estimated kinetic parameters as constants. Estimate the Km’s for the new reactions and the k value using the data only over the transition period. Results of this method applied to the CHO example are shown in Figure 7.6.
Glutamine
0
1
2
GLN 1 – k1
Lactate
3
GLU
0
1
PYR 1 – k2
S1
2
3
LAC
0
1
S1 k3
S2 LAC
3
GLY
k2
S2 GLN
2
SER 1 – k3
S1
k1
GLU
Serine
S2 PYR
SER
GLY
Figure 7.5 Markov transition model. At day 1.5 the glutamine concentration approached a low level, resulting in a depletion of some internal metabolites, and a shift to a new (S1 to S2) metabolic steady state to compensate. During this transition, a heterogeneous population developed, with some cells utilizing specific reactions in the forward direction, with a Markov probability = k, and other cells in the reverse direction, with probability 1 – k.
119
Modeling the Dynamics of Cellular Networks
Viable Cell Density
0
1
2
3
Antibody
0
1
2
2
3
0
Asparagine
Glutamine
0
1
Glucose
3
0
1
2
1
2
Lactate
3
0
Alanine
3
0
1
2
1
2
3
Ammonia
3
0
1
2
3
Figure 7.6 First transition simulation. Setting the initial 42 parameters (from balanced growth) as constant, the transition (days 1.5 to 3) to a new metabolic state was modeled by decreasing the activity of several forward reactions and increasing the activity of the corresponding reversible reactions. The net result was an increase in the depleted internal metabolites; 25 parameters (Km’s and Markov probabilities, k’s) were fit for this phase.
7.3.4
Parameter estimation
The parameter estimation problem is generally solved using nonlinear optimization. Given a set of coupled differential equations expressing the reaction rate dependences on metabolite concentrations and kinetic coefficient parameters, the objective function for the optimization problem is to minimize the sum-squared differences between the calculated and measured dependent variables (e.g., reaction rates) based on a set of parameter choices. Typical inputs to the problem are the measured or assumed initial values of the independent variables (e.g., metabolite concentrations). Here, we refer to independence in a mathematical, rather than physical sense. This may be an obvious point, since intracellular metabolite concentrations generally cannot be controlled independently. As with other nonlinear optimization problems, guaranteeing a globally optimal solution is exceedingly difficult, if not impossible. For large-scale problems, the use of gradient-based, local search methods that repeatedly solve the problem with different initial conditions (multistart strategy) generally fail to arrive at satisfactory solutions, often yielding the same local minimum [5]. It is generally agreed that global search methods, while computationally expensive, are likely to yield results that broadly reflect the full range of parameter estimation data. Several such methods have recently been examined, including branch-and-bound [6] and hybrid functional Petri nets [7]. In this chapter, we use genetic algorithms, which is a particular class of evolutionary algorithms. The advantage of this global search heuristic is that it offers reasonable (i.e., exact or approximately exact) solutions even when applied to ill-conditioned problems [8]. Briefly, an initial seed population of random individuals is examined for their ability 120
7.4
Data Acquisition, Anticipated Results, and Interpretation
to satisfy a predefined fitness function. The algorithm selects the most promising individuals (elite children) along with a user-defined portion of randomly mutated and recombined (crossover) individuals to seed the next iteration (or generation). The genetic algorithm is implemented as follows. 1. Define the fitness function as Minimize: ∑ ( cj − cjobs ) , ∀ j ∈{external metabolites} 2
obs where cj and cj are, respectively, the predicted and observed external metabolite concentrations in the culture medium.
2. Initialize the intracellular concentrations and define the bounds on the parameters. Initial concentrations can be estimated from the literature or measured directly. Parameter bounds can be estimated to span one order of magnitude based on the associated metabolite concentration. For example, in a reaction A → B, if the concentration of A is 0.5 mM, the bounds on the Km parameter of the reaction would be 0.1 mM and 1 mM. 3. Define the number of generations to terminate the algorithm, the mutation function, and crossover fraction. In this work, these values were 500, Gaussian, and 0.8, respectively. The choices for these algorithm parameters depend on the software package. This work used the Genetic Algorithm and Direct Search toolbox for MATLAB to integrate all computing routines, including EFM analysis, flux calculation, model simulation, and parameter estimation, into one software environment.
7.4 Data Acquisition, Anticipated Results, and Interpretation The focus of this chapter is on model development, rather than metabolite data acquisition. We refer the interested reader to recent publications by Nielsen and coworkers, who have developed excellent assay platforms for high-throughput metabolite analysis [9]. This section will therefore limit the comments to anticipated results and interpretation. These comments will be brief, because Section 7.3 presented many of the important observations and quantitative details regarding the expected results through the CHO cell example.
7.4.1
Model network
The expected final result is a dynamic model simulating the time-dependent metabolic behavior of a cell culture. Key intermediate results are the graph models obtained through the systematic reduction and modularization strategy (Figure 7.3). The number of reaction modules (paths) in the reduced model correlates with the number of rate equations, which in turn determines the size of the parameter space. We found that a tenfold reduction in model size was possible for the genome-scale CHO cell network, which initially consisted of about 200 reactions (Table 7.2). This level of reduction is
121
Modeling the Dynamics of Cellular Networks
likely to be typical, because cellular metabolism is generally well conserved across species and cell types. The dynamic model expresses the time-dependent behavior through a set of coupled differentiation equations. Following data fitting, the results should include rate coefficients and other equation parameters along with the calculated reaction rates and metabolite concentrations. Figure 7.4 shows representative time trajectories of simulated metabolite concentrations plotted against experimental data. The time scale of the simulation necessarily depends on the cell type and the culture behavior of interest. In the case of the CHO cell culture used here as a test system, the time scale was on the order of days. To clarify, this time scale does not refer to the time for computation, which executes within minutes. The anticipated dynamic range over the course of a simulation is two orders of magnitude for the extracellular metabolite concentrations for a high-density culture exceeding 106 cells/ml. The dynamic range of the intracellular metabolites depends on the enzyme affinity parameter (Km) bounds established by domain (cell type specific biological knowledge). We noted that these bounds significantly influence the convergence of the model during parameter optimization. Fortunately, experimental determination of these bounds is possible through initial concentration measurements on the intracellular metabolites.
7.4.2
Dynamic simulation parameters
The modeling strategy of this chapter involves two types of parameters. One set of parameters directly depends on the form of the kinetic expression, and reflects the sensitivity of the reaction rate to the substrate concentrations. Beyond this basic interpretation, further analysis again depends on the specifics of the model. For example, rate equation parameters of the lin-log form express the elasticity of the enzyme with respect to both substrates and nonsubstrate effectors [10]. The second parameter type represents a probability that the culture gives rise to one or more additional subpopulations with qualitative differences in metabolic behavior. When interpreting the parameter values, we recommend a global analysis of all of the metabolite and reaction rate time profiles after performing multiple (on the order of 10) iterations of modeling training (data fitting). While GA-based nonlinear optimization generally yields robust results, the heuristic nature of the algorithm cannot guarantee convergence to a globally optimal solution.
7.5 Discussion and Commentary 7.5.1
Modularity
In this chapter, we have outlined steps to develop a dynamic model of cellular metabolism based on annotated genome data and metabolite concentration measurements. The central premise was that cellular networks can be decomposed into recognizable and functionally meaningful modules. In the present case of the CHO cell metabolic network, the modules represented reaction groups or pathways. Our model reduction strategy was largely motivated by several recent developments in graph theoretical analysis of biological network topology. In particular, an emergent theme in this literature is the concept of modular organization [11]. A number of studies have shown that a 122
7.5
Discussion and Commentary
biochemical network possesses significant patterns of interconnections representing basic structural units similar to other complex natural networks such as the ecological food web. Such units have been labeled motifs when identified through bottom-up searches [12]. The modularity of biochemical networks has also been explored using top-down approaches that successively divide the system into smaller subnetworks [13]. In these earlier studies, modularity has been determined by analyzing patterns of structural (e.g., reaction stoichiometry-based) connectivity. Until recently, less attention has been paid to functional (e.g., reaction flux-based) connectivity [14]. Determining connectivity relationships solely based on structural information has the drawback that every biochemical interaction is treated equally, regardless of the activity level of that interaction. The modeling strategy presented in this chapter examines both structural and functional modularity in reducing model complexity. In addition to the premise on modularity, two other important assumptions were introduced, which we will discuss in the remainder of this commentary section. The discussion will concentrate on the limitations imposed by the assumptions with respect to generality, as opposed to, for example, particular aspects of simulating CHO cell dynamics. As part of this discussion, we will suggest alternative modeling options as well as future research directions for refining the modeling framework.
7.5.2
Generalized kinetic expressions
In this work, we used Michaelis-Menten type equations, which are hyperbolic functions that reasonably approximate the saturation behavior of many metabolic enzymes, and are thus an appropriate initial choice. However, there are a number of other options, and careful consideration should be made as to the choice of the rate equation form. 1. Commonly used alternatives to the Michaelis-Menten equations include generalized mass-action (or S-system) [15], convenience [16], and lin-log kinetics. Each of these alternatives offers particular advantages. For example, convenience kinetics provides a simple and generalized form for rate expressions of random order enzyme mechanisms, and can include thermodynamic dependencies among parameters. When inhibition, activation, and thermodynamic constraints are not considered, the convenience and Michaelis-Menten kinetic expressions are equivalent. Thus, it is not surprising that, like Michaelis-Menten type expressions, the equations are numerically well behaved and suitable for large-scale parameter estimation and optimization. In lin-log kinetics, the reaction rate is proportional to the enzyme level and a linear sum of nonlinear logarithmic substrate concentrations. While this equation form does not reflect a particular enzyme action mechanism, it can very closely approximate the behavior of a Michaelis-Menten kinetic expression with an appropriate set of parameter choices. One challenge in implementing lin-log kinetics is that the expressions include reference state parameters, preferably obtained at a steady state. In many cases, initial conditions (for internal metabolites) reflecting a steady state may not be readily available, since dynamic culture experiments are more commonly conducted in a batch setting. On the other hand, as techniques for intracellular metabolite measurements rapidly mature, lin-log kinetics may become more attractive. A particularly compelling feature of lin-log kinetics is that its rate equation parameters directly reflect the enzyme elasticities with respect to its substrates. 123
Modeling the Dynamics of Cellular Networks
2. Regardless of the form, generalized rate equations may not fully capture kinetic behaviors resulting from the regulatory effects of allosteric modulators. When there is mechanistic knowledge, it is possible to appropriately modify the rate equations on a case-by-case basis, for example by replacing a hyperbolic with sigmoidal function via an ultrasensitivity or cooperativity parameter. Unfortunately, such mechanistic information remains unavailable for many enzymes. Thus, the addition of regulatory variables and parameters may rely on decisions based on ad hoc knowledge. 3. One systematic, data-driven approach to determine whether there are regulatory interactions is through network inference. Various promising approaches for network inference have been described in the recent literature, including singular value decomposition (SVD), independent component analysis (ICA), network component analysis (NCA), network component mapping (NCM), and Bayesian network (BN) inference. Most of these approaches have been developed for gene regulatory circuits, partly because of the comparatively earlier advances in technologies for high-throughput gene expression measurements. A notable earlier study on reconstructing metabolic subnetworks was described by Chan and co-workers, who used an information theory–based learning algorithm [17]. While computationally efficient, this approach does not distinguish between candidate models during the structural learning stage. Rather, categorical decisions are made on conditional dependencies to arrive at a unique structure. Model refinement occurs through subsequent introduction of expert knowledge and hypothesis testing. The decisions on the structure of the network rely on thresholds, which can produce errors when the data size is small. Very recently, we have developed an alternative method based on probability theory (unpublished work [18]). This method systematically (although not exhaustively) assesses various candidate models based on their conditional probabilities given the data. The expert or subjective knowledge is introduced as prior probabilities. Therefore, it is not necessary to subsequently set up and test various hypotheses on the conditional independencies within the learned structure. An attractive feature of our method is that it can routinely update both the structure (conditional independencies between components) and parameters of the learned network as additional data become available. The data requirements are information on network stoichiometry and measurements on metabolic (flux) profiles, and thus completely overlap with the dynamic modeling framework.
7.5.3
Population heterogeneity
In this work, we modeled the apparent change in the overall behavior of the CHO cell culture to reflect a putative rise in a new subpopulation with a different metabolic phenotype. This assumption was based on the observation that a growing fraction of cells in the aging culture failed to exhibit balanced growth even when there was a sufficient supply of nutrients. An alternative interpretation would have been to consider the changing culture behavior as an adaptation, where the entire population, represented by an “average” cell, progressively takes on a different phenotype. Resolving this type of ambiguity will require additional experiments on population characteristics, for exam-
124
7.6
Application Notes
Troubleshooting Table Problem
Potential Solution
An empty set is obtained for the EFM analysis
An additional node is required to form a stoichiometrically conserved pathway. Screen additional input, cycle, and output nodes as an added input to the EFM algorithm, which should result in an EFM pathway Uncertainty or poor estimates in kinetic parameters Improve the estimates on the parameter bounds by consulting literature data for kinetic parameters or the associated intracellular metabolite concentration. Even better, obtain measurements for the intracellular concentrations Poor fitting of the transition periods Change the dependency of the Markov parameter to a metabolite concentration, time, or process parameter
ple by measuring the distribution of cell cycle states at various times during the culture process.
7.6 Application Notes In this section, we highlight a few notable features of the CHO cell network simulations. It is observed that during the transition from balanced growth following a glutamine depletion, lactate and ammonia are effected to the greatest extent, while glucose, alanine, asparagines, and other metabolites not shown appear to deviate slightly. It is hypothesized that the depletion of glutamine results in less carbon entering the TCA cycle via alpha-ketoglutarate, and as a result, less NADH being produced. To compensate, some cells in the population are able to take advantage of the abundant supply of lactate in the culture and reverse the substrate for the LDH enzyme, resulting in a reduction of lactate to pyruvate and generating the needed NADH. The deviation of ammonia, on the other hand, is a consequence of the reversal of glutamine synthetase in attempt to replenish the depleted glutamine, which is necessary for nucleotide synthesis and antibody production, in addition to central energy metabolism. For this simple case of a single, measurable metabolite being depleted, the rules for determining which reactions would be altered in activity could be determined based on knowledge of the system. Unfortunately, there are many potential perturbations that can occur in a mammalian cell culture system, and often more than one type of perturbation is occurring at a given time. The challenge then becomes how to systematically define which reactions will respond to a specific perturbation. By determining which reactions or metabolites are directly connected to the perturbation, and then which reactions are most adaptable (i.e., those for which reversibility is known or observed), one can begin to define a chain of connectivity for modeling the transition. The final point worth mentioning is the accuracy of the predicted transition rate. It can be observed in Figure 7.6 that, while the first transition profiles for lactate and asparagine are accurate, those for glucose, glutamine, alanine, and ammonia are not. The reason for this is that the Markov transition probability, k, was modeled as a constant. It is more likely, however, that this transition has a dependency on a particular metabolite concentration, time, or other process parameter. By including this dependency, the transition profiles for these metabolites should be more closely simulated.
125
Modeling the Dynamics of Cellular Networks
7.7 Summary Points •
A systematic methodology based on graph theory and pathway analysis was used to reduce a genome-scale metabolic reaction network, without loss of conservation relationships, to a manageable network for kinetic modeling.
•
Metabolite and cell density profiles were simulated using Michaelis-Menten kinetic equations and Monod growth kinetics, respectively.
•
Kinetic parameters were estimated using a genetic algorithm.
•
Deviations from external perturbations were modeled by assuming the development of heterogeneous sub-populations, each with distinct metabolic activities, and the sum of which contribute to the global activity of the culture.
•
The probability of a subpopulation transitioning to a new metabolic steady state was modeled with a Markov process.
Acknowledgments We gratefully acknowledge financial support for RN by Wyeth and a National Science Foundation grant (award # 0829899) to KL.
References [1] [2] [3]
[4]
[5] [6] [7]
[8] [9]
[10] [11]
[12] [13]
126
Kanehisa, M., and S. Goto, “Kegg: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Res., Vol. 28, No. 1, 2000, pp. 27–30. von Kamp, A., and S. Schuster, “Metatool 5.0: fast and flexible elementary modes analysis,” Bioinformatics, Vol. 22, No. 15, 2006, pp. 1930–1931. Nolan, R.P., A.P. Fenley, and K. Lee, “Identification of distributed metabolic objectives in the hypermetabolic liver by flux and energy balance analysis,” Metab. Eng., Vol. 8, No. 1, 2006, pp. 30–45. Mavrovouniotis, M.L., “Group contributions for estimating standard Gibbs energies of formation of biochemical compounds in aqueous solution,” Biotechnol. Bioeng., Vol. 36, No. 10, 1990, pp. 1070–1082. Pardalos, P.M., and R.H. Edwin, Handbook of Global Optimization, Vol. 2, London: Kluwer Academic, 2002. Polisetty, P.K., E.O. Voit, and E.P. Gatzke, “Identification of metabolic system parameters using global optimization methods,” Theor. Biol. Med. Model, Vol. 3, 2006, p. 4. Koh, G., H.F. Teong, M.V. Clement, D. Hsu, and P.S. Thiagarajan, “A decompositional approach to parameter estimation in pathway modeling: a case study of the Akt and Mapk pathways and their crosstalk,” Bioinformatics, Vol. 22, No. 14, 2006, pp. e271–e280. Moles, C.G., P. Mendes, and J.R. Banga, “Parameter estimation in biochemical pathways: a comparison of global optimization methods,” Genome Res., Vol. 13, No. 11, 2003, pp. 2467–2474. Villas-Boas, S.G., J.F. Moxley, M. Akesson, G. Stephanopoulos, and J. Nielsen, “High-throughput metabolic state analysis: the missing link in integrated functional genomics of yeasts,” Biochem. J., Vol. 388, Pt. 2, 2005, pp. 669–677. Liebermeister, W., and E. Klipp, “Bringing metabolic networks to life: convenience rate law and thermodynamic constraints,” Theor. Biol. Med. Model, Vol. 3, 2006, p. 41. Spirin, V., M.S. Gelfand, A.A. Mironov, and L.A. Mirny, “A metabolic network in the evolutionary context: multiscale structure and modularity,” Proc. Natl. Acad. Sci. USA, Vol. 103, No. 23, 2006, pp. 8774–8779. Milo, R., S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simple building blocks of complex networks,” Science, Vol. 298, No. 5594, 2002, pp. 824–827. Ma, H.W., X. M. Zhao, Y.J. Yuan, and A.P. Zeng, “Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph,” Bioinformatics, Vol. 20, No. 12, 2004, pp. 1870–1876.
Acknowledgments
[14] [15] [16]
[17] [18]
Yoon, J., Y. Si, R. Nolan, and K. Lee, “Modular decomposition of metabolic reaction networks based on flux analysis and pathway projection,” Bioinformatics, Vol. 23, No. 18, 2007, pp. 2433–2440. Schwacke, J.H., and E. Voit, “Computation and analysis of time-dependent sensitivities in generalized mass action systems,” J. Theor. Biol., Vol. 236, No. 1, 2005, pp. 21–38. Kresnowati, M.T., W.A. van Winden, and J.J. Heijnen, “Determination of elasticities, concentration and flux control coefficients from transient metabolite data using linlog kinetics,” Metab. Eng., Vol. 7, No. 2, 2005, pp. 142–153. Li, Z., and C. Chan, “Inferring pathways and networks with a Bayesian framework,” FASEB J, Vol. 18, No. 6, 2004, pp. 746–748. Yoon, J., “Metabolic network analysis of liver and adipose tissue,” Ph.D. Dissertation, Department of Chemical and Biological Engineering, Tufts University, 2007.
127
CHAPTER
8 Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods 1
2
3
Stefan Streif, Steffen Waldherr, Frank Allgöwer, and Rolf Findeisen
3
1
Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany Institute for Systems Theory and Automatic Control, Universität Stuttgart, Germany E-mail:
[email protected] 3 Institute for Automation Engineering, Otto-von-Guericke University, Magdeburg, Germany 2
Abstract Sensitivity analysis is a valuable tool in the analysis of biological systems. It can be used for many purposes such as drug target identification, model comparison, and model refinements. In this chapter we review steady-state parametric sensitivity analysis for biochemical reaction networks. As shown, local sensitivity analysis methods might lead to wrong conclusions if the considered system is highly variable over the range of possible parameters. To overcome this problem, we outline two new methods for the steady-state sensitivity analysis. The first method is based on an input-output consideration of the sensitivity question. The parameters of interest are considered as input, and the influence of them on the “output” is captured by an expansion of the concept of linear cross Gramians to nonlinear systems, the empirical cross Gramian approach, which allows one to consider a wider class of systems and obtain insight into the question of sensitivity based on simulations, and it can be expanded to time-varying sensitivity analysis. The second approach is based on a reformulation of the original question to the question of outer approximating the range of possible steady states under uncertainties. Since an outer approximation of the possible steady states is obtained, the method is nonlocal (i.e., global in nature).
Key terms
Parametric sensitivity analysis Steady state Biochemical reaction networks Empirical Gramian Global infeasibility certificate
129
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
8.1 Introduction Over the past decades significant advances with respect to the mathematical modeling of biological systems have been achieved. Advancements in biological experimental techniques have lead to a brisk increase in the size of the mathematical models, as well as in the amount of available models (see [1]). However, models are not derived without purpose. They often lay the basis for the analysis and understanding of the underlying (biological) principles and are used to identify the key influencing elements. One of the basic questions in the analysis of biological systems is how the dynamics (e.g., the steady state) is changing with respect to changes in parameters (or external inputs). Examples for such parameters are reaction constants in biochemical reaction networks, or association constants. Typically such an analysis of the influence of parameter changes on the behavior of the system is denoted as (parametric) sensitivity analysis (see [2]). It might be used for several purposes, such as the identification of targets for the design of drugs and therapies, the identification of limiting steps in a metabolic network to achieve a maximum yield of a product, or model comparison [35] or refinement [36]. We focus here on the sensitivity analysis for biochemical reaction networks, which forms an important class of models for biological processes [3–5]. One classical tool to provide insight into the effect that certain parameters have is metabolic control analysis (MCA) [6, 7]. It basically allows one to analyze the influence of parameters on the behavior of the system close to a certain nominal (parameter) operation point. Under such conditions one can safely assume that the behavior of the system depends linearly on the parameters. However, in biochemical reaction networks one usually faces large parameter variations: in genetic engineering, common techniques like gene knockouts or knock-downs, overexpression or binding site mutations typically give rise to large parameter variations. In these cases one typically falls back to global sensitivity methods, which are often based on statistical considerations [8–10]. The objective of this contribution is twofold. First, we provide a brief introduction to the issue of parametric steady-state sensitivity analysis for biochemical reaction networks. As shown, existing methods are typically based on local considerations. To overcome the local limitations we introduce two new approaches for parametric sensitivity analysis. The first approach is based on the concept of controllability and observability Gramians and their expansion to nonlinear systems. It allows one to consider a wider system class and provides a less local insight into the influence of parameters. The second approach is based on an efficient calculation of an outer approximation of the set of all possible steady states for the set of possible parameters. It is based on a reformulation of the problem to an infeasibility problem. The remainder of the chapter is structured as follows. In Section 8.2 we briefly introduce the considered system class and the question of parametric steady-state sensitivity analysis. Section 8.3 reviews linear sensitivity analysis. In Sections 8.4 and 8.5 we introduce two new approaches for parametric sensitivity analysis that partly overcome the local limitation of existing methods. Both approaches are exemplified considering a reversible covalent modification system. Conclusions and a final outlook are provided in Section 8.6.
130
8.2
Considered System Class and Parametric Sensitivity
8.2 Considered System Class and Parametric Sensitivity We are interested in the modeling and analysis of biochemical reaction networks, which are typically given by sets of reactions of the form:
[ ]
[ ]
α1[ S1 ] + K + α n s Sn s → β1[ P1 ] + K + β n p Pn p
(8.1)
Here Si denotes substrates that are transformed into the products Pi. The factors αi and βi denote the stoichiometric coefficients of the reactants. Typically these networks are modeled by systems of differential equations of the form: dx = Nv( x , p), x(t 0 ) = x0 dt
(8.2)
The rate vector v( v: ⺢n × ⺢l a ⺢k ) depends on the parameters p ∈⺢l and on the dependent state variables (concentrations) denoted by x ∈⺢n . The stoichiometric matrix N ∈ ⺢n × m relates the rate vector to the rate of change of the states. It depends on the coefficients αi, βi, and, possibly on factors compensating different units or volumina. x0 denotes the initial concentrations at time t0 = 0. For simplicity we assume in the following that all functions are at least once continuously differentiable with respect to their arguments. Note that this is usually the case for biochemical reaction networks. There are a large variety of possible reaction models [3] defining the rate vector r and the stoichometry. Examples are mass action, power law, Michaelis Menten, and Hill kinetics. We do not go into details here and rather refer to [5, 11]. A common feature in biochemical reaction networks is conservation relationships Lj among the n state variables x of the form Lj = ∑
n i =1
ζ i xi with nonnegative coefficients ζi.
Usually the system of differential equations is treated in its reduced form with n − j state variables (see [12]).
8.2.1
Example system: reversible covalent modification
One classical example for a biochemical reaction network, used in this work to exemplify our considerations, is the reversible covalent modification system [13]. The reaction scheme for this system is given by: k * ⎯⎯ ⎯→ [ A] + [ E1 ] ← ⎯[C1 ] ⎯ ⎯→[ E1 ] + [ A ] k k ⎯⎯ ⎯⎯ → [C ] ⎯ k⎯→[ E ] + [ A] [ A *] + [ E2 ]← 2 2 k k1
3
2
4
(8.3)
6
5
Here enzymes E1 and E2 convert the protein between its two states A and A* with intermediate complexes C1 and C2. Applying mass-action kinetics, the model of the system is given by
131
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
⎡k1[ A]( E1, tot − [C1 ]) ⎢ ⎢k2 [C1 ] ⎡ A ⎤ ⎡−1 1 0 0 0 1⎤ ⎢⎢k3[C1 ] dt ⎢ * ⎥ ⎢ A = 0 0 1 − 1 1 0⎥ ⋅ ⎢ ⎥ k4 A* E2 , tot − Atot + [ A] + A* + [C1 ] dt ⎢ ⎥ ⎢ ⎢⎣C1 ⎥⎦ ⎢⎣ 1 −1 −1 0 0 0⎥⎦ ⎢⎢ * ⎢k5 Atot − [ A] − A − [C1 ] ⎢ * ⎢⎣k6 Atot − [ A] − A − [C1 ]
[ ]( ( (
) )
[ ] [ ]
[ ]
)
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦
(8.4)
with the conservation relationships E1, tot = [E1]+[C1], E2, tot = [E2]+[C2], and Atot = [A] + [A*] + [C1] +[C2]. For the analyses in the following sections, we use the total concentrations Atot = 1 and E1, tot = E2, tot = 0.01, and the nominal parameter values k1 = 105, k4 =5 · 104, k2 = k5 = 1, and k3 3 = k6 = 10 . The system shows an ultrasensitive behavior with respect to the conversion rate of enzyme E2 represented by parameter k6 [see Figure 8.1(a)]: [A*] is either 0 or 1 for most values of k6 and changes rapidly for k6 ≈ 103. Therefore, the parameter k6 dramatically influences the steady state of the system and an appropriate choice of k6 changes the steady state completely from almost 1 to almost 0.
8.2.2
Parametric steady-state sensitivity
Parametric sensitivity analysis in general addresses the manner how the nominal behavior of the biochemical reaction network changes with respect to parameter variations. We refer to “sensitivity” as a property/behavior of a considered model [14]. The model is defined as “sensitive” when the behavior/property is highly affected by small variations in the parameters or state variables. We assume in the following that the behavior/property of interest is defined in terms of an (virtual) output of (8.2) given by y = h( x , p)
(8.5)
1
Steady state of [A*]
Steady state of [A*]
m l where y: ⺢ × ⺢ →⺢
0.75 0.5 0.25 0
1 0.75 0.5 0.25 0
1
10
2
10
3
10 k6
(a)
4
10
5
10
1
10
2
10
3
10 k6
4
10
(b)
Figure 8.1 (a) Ultrasensitive steady state response of the reversible covalent modification system with respect to variations of the parameter k6. (b) Linear extrapolation (dotted lines) of steady-state response using linear sensitivity analysis around two nominal steady states (circles).
132
5
10
8.2
Considered System Class and Parametric Sensitivity
Remark 1 We limit our attention to the sensitivity with respect to properties/behaviors which are directly given as a function of the states and parameters. In general one might be interested in more complicated properties, such as the frequency of an oscillation, an amplitude or other properties which might be defined in terms of the complete solution of the dynamical system (see [12, 15]). With respect to the output/behavior of interest y one might be interested, for instance, in the variation of the steady state that a system reaches due to parameter variations. This is referred to as steady-state sensitivity and is defined next. Definition 1 Steady-State Sensitivity
The steady-state sensitivity is the shift Δyss of the
output y = h(x, p) due to a perturbation Δpj of parameter pj: Δpj = pj − pj, nom → Δy ss = lim y(x(t ), pj, nom + Δpj ) − y(xnom (t ), pj, nom ) t→ ∞
Next we restrict our attention to variations of the steady state. Another interesting and important point might be how y changes over time due to step-wise or time-varying parameter perturbations. Approaches to this question are briefly discussed later and references to relevant literature are given. The most straightforward way to perform a sensitivity analysis is simply to simulate the system for different parameter values and to look at the steady-state response. One obtains a continuation diagram [see Figure 8.1(a) for the covalent modification example] that can already lead to useful statements about the parametric influence, steady-state sensitivity, and stability of steady states (bifurcation analysis). However, this is easy only for low dimensional systems, and is difficult or not feasible for larger biochemical reaction systems where instability and/or multistationarity might occur. An approximation of the true steady-state shift due to parameter perturbations can be calculated by extrapolation from the linear approximation of the true solution, which is called linear sensitivity analysis and is explained in detail in the next section. Here we rather want to highlight the differences between local and global sensitivity analysis methods. An extrapolation based on a local measure provides good approximations for perturbations that are close enough to the nominal value [see Figure 8.2(b)]. However, extrapolation of local properties does not necessarily lead to global sensitivity properties due to nonlinearities. Thus, local sensitivity contains only partial information. This becomes evident for large perturbations [see Figure 8.2(b)], where a local approximation does not provide sufficiently good estimates of the steady-state shift. Often new approaches to sensitivity analysis try to extend the range of validity for the extrapolations by including second-order or bilinear terms [16, 17]. However, such methods are in general still local. Global sensitivity methods are, for example, especially important in pharmaceutical problems, where the influence of certain parameters has to be satisfied over the complete range of possibilities (e.g., for all possible weights of the patient) or for all possible variations in blood pressure. Global sensitivity analysis methods (such as those introduced in Section 8.5) may partly overcome this limitation. The question we seek to answer is: What are the (minimum and maximum) domains in state space that contain valid solutions due to a range of parameter values? In the following section the linear sensitivity approach is considered.
133
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
pj, nom ynom
y Δy ssL
pj, nom ynom
y Δy ssL
G Δy ss
G Δy ss
Δpj Δpj pj
pj (a)
(b)
Figure 8.2 Extrapolation of the steady-state shift Δyss due to parameter perturbations Δpj around a nominal steady state (pj,nom, ynom) using local ( Δy ssL ) versus global ( Δy ss ) sensitivity analysis methods. (a) Approximation of the steady-state response using linear sensitivity analysis provides good estimates only locally around the nominal steady state. (b) For larger deviations, the steady-state response approximation by the linear sensitivity analysis deviates significantly from the true response. Global sensitivity analysis methods provide better approximations in this case.
8.3 Linear Sensitivity Analysis In the following two sections we analyze the behavior of system (8.2) locally around its nominal solution. This allows one to make predictions for parameter perturbations that are close enough to their nominal values or for systems in which the influence of the parameter over the range of possible conditions does not change dramatically. The approach is to perform a Taylor series expansion of Nv(x, p) in x and p around the nominal solution xnom pnom. d Δx = Nv( xnom , pnom ) dt ⎛ ∂ v( xnom , pnom ) ∂ v( xnom , pnom ) ⎞ + N⎜ Δx + Δp⎟ ∂x ∂p ⎝ ⎠
(8.6)
+O( Δx , Δp)
2
Δy =
∂ h( xnom , pnom ) ∂ h( xnom , pnom ) 2 Δx + Δp + O( Δx , Δp) ∂x ∂p
(8.7)
where Δx = x − xnom, Δy = y − ynom, and Δp = p − pnom. One obtains the linearization of the system if terms of order two and higher are neglected d Δ x = AΔ x + B Δ p dt
(8.8)
Δ y = C Δ x + DΔ p
(8.9)
where
134
8.3
Linear Sensitivity Analysis
∂ v( xnom , pnom ) ∂ v( xnom , pnom ) B=N ∂x ∂p ∂ h( xnom , pnom ) ∂ h( xnom , pnom ) C= D= ∂p ∂x
A=N
The linearization allows to investigate the behavior of the nonlinear system (8.2) due to small/local parameter perturbations. For an asymptotically stable steady state, straightforward calculation shows that a constant parameter deviation of Δp results in a steady-state shift Δxss of Δxss = − A−1 BΔp Δy ss = −CA−1 BΔp
(8.10)
The asymptotic stability of the steady state directly implies that A is invertible and therefore (8.10) is well defined. Conservation relations introduce zero eigenvalues and the system is not strictly asymptotically stable. In practice it is therefore necessary to reduce the system before performing the analysis. Definition 2 Linear Steady-State Sensitivity The linear steady-state sensitivity of output y(x, p) with respect to constant changes of parameter p is defined as SLy , p = lim
Δp → 0
Δy ss ∂x = C ss Δp ∂p
(8.11)
The linear steady-state sensitivity SL can be used to approximate the steady-state response locally around a nominal steady state, as depicted in Figure 8.1(b). A classical approach for sensitivity analysis is metabolic control analysis (MCA) (see [18] for a review). Two sensitivities are commonly used in MCA. First, the concentration response coefficient, which is equivalent to the steady-state sensitivity, is given in Definition 2. Second, the flux response coefficient measures the linear sensitivity of the parameters on the rate vector at steady state (see [18]). Common in linear sensitivity analysis is the use of relative sensitivities measuring the impact of a relative change of parameter. To simplify the presentation, this chapter discusses only the unscaled case. Remark 2 The analysis presented so far is restricted to the steady-state response or the asymptotic response. However, the transient or early response of a biochemical reaction network might be important when, for example, timing in the network is essential. In such cases, the linear time-varying sensitivity should be considered instead [12]; this allows one to draw conclusions about temporal sensitivity. If parameters are subjected to external time-varying perturbations, a frequency consideration should be taken into account [15, 19]. Linear steady-state sensitivity analysis has become an important tool in the analysis of biochemical reaction networks, because it is well known and easy to use and apply even to large systems, and many further extensions are available. Disadvantages of this method are that it provides only estimations of the steady-state shift for small parameter perturbations [Figure 8.1(b)]. A further disadvantage is that
135
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
(8.10) is only applicable to strictly asymptotically stable systems, which requires the elimination of conservation relations of the system before any analysis. In the next sections we propose two new methods that partially overcome some of the drawbacks of linear, steady-state sensitivity analysis.
8.4 Sensitivity Analysis Via Empirical Gramians The sensitivity measures that were presented so far explicitly depend on the linearization of the system under consideration. The expression of the steady-state shift according to (8.10) also requires inversion of the Jacobian A which causes problems for nonasymptotically stable systems. We present a method to compute the steady-state sensitivity of the nonlinear system under consideration directly from transient simulation data. For the proposed method, it is also not necessary to linearize the model, to perform a matrix inversion, or to reduce the conservation relations beforehand. Parametric sensitivity of a dynamic system can be analyzed using notions and methods that deal with input-output systems if one considers the parameters to be variable, possibly time-dependent inputs p(t), and the outputs y to be the variables of interest. A control theoretic notion that deals with the influence of the inputs on the states is controllability. Informally described, a system is controllable if all of its states can be changed arbitrarily by its inputs. The notion of observability deals—informally speaking—with the influence of the states on the outputs. The controllability and observability of a linear system can be quantified (in an L2-optimal sense) by the controllability and observability Gramians that are introduced next.
8.4.1
Gramians and linear sensitivity analysis
How easily the input p(t) can influence the states can be quantified by the controllability Gramian Wc, whereas the observability Gramian Wo is a measure of the state “energy” that is visible in the output. The two Gramians are defined as 0
∞
Wc = ∫ e − Aτ BB T e − A dτ and W0 = ∫ e A C T Ce Aτ dt −∞
Tτ
Tτ
0
Clearly both Gramians capture the asymptotic (since the integrals are spanning infinite far into the past or future) and the temporary behavior of the system. The Gramians can be used for multiple purposes in general. For example, the analysis of the eigenvalues and corresponding eigenvectors of the controllability and observability Gramians reveals which directions of the system are best (in an L2 optimal sense) controllable and observable. Directions corresponding to large eigenvalues indicate the directions that are most sensitive. In particular, states in the kernel of the Gramians are unobservable and uncontrollable, respectively. However, analyzing controllability and observability gives distinct views about the importance of directions in the state space. For an input-output analysis such as sensitivity analysis, a combination of both is required. Moore [20] showed that a straightforward combination can be misleading. For example, the least observable states could be very controllable. Thus a small input signal could result in a nonnegligible output signal.
136
8.4
Sensitivity Analysis Via Empirical Gramians
A combined view of observability and controllability is the cross Gramian [21, 22] which is defined for a single-input-single-output system as ∞
Wco = ∫ e Aτ BCe Aτ dt 0
(8.12)
The cross Gramian is not only related to the controllability and observability Gramians by its similar definition, it also holds for single-input-single-output systems [23] that Wco2 = WcWo showing that the cross Gramian contains both the controllability and observability Gramian. One can show that for a single-input single-output system the steady-state system gain with respect to a step input is proportional to the sum of the eigenvalues of the corresponding cross Gramian [22]. This relationship can be used to derive the following result (proof given in [24]) that relates steady-state linear sensitivity analysis to the cross Gramian and sets the stage for the development of nonlinear expansions as derived in the following section. The linear steady-state sensitivity of output y(x, p) = Cx with respect to parameter p is related to an appropriately chosen cross Gramian Wcop, C :
Theorem 1
SLy , p = C
∂ xss = 2 trace Wco( p, C ) ∂p
in which the matrixWco( p, C ) is the cross Gramian for the input p and the output y = Cx. With this relation, it is possible to quantify how the steady-state response of a particular output is influenced by a step perturbation of a particular parameter based on the controllability and observability notion. Furthermore, it is straightforward to show that the cross Gramian contains as a special case the sensitivity covariance matrix as introduced in [25]. The main limitation of the Gramians introduced so far is that they are only applicable and defined for linear systems. In the next section we derive an expansion of the Gramians to nonlinear systems that leads to a new sensitivity measure for nonlinear systems motivated by the linear case.
8.4.2
Empirical Gramians for nonlinear systems
As outlined earlier, the main idea for a new sensitivity measure for nonlinear systems is the consideration of parameters as inputs that lead to changes in the system and thus the output of interest. This input-output behavior is analyzed using the concept of Gramians. This, however, requires a suitable expansion of the concept of linear cross Gramian to the nonlinear case. In recent years significant progress has been made in deriving Gramians for nonlinear systems [26]. However, calculating the Gramians explicitly is still a difficult task. Empirical controllability and observability Gramians were suggested by [27] to overcome this problem. Here we follow this approach, derive a method to compute the empirical cross Gramian, and adapt it to sensitivity analysis. The idea is to derive the empirical cross Gramian by averaging the system behaviors that result from systematic parameter perturbations. The perturbations and the averaging are chosen in such a way that, first, the state space and parameter space of interest are 137
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
probed and sampled, and, second, the result reduces to the usual cross Gramian (8.12) when applied to a linear system. This ensures consistency between the usual Gramian of an asymptotical stable linear systems and the empirical Gramian of nonlinear system in the limit of infinitely small perturbations. First, we define the perturbation of the parameters and state space and the resulting deviations of the behavior and its temporal mean. Then, we define the empirical cross Gramian that will be used for sensitivity analysis. Definition 3 Systematic System Perturbations The considered perturbations of the parameter ps(t)= pnom + dsδ(t) are defined by the set S σ = {d1 , K , dσ ; ds ∈ ⺢, ds > 0, s = 1, K , σ} where δ(t) denotes the impulsive input (Dirac impulse). Perturbation of the initial conditions x0rkj = x0, nom + ck Rr ej given by the sets
{
}
R ρ = R1 , K , Rρ ; Rr ∈ ⺢n × n , RrT Rr = I , r = 1, K , ρ K = {c1 , K , cκ ; ck ∈ ⺢, ck > 0, k = 1, K , κ} κ
⎧ ⎫ ⎧1, i = j E n = ⎨e1 , K , en ; ek ∈ ⺢n , eiT ej = ⎨ , i , j , k = 1, K , n⎬ 0 , i ≠ j ⎩ ⎩ ⎭ 1 t u( τ)dτ denote the temporal mean of a function u(t), and let Δu(t ) = u(t ) − u t ∫0 denote the deviation of u(t) from its temporal mean u. Φ sp (t ) denotes the solution/behavior due to parameter perturbation ps(t), and Φ rkj x 0 (t) denotes the solution/behavior due to perturbation of the initial condition p nom , x0rkj . The set Sσ defines scales for the parameter perturbations that are required to investigate the controllability component of the system. Perturbations of the state space are required to account for the observability component. The state space perturbations are parameterized by the sets K κ , which defines orthogonal coordinate systems and different scales for each perturbed direction in the state space. The perturbation sets should be chosen such that the states and the parameter domain of interest are covered. Using these sets and the collected data resulting from the perturbation simulations, a construction for empirical controllability and observability Gramians was proposed by [27]. The advantage of this construction is that that the empirical Gramians reduce to the usual Gramians of linear system when applied to a linear (time-invariant) system. Sensitivity analysis requires one to take into account both controllability and observability at once. Next we introduce the empirical cross Gramian that accounts for both. Let u =
8.4.3
A new sensitivity measure based on Gramians
We next show how the pertubations introduced in Definition 3 can be used to construct an empirical cross Gramian from simulation data.
138
8.4
s
ρ
Sensitivity Analysis Via Empirical Gramians
κ
n
Definition 4 Empirical Cross Gramian Let S , R , K and E be given sets as in Definition 3. For system (8.2) with scalar input and scalar output, define the empirical cross GramianW$ around the steady state x with corresponding nominal input p by co
1 W$ co = σκ
σ
κ
0,nom
ψsrk
∑∑ dc s =1 k =1
nom
(8.14a)
s k
where the entries of the n × n-matrix ψ are given for all i, j = 1, ..., n by srk
T T s rkj ψsrk i, j = ∫ e i Rr ΔΦ p ( τ)CΔΦ x0 ( τ)dτ t
0
(8.14b)
The definition of the empirical cross Gramian and the choice of the impulsive input seems somewhat arbitrary and nonintuitive the controllability component (ΔΦ sp ) of the empirical cross Gramian is due to the different input perturbations whereas the observability component (ΔΦ rkj x0 ) is due to perturbations of the initial conditions. This again shows that the cross Gramian includes both controllability and observability. Furthermore, the definition was constructed in such a way that the definition of the empirical cross Gramian falls back to the definition of the usual Gramian when applied to a linear asymptotic stable system. For any nonempty sets Rρ, Kκ, and Sσ the empirical cross Gramian W$ co of
Proposition 1
an asymptotically stable linear system in (8.8) and (8.9) is equal to the usual cross Gramian Wco [see (8.12)] for large integration times t → ∞. Proof: Due to the linearity of (8.8) and (8.9), xnom = 0 and pnom = 0. Thus, the solution/behavior to the perturbations p s (t ) = ds δ(t ) and x0rkj = ck Rr ej is given by At Φ rkj x0 = e c k Rr e j
and Φ sp = ∫ e Aτ B(ds δ( τ))dτ = e At Bds t
0
All perturbed trajectories converge to the origin, independently of s, r, k, and j. For long integration times the perturbed state trajectory converges versus the origin. Then, (8.14b) simplifies to T T Aτ Aτ Ψisrk , j = ∫ e i Rr ( e Bds )C( e c k Rr e j ) dτ ∞
0
and since Rr is orthonormal, we obtain the desired result ∞
Ψ srk = ds ck ∫ e Aτ BCe Aτ dτ 0
∞
1 W$ co = σκ
∑ s, k
ds ck ∫ e Aτ BCe Aτ dτ 0
ds ck
= Wco
139
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
In Theorem 1 we have shown that the cross Gramian is directly related to the linear steady-state sensitivity. We employ this theorem and define an empirical sensitivity measure. Definition 5 Empirical Cross Gramian-Based Sensitivity Measure sitivity is given by
The empirical sen-
S yE, p = 2 trace W$ co where W$ co is the empirical cross Gramian as defined in Definition 4. Due to Proposition 1, the empirical sensitivity measure SE is identical to the linear steady-state sensitivity SL if the conditions hold. The calculation of the sensitivity measure is straightforwardly implementable to nonlinear systems and no linearizations are required. It is based on simulation data that is weighted in such a manner that it is consistent with linear sensitivity of a linear system. Whereas in the limit of small perturbations, the equivalence to linear sensitivity is expected, it is expected that for larger perturbations better estimations of the steady-state shift can be derived such that less local statements can be derived. It might also be possible that this approach is directly applicable to a wider system class than the one that can be considered by the linearization. For instance, zero eigenvalues, such as in the case of conservation relations, do not pose any problems for the analysis. Thus, eliminating conservation relations beforehand is not necessary.
8.4.4
Example: covalent modification system
Next we consider the example system (8.4) and analyze the properties of the empirical sensitivity measure. Figure 8.3 shows the values of the nonlinear empirical cross Gramian-based sensitivity measure for different development points (e.g., nominal parameter sets around which the perturbations are considered). As can be seen, the result obtained by the nonlinear empirical cross Gramians are largely consistent with the results obtained by the linearization for all development points considered. This most probably results from the averaging over the various perturbations and initial conditions. While this is somewhat surprising, it does not limit the applicability of the method. The strength of the approach lies rather in the wider applicability with respect to the considerable system class, as explained in the previous section. Capturing the complete range of the influence of parameters does not require a method that averages the behavior over all possible parameter variations. Rather an explicit bounding should be used to obtain a better insight; this is outlined in the next section.
140
Nominal k6 , [A*]
SL
SE
5e1, 0.989 5e2, 0.975 8e2, 0945 9e2, 0.906 1e3, 0.654 50e3, 0.02
-2e-5 -4.8e-5 -2.2e-4 -7e-4 -7e-3 -8.2e-9
-2.4e-5 -6.3e-5 -3e-4 -9.2e-4 -8.3e-3 -8.7e-9
Steady state of [A*]
8.5
1 0.75 0.5 0.25 0 1
10 (a)
Sensitivity Analysis Via Infeasibility Certificates
2
10
3
10 k6
4
10
5
10
(b)
Figure 8.3 Numerical comparison of the empirical cross Gramian based sensitivity measure with linear steady state sensitivity for the covalent modification example (a) and five different development points −4 −3 −4 n indicated by circles in (b). Perturbation sets used were S : ds ∈ [10 , 10 ], K : ck ∈ [10 , 1], R = I .
8.5 Sensitivity Analysis Via Infeasibility Certificates A common use of sensitivity measures is to estimate the steady-state shift that occurs due to parameter variations. However, as pointed out, due to the local nature of most sensitivity measures, the estimated steady-state shift is rarely valid for larger parameter perturbations. The aim of this section is to present an approach that computes reliable outer bounds on the steady states of the biochemical network under a specified parameter uncertainty and thus provides insight into the global influence parameter changes have on the output or state. Computing the set of steady states analytically is only possible in very rare cases. Even if an analytical solution for the steady state is known, computing the corresponding set for all possible parameter values may be difficult. Due to this difficulty, nondeterministic approaches are frequently used to solve this problem. A common tool for this kind of analysis is Monte Carlo methods, which are routinely applied in the analysis of uncertain biochemical reaction networks. However, Monte Carlo approaches to the problem at hand typically require that all of the possibly multiple steady states for specific parameter values can be computed explicitly, which is often a difficult task in itself. The approach presented in this section avoids the direct computation of steady states by making use of the specific properties of biochemical reaction networks as given by (8.2). In particular, it is assumed that fluxes are modeled using the law of mass action, that is, v(x, p) takes the form n
v j ( x , p) = pj ∏ xk jk σ
(8.15)
k =1
for j =1, ..., m. The constants σjk are integers representing the stoichiometric coefficient of the species k taking part in the jth reacting complex. Note that an expansion to more complicated reaction schemes that are described by rational functions, such as Michaelis-Menten kinetics, is easily possible. The problem under consideration can be formulated as follows. Given a set P ⊂ ⺢l in parameter space, we aim to compute a set Xs* ⊂ ⺢n that consists of all steady states that 141
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
the system (8.2) can attain for parameter values taken from P. Mathematically, this is written as
{
}
Xs* = x ∈ ⺢n ∃p ∈ P: Nv( x , p) = 0
(8.16)
However, Xs* can rarely be computed directly. Instead, we are looking for an outer bound Xs ⊃ Xs* which should be as tight as possible. In order to search for sets of steady states for a given parameter set P, we need means to test whether a candidate solution X$ obtained in such a search is actually valid or not. s
Such a test is readily formulated as a feasibility problem. Moreover, it will turn out that the Lagrangian dual for this feasibility problem allows one to certify given regions in state space as not containing a steady state for any parameter value from the set P. This information can be used to implement an algorithm that constructs outer bounds on the region Xs* of all steady states. For ease of presentation, only hyper-rectangles in the state and parameter space are considered for the sets Xs and P, although the results are readily extended to convex polytopes in general.
8.5.1
Feasibility problem and semidefinite relaxation
The problem of testing whether a given hyper-rectangle Xs in state space contains steady states of the system (8.2), for some parameter values in a given hyper-rectangle P in parameter space, can be formulated as the following feasibility problem [28]: x ∈ ⺢n , p ∈ ⺢l Nv( x , p) = 0
find s. t.
(8.17)
pj,min ≤ pj ≤ pj,max j = 1, K , l xi,min ≤ xi ≤ xi,max i = 1, K , n
The feasibility problem (8.17) can be relaxed to a semidefinite program as follows. In the first step, construct a vector ξ containing monomials that occur in the reaction flux vector v(x, p) [29]. In the special case where no single reaction has more than two reagents, a starting point for the construction of ξ is ξ T = (1, p1 , K , pl , x1 , K , xn , p1 , x1 , K , pj xi , K , pl xn ) which can usually be reduced by eliminating components that are not required to represent the reaction fluxes. Define k such that ξ ∈ ⺢ . Note that this approach is not limited to second order reaction networks. In more general cases, one has to extend the vector ξ by monomials that are products of several state variables. Using the vector ξ, the elements of the flux vector v(x, p) can be expressed as k
v j ( x , p) = ξ T Vj ξ , where Vj ∈ ten as
142
k
j = 1, K , m
(8.18)
is a constant symmetric matrix. Using (8.18), the system (8.2) can be writ-
8.5
Sensitivity Analysis Via Infeasibility Certificates
x&i = ξ T Qi ξ , i = 1, K , n where Qi = ∑
m j =1
(8.19)
SijVj ∈ S k are constant symmetric matrices.
The original feasibility problem (8.17) is thus equivalent to the problem find ξ ∈ ⺢k s. t. ξ T Qi ξ = 0 i = 1, K , n Bξ ≥ 0 ξ1 = 1
(8.20)
where the matrix B ∈ ⺢( 2 k − 2 ) × k is constructed to cover the inequality constraints in (8.17). A relaxation to a semidefinite program is found by setting X = ξξT. The resulting nonconvex constraint rank X = 1 is omitted in the relaxation. The relaxed version of the original feasibility problem (8.17) is thus obtained as find X ∈ S k s. t. trace(Qi X) = 0
(e e X ) = 1
i = 1, K , n
T 1 1
(8.21)
BXe 1 ≥ 0 BXB T ≥ 0 X positive semidefinite
where e1 = (1, 0, K , 0)T ∈ ⺢k . The basic relationship between the original problem (8.17) and the relaxed problem (8.21) is that if the original problem is feasible, then the relaxed problem is also feasible.
8.5.2
Infeasibility certificates from the dual problem
The Lagrange dual problem can be used to certify infeasibility of the primal problem (8.21). First, the Lagrangian function L is constructed for the primal problem. We obtain L( X , λ1 , λ 2 , λ 3 , ν) = − λT1 BXe1 − trace( λT2 BXB T )
(
n
)
− trace( λT3 X ) + ∑ ν i trace(Qi X) + ν n +1 trace( e1 e1T X ) − 1 i =1
where λ1 ∈ ⺢2 k − 2 , λ 2 ∈ S 2 k − 2 , λ 3 ∈ S k , and ν ∈ ⺢n +1 . Based on the Lagrangian L, the dual problem is obtained as max inf L( X , λ1 , λ 2 , λ 3 , ν)
λ 1 , λ 2 , λ 3 , v X ∈S k
s. t. λ1 ≥ 0, λ 2 ≥ 0, λ 3 positive semidefinite
which is equivalent to
143
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
max s. t.
ν n +1 B T λ 2 B + e1 λT1 B + B T λ1 e1T n
+ λ 3 + ∑ v iQi + ν n +1 e1 e1T = 0
(8.22)
i =1
λ1 ≥ 0, λ 2 ≥ 0, λ 3 ≥ 0 It is a standard procedure in convex optimization to use the dual problem in order to find a certificate that guarantees infeasibility of the primal problem [30]. For the problem at hand, this principle is formulated in Theorem 2 [31]. Theorem 2 If the dual problem (8.22) has a feasible solution where νn+1 > 0, then the primal problem (8.17) is infeasible.
8.5.3
Algorithm to bound feasible steady states
In this section, an algorithm to find outer bounds on the steady state region Xs, based on the results obtained in the previous section, is presented. As a basic additional requirement, assume that some upper and lower bounds on steady states are already known previously by other means. Let these bounds be given by xi, lower ≤ xi ≤ xi, upper , i = 1, K , n
(8.23)
In biochemical reaction networks, such bounds typically follow straightforwardly from conservation relationships or from positive invariance of a suitable large region in state space. These bounds may be very loose though, and the main objective of the presented method is to tighten them as far as possible. To this end, a bisection algorithm that finds the maximum ranges [xj,lower, xj,min] and [xj,max, xj,upper] for which infeasibility can be proven via Theorem 2 is used. The algorithm iterates over j =1, ..., n, while the steady state values xi for i ≠ j are assumed to be located within the interval given by inequality (8.23). For illustration, pseudo-code for computing the lower bound x1,min is given. Computation of the upper bound x1,max works in a very similar way. Algorithm 1 Lower bound maximization by bisection up_guess ← x1,upper, lo_guess ← x1,lower next_x1 ← x1,upper while (up_guess – lo_guess) = tolerance use constraint x1,lower ≤ x1 ≤ next_x1 solve semidefinite program (8.22) if optimal value of (8.22) is infinite lo_guess ← next_x1 increase next_x1 by 1/2(up_guess – next_x1) else up_guess ← next_x1 decrease next_x1 by 1/2(next_x1 – lo_guess) endif endwhile
x1,min ← lo_guess 144
8.5
Sensitivity Analysis Via Infeasibility Certificates
Due to the availability of efficient solvers for semidenifite programs and the use of bisection to maximize the interval that is certified as infeasible, Algorithm 1 can run considerably fast on standard desktop computers. Algorithm 1 is run for all state variables, to obtain a hyper-rectangle in state space containing all steady states for the assumed parameter ranges. As demonstrated next, this is relevant information for the global sensitivity analysis and allows one to draw conclusions on (parametric) steady-state sensitivity. The outlined approach for outer bounding the set of all feasible steady states for sets of parameters can be expanded to more general system classes, including systems described by discrete variables such as switching genetic parts. Also, expansions to more general nonlinear systems are possible, for details of these expansions see [32]. Furthermore, based on similar considerations a general framework for model invalidation, parameter and state estimation, as well as input selection for experimental design for nonlinear systems, specifically biochemical reaction networks, can be derived [33, 34]. In this approach the experimental data is allowed to be available as possibly sparse, uncertain, but (set-)bounded measurements of inputs and outputs. In comparison to other approaches, all infeasibility-based approaches have in common that instead of checking (possibly) many separate points which might lead to nonconclusive answers, they allow one to check whole parameter and state regions for feasibility.
8.5.4
Example: covalent modification system
As an example, let us compute a steady state region for the covalent modification scheme (8.3). From the conservation relations and positive invariance of the positive orthant, we have the steady state bounds 0 ≤ [A], [A*] ≤ [Atot] and 0 ≤ [C1] ≤ [E1,tot] which are valid for any parameter values. The previously discussed analysis method can be applied to find tighter bounds on possible steady state values under specified parameter uncertainties. First, we consider an example where k6 is uncertain between 100 and a varying maximum value. The resulting lower and upper bounds on [A*] are shown in Figure 8.4. As an example for multidimensional parameter uncertainty sets, consider the three uncertainty regions P1, P2, P3 ⊂ •
4
given by:
(k2, k3, k5, k6) ∈ P1 ⇔ 0.98 ki,nom ≤ ki ≤ 1.02 ki,nom;
Bounds on [A*]
1
0.75 0.5
Region of possible steady states
0.25 0 10 2
10 3 10 4 Max. value of k 6
10 5
Figure 8.4 Global sensitivity analysis of the covalent modification example. Upper and lower bounds (black lines) on the steady state of [A*] for the parameter set 100 ≤ k6 ≤ k6,max.
145
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
Table 8.1 Upper (ub) and Lower (lb) Bounds on [ ] for Several Parameter Uncertainty Sets (as given in the text), with Comparison to Extremal Values Taken from 1,000 Random Parameter Samples (mc) * * * * [ ] [ ] [ ] [ ] x 1
P2 P3
0.356 0.094 0.013
0.363 0.096 0.013
0.823 0.936 0.980
•
(k2, k3, k5, k6) ∈ P2 ⇔ 0.9 ki,nom ≤ ki ≤ 1.1 ki,nom;
•
(k2, k3, k5, k6) ∈ P3 ⇔ 0.5 ki,nom ≤ ki ≤ 2 ki,nom;
0.838 0.946 0.984
with i = 2, 3, 5, 6 in all three cases. Certified upper and lower bounds on [A*] for this case are given in Table 8.1, together with “inner” bounds on the steady state set obtained by explicit computation of the steady state for randomly chosen parameter samples. As can be seen from the results, our approach is able to find tight intervals for the steady state values in all three cases. Note that the absolute bound on the output variation subject to the various parameters easily shows which parameter is the most sensitive.
8.6 Discussion and Outlook Sensitivity analysis is a useful tool for the analysis of mathematical models of biological systems. Commonly, linear, local methods are used to analyze sensitivity with respect to parameter perturbations. It was shown in this chapter that local sensitivity methods might lead to wrong conclusions if the system is highly nonlinear and if large variations in the parameters are considered. To overcome the problem of locality, we outlined two new methods for the steadystate sensitivity analysis. The first method is based on an input output consideration of the sensitivity question: parameters are considered as inputs, and variables of interest are considered as outputs. The sensitivity question is then answered by an extension of the concept of linear cross Gramians to nonlinear systems, the empirical cross Gramian approach. This method allows one to consider a wider class of systems and to derive sensitivity statements based on simulations. Further work will focus on the expansion of the Gramian-based approach to the question of nonstationary sensitivity analysis. A reformulation of the original question of parametric sensitivity to the question of outer approximating the range of possible steady states under parameter uncertainties sets the basis for the second approach. Since an outer approximation of the possible steady states is obtained, the method is nonlocal (i.e., global in nature). In the case of mass action kinetics it was shown that one can find rather close outer bounds on the feasible region using infeasibility certificates. Once the region of possible steady states is found, one has a direct insight into the most sensitive outputs for the considered parameter uncertainty. Future research focuses on the expansion of the global sensitivity method to a richer class of systems and on the application of the ideas to model validation and parameter/state estimation.
146
8.6
Discussion and Outlook
References [1]
[2] [3] [4] [5] [6] [7] [8] [9] [10]
[11] [12] [13] [14] [15] [16]
[17]
[18] [19] [20] [21] [22] [23] [24]
[25] [26]
Le Novère, N., B. Bornstein, A. Broicher, M. Courtot, M. Donizelli, H. Dharuri, L. Li, H. Sauro, M. Schilstra, B. Shapiro, J.L. Snoep, and M. Hucka, “BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems” Nucleic Acids Res., Vol. 34, 2006, pp. D689–D691. See also http://www.ebi.ac.uk/biomodels/, last visited May 25, 2008. Varma, A., M. Morbidelli, and H. Wu, Parametric Sensitivity in Chemical Systems, Cambridge, U.K.: Cambridge University Press, 1999. Cornish-Bowden, A., Fundamentals of Enzyme Kinetics, 3rd edition, London, U.K.: Portland Press, 2004. Saez-Rodriguez, J., A. Kremling, and E.D. Gilles, “Dissecting the puzzle of life: Modularization of signal transduction networks,” Computers and Chemical Engineering, Vol. 29, 2005, pp. 619–629. Klipp, E., R. Herwig, A. Kowald, C. Wierling, and H. Lehrach, Systems Biology in Practice: Concepts, Implementation and Application, Weinheim: Wiley-VCH, 2005. Kacser, H. and J.A. Burns, “The control of flux,” Symposia Society for Experimental Biology, Vol. 27, 1973, pp. 65–104. Fell, D.A., “Metabolic control analysis: a survey of its theoretical and experimental development,” Biochemical J., Vol. 286, 1992, pp. 313–330. Feng, X. J., S. Hooshangi, D. Chen, G. Li, R. Weiss, and H. Rabitz, “Optimizing genetic circuits by global sensitivity analysis,” Biophys. J., Vol. 87, No. 4, 2004, pp. 2195–2202. Robert, C.P., and G. Casella, Monte Carlo Statistical Methods, New York: Springer Verlag, 2004. Alves, R., and M.A. Savageau, “Systemic properties of ensembles of metabolic networks: application of graphical and statistical methods to simple unbranched pathways,” Bioinformatics, Vol. 16, No. 6, 2000, pp. 534–547. Keener, J., and J. Sneyd, Mathematical Physiology, 2nd edition, volume 8 of Interdisciplinary Applied Mathematics, New York: Springer-Verlag, 2001. Ingalls, B.P., and H.M. Sauro, “Sensitivity analysis of stoichiometric networks: an extension of metabolic control analysis to non-steady state trajectories,” J. Theor. Biol. Vol. 222, 2003, pp. 23–36. Goldbeter, A., and D.E. Koshland, “An amplified sensitivity arising from covalent modification in biological systems,” Proc. Natl. Acad. Sci. USA, Vol. 78, No. 11, November 1981, pp. 6840–6844. Saltelli, A., M. Ratto, S. Tarantola, and F. Campolongo, “Sensitivity analysis practices: Strategies for model-based inference,” Reliability Engineering & System Safety, Vol. 91, 2006, pp. 1109–1125. Ingalls, B.P., “A frequency domain approach to sensitivity analysis of biochemical systems,” Journal of Physical Chemistry B, Vol. 108, 2004. Cascante, M., A. Sorribas, R. Franco, and E.I. Canela, “Biochemical systems theory: increasing predictive power by using second-order derivatives measurements,” J. Theor. Biol., Vol. 149, No, 4, April 1991, pp. 521–535. Streif, S., R. Findeisen, and E. Bullinger, “Sensitivity analysis of biochemical reaction networks by bilinear approximation,” Proc. of the Foundations of Systems Biology in Engineering (FOSBE), Stuttgart, Germany, September 2007, pp. 521–526. Hofmeyr, J.-H.S., “Metabolic control analysis in a nutshell,” Proc. International Conference on Systems Biology, Pasadena, CA, November 2000, pp. 291–300. Yi, T.-M., B.W. Andrews, and P.A. Iglesias, “Control analysis of bacterial chemotaxis signaling,”Methods Enzymol., Vol. 422, 2007, pp. 123–140. Moore, B.C., “Principal component analysis in linear systems: Controllability, observability, and model reduction,” IEEE Trans. Autom. Control, Vol. 26, No. 1, 1981, pp. 17–32. Fernando, K.V., and H. Nicholson, “Stability assessment of two-dimensional state-space systems,” IEEE Trans. Circ. Syst., Vol. 32, No. 5, 1985. Fernando, K.V., and H. Nicholson, “On the structure of balanced and other principle representations of SISO systems,” IEEE Trans. Autom. Control, Vol. 28, No. 2, 1983, pp. 228–231. Laub, A.J., L.M. Silverman, and M. Verma, “A note on cross-Grammians for symmetric realizations,” Proceedings of the IEEE, Vol. 71, No. 7, 1983, pp. 904–905. Streif, S., R. Findeisen, S., and E. Bullinger, “Relating cross Gramian and sensitivity analysis in systems biology,” Proc. of the Mathematical Theory of Networks and Systems (MTNS), Kyoto, Japan, 2006, pp. 437–442. Sun, C., and J. Hahn, “Parameter reduction for stable dynamical systems based on Hankel singular values and sensitivity analysis,” Chemical Engineering Science, Vol. 61, No. 16, 2006, pp. 5393–5403. Fujimoto, K., and J.M.A. Scherpen, “Nonlinear balanced realization based on singular value analysis of Hankel operators,” Proc. 42nd IEEE Conference on Decision and Control, Vols. 1-6, 2003, pp. 6072–6077.
147
Steady-State Sensitivity Analysis of Biochemical Reaction Networks: A Brief Review and New Methods
[27]
[28] [29] [30] [31]
[32]
[33]
[34]
[35] [36]
148
Lall, S., J.E. Marsden, and S. Glavaski, “A subspace approach to balanced truncation for model reduction of nonlinear control systems,” International Journal of Robust and Nonlinear Control, Vol. 12, 2002, pp. 519–535. Kuepfer, L., U. Sauer, and P. Parrilo, “Efficient classification of complete parameter regions based on semidefinite programming,” BMC Bioinformatics, Vol. 8, No. 1, January 12, 2007. Parrilo, P.A., “Semidefinite programming relaxations for semialgebraic problems,” Mathematical Programming, Vol. 96, No. 2, May 2003, pp. 293–320. Boyd, S., and L. Vandenberghe, Convex Optimization, Cambridge, U.K.: Cambridge University Press, 2004. Waldherr, S., R. Findeisen, and F. Allgöwer, “Global sensitivity analysis of biochemical reaction networks via semidefinite programming,” Proc. of the 17th IFAC World Congress, Seoul, Korea, 2008, pp. 9701–9706. Hasenauer, J., P. Rumschinski, S. Waldherr, S. Borchers, F. Allgöwer, and R. Findeisen, “Guaranteed steady-state bounds for uncertain chemical processes,” Proc. Int. Symp. Adv. Control of Chemical Processes, ADCHEM’09, 2009. Borchers, S., P. Rumschinski, S. Bosio, R. Weismantel, and R. Findeisen, “Model discrimination and parameter estimation via infeasibility certificates for dynamical biochemical reaction networks,” Proc. of the 7th MATHMOD Conference, 2009. Borchers, S., P. Rumschinski, S. Bosio, R. Weismantel, and R. Findeisen, “Model invalidation and system identification of biochemical reaction networks,” Proc. 16th IFAC Symposium on Identification and System Parameter Estimation (SYSID 2009), 2009. Stelling, J., E.D. Gilles, and F.J. Doyle, “Robustness properties of circadian clock architectures,” Proc. Natl. Acad. Sci. USA, Vol. 101, No. 36, September 2004, pp. 13210–13215. del Rosario, R.C.H., F.W. Staudinger, S. Streif, F. Pfeiffer, E. Mendoza, and D. Oesterhelt, “Modelling the Halobacterium salinarum mutant: sensitivity analysis allows choice of parameter to be modified in the phototaxis model,” IET Systems Biology, Vol. 1, No. 4, 2007, pp. 207–221.
CHAPTER
9 Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae Using Dynamic Flux Balance Analysis Jared L. Hjersted and Michael A. Henson
1
1
Department of Chemical Engineering University of Massachusetts Amherst, MA 01003-9303, phone: 413-545-3481; fax: 413-545-1647; e-mail:
[email protected]
Abstract Dynamic flux balance analysis (DFBA) is a computational approach for analyzing and engineering cellular behavior in dynamic culture environments that predominate in batch and fed-batch biochemical reactors. The basic element of DFBA is a dynamic flux balance model that combines stoichiometric mass balances on intracellular metabolites with dynamic mass balances on extracellular species through substrate uptake kinetics and the cellular growth rate. The development of customized computational tools allows DFBA to address a wide variety of problems in metabolic network analysis and design, including the dynamic simulation of batch and fed-batch bioreactors, the dynamic optimization of fed-batch operating policies, and the in silico design of metabolite overproduction mutants for batch and fed-batch cultures. We focus on the development and application of DFBA techniques for the yeast Saccharomyces cerevisiae.
Key terms
Metabolic models Flux balance analysis Dynamic optimization Metabolic engineering Batch and fed-batch culture Saccharomyces cerevisiae
149
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
9.1 Introduction The availability of stoichiometric models of cellular metabolism has enabled the development of computational algorithms for the analysis and design of complex metabolic networks. A stoichiometric model is comprised of a linear system of flux balance equations that relate metabolic species to their intracellular fluxes through a reaction network [1, 2]. Typically the network contains more unknown fluxes than balanced intracellular species, and the linear system is underdetermined. In flux balance analysis (FBA), the fluxes are resolved by solving a linear programming problem formulated under the assumption that the cell optimally utilizes available resources. FBA has been used extensively for predicting cellular growth and product secretion patterns in microbial systems [3–5]. Extensions of classical FBA allow the redesign of metabolic networks for the overproduction of desired metabolites through gene deletions and insertions, which are implemented by removing or adding intracellular reactions to the network. These computational methods provide metabolic engineering targets that are experimentally testable. In a study with the yeast Saccharomyces cerevisiae, the growth phenotypes of knockout mutants were predicted with a 70%–80% success rate by constraining fluxes associated with these genes [6]. Several computational studies of gene manipulations for metabolite overproduction also have been presented [7–9]. Large-scale production of biotechnological products often is performed in batch and fed-batch bioreactors. An important advantage of fed-batch operation is that substrate levels can be varied transiently to achieve favorable trade-offs between the cellular growth and metabolite production rates. Fed-batch fermentation of S. cerevisiae is an important technology for producing metabolic products such as ethanol [10–12]. However, classical FBA methods assume time-invariant extracellular conditions and generate steady-state predictions consistent with continuous bioreactor operation. Although batch culture experiments often are used to evaluate FBA predictions, the results are strictly valid only for the balanced growth phase. An alternative approach is to perform metabolic network analysis and design using dynamic extensions of stoichiometric models and classical FBA. Dynamic flux balance models [13–16] are obtained by combining stoichiometric equations for intracellular metabolism with dynamic mass balances on extracellular substrates and products under the assumption that intracellular metabolite concentrations equilibrate rapidly in response to extracellular perturbations [2]. The intracellular and extracellular descriptions are coupled through the cellular growth rate and substrate uptake kinetics, which can be formulated to include regulatory effects such as product inhibition of growth. Therefore, dynamic flux balance models allow the prediction of cellular behavior as the extracellular environment changes with time. Batch culture simulations with dynamic flux balance models have shown good agreement with experimental data [13–15]. Dynamic flux balance analysis (DFBA) refers to computational algorithms in which dynamic flux balance models are used for the metabolic network analysis and design. Dynamic flux balance modeling offers important advantages over alternative transient modeling frameworks. Because simple unstructured models rely on phenomenological descriptions of cell growth and constant yield coefficients [17], they have limited predictive capability and cannot account for genetic alterations. Metabolic engineering applications of structured kinetic models [18, 19], log-linear kinetic models [20], and cybernetic models [21, 22] are often limited by the lack of parameter values for in vivo enzyme kinetics. Dynamic flux balance modeling provides a practical alternative 150
9.2
Methods
for incorporating intracellular structure. Given the availability of a steady-state flux balance model, only a small number of additional parameters are needed to account for the substrate uptake kinetics. On the other hand, a well-documented weakness of classical and dynamic FBA is the difficulty associated with incorporating cellular regulation. This problem has been partially addressed by using gene expression data to constrain regulated fluxes within the metabolic network [23, 24]. DFBA offers the additional possibility of formulating substrate uptake kinetics to account for known regulatory processes. In this chapter, the basic elements of DFBA are presented and illustrated through applications to Saccharomyces cerevisiae dynamic simulation, fed-batch optimization, and in silico metabolic engineering. Following a discussion of stoichiometric modeling, the computational underpinnings of classical and dynamic FBA are discussed. The application of DFBA is illustrated by performing dynamic simulations of fed-batch cultures, dynamic optimization of fed-batch operating policies for ethanol production, and in silico design of ethanol overproduction mutants for pure and mixed substrates.
9.2 Methods 9.2.1
Stoichiometric models of cellular metabolism
Both classical and dynamic flux balance analysis are based on stoichiometric cell models that mathematically represent the biochemical reactions in a metabolic network. A stoichiometric model contains all possible paths from the externally supplied substrates to the biomass constituents and metabolic products [5, 25]. The essential information required to construct a stoichiometric model is a list of participating biochemical species (metabolites), a list of the relevant intracellular reactions involving these species, and the stoichiometric coefficients for every species in each reaction. The intracellular reaction rates are called fluxes and are unknown variables determined from mathematical solution of the stoichiometric model. Also treated as fluxes are the metabolite transport rates across the cell membrane for extracellular substrates (uptake rates) and for secreted metabolic products (secretion rates). Typically substrate uptake rates are known model input variables, while product secretion rates are unknown model output variables calculated along with the intracellular reaction rates. Unless experimental evidence indicates otherwise, each intracellular metabolite is assumed to exhibit negligible accumulation such that the fluxes producing the metabolite must be balanced by the fluxes consuming the metabolite: aT v = 0
(9.1)
where v is a n-dimensional column vector containing all the fluxes with typical units of millimole of metabolite per gram dry weight of biomass per hour (mmol/gDW/h) and aT is an n-dimensional row vector containing the stoichiometric coefficients of the balanced metabolite for the corresponding reactions. Stoichiometric coefficients are usually positive for reactions that produce the metabolite and negative for reactions that consume the metabolite. Because a single metabolic reaction typically involves a small T number of metabolites, most of the coefficients in the a vector are zero. The individual stoichiometric equations (9.1) can be gathered to form a matrix equation of the form:
151
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
Av = 0
(9.2)
where A is the stoichiometric matrix with m rows corresponding to the number of balanced metabolites and n columns corresponding to the number of fluxes. The matrix entry aij in row i and column j is the stoichiometry of the ith species participating in the jth reaction. Given a set of known fluxes vm obtained either by measurement or specification, the matrix A can be partitioned in terms of the remaining unknown fluxes vc to yield A c v c + Am vm = 0 → A c v c = − Am vm ≡ b
(9.3)
where Ac and Am are appropriately dimensioned submatrices of A and b is a known column vector of appropriate dimension. As discussed in the following section, the existence and uniqueness of solutions to (9.3) can be determined from properties of the matrix Ac and the vector b. A large number of stoichiometric cell models have been constructed for organisms ranging in complexity from bacteria to mammals [26–28]. These models often are differentiated according to the amount of genomic information utilized in their development. Prior to the wide availability of complete genomic sequences, stoichiometric models were developed from knowledge of metabolic pathways and cellular physiology without regard to the genes involved in the synthesis of enzymes that catalyze the intracellular reactions [29–31]. These small-scale pathway models typically describe primary carbon metabolism and include a lumped description of biomass constituent formation with 100 or fewer metabolites and reactions. More recently, genome-scale stoichiometric models that attempt to account for all known gene-protein-reaction associations have been developed for various organisms [26, 27, 32, 33]. With the increasing availability of genome-scale models, the choice of an appropriate stoichiometric model is often determined by the intended application. We have found that small-scale metabolic models are more computationally efficient when integrated into mathematical programming strategies such as those developed for bioreactor optimization (Section 9.3.3). On the other hand, genome-scale models are more suitable for analysis of metabolic engineering strategies because the gene-protein-reaction associations facilitate the in silico implementation of genetic manipulations such as gene knockouts and gene insertions (Section 9.3.4).
9.2.2
Classical flux balance analysis
The objective of classical flux balance analysis (FBA) is to solve the matrix equation (9.3) ~ for the unknown fluxes vc. Consider the augmented matrix defined by A c ≡ [ A c b ]. At ~ least one solution to (9.3) exists if and only if Ac and A c have the same matrix rank r and the solution is unique if and only if r = nc, the dimension of the unknown flux vector vc [2]. Most stoichiometric models satisfy the existence condition but not the uniqueness condition because the number of unknown fluxes is greater than the number of balanced metabolites, rendering the matrix rank r < nc. In this case, (9.3) has infinitely many solutions corresponding to different flux distributions that satisfy the stoichiometric equations. This degree-of-freedom problem can be resolved by measuring a sufficient number of intracellular and/or transport fluxes such that the number of 152
9.2
Methods
unknown fluxes is equal to the number of balanced metabolites and the resulting matrix has rank r = nc. This approach typically involves carbon labeling experiments to measure intracellular fluxes [30, 34] and is not well suited for genome-scale models in which rank deficiencies of several hundred are common. More importantly, carbon labeling does not allow a priori prediction of cellular metabolism due to the necessity of collecting data for flux computation. An alternative approach that is the focus of this chapter is to assume the existence of a cellular objective and to solve an optimization problem in which the fluxes are distributed to maximize this objective while simultaneously satisfying the stoichiometric equations. The most common cellular objective is growth rate maximization, although other objectives such as maximal ATP production have been considered [35, 36]. Figure 9.1 depicts this computational framework for a very simple example of three calculated fluxes. Physiochemical constraints such as reaction directionality (reversible or irreversible reaction) can be represented as vmin ≤ v ≤ vmax
(9.4)
where vmin and vmax are vectors containing known lower and upper bounds on the fluxes, respectively. When combined with the stoichiometric equations, the physiochemical constraints bound the flux solution space but typically fail to yield a unique solution for the flux distribution. In an attempt to resolve the degree-of-freedom problem, the cell is assumed to utilize available substrates to achieve maximal growth. The growth rate μ(h−1) is calculated as the weighted sum of the fluxes, with fluxes corresponding to biomass precursors (amino acids, carbohydrates, ribonucleotides, deoxyribonucleotides, lipids, sterols, phospholipids, fatty acids) weighted according to their contribution to the biomass and the remaining fluxes given weights of zero. Values of the weights w (gDW/mmol) are determined from measurement of the biomass composition and they are assumed to remain constant under different conditions, an assumption that has been challenged by experimental data [37]. The resulting optimization problem is known as a linear program (LP) because the objective function and the equality and inequality constraints are linear in the unknown fluxes [38]. A variety of numerical algorithms have been developed for rapid and robust solution of large LPs with many thousands of unknown fluxes and stoichiometric equations [39, 40]. Given a set of substrate uptake rates either measured through experiment or specified for analysis, solution of the LP yields the intracellular flux distribution in the network as well as the maximal growth rate and the transport
Figure 9.1 Classical flux balance analysis for a hypothetical stoichiometric model with three unknown fluxes (vA, vB, vC), stoichiometric matrix A, physiochemical flux constraints vmin and vmax, growth rate μ and biomass composition weights w.
153
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
rates of secreted products. The solution obtained is unique unless the LP has alternative optima, in which case the same maximal growth rate can be obtained with different flux distributions [41, 42]. We will revisit this issue in Section 9.3. This approach of formulating an LP to resolve the intracellular fluxes subject to stoichiometric and physiochemical constraints is known as classical flux balance analysis (FBA).
9.2.3
Dynamic flux balance analysis
Classical flux balance analysis allows prediction of cellular growth and product secretion rates for fixed values of the substrate uptake rates. As a result, FBA is strictly applicable only to the balanced growth phase in batch cultures and the steady-state growth phase in continuous cultures. Dynamic flux balance analysis (DFBA) is an extension of classic flux balance analysis that accounts for cell culture dynamics and allows prediction of cellular metabolism in batch and fed-batch fermentations. The basic DFBA framework is depicted in Figure 9.2 for the case of fed-batch culture. The growth rate μ, the intracellular fluxes v, and the product secretion rates vP are computed through solution of the classical FBA problem. Rather than specifying constant substrate uptake rates, extracellular substrate concentrations S and product concentrations P are used for calculation of time-varying substrate uptake rates vP through expressions for the uptake kinetics vS. The extracellular concentrations are computed by numerically solving extracellular balance equations for the liquid volume (V), biomass concentration (X), and the substrate and product concentrations given growth and secretion rates obtained from the LP. Consequently, all the intracellular and extracellular variables are time varying. As compared to alternative dynamic cell modeling approaches, the primary advantages of DFBA are that the increasing availability of flux balance models is fully leveraged and that very little additional information is required for model construction. The principal challenges associated with DFBA are experimental determination of the substrate uptake kinetics [43–45] and numerical solution of the dynamic model [14, 16, 46]. The substrate uptake expressions typically take the form of saturation kinetics with possible terms for inhibition due to product toxicity or a competing substrate in the case of shared transporters. Two general approaches have been proposed for numerical solution of the DFBA problem. The sequential approach involves time discretization of the model
Figure 9.2 Dynamic flux balance model for a fed-batch bioreactor with substrate concentrations S, product concentrations P, biomass concentration X, reactor liquid volume V, and feed flow rate F. Here vs is the subset of fluxes for substrate uptakes, vp is the subset of fluxes for product secretion rates and fs is a vector function of substrate uptake kinetics.
154
9.3
Results and Interpretation
equations such that the substrate uptake kinetics, the FBA LP and the extracellular balances can be solved separately and sequentially [13, 14, 16]. Given that large LPs can be solved rapidly and robustly, we have developed a simultaneous approach in which the uptake kinetics and the LP are embedded within the extracellular balance solution such that explicit time discretization is avoided and high performance integration codes can be used directly [46]. We demonstrate application of the simultaneous solution approach in Section 9.3.2.
9.3 Results and Interpretation 9.3.1
Stoichiometric models of S. cerevisiae metabolism
A variety of stoichiometric models of S. cerevisiae metabolism that differ according to their complexity and intended use have been presented. Three alternative models have been used in our studies to investigate the relationship between model complexity, prediction capability, and computational efficiency: (1) a small-scale pathway model that describes primary carbon metabolism and the formation of cellular biomass [31, 47]; (2) the two compartment iFF708 genome-scale model with explicit connections between annotated genes and the associated enzyme catalyzed reactions [32]; and (3) the multicompartment iND750 genome-scale model in which the metabolic reactions are localized to seven intracellular compartments [27]. We refer to the small-scale model as iGH99 with the capital letters “GH” referring to the last names of the primary authors and the number “99” referring to the number of intracellular reactions rather than the number of gene-reaction associations as used in the genome-scale models. The general characteristics of each stoichiometric model are summarized in Table 9.1. The number of fluxes reported includes both the number of intracellular reactions and the number of membrane transport fluxes (e.g., the iGH99 model has 99 intracellular reaction fluxes and 30 transport fluxes). The iGH99 model has only a single intracellular compartment and accounts for a comparatively small number of metabolites. Because the iGH99 model does not explicitly include annotated genes, we manually analyzed the reaction set to establish the gene-reaction associations necessary to implement selected metabolic engineering strategies (Section 9.3.3). In addition to dividing the intracellular reactions between the cytosol and mitochondria and including gene-reaction associations, the first generation iFF708 genome-scale model contains a much more extensive list of reactions. The primary enhancements in the second generation iND750 genome-scale model are more extensive reaction localization through inclusion of the cytosol, mitochondria,
Table 9.1
Summary of S. cerevisiae Stoichiometric Models iGH99
iFF708
iND750
Genes Intracellular compartments Metabolites Fluxes Elementally balanced Charge balanced
— 1 98 129 X X
708 2 711 1,176 X X
750 7 1,059 1,264
Reference
[31]
[32]
√ √ [33] 155
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
peroxisome, nucleus, endoplasmic reticulum, Golgi apparatus, and vacuole, detailed charge balancing, and full elemental mass balancing with respect to carbon and hydrogen. While stoichiometric coefficients are fixed by the biochemical reactions, the intracellular models have several adjustable parameters associated with cellular energetics and the biomass composition. Energy related parameters correspond to growth and nongrowth associated maintenance. The small-scale iGH99 model does not distinguish between the two types of maintenance and has a single lumped parameter in the biomass formation flux. The genome-scale models iFF708 and iND750 account for nongrowth associated maintenance with a separate flux where the upper and lower bounds of the flux are set to identical values (mmol ATP/gdw/h). Lumped maintenance in the small-scale model and growth associated maintenance in the genome-scale models were specified by adjusting the stoichiometry of ATP consumption in the biomass formation flux. In the original studies, the energy related parameters for each model were determined for a single metabolic state and then assumed to remain constant under different conditions. The biomass composition parameters (w) determine the relative contribution of each precursor to the biomass formation rate (μ). Despite large differences in their metabolic descriptions, all three models utilize roughly the same level of detail for the biomass precursors. In the original studies for the iGH99 and iFF708 models, the biomass composition parameters were determined for a particular metabolic state and then assumed to remain constant. The second generation genome-scale iND750 model utilizes the same biomass composition as the iFF708 model. While metabolic models are formulated and analyzed under the assumption of constant biomass composition, continuous culture experiments with S. cerevisiae have shown that the relative contributions of proteins and carbohydrates to the biomass composition change significantly with varying dilution rate [37]. Known variations in biomass composition at different culture conditions can be incorporated by manipulating the stoichiometric coefficients of the precursors in the biomass formation rate, but the precursor coefficients must be collectively adjusted so that total biomass is conserved [48]. We have applied classical FBA to the iND750 model to investigate steady-state growth and ethanol production characteristics. The MATLAB interface to the LP code MOSEK was used to solve the linear program given a set of glucose and oxygen uptake rates. Figure 9.3 shows the growth rate and ethanol production rate surfaces obtained when the LP was repeatedly solved over a representative grid of glucose and oxygen uptake rates. Although alternative optimal solutions are a well-known problem with FBA [49], we did not encounter this issue in these computations. The surfaces are nontrivial functions of the substrate uptake rates, as demonstrated by the maximum in the ethanol production rate for microaerobic growth conditions. More complicated behavior is observed when additional metabolic products are considered, as demonstrated by the existence of seven metabolic phenotypes in the iFF708 genome-scale model [50]. Figure 9.3 suggests that the development of a comparable unstructured model [17] would minimally require the specification of different metabolic parameters for anaerobic, microaerobic, and aerobic growth. While conceptually possible, the development of unstructured models that attempt to reproduce multiple metabolic phenotypes quickly becomes unwieldy. Given the development of the highly efficient dynamic simulation
156
Results and Interpretation
μ
9.3
Figure 9.3 Surfaces for optimal growth rate (µ) and ethanol production rate (ve) obtained by repeatedly solving the iND750 stoichiometric model over a range of glucose (vg) and oxygen (vo) uptake rates.
and optimization techniques discussed in this chapter, there is little motivation to develop unstructured models when a detailed stoichiometric model is available.
9.3.2
Dynamic simulation of fed-batch cultures
Using the stoichiometric models described in Section 9.3.1, we have developed dynamic flux balance models for combined aerobic and anaerobic fed-batch growth of S. cerevisiae. Each model consists of intracellular steady-state flux balances coupled to dynamic extracellular mass balances through kinetic uptake expressions for the two substrates (glucose and oxygen). The linear program used to resolve the underdetermined flux balances was formulated as shown in Figure 9.2. The uptake kinetics for glucose (vg) and oxygen (vo) were modeled as v g = v g ,max
G 1 Kg + G 1+ E Kie
v o = v o ,max
O Ko + O
(9.5)
(9.6)
where G and O are the glucose and dissolved oxygen concentrations, respectively, Kg and Ko are saturation constants, vg,max and vo,max are maximum uptake rates, and Kie is an inhibition constant. The glucose uptake rate follows Michaelis-Menten kinetics with an additional regulatory term to capture growth rate suppression due to high ethanol concentrations [15]. Ethanol uptake was excluded from the model because ethanol consumption is oxidative and only experimentally observed when glucose is nearly exhausted [21], conditions which do not occur in these simulations. The dynamic mass balances on the extracellular environment were posed as dV =F dt
(9.7)
157
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
d(VX) dt d(VG) dt d(VE) dt
= μVX
(9.8)
= FGf − v g VX
(9.9)
= v eVX
(9.10)
where V is the liquid volume, X is the biomass concentration, and Gf is the glucose feed concentration, and F is the feed flow rate. The growth rate (μ) and the ethanol secretion flux (ve) were resolved by solution of the inner flux balance model. Although not shown here, analogous equations can be posed for other metabolic byproducts such as glycerol. The dissolved oxygen concentration was treated as an input variable under the assumption that its dynamic profile could be tracked by a suitably designed feedback controller. This simplification was deemed reasonable because anaerobic conditions were used to promote ethanol production during later stages of the batch when high cell densities might limit oxygen mass transfer. Consequently, extracellular oxygen balances were omitted and the dissolved oxygen concentration was simply represented as the percent of saturation, DO = O/Osat, where Osat is the saturation concentration. Nominal model parameter values are listed in Table 9.2. Literature values for a wild-type yeast strain [51] were used for the glucose (vg,max, Kg) and oxygen (vo,max, Ko) uptake kinetic parameters. The glucose inhibition constant with respect to ethanol (Kie) was chosen to give reasonable predictions of experimentally observed glucose, biomass, and ethanol profiles in batch culture with glucose media [21]. The saturation oxygen concentration (Osat) was determined from Henry’s law at 1.0 atm and 30°C. The initial and final liquid volumes (V0, Vf), initial glucose (G0) and biomass (X0) concentrations, feed flow rate (F), glucose feed concentration (Gf), and final batch time (tf) were chosen as representative values for a bench-scale bioreactor available in our laboratory. Dynamic simulations were performed in MATLAB using the code ode23 to integrate the extracellular mass balance equations. The inner linear program was evaluated inside the integration routine along with the dynamic equations using the MATLAB interface
Table 9.2 Parameter Values for Glucose Media Dynamic Simulation Variable
0 0 0
158
Value
Reference
20 mmol/gdw/h 0.5 g/l 8 mmol/gdw/h 0.003 mmol/l 0.30 mmol/l 10 g/l 0.5 l 10.0 g/l 0.05 g/l 0.044 l/h 100 g/l 16.0 hours 1.2 l
[51] [51] [51] [51] — — — — — — — — —
9.3
Results and Interpretation
to the linear program (LP) code MOSEK. A possible problem with FBA is the presence of multiple optimal solutions, which implies the existence of an infinite number of different flux distributions that produce the same optimal growth rate [42]. Multiple optimal solutions with respect to the ethanol secretion rate were handled by first solving the LP for the maximum growth rate, and then by fixing the growth rate at this maximum value and resolving the LP for maximum ethanol secretion. This approach allowed variability in the ethanol production rate as a result of multiple optima to be eliminated by selecting the theoretical maximum ethanol production with respect to the maximal growth rate. Figure 9.4 shows the results of a fed-batch simulation with the second generation iND750 genome-scale stoichiometric model. The glucose feed flow rate was held constant during the batch, while a switch from aerobic (50% DO) to anaerobic (0% DO) growth was implemented at 7.7 hours. A rapid increase in the biomass concentration was observed under aerobic growth conditions. The switch to anaerobic growth resulted in a substantially increased ethanol production rate at the expense of biomass production. The switching time (ts) was chosen such that the glucose was nearly exhausted by the end of the batch. Competition between the byproducts glycerol and ethanol was observed after the switch to anaerobic conditions. Although not shown here, similar results were obtained with the iGH99 small-scale pathway model and the first generation iFF708 genome-scale model. Given that the computation time for the dynamic simulation with the iND750 model was only 9 seconds on a 3.0 GHz Pentium IV workstation, there was little motivation to utilize a simpler and more computationally efficient stoichiometric model.
9.3.3
Dynamic optimization of fed-batch cultures
The primary operational challenges associated with fed-batch cultures are the determination of the initial substrate concentrations and liquid volume, the feeding policies of
Figure 9.4 Fed-batch simulation profiles for the iND750 stoichiometric model with constant glucose feed rate and a switch in the dissolved oxygen concentration from 50% DO to 0% DO at 7.7 hours indicated by the vertical line.
159
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
the substrates throughout the batch, and the final batch time. Fed-batch performance can be highly sensitive to these variables due to their complex effects on cellular metabolism. Therefore, model-based optimization is an essential tool for determining fed-batch operating strategies. The transient nature of fed-batch fermentation requires that the optimal operating policy be determined by solving a dynamic optimization problem in which a final time objective (e.g., productivity) is maximized subject to constraints imposed by dynamic model equations [52, 53]. The formulation and solution of dynamic optimization problems for maximizing ethanol productivity in fed-batch S. cerevisiae fermentations have been extensively investigated [54–59]. These studies were based on simple unstructured models with phenomenological descriptions of cell growth and constant yield coefficients. Unstructured models cannot be expected to provide accurate predictions over the wide range of transient conditions observed in fed-batch culture. We have developed dynamic optimization techniques for fed-batch fermentation based on dynamic flux balance models [46]. Our work has focused primarily on the small-scale iGH99 stoichiometric model due to the computationally intensive nature of the optimization problem. The oxygen uptake rate (vo) was modeled as in (9.6), while the glucose uptake rate (vg) followed Michaelis-Menten kinetics with additional inhibitory terms to capture regulatory effects due to high glucose and ethanol concentrations v g = v g ,max
G
1 E G2 1+ Kg + G + Kie Kig
(9.11)
where Kig and Kie are inhibition constants that restrict glucose uptake in the presence of high glucose or ethanol concentrations, respectively. Model parameter values used for fed-batch optimization are listed in Tables 9.2 and 9.3. The value of the inhibition constant Kig was chosen to yield reasonable model predictions. A constant glucose feed concentration Gf was used to avoid the trivial situation where the maximum concentration is always selected by the optimizer. A maximum DO concentration DOmax less than 100% was assumed to account for possible oxygen mass transfer limitations during later stages of the batch due to high biomass concentrations. The model parameter tss represents the time required for equipment maintenance between fed-batch runs. The objective function maximized was the weighted sum of the ethanol productivity and the ethanol yield on glucose. This dual objective allowed the tradeoff between high production rates and efficient substrate usage to be examined. The initial volume V(0) and glucose concentration G(0), the feed flow rate F(t) and dissolved oxygen concentration DO(t) profiles, and the final batch time tf were treated as decision variables. Therefore, the dynamic optimization problem had the form:
Table 9.3 Additional Model Parameter Values for Dynamic Optimization Variable
Value 10 g l-1 -1 50 g l 6 hours
160
Variable
max
Value 2.53 × 10-4 mol l-1 50%
9.3
max
V ( 0 ), G ( 0 ), F ( t ), DO ( t ), t f
subject to:
Results and Interpretation
( ) + c Y(t )
cp P t f
y
f
extracellular balances (9.7)– (9.10) uptake kinetics (9.6) and (9.11) flux balance LP . 1, V t f V ( 0) ≥ 05 0 ≤ G( 0) ≤ Gf 0 ≤ DO(t ) ≤ DOmax
( ) ≤ 12. 1
(9.12)
F (t ) ≥ 01 / h 1 h ≤ t f ≤ 36 h X(t ), G(t ), E(t ) ≥ 0 g / 1 The ethanol productivity P and the ethanol yield on glucose Y at the final batch time were defined as
( )=
P tf
( )=
Y tf
( ) ( )
V tf E tf
(9.13)
t f + t ss
( ) ( )
V tf E tf
V ( 0)G( 0) + ∫ Gf F (t )dt tf
(9.14)
0
The parameters cp and cy are weights for the productivity and yield objectives, respectively. The bounds on the state variables X(t), G(t), and E(t) were specified to ensure a physically realistic solution. The bounds on the initial and final volumes were chosen for consistency with our experimental system. Lower and upper bounds on the final batch time were included to confine the solution space, but they had no effect on the optimal solutions generated. A number of computational algorithms have been proposed to solve general dynamic optimization problems [60]. Sequential solution methods involve repeated iterations between a dynamic simulation code that integrates the model equations given a candidate feeding policy and a nonlinear programming code that determines an improved feeding policy given the dynamic simulation results. Simultaneous solution methods based on temporal discretization of the dynamic model equations have proven to be more effective due to their ability to handle state dependent constraints and their applicability to large optimal control problems [61, 62]. Our attempts to solve the dynamic optimization problem (9.12) using a sequential method [16] proved unsuccessful due to problem complexity. Therefore, we employed a simultaneous solution method in which the bilevel dynamic optimization problem (9.12) was reformulated as a single level nonlinear program with only algebraic constraints. The procedure required temporal discretization of the extracellular balances (9.7)–(9.10) and replacement of the inner LP with its associated first-order optimality conditions to generate complimentarity constraints [63]. Discretization was performed with Radau collocation on finite elements using a monomial basis representation [61, 64] with 61 finite elements and two internal collocation points per element for a total of 184 discretization points. The linear program was enforced only at the beginning of each finite element to reduce the overall problem size. The decision variables DO(t) and F(t) were restricted to change only at the element boundaries. 161
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
The dynamic optimization problem was solved through the AMPL interface to the nonlinear program solver CONOPT. AMPL is a mathematical programming language that provides analytic Jacobian and Hessian information to the solver through integrated automatic differentiation [65]. CONOPT is a feasible path, multimethod nonlinear program solver based on the generalized reduced gradient method [66]. The optimization problem consisted of 36,049 decision variables and 30,496 algebraic constraints. The computation time necessary to obtained a converged solution varied from 126 to 221 seconds depending on the initialization and objective function used. Subsequent solutions for small changes in the objective function or constraints required only a small fraction of the initial computation time. All computations were performed on a 3.0 GHz Pentium 4 CPU. Solution of the full dynamic optimization problem (9.12) with a larger stoichiometric model such as the iFF708 or iND750 genome-scale network has proven more challenging and is currently under development. We have used the iND750 model to solve a simpler batch optimization problem for the optimal aerobic-anaerobic switching time within a metabolic engineering context (see Section 9.3.4). Figure 9.5 shows the optimal control profiles for the feed flow rate and the dissolved oxygen generated from solution of the dynamic optimization problem for maximization of ethanol productivity: cy = 0 in (9.12). The calculated optimal state profiles and the simulated profiles obtained from direct simulation of the optimal control profiles also are displayed. Slight differences between the optimal and simulated profiles originated from the approximation of constant fluxes across finite elements used in the optimization problem. The optimal control policy produced an initial glucose concentration
Figure 9.5 Optimal glucose feed (top inset) and dissolved oxygen (bottom inset) profiles for dynamic optimization of ethanol productivity with the iGH99 stoichiometric model and the corresponding simulated and optimal profiles of the biomass, glucose, and ethanol concentrations.
162
9.3
Results and Interpretation
(14.6 g/l) well below its upper bound and no initial glucose feed. The glucose concentration declined until feeding began at t = 7.0 hours. Then the glucose feed flow rate increased over time such that the glucose concentration remained approximately constant until the final volume constraint was encountered at t = 13.4 hours. Analysis of (9.11) revealed that this constant glucose concentration resided very close to the relatively flat maximum in the glucose uptake rate. A sudden switch in the dissolved oxygen from the initial maximum to a final value near zero was observed at t = 8.4 hours. This switch divided an initial aerobic phase of high cell growth followed by a microaerobic phase of high ethanol production. The dynamic optimization problem formulated in (9.12) contains a dual objective for ethanol productivity and ethanol yield. A parametric sensitivity analysis was performed to examine the trade-off between these competing objectives. The analysis involved repeated solution of the dynamic optimization problem with a constant value of the productivity weight (cp = 0.81−1) and a wide range of values for the yield weight (0 ≤ cy ≤ 60). A nonzero value of the productivity weight was used to avoid solutions lying at the maximum bound of the final time constraint and exhibiting a dramatic decline in the productivity for an insignificant increase in the yield. Therefore, this strategy produced optimal policies for maximization of ethanol yield where the overall productivity loss was minimized. Figure 9.6 shows that increasing yields were achieved at the expense of decreasing productivities and longer batch times. The productivity versus yield curve represents the locus of achievable optima for the dual objective where the entire area above the curve is unachievable. During calculation of the yield-productivity trade-off curve, the ethanol yield eventually saturated with respect to increasing values of the yield weight (cy). This trend indicated that the yield was at its overall maximum and the productivity was at its maximum with respect to this yield. Figure 9.7 shows the optimal feeding policy that generated this point (circle in Figure 9.6). The calculated optimal state profiles and the simulated profiles obtained from direct simulation of the optimal control profiles are also shown. The results obtained were markedly different from the maximum productiv-
Figure 9.6 Trade-off between ethanol productivity and ethanol yield on glucose (left), and the relationship between ethanol yield and the batch time (right) obtained from dynamic optimization with the iGH99 stoichiometric model. The square and the circle correspond to the optimization results shown in Figures 9.5 and 9.7, respectively.
163
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
Figure 9.7 Optimal glucose feed (top inset) and dissolved oxygen (bottom inset) profiles obtained from dynamic optimization with the iGH99 stoichiometric model for a combined yield-productivity objective where yield was most heavily weighted, and the corresponding simulated and optimal profiles of the biomass, glucose, and ethanol concentrations.
ity results (Figure 9.5). While the glucose concentration decreased until feeding began such that a relatively constant glucose concentration was maintained, the combined objective produced a lower initial glucose concentration to increase yield and earlier glucose feeding to achieve the glucose concentration that maximized uptake. The dissolved oxygen concentration profile showed that microaerobic growth conditions were utilized throughout the batch.
9.3.4
Identification of ethanol overproduction mutants
The production of ethanol from recombinant S. cerevisiae strains has received considerable attention for renewable liquid fuel applications. A recent study [9] revealed novel metabolic engineering targets for improved ethanol production from glucose media based on classical FBA of the iFF708 stoichiometric model. In addition to being limited to steady-state culture conditions, this computational analysis failed to explicitly address the synergistic effects of the biomass and ethanol yields. While all the metabolic engineering strategies considered increased both these yields through modification of the cellular redox balance, the most favorable strategy for enhanced ethanol productivity could not be identified due to the complex relation between these two yields and total ethanol production. Moreover, regulatory processes such as ethanol inhibition of growth that are active under dynamic culture conditions can lead to alter164
9.3
Results and Interpretation
native metabolic engineering strategies that are not easily identifiable from steady-state analysis. We have utilized the iND750 dynamic flux balance model (Section 9.3.2) to develop computational techniques for identifying promising S. cerevisiae mutants for ethanol production in fed-batch culture [67]. The ethanol productivity was defined as the overall rate of ethanol production from the batch:
(VE) t = t tf
f
(9.15)
where tf denotes the final batch time. Dynamic optimization of fed-batch ethanol productivity was performed with the switching time between partially aerobic (50% DO; hereafter referred to as aerobic) and anaerobic (0% DO) conditions treated as the only decision variable to simplify the optimization problem. Additionally, a constant feed flow rate and feed glucose concentration, fixed initial conditions, and a fixed final batch were utilized. The resulting single variable optimization problem was solved with the MATLAB optimal search function fminsearch. We applied steady-state and dynamic FBA to 10 metabolic engineering strategies that included eight gene insertions and two combination gene insertion/overexpression strategies that were previously predicted to enhance anaerobic biomass and ethanol yields when classical FBA was applied to the iFF708 stoichiometric model [9]. The steady-state analysis was performed with the iND750 model to provide a consistent basis for comparing our DFBA results. The 10 strategies were implemented by the addition of reactions to the metabolic network for gene insertions, by the removal of reactions for gene deletions, and by the removal of bound constraints from reactions for gene overexpressions as discussed in [9]. Each inserted reaction was charge and elementally balanced for consistency with the iND750 model. Table 9.4 shows the FBA results obtained with the iND750 model, where the two combination gene insertion/ overexpression strategies are denoted Δgdh1 glt1 gln1 and Δgdh1 gdh2 according the genes manipulated and labels for the eight gene insertions correspond to reaction entries in the KEGG LIGAND database (http://www.genome.jp/). As shown previously with the iFF708 model, all ten manipulations produced enhanced ethanol and growth yields for steady-state anaerobic growth. Under aerobic conditions, only one manipulation generated ethanol and biomass yields that differed from the wild type. Enhanced aerobic ethanol production at the expense of reduced growth was predicted for this strategy. In the original study [9], anaerobic yield enhancements of 4.2–10.4% for ethanol and 5.2–16.5% for biomass were predicted for the 10 manipulations. We found significantly reduced ethanol yields enhancements of 3.4–6.1%. We believe that additional compartmentalization and full charge balancing of the iND750 model used in our study were the primary causes of the discrepancies with the iFF708 model used in the original study. The lower ethanol yield prediction from the iND750 model was more consistent with experimental data for the R01058 mutant, but both models overpredicted the experimentally observed R01058 growth rate [9]. The results in Table 9.4 demonstrate a notable shortcoming of classical FBA. Given different relative enhancements in anaerobic ethanol and biomass yields, the preferred in silico manipulation for anaerobic ethanol production cannot be directly determined. A similar difficulty is encountered for aerobic growth, where the impact of increased ethanol and decreased biomass yields for 165
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
Table 9.4
Steady-State FBA for
Mutants in Glucose Media Ethanol Yield Increase (%)
Strategy Glucose uptake,
An Deletion of
−1
−1
=5.0 mmol g h ; Oxygen uptake,
1 and overexpression of
1 and gln1 (Δ
1
1
−1
−1
−1
=5.0 mmol g h , Oxygen uptake,
−1
=0.0 mmol g h 1)
1 and overexpression of gdh2 (Δg 1 2) Deletion of Insertion of NAD dependent glycine dehydrogenase (R00365) Insertion of NADP dependent orotate reductase (R01866) Insertion of a transhydrogenase (R00112) Insertion of NADH kinase (R00105) Insertion of NADP dependent glycerol dehydrogenase (R01039) Insertion of NADP dependent glycerol 3-phosphate dehydrogenase (R00845) Insertion of NADP dependent glyceraldehyde-3-phosphate dehydrogenase (R01063) Insertion of nonphosphorylating NADP dependent glyceraldehyde-3-phosphate ) (R01058) dehydrogenase (e.g., Wild type: growth rate, μ =0.085 h−1; ethanol yield, / =0.424 g/g †Aerobic: Glucose uptake,
3.4
5.4
—
3.7
11.0
—
3.8 3.9 3.8 6.1 6.1 3.8 3.8
18.0 18.1 18.0 5.6 5.6 18.0 18.0
1.18 1.20 1.18 0.87 0.88 1.18 1.18
6.1
5.6
0.87
–7.4
—
−1
−1
=7.84 mmol g h 9.5 1 1)
Deletion of 1 and overexpression of 1 and gln1 (Δg 1 -1 Wild type: growth rate, μ =0.339 h ; ethanol yield, / =0.166 g/g
Biomass Yield Reaction Increase (%) Flux (mmol/g/h)
†Aerobic yields differed from the wild type only for the single strategy reported
the Δgdh1 glt1 gln1 mutant cannot be quantitatively compared to the wild type with respect to total ethanol production. Consequently, the preferred manipulation for fed-batch ethanol production in which an aerobic growth phase is followed by an anaerobic growth phase cannot be determined without further analysis. The ethanol productivity represents a single measure of fed-batch performance that explicitly incorporates the tradeoff between possibly time-varying ethanol and biomass yields throughout the batch. DFBA results for the sensitivity of the ethanol productivity to the aerobic-anaerobic switching time in fed-batch culture are shown in Figure 9.8. These results were generated for each manipulation strategy by repeated fed-batch simulation with different switching times. The maximal productivities shown as peaks and checked by solving a single variable optimization problem with the switching time as the decision variable were used to produce an explicit ranking of the manipulation strategies: (1) R00105/R01039/R01058; (2) R00365/R01866/R00112/R00845/R01063; (3) Δgdh1 gdh2; (4) Δgdh1 glt1 gln1; and (5) wild type. The vertical dotted line at the optimal productivity for the wild type demonstrates that the manipulation strategies have different optimal switching times. Consequently, optimal performance is dependent both on the metabolic engineering strategy and the fed-batch operating policy. This result suggests that attempts to separately optimize the cellular design and the fermentation conditions are likely to produce suboptimal performance. We also assembled a library of 357 gene insertion candidates from the KEGG LIGAND database in an attempt to uncover novel metabolic engineering strategies for ethanol overproduction in fed-batch culture. Only reactions involving species present in the cytosol of the iND750 model were considered. We were able to match 517 of the 575 iND750 cytosolic species to compounds in the KEGG database, and 788 reactions involved only these matched species. The iND750 metabolic network already included 431 of these reactions, yielding a reduced set of 357 reactions corresponding to potential gene insertions. All reactions were assumed reversible unless available experimental data 166
9.3
Results and Interpretation
Δ Δ
Figure 9.8 Sensitivity of the ethanol productivity to the aerobic-anaerobic switching time (ts) in fed-batch culture predicted with the iND750 stoichiometric model. The dotted line indicates the optimal switching time for the wild type strain.
or other genome-scale models [1] suggested otherwise. The reactions extracted from the KEGG database were charge and elementally balanced for consistency with the iND750 model. The fed-batch performance of each candidate insertion was assessed by optimizing the aerobic-anaerobic switching time to determine maximal ethanol productivity. Figure 9.9 shows the dynamic screening results where the insertions are labeled by their entries in the KEGG LIGAND database and the eight insertions suggested in [9] are indicated by white bars. In addition to the eight previously analyzed insertions, DFBA identified 21 new insertion strategies with productivity enhancements greater than 3% over the wild type value. The insertions could be grouped into three sets, each with the same aerobic-anaerobic switching time and very similar productivities. The switching time varied only slightly between these three groups. The two new candidate insertions with the highest productivities correspond to expression of a NADP-specific 1-pyrroline-5-carboxylate dehydrogenase (R00708) and a NADP-malic enzyme (R00216). A NAD-specific 1-pyrroline-5carboxylate and the same NADP-malic enzyme were already expressed in the mitochondria of the iND750 model, so identification of these cytosolic insertions required a compartmentalized metabolic network model. Both of the proposed insertions maintain a favorable redox balance for ethanol production by generating NADPH, and therefore they represent similar design alternatives to those previously proposed [9].
9.3.5
Exploration of novel metabolic capabilities
The computational studies described in Sections 9.3.1 to 9.3.4 illustrate that DFBA can be used to analyze and engineer the ethanol production capabilities of S. cerevisiae in glucose media. Because glucose is a preferred substrate and ethanol is a naturally 167
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
Figure 9.9 Dynamic screening of a gene insertion library derived from the KEGG database for optimal fed-batch ethanol productivity predicted with the iND750 stoichiometric model. Results are presented as percentage increase in the ethanol productivity relative to the wild-type strain. Insertions proposed by [9] are shown as white bars. The number indicated to the right of each bar indicates the optimal aerobicanaerobic switching time.
secreted metabolite, these studies demonstrate the in silico enhancement of native metabolic capabilities. DFBA also can be used to investigate the engineering of novel metabolic capabilities such as the consumption of new substrates and the production of nonnative metabolic products. The wild-type stoichiometric model is expanded by including the intracellular reactions required to describe new substrate metabolism and/or metabolite synthesis, while the extracellular model is augmented with the corresponding substrate kinetics and extracellular mass balances. The resulting dynamic flux balance model can be used to perform in silico studies of novel substrate utilization and/or product formation behavior to guide experimental efforts. Genetic engineering of xylose fermenting S. cerevisiae strains that can grow on media derived from agricultural products is important for the production of renewable liquid fuels [68–71]. We have developed a dynamic flux balance model that describes S. cerevisiae growth and ethanol production on glucose/xylose substrate mixtures [67]. The dynamic model consists of the iND750 stoichiometric model coupled to dynamic 168
9.3
Results and Interpretation
extracellular mass balances through uptake expressions for the three possible substrates (glucose, xylose, oxygen). The publicly available iND750 model [27] includes a mostly complete description of xylose metabolism such that the associated pathways become active only when a xylose uptake rate is specified. The modification needed for simulation of recombinant xylose utilizing strains was the insertion of the reverse reaction for xylitol dehydrogenase, which increased the number of fluxes to 1,265 for mixed-substrate studies (see Table 9.1). The uptake kinetics for glucose (vg) and oxygen (vo) were modeled as in (9.5) and (9.6), respectively, while the xylose uptake (vz) was chosen as v z = v z,max
Z 1 1 Kz + Z 1 + E 1 + G Kie Kig
(9.16)
where Z is the extracellular xylose concentration, Kz is a saturation constant, vz,max is the maximum uptake rate, and Kie and Kig are inhibition constants. The xylose uptake follows Michaelis-Menten kinetics with additional regulatory terms to capture growth rate suppression due to high ethanol concentrations [15] and inhibited xylose metabolism in the presence of the preferred substrate glucose [71]. For fed-batch operation, the dynamic mass balances on the extracellular environment were posed as in (9.7) to (9.10) with an additional equation for xylose: d(VZ) = FZf − v zVX dt
(9.17)
Although not shown here, analogous equations were posed for other key metabolic byproducts (glycerol and xylitol). Table 9.5 lists parameter values used for dynamic simulation of the xylose utilizing recombinant S. cerevisiae strain RWB 218, including experimentally derived glucose and xylose uptake kinetic parameters [71]. Literature values for a wild-type S. cerevisiae strain [51] were used for the oxygen uptake kinetic parameters. The fermenter operating conditions were chosen as representative values for our experimental system with equal concentrations of glucose and xylose in the media. Glucose and xylose are believed to be transported by the same family of hexose transporters with glucose being the preferred carbon source [71, 72]. The xylose inhibition constant with respect to glucose (Kig) was chosen to capture the effect of repressed xylose uptake in the presence of glucose [71]. Figure 9.10 shows the results of a fed-batch simulation with constant feeding of a 50%/50% glucose/xylose mixture. The mixed-substrate results were generated with a longer batch time of 20 hours than the pure glucose media simulation (Figure 9.4) due to the xylose utilizing strain having higher saturation constants, a lower maximum glucose uptake rate, and inhibition of xylose uptake in the presence of glucose. Furthermore, a longer aerobic phase was necessary to generate a sufficiently high biomass concentration such that the substrate was mostly consumed by the final batch time. The switch from aerobic to anaerobic conditions at 16 hours was characterized by a significant increase in ethanol production and a sharp decline in biomass production. The xylose concentration increased due to media feeding until decreasing sharply after glucose was nearly exhausted. Glycerol production was insignificant as a result of the limited residual glucose present following the switch to anaerobic conditions. The production rate of 169
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
Table 9.5 Parameter Values for Glucose/ Xylose Media Dynamic Simulation Variable ,max
,max
,max
0
0 0 0
Value
Reference
7.3 mmol/gdw/h 1.026 g/l 32 mmol/gdw/h 14.85 g/l 8 mmol/gdw/h 0.003 mmol/l 10 g/l 0.5 g/l 0.30 mmol/l 0.5 l 0.035 l/h 50 g/l 50 g/l 1.2 l 20.0 h 5 g/l 5 g/l 0.05 g/l
[71] [71] [71] [71] [51] [51] — — — — — — — — — — — —
the byproduct xylitol was much higher than was the competing byproduct glycerol in the glucose media case (Figure 9.4), which suggested that metabolic engineering strategies are needed to divert carbon from xylitol to ethanol and/or biomass. These fed-batch predictions are in qualitative agreement with experimental batch profiles presented in [71]. Steady-state FBA results with glucose/xylose mixed substrates are presented in Table 9.6 for the ten genetic manipulations suggested in [9] and analyzed for glucose media in Section 9.3.4. While each manipulation was predicted to yield a simultaneous increase
Figure 9.10 Fed-batch simulation profiles for the xylose utilizing S. cerevisiae strain RWB 218 [71] obtained with an extended version of the iND750 stoichiometric model. The glucose/xylose feed was maintained constant and a switch in the dissolved oxygen concentration from 50% DO to 0% DO was implemented at 17.0 hours as indicated by the vertical line.
170
9.3
Steady-State FBA for and Xylose Media
Mutants in Glucose
Table 9.6
Ethanol Yield Increase (%)
Label Anaerobic:
=2.4, =2.1,
Δgdh1 glt1 gln1
9.1
Biomass Yield Reaction Flux Increase (%) (mmol/g/h)
=0.0 (mmol/g/h) 4.1
8.1 Δgdh1 gdh2 R00365 12.2 R01866 12.3 R00112 12.2 R00105 9.6 R01039 9.6 R00845 12.2 R01063 12.2 R01058 9.6 -1 Wild type: μ =0.073 h , Aerobic: =2.4, =2.1, Δgdh1 1 1 17.7 Wild type: μ =0.319 h ;
9.5 15.2 15.3 15.2 4.1 4.1 15.2 15.2 4.1 =0.406 g/g /( ) =7.84 (mmol/g/h) –7.6
-1
/(
Results and Interpretation
— — 0.96 0.98 0.96 0.44 0.45 0.96 0.96 0.45
—
=0.111 g/g )
in the ethanol and biomass yields under anaerobic conditions compared to the wild type, the relative performance of these manipulations could not be determined without DFBA. Compared to glucose media (Table 9.4), higher increases in ethanol yields and smaller increases in biomass yields were predicted. Only the deletion/overexpression Δgdh1 glt1 gln1 mutant differed from the wild type under aerobic conditions (50% DO). The impact of the substantial increase in ethanol yield and the large decrease in biomass yield for aerobic growth was difficult to quantitatively assess, especially when considering fed-batch culture with both aerobic and anaerobic growth phases. The sensitivity of fed-batch ethanol productivities with mixed substrates to the aerobic-anaerobic switching time is shown in Figure 9.11. The predicted productivities were substantially lower than for glucose media (Figure 9.8) due to reduced substrate uptake rates and significant secretion of xylitol as a competing byproduct. The productivity measure allowed an explicit ranking of the manipulation strategies, with the R00112, R00365, R00845, R01063, and R01866 insertions predicted to yield the best performance. These insertions comprised the second highest ranked group for glucose media, demonstrating that the media should be considered simultaneously with the genetic manipulation and the fed-batch operating policy to achieve optimal performance. Unlike glucose media the optimal switching time was relatively insensitive to the manipulation, suggesting that the optimum was most strongly affected by the substrate uptake kinetics. Only the deletion/overexpression Δgdh1 glt1 gln1 required a significantly different switching time, but this manipulation produced a substantially lower productivity due to its reduced aerobic biomass yield. Comparison of these dynamic predictions with the steady-state FBA results (Table 9.6) revealed that manipulations with relatively high biomass yields were most favorable for fed-batch growth on these mixed substrates. Tests with 25%/75% and 75%/25% glucose/xylose mixtures were conducted and similar trends were predicted (not shown). In an effort to reveal novel metabolic engineering strategies for ethanol production from glucose/xylose media, we performed DFBA for mixed substrates to screen the 357 reactions corresponding to potential gene insertions extracted from the KEGG database 171
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
Δ Δ
Figure 9.11 Sensitivity of the ethanol productivity to the aerobic-anaerobic switching time (ts) in fed-batch culture with glucose and xylose media predicted with the iND750 stoichiometric model. The dotted line indicates the optimal switching time for the wild type strain.
(see Section 9.3.4). Figure 9.12 shows that DFBA revealed 15 new insertions that matched the performance of the top five insertions from [9]. The top 25 insertions could be divided into two sets according to their optimal switching time and predicted ethanol productivity. These two sets appeared in both glucose and mixed media analysis, but their relative performance was reversed such that the top five insertions in glucose media were surpassed by the set of 20 insertions in the mixed media. This result emphasizes the importance of explicitly considering the media composition when utilizing DFBA to identify mutants for metabolite overproduction.
9.4 Discussion and Commentary Large-scale production of many important biochemical products is performed in batch and fed-batch bioreactors in which the assumption of balanced growth implicit in classical flux balance analysis (FBA) does not hold. Dynamic flux balance analysis (DFBA) is an extension of FBA that allows the prediction and engineering of cellular metabolism for dynamic cell culture. The core element of DFBA is a dynamic flux balance model that combines a stoichiometric cell model with dynamic mass balances on extracellular substrates and products through experimentally determined substrate uptake kinetics and the calculated growth rate. Our work has focused on the analysis and engineering of Saccharomyces cerevisiae metabolism for enhanced ethanol production in batch and fed-batch culture. We have successfully applied DFBA to the dynamic simulation of batch and fed-batch fermentation, the dynamic optimization of fed-batch operating policies, the in silico identification of ethanol overproducing mutants in dynamic cell culture, and the in silico introduction of novel metabolic capabilities for xylose consumption. 172
9.4
Discussion and Commentary
Figure 9.12 Dynamic screening of a gene insertion library derived from the KEGG database for optimal fed-batch ethanol productivity from glucose and xylose media predicted with the iND750 stoichiometric model. Results are presented as percentage increase in the ethanol productivity relative to the wild-type strain. Insertions proposed by [9] are shown as white bars. The number indicated to the right of each bar indicates the optimal aerobic-anaerobic switching time.
Both FBA and DFBA are based on several assumptions that have not been fully validated through experiment. The most essential and controversial assumption is that cell metabolism is regulated to maximize the cellular growth rate or a similar objective. Computational evidence supporting this hypothesis includes the ability to predict growth phenotypes of knockout mutants with 75–90% accuracy [6, 27] and the qualitative reproduction of biomass and extracellular metabolite profiles in batch cultures [13–15]. Although not presented in this chapter, we have unpublished results for S. cerevisiae that show measured gene expression data is largely captured by the maximal growth objective [48] and that dynamic flux balance models can be parameterized to produce quantitative agreement with batch and fed-batch data [45]. Despite these successes, the maximal growth hypothesis appears to be inappropriate for more complex eukaryotic cells in plants and animals and will remain controversial even for microbes that are the current focus of study. Another key assumption is that the biomass composition remains constant under different growth conditions despite experimental evidence to the contrary [37]. We have unpublished results showing that experimentally deter173
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
mined variations in S. cerevisiae biomass composition can introduce errors approaching 10% in FBA and DFBA predictions [48]. An implicit assumption of flux balance analysis techniques is that metabolic engineering through gene deletions and insertions does not affect the substrate uptake rates such that the wild-type values can be used. To our knowledge, this assumption has not been experimentally evaluated by direct comparison of wild-type and mutant metabolism. Given the availability of a suitable stoichiometric cell model and the capability to develop the necessary substrate uptake kinetics, the primary challenges associated with the application of DFBA are computational. Dynamic flux balance model simulation for prediction of batch or fed-batch culture dynamics requires simultaneous solution of the linear program (LP) for growth rate maximization and integration of the extracellular mass balance equations. We have found that the simulation problem can be efficiently and robustly solved by embedding the substrate uptake kinetics and the LP within the extracellular balance solution such that high performance integration codes can be utilized. Fed-batch culture optimization requires the solution of a much more demanding bilevel nonlinear programming problem in which the cellular objective is growth rate maximization and the engineering objective is maximal metabolite production. We have used the small-scale iGH99 stoichiometric model to develop a solution strategy based on reformulating the bilevel programming problem as a single level nonlinear program through temporal discretization of the extracellular mass balance equations and replacement of the LP with its associated first-order optimality conditions to generate complimentarity constraints. Our initial attempts to implement this method with the genome-scale iND750 stoichiometric model have proven unsuccessful due to greater model complexity and increased problem size. We have used a brute force strategy involving enumeration and evaluation to screen a library of candidate gene insertions for enhanced ethanol production. Because this approach is computationally infeasible for screening large libraries and/or multiple gene insertions, extensions of existing mixed-integer linear programming methods [8, 73] that account for culture dynamics are needed. Ultimately, computational strategies that allow simultaneous optimization of the cellular design, media components, and dynamic operating policies for maximization of metabolite production in batch and fed-batch culture should be developed. The main alternative to dynamic flux balance modeling is full kinetic modeling with the enzyme kinetics specified for each intracellular reaction [18, 21, 22, 74, 75]. Advantages of kinetic models include the lack of an assumed cellular objective, the possibility of including regulation at the enzyme level, and the simultaneous prediction of reaction rates and species concentrations. While kinetic models have been fruitfully utilized for the analysis and engineering of individual metabolic pathways [19, 76], the necessity of including enzyme kinetics has severely restricted their application to comprehensive metabolic modeling. Dynamic flux balance models are well suited for this purpose due to the increasing availability of genome-scale stoichiometric models and the minimal requirement that only substrate uptake kinetics are required for model construction. DFBA also offers important computational advantages due to the LP formulation of intracellular metabolism. Dynamic simulation of a hypothetical genome-scale kinetic model would require numerical integration of about 1,000 differential equations. The fed-batch and mutant optimization problems discussed in this chapter would quickly become intractable with such kinetic models. Ultimately, the two dynamic modeling approaches may be combined synergistically with full kinetic equations incorporated 174
9.5
Summary Points
for well-characterized primary pathways and stoichiometric equations used for the remaining reactions.
9.5 Summary Points •
Dynamic flux balance analysis (DFBA) is an extension of classic flux balance analysis (FBA) that accounts for cell culture dynamics and allows prediction of cellular metabolism in batch and fed-batch fermentations.
•
The scope of DFBA includes the dynamic simulation of batch and fed-batch cultures, the dynamic optimization of fed-batch operating policies, the in silico identification of metabolite overproducing mutants in dynamic cell culture, and the in silico introduction of novel metabolic capabilities such as the consumption of new substrates and the production of nonnative metabolic products.
•
Both FBA and DFBA are based on the assumption that substrates are consumed and products are produced to maximize the cellular growth rate.
•
Both FBA and DFBA require the availability of a stoichiometric cell model that allows the steady-state prediction of intracellular fluxes to the biomass constituents and metabolic products from uptake rates of the extracellularly supplied substrates.
•
The dynamic flux balance model needed for DFBA is developed by combining the stoichiometric cell model with dynamic mass balances on extracellular substrates and products through experimentally determined substrate uptake kinetics and the calculated growth rate.
•
As compared to alternative approaches such as enzyme kinetic models, the primary advantages of DFBA are that the increasing availability of stoichiometric cell models is fully leveraged and that very little additional information is required for model construction.
•
Batch and fed-batch simulation of a dynamic flux balance model involves simultaneous solution of the linear program for growth rate maximization and integration of the extracellular mass balance equations.
•
The use of DFBA for fed-batch culture optimization requires the solution of a bilevel nonlinear programming problem in which the cellular objective is growth rate maximization and the engineering objective is maximal metabolite production.
•
Applications of DFBA to Saccharomyces cerevisiae demonstrate that maximization of ethanol production capabilities requires simultaneous optimization of the growth media, the metabolic engineering strategy, and the fed-batch operating policy.
Acknowledgments Financial support for Jared L. Hjersted from the UMass Center for Process Design and Control is gratefully acknowledged. The authors acknowledge the contributions of Radhakrishnan Mahadevan (University of Toronto) to the in silico metabolic engineering work presented in Sections 9.3.4 and 9.3.5.
175
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
References [1] [2] [3]
[4] [5] [6]
[7] [8] [9]
[10]
[11]
[12] [13]
[14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]
176
Reed, J.L., I. Famili, I. Thiele, and B.O. Palsson, “Towards multidimensional genome annotation,” Nature Reviews Genetics, Vol. 7, 2006, pp. 130–141. Stephanopoulos, G.N., A.A. Aristidou, and J. Nielsen, Metabolic Engineering: Principles and Methodologies, New York: Academic Press, 1998. Sauer, U., V. Hatzimanikatis, H.P. Hohmann, M. Manneberg, A.P. van Loon, and J.E. Bailey, “Physiology and metabolic fluxes of wild-type and riboflavin-producing Bacillus subtilis,” Appl. Environ. Microbiol., Vol. 62, 1996, pp. 3687–3696. Segre, D., D. Vitkup, and G.M. Church, “Analysis of optimality in natural and perturbed metabolic networks,” Proc. Natl. Acad. Sci. USA, Vol. 99, 2002, pp. 15112–15117. Kauffman, K.J., P. Prakash, and J.S. Edwards, “Advances in metabolic flux analysis.” Curr. Opin. Biotechnol., Vol. 14, 2003, pp. 491–496. Famili, I., J. Forster, J. Nielsen, and B.O. Palsson, “Saccharomyces cerevisiae phenotypes can be predicted by using constraint-based analysis of a genome-scale reconstructed metabolic network,” Proc. Natl. Acad. Sci. USA, Vol. 100, 2003, pp. 13134–13139. Burgard, A.P., and C.D. Maranas, “Probing the performance limits of the Escherichia coli metabolic network subject to gene additions or deletions,” Biotechnol. Bioeng., Vol. 74, 2001, pp. 364–375. Pharkya, P., A.P. Burgard, and C.D. Maranas, “OptStrain: A computational framework for redesign of microbial production systems,” Genome Res., Vol. 14, 2004, pp. 2367–2376. Bro, C., B. Regenberg, J. Forster, and J. Nielsen, “In silico aided metabolic engineering of Saccharomyces cerevisiae for improved bioethanol production,” Metabolic Eng., Vol. 8, 2006, pp. 102–111. Alfenore, S., X. Cameleyre, L. Benbadis, C. Bideaux, J.-L. Uribelarra, G. Goma, C. Molina-Jouve, and S.E. Guillouet, “Aeration strategy: A need for very high ethanol performance in Saccharomyces cerevisiae fed-batch process,” Appl. Microbiol. Biotechnol., Vol. 63, 2004, pp. 537–542. Converti, A., S. Arni, S. Sato, J.C. de Carvalho, and E. Aquarone, “Simplified modeling of fed-batch alcoholic fermentation of sugarcane blackstrap molasses,” Biotechnol. Bioeng., Vol. 84, 2003, pp. 88–95. Nilssen, A., M.J. Taherzadeh, and G. Linden, “Use of dynamic step response for control of fed-batch conversion of lignocellulosic hyrdrolyzates to ethanol,” J. Biotechnol., Vol. 89, 2001, pp. 41–53. Varma, A., and B.O. Palsson, “Stoichiometric flux balance models quantitatively predict growth and metabolic by-product secretion in wild-type Escherichia coli,” Appl. Environ. Microbiol., Vol. 60, 1994, pp. 3724–3731. Mahadevan, R., J.S. Edwards, and F.J. Doyle III, “Dynamic flux balance analysis of diauxic growth in Escherichia coli,” Biophys. J., Vol. 83, 2002, pp. 1331–1340. Sainz, J., F. Ricardo Perez-Correa, and E. Agosin, “Modeling of yeast metabolism and process dynamics in batch fermentation,” Biotechnol. Bioeng., Vol. 81, 2003, pp. 818–828. Gadkar, K.P., F.J. Doyle III, J.S. Edwards, and R. Mahadevan, “Estimating optimal profiles of genetic alterations using constraint-based models,” Biotechnol. Bioeng., Vol. 89, 2004, pp. 243–251. Nielsen, J., and J. Villadsen, Bioreaction Engineering Principles, New York: Plenum Press, 1994. Steinmeyer, D.E., and M.L. Shuler, “Structured model for Saccharomyces cerevisiae,” Chem. Eng. Sci., Vol. 44, 1989, pp. 2017–2030. Vaseghi, S., A. Baumeister, M. Rizzi, and M. Reuss, “In vivo dynamics of the pentose phosphate pathway in Saccharomyces cerevisiae,” Metabolic Eng., Vol. 1, 1999, pp. 128–140. Hatzimanikatis, V., M. Emmerling, U. Sauer, and J.E. Bailey, “Application of mathematical tools for metabolic design of microbial ethanol production,” Biotech. Bioeng., Vol. 58, 1998, pp. 154–161. Jones, K.D., and D.S. Kompala, “Cybernetic modeling of the growth dynamics of Saccharomyces cerevisiae in batch and continuous cultures,” J. Biotech., Vol. 71, 1999, pp. 105–131. Varner, J. and D. Ramkrishna, “Metabolic engineering from a cybernetic perspective. 1. Theoretical preliminaries,” Biotechnol. Prog., Vol. 15, 1999, pp. 407–425. Covert, M.W., C.H. Schilling, and B.O. Palsson, “Regulation of gene expression in flux balance models of metabolism,” J. Theor. Biol., Vol. 213, 2001, pp. 73–88. Akesson, M., J. Forster, and J. Nielsen, “Integration of gene expression data into genome-scale metabolic models,” Metabol. Eng., Vol. 6, 2004, pp. 285–293. Palsson, B.O., Systems Biology: Properties of Reconstructed Networks, New York: Cambridge University Press, 2006. Reed, J.L., T.D. Vo, C.H. Schilling, and B.O. Palsson, “An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR),” Genome Biology, Vol. 4, 2003, pp. R54.1–R54.12. Duarte, N.C., M.J. Herrgard, and B.O. Palsson, “Reconstruction and validation of Saccharomyces cerevisiae iND750, a fully compartmentalized genome-scale metabolic model,” Genome Res., Vol. 14, 2004, pp. 1298–1309.
Acknowledgments
[28]
[29] [30] [31] [32] [33]
[34]
[35] [36]
[37] [38] [39] [40] [41]
[42] [43]
[44] [45] [46] [47]
[48] [49]
[50] [51]
[52] [53] [54]
Duarte, N.D., S.A. Becker, N. Jamshidi, I. Thiele, M.L. Mo, T.D. Vo, R. Srivas, and B. O. Palsson, “Global reconstruction of the human metabolic network based on genomic and bibliomic data,” Proc. Natl. Acad. Sci., Vol. 104, 2007, pp. 1777–1782. Varma, A., and B.O. Palsson, “Metabolic capabilities of Escherichia coli. II. Optimal growth patterns,” J. Theor. Biol., Vol. 165, 1993, pp. 503–522. Nissen, T.L., U. Schulze, J. Nielsen, and J. Villadsen, “Flux distributions in anaerobic, glucose-limited continuous cultures of Saccharomyces cerevisiae,” Microbiology, Vol. 143, 1997, pp. 203–218. van Gulik, W.M., and J.J. Heijnen, “A metabolic network stoichiometry analysis for microbial growth and product formation,” Biotechnol. Bioeng., Vol. 48, 1995, pp. 681–698. Forster, J., I. Famili, P. Fu, B. O. Palsson, and J. Nielsen, “Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network,” Genome Res., Vol. 13, 2003, pp. 244–253. Schilling, C.H., M.W. Covert, I. Famili, G.M. Church, J.S. Edwards, and B.O. Palsson, “Genome-scale metabolic model of Helicobacter pylori 26695,” J. Bacteriol., Vol. 184, 2002, pp. 4582–4593. Schmidt, K., L.C. Norregaard, B., Pedersen, A. Meissner, J.O. Dussa, J. Nielsen, and J. Villadsen, “Quantification of intracellular metabolic fluxes from fractional enrichment and 13c–13c coupling constants on the isotopomer distribution in labeled biomass components,” Metabol. Eng., Vol. 1, 1999, pp. 166–179. Majewski, R.A., and M.M. Domach, “Simple constrained optimization view of acetate overflow in Escherichia coli,” Biotechnol. Bioeng., Vol. 35, 1990, pp. 732–738. Ramakrishna, R., J.S. Edwards, A. McCulloch, and B.O. Palsson, “Flux balance analysis of mitochondrial energy metabolism: Consequences of systemic stoichiometric constraints,” Am. J. Physiol. Regulatory Integrative Comp. Physiol., Vol. 280, 2001, pp. R695–R704. Lange, H.C., and J.J. Heijnen, “Statistical reconciliation of the elemental and molecular biomass composition of Saccharomyces cerevisiae,” Biotechnol. Bioeng., Vol. 75, 2001, pp. 334–344. Chvatal, V., Linear Programming, New York: W.H. Freeman and Company, 1983. Bixby, R.E., “Implementing the simplex method: The initial basis,” ORSA Journal on Computing, Vol. 4, 1992, pp. 267–284. Roos, C., T. Terlaky, and J.-Ph. Vial, Theory and Algorithms for Linear Optimization: An Interior Point Approach, New York: John Wiley and Sons, 1997. Lee, S., C. Phalakornkule, M.M. Domach, and I.E. Grossmann, “Recursive MILP model for finding all the alternate optima in LP models for metabolic networks,” Comput. Chem. Eng., Vol. 24, 2000, pp. 711–716. Mahadevan, R., and C.H. Schilling, “Effects of alternate optima on constraint-based genome-scale metabolic models,” Metabolic Eng., Vol. 5, 2003, pp. 264–276. Rufleux, P.-A., U. von Stockar, and I.W. Marison, “Measurement of volumetric (OUR) and determination of specific (qO2) oxygen uptake rates in animal cell cultures,” J. Biotechnol., Vol. 63, 1998, pp. 85–95. Gorgens, J.F., and J.H. Knoetze W.H. van Zyl, “Reliability of methods for the determination of specific substrate consumption rates in batch culture,” Biochem. Eng. J., Vol. 25, 2005, pp. 109–112. Hjersted, J.L., and M.A. Henson, “Parameterization and validation of a Saccharomyces cerevisiae dynamic flux balance model with batch and fed-batch experiments,” in preparation. Hjersted, J.L., and M.A. Henson, “Optimization of fed-batch Saccharomyces cerevisiae fermentation using dynamic flux balance models,” Biotechnol. Prog., Vol. 22, 2006, pp. 1239–1248. Vanrolleghem, P.A., P. de Jong-Gubbels, W.M. van Gulik, J.T. Pronk, J.P. van Dijken, and S. Heijnen, “Validation of a metabolic network for Saccharomyces cerevisiae using mixed substrate studies,” Biotechnol. Prog., Vol. 12, 1996, pp. 434–448. Hjersted, J.L., and M.A. Henson, “Steady-state and dynamic flux balance analysis of ethanol production by Saccharomyces cerevisiae,” IET Systems Biology, accepted. Phalakornkule, C., S. Lee, T. Zhu, R. Koepsel, M.M. Ataai, I.E. Grossman, and M.M. Domach, “A milp-based flux alternative generation and nmr experimental design strategy for metabolic engineering,” Metab. Eng., Vol. 3, 2001, pp. 124–137. Duarte, N.C., B.O. Palsson, and P. Fu, “Integrated analysis of metabolic phenotypes in Saccharomyces cerevisiae,” BMC Genomics, Vol. 5, 2004, pp. 63–73. Sonnleitner, B., and O. Kappeli, “Growth of Saccharomyces cerevisiae is controlled by its limited respiratory capacity: Formulation and verification of a hypothesis,” Biotechnol. Bioeng., Vol. 28, 1986, pp. 927–937. Johnson, A., “The control of fed-batch fermentation—A survey,” Automatica, Vol. 23, 1987, pp. 691–705. Lubbert, A., and S.B. Jorgensen, “Bioreactor performance: A more scientific approach for practice,” J. Biotechnol., Vol. 85, 2001, pp. 187–212. Banga, J.R., A.A. Alonso, and R.P. Singh, “Stochastic dynamic optimization of batch and semicontinuous bioprocesses,” Biotechnol. Prog., Vol. 13, 1997, pp. 326–335.
177
Determining Metabolite Production Capabilities of Saccharomyces Cerevisiae
[55] [56] [57] [58]
[59] [60] [61] [62] [63] [64] [65] [66] [67]
[68] [69] [70] [71]
[72]
[73]
[74] [75] [76]
Kookos, I.K., “Optimization of batch and fed-batch bioreactors using simulated annealing,” Biotechnol. Prog., Vol. 20, 2004, pp. 1285–1288. Lee, J.-H., “Comparison of various optimization approaches for fed-batch ethanol production,” Appl. Biochem. Biotechnol., Vol. 81, 1999, pp. 91–106. Luus, R., “Application of dynamic programming to differential algebraic process systems,” Comput. Chem. Eng., Vol. 17, 1993, pp. 373–377. Vera, J., P. de Atauri, M. Cascante, and N.V. Torres, “Multicriteria optimization of biochemical systems by linear programming: Application to production of ethanol by Saccharomyces cerevisiae,” Biotechnol. Bioeng., Vol. 83, 2003, pp. 335–343. Wang, F.S., and C.S. Shyu, “Optimal feed policy for fed-batch fermentation of ethanol production by Zymomous mobilis,” Bioproc. Eng., Vol. 17, 1997, pp. 63–68. Biegler, L.T., and I.E. Grossmann, “Retrospective on optimization,” Comput. Chem. Eng., Vol. 28, 2004, pp. 1169–1192. Biegler, L.T., A.M. Cervantes, and A. Wachter, “Advances in simultaneous strategies for dynamic process optimization,” Chem. Eng. Sci., Vol. 57, 2002, pp. 575–593. Cuthrell, J.E., and L.T. Biegler, “Simultaneous optimization and solution methods for batch reactor control problems,” Comput. Chem. Eng., Vol. 13, 1987, pp. 49–62. Raghunathan, A.U., J.R. Perez-Correa, and L.T. Biegler, “Data reconciliation and parameter estimation in flux balance analysis,” Biotechnol. Bioeng., Vol. 84, 2003, pp. 700–709. Bader, G., and U. Ascher, “A new basis implementation for a mixed order boundary value ODE solver,” SIAM J. Sci. Comp., Vol. 8, 1987, pp. 483–500. Fourer, R., D.M. Gay, and B.W. Kernighan, “A modeling language for mathematical programming,” Management Science, Vol. 36, 1990, pp. 519–554. Drud, A.S., “CONOPT—a large scale GRG code,” ORSA Journal on Computing, Vol. 6, 1994, pp. 207–216. Hjersted, J.L., M.A. Henson, and R. Mahadevan, “Genome-scale analysis of Saccharomyces cerevisiae metabolism and ethanol production in fed-batch culture,” Biotechnol. Bioeng., Vol. 97, 2007, pp. 1190–1204. Aristidou, A., and M. Penttila, “Metabolic engineering applications to renewable resource utilization,” Curr. Opin. Biotechnol., Vol. 11, 2000, pp. 187–198. Ostergaard, S., L. Olsson, and J. Nielsen, “Metabolic engineering of Saccharomyces cerevisiae,” Microbiology and Molecular Biology Reviews, Vol. 64, 2000, pp. 34–50. Jeffries, T.W., and Y.-S. Jin, “Metabolic engineering for improved fermentation of pentoses by yeasts,” Appl. Microbiol. Biotechnol., Vol. 63, 2004, pp. 495–509. Kuyper, M., M.J. Toirkens, J.A. Diderich, A.A. Winkler, J.P. van Dijken, and J.T. Pronk, “Evolutionary engineering of mixed-sugar utilization by a xylose-fermenting Saccharomyces cerevisiae strain,” FEMS Yeast Research, Vol. 5, 2005, pp. 925–934. Zaldivar, J., A. Borges, B. Johansson, H.P. Smits, S.G. Villas-Boas, J. Nielsen, and L. Olsson, “Fermentation performance and intracellular metabolite patterns in laboratory and industrial xylose-fermenting Saccharomyces cerevisiae,” Appl. Microbiol. Biotechnol., Vol. 59, 2002, pp. 436–442. Burgard, A.P., P. Pharkya, and C.D. Maranas, “OptKnock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization,” Biotechnol. Bioeng., Vol. 84, 2003, pp. 647–657. Nielsen, J., and J. Villadsen, “Modelling of microbial kinetics,” Chem. Eng. Sci., Vol. 47, 1992, pp. 4225–4270. Rizzi, M., M. Baltes, U. Theobald, and M. Reuss, “In vivo analysis of metabolic dynamics in Saccharomyces cerevisiae: II. Mathematical model,” Biotechnol. Bioeng., Vol. 55, 1997, pp. 592–608. Rizzi, M., U. Theobald, E. Querfurth, T. Rohrhirsch, M. Baltes, and M. Reuss, “In vivo investigations of glucose transport in Saccharomyces cerevisiae,” Biotechnol. Bioeng., Vol. 49, 1996, pp. 316–327.
Related Resources and Supplementary Electronic Information Center for Microbial Biotechnology (CMB), iFF708 genome-scale metabolic model, http://www.cmb.dtu.dk/Forskning/Software.aspx. Kyoto Encyclopedia of Genes and Genomes (KEGG), LIGAND database of biochemical reactions for various organisms, http://www.genome.jp/kegg/. Systems Biology Research Group, University of California at San Diego, Genome-scale metabolic models for many organisms (including iND750), http://gcrg.ucsd.edu/.
178
CHAPTER
10 Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling 1
2
1
Marc R. Birtwistle , Boris N. Kholodenko , and Babatunde A. Ogunnaike 1
Department of Chemical Engineering, University of Delaware, Newark, DE 19716 Department of Pathology, Anatomy, and Cell Biology, Thomas Jefferson University, Philadelphia, PA 19107 2
Abstract Predicting how different stimuli elicit distinct cell fate decisions is critical for advancement of bioengineering applications such as stem cell medicine and requires understanding the quantitative, dynamic behavior of cellular signal transduction systems. Mathematical modeling has emerged as a useful tool for obtaining such understanding; however, typical signal transduction models are extremely complex, containing hundreds of nonlinear ordinary differential equations and an even larger number of unknown parameters that must be estimated from experimental data. The sheer size of these models makes it computationally impractical to apply traditional experimental design methods in determining appropriate experimental strategies for estimating the model parameters accurately and precisely. In this chapter, we describe a computationally inexpensive, iterative experimental design procedure that allows one to determine how to perturb the system, what to measure, and when to measure it such that the unknown signal transduction model parameters can be identified to specified tolerances. Key terms
Signal transduction Experimental design Parameter identifiability Structural identifiability Impact analysis
179
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
10.1 Introduction Experimental design, in the broadest sense, is a methodology for generating experimental protocols that will maximize the information content in the experimental data sets produced thereby. Whether stated explicitly or not, such statistical experimental design strategies are based on assumed mathematical models of the systems of interest. Standard, “alphabetic” optimal experimental designs (A-optimal, D-optimal, E-optimal, and so forth) result from optimizing an appropriate norm of the Fisher Information Matrix (FIM) [1–3]. Although these standard optimal designs work well for systems and models with few parameters and experimental design variables [4–8], since the FIM is a function of the n-by-p parameter sensitivity matrix, where n is the number of data points and p is the number of parameters, even modestly sized models consisting of ~10 parameters and experimental decision variables pose a significant computational challenge. For biological signal transduction models that can contain hundreds of parameters and hundreds of experimental decision variables, applying standard experimental design methodology is computationally impractical. In this chapter, we present an experimental design methodology developed specifically for application to signal transduction models with a large number of unknown parameters and experimental decision variables. With modest computational requirements, the technique allows one to determine how to perturb the system, what to measure, and when to measure it such that the unknown model parameters are identifiable (i.e., determinable to within specified precision tolerances). First, we discuss the basic model structure under consideration, present an overview of parameter estimation, and define the parameter identifiability metrics that form the basis of the proposed experimental design procedure before presenting the procedure itself.
10.1.1
Model structure
We consider nonlinear, ordinary differential equation (ODE) models of the form dx = f(x , t , Θ, u(t )) dt x(t = 0) = x o (Θ, u(t = 0))
(10.1)
where x is an s-dimensional vector of model states, t is time, Θ is an p-dimensional vector of unknown parameters, and u is a c-dimensional vector of inputs. The n-dimensional vector of experimental observations (data), Y, consists of a true but unknown ( value Y, and εe, an n-dimensional vector of experimental errors; that is, ( Y = Y + εe
(10.2)
( $ the vector of model predictions of Y, is a function of the states x, and the nt-dimenY, sional vector of sampling times, ts; that is, Y = g( t s , x) It is related to the vector of true values εm by
180
(10.3)
10.1
( $ +ε Y=Y m
Introduction
(10.4)
where εm is an n-dimensional vector of unknown model mismatch errors. Combining (10.2) and (10.4) gives $ +ε +ε Y=Y e m
(10.5)
showing that both experimental and model mismatch errors contribute to observed discrepancies between model predictions and experimental data. When experimental errors dominate model mismatch errors, we have $ +ε Y≈Y e
(10.6)
Thus, under such conditions in which structural model uncertainties are not as significant as measurement noise, the residual vector of differences between model predictions and experimental data, defined as $ e = Y−Y
(10.7)
will be a reasonable realization of εe. Now, given an error model for εe, one can test whether e is a valid realization εe, and hence establish the validity of (10.6) and the adequacy of the proposed model for describing the experimental data. This is a standard assumption in classical statistical modeling, and the parameter identifiability metrics and experimental design techniques proposed below are also predicated on the validity of (10.6). Methods for performing such model adequacy tests are widely available [1, 9, 10], of course, but discussing them is outside the scope of this chapter.
10.1.2
Parameter estimation
Given the model described above, the objective in parameter estimation is to determine $ are the “best” possible represen$ such that the model predictions Y a set of parameters Θ tation of the experimental observations, Y. Out of several criteria by which parameter sets are judged to be “best,” the two most widely used are the maximum likelihood (ML) and weighted least squares (LS) criteria [11]. With weighted LS, the objective function to be minimized is φ = e T We
(10.8)
where φ is the weighted sum of squared residuals (or the weighted residual norm), where W is an n-by-n weighting matrix. When the experimental errors follow a zero-mean multivariate normal distribution, ε e ~ N( 0 , VY )
(10.9)
where VY is the error covariance matrix, setting W equal to VY and minimizing the $ weighted sum of squared residuals leads to minimum variance parameter estimates Θ [1],
181
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
$ min( φ = e T VY e ) → Θ = Θ
(10.10)
In particular, when VY is diagonal, the LS criterion is equivalent to the ML criterion [1]. To employ this objective function in practice, VY must be known, and it can be estimated from the data Y given a sufficient number of replicates. Given (10.9) and (10.10), the parameter covariance matrix, VΘ can be approximated by [1] VΘ ≈ ( Z T VY−1 Z )
−1
(10.11)
where Z is the parameter sensitivity matrix, defined by Zij ≡
∂ ei ∂ Θj
= $ Θ= Θ
∂ Yi ∂ Θj
(10.12) $ Θ= Θ
The approximation in (10.11) becomes more accurate as the variance of the measurements and the residual norm decrease [1, 12]. The inverse of the parameter covariance matrix (ZTVY–1Z) is commonly referred to as the FIM. It is quite common for the parameter sensitivity matrix to be ill-conditioned, which impairs one’s ability to obtain reasonable values for Vθ. To avoid this ill-conditioning, use of scaled quantities (denoted by ∼) is preferred: ~ ~ ~ −1 ~ ~ ~$ ~ ~ $ = ( V1 2 )Y $ ;e ≡ Y Θj ≡ Θj Θj ; Y ≡ ( VY1 2 ) Y;Y − Y ; Zij ≡∂ ei ∂ Θj = ∂ Y$i ∂ Θj Y
(10.13)
All the experimental data Y are scaled by the matrix square root of VY, which has two desirable effects: (1) the scaled measurement units are dimensionless, and (2) the scaled measurement covariance matrix is the identity matrix,
((V
( )
~ ~ VY ≡ Y = var = (V
1 2 Y
)
−1
1 2 Y
VY ( V
)
1 2 Y
−1
)
Y = ( VY1 2 ) var( Y)( VY1 2 )
)
−1 T
−1
−1 T
(10.14)
=I
All parameters Θ are scaled by their best fit values, which nondimensionalizes the parameter units and scales them all to order one. The resulting scaled parameter covariance matrix is given by
(
~ ~ ~ ~ VΘ ≈ Z T VY Z
10.1.3
) = (Z Z ) −1
~T ~
−1
(10.15)
Identifiability metrics and conditions
Minimization of the objective function in (10.10) leads to a best fit response and the $ with covariance matrix V . In this section we propose metrics and conparameter set Θ, Θ
ditions for identifiability which can be used to assess the quality of these parameter estimates. We consider two main classes of identifiability: 182
10.1
Introduction
1. Structural Identifiability. A model M(Θ,Y,x) is structurally identifiable if the elements of the parameter set Θ can be uniquely estimated from noise-free measurements Y. 2. Parameter Identifiability. A parameter set Θ of a model M(Θ,Y,x) is identifiable if it can be estimated to within a specified precision from an experimental data set Y. These definitions are general in that they can be considered globally (with respect to $ Because we are the entire parameter space), or locally (in a neighborhood around Θ). dealing with nonlinear systems, global identifiability may differ from local identifiability. However, evaluation of global identifiability requires a combinatorial search over the entire parameter space, which is computationally intensive. As the main purpose of this work is to reduce the computational requirements for experimental design, we therefore focus here on local identifiability metrics. We note however that any identifiability metric, local or global, is compatible with our experimental design procedure.
10.1.3.1 Local structural identifiability Consider a first-order Taylor series expansion of the model predictions around the best-fit parameter values, ~ ~ $ =Z ΔY
~ $ Θ= Θ
~ ΔΘ
(10.16)
$ Solving (10.16) for ΔΘ where Δ denotes a difference from the vector value when Θ = Θ. gives
(
~ ~ ~ ΔΘ = Z T Z
)
−1
~ ~$ Z T ΔY
(10.17)
~ This equation shows that to find a unique local solution for ΔΘ, the parameter sensitiv~ ~ ~T ~ ity matrix Z must be nonsingular so that Z Z is invertible. For Z to be nonsingular it must be of full rank, and therefore a condition for local structural identifiability is given by
( )
~ rank Z = p
(10.18)
~ ~ If the rank of Z is less than p, then the rank (Z) − p parameters that have no independent effect on the observables must be held constant in the current estimation problem.
10.1.3.2 Local parameter identifiability Given the error model defined by (10.6) and (10.9), an (1-α)-level confidence interval ~ for parameter Θj is given by δj = t αn − p
~ eT ~ e ~ VΘjj n− p
(10.19)
183
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
where t αn − p is a two-tailed t-distribution statistic evaluated with n-p degrees of freedom at ~ confidence level (1-α), and δj is the confidence interval (Θj ± 1 + δj ) [13]. Let ~ κ j be the ~ ~ specified precision for parameter Θ , such that we desire Θ ± 1 + κ . A parameter is identij
j
j
fiable if δj ≤ ~ κj
(10.20)
This equation states that if the parameter confidence interval is less than the specified parameter tolerance, it is identifiable.
10.1.4
Overview of the experimental design procedure
As explained previously, application of conventional experimental design techniques to large signal transduction modeling problems is computationally impractical. As a solution to this problem, we propose an iterative procedure for generating an experimental design whose implementation will yield an experimental data set Y that can be used to identify the model parameter values to specified precisions. While our procedure does not guarantee a unique design, it does guarantee an adequate design that is experimentally feasible to implement. The proposed procedure for experimental design is illustrated as a flowchart in Figure 10.1, and is described in detail in the methods section (Section 10.2). Over the course of this procedure, an initial experimental design, which is a comprehensive design that encompasses the feasible ranges of all perturbations and all measurements, is trimmed down to the implemented design, which contains only essential experiments needed for parameter identifiability. Parameter identifiability is tested several times during the experimental design process to determine whether more or fewer experiments are needed. Impact analysis, which identifies experiments that are the most valuable for
Initial Perturbation and Measurement Design
Design Reduction
Resources and other constraints
Identifiability Analysis Design Modification
No
Identifiability Analysis
Identifiable?
Yes
Identifiable?
No Yes Impact Analysis
Design Modification
Design Implementation
Figure 10.1 Experimental design for parameter identifiability procedure. See Section 10.2 for a detailed description of the procedure.
184
10.2
Methods
parameter identifiability, is central to the process, as the impact analysis results are used to determine the implemented design. Before going into the details of the experimental design procedure in the methods section (Section 10.2), it first is useful to list the characteristics of signal transduction models that dictate what can be done experimentally. In terms of traditional experimental design nomenclature, we must first identify the factors, or what perturbations can be made, and the responses, or what can be measured. Typical factors and responses are shown in Table 10.1, although this list may change slightly from model to model. Factors and responses can be either continuous or categorical. If a quantity is continuous, it can take any real number value within a particular range. If a quantity categorical, it can only take a finite number of discrete values.
10.2 Methods Here, we describe in detail how each step of the experimental design process shown in Figure 10.1 is performed. First, the purpose of each step and how to implement it are described. This description is followed by a simple, numbered list of tasks required to complete each step.
10.2.1
Initial perturbation and measurement design
10.2.1.1 Purpose and implementation The initial design is intended to explore the experimental design variable space comprehensively, while later steps in the design process identify subsets of the initial design that yield the most informative experiments. Thus, the initial designs are made to be as large as possible, only to be trimmed later. The initial perturbation design is determined using a factorial design, which requires setting levels for each factor and then permuting over combinations of levels for all the factors [2]. When feasible we recommend using a full factorial design, but it is recog-
Table 10.1 Classes of Factors and Responses for Typical Signal Transduction Model Factors
Name
Type
Description
Ligand type Ligand input sequence
Categorical Continuous
The different ligands that can be used to perturb the system of interest The dynamic profile of each ligand’s concentration. In this chapter we consider rectangular pulses, which are characterized by a magnitude and a duration Used to knock-down the level of a particular protein The different pharmaceutical inhibitors that target species in the system The concentration of each pharmaceutical inhibitor Total amount of a particular protein in the system of interest Amount of signaling-related post-translational modification of proteins in the system of interest. Examples include phospho-threonine, phospho-serine, phospho-tyrosine, and ubiquitylation, but there may be many more or less depending on the system Amount of signaling-related protein-protein association in the system of interest
1
Responses
siRNA Inhibitor type Inhibitor concentration Total protein abundance Post-translational modification
Categorical Categorical Continuous Continuous Continuous
Protein-protein association
Continuous
1 Although in principal the amount of siRNA-mediated protein knock-down can be adjusted, typical applications use the “all-or-nothing” approach where the protein is knocked-down as much as possible. It is in this sense that siRNA is considered categorical, but it can be made continuous if it is desired.
185
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
nized that in some instances when there are multiple ligands and/or siRNA targets this may not be reasonable. For such cases fractional factorial designs are recommended. To set the levels for each factor, first the feasible ranges/values for each factor are determined. The levels for categorical factors are set to each discrete value the variable can take. For continuous factors that span several orders of magnitude (e.g., ligand concentrations), we recommend that levels be determined using logarithmic spacing within the feasible range. For continuous factors that have narrower feasible ranges (e.g., pulse widths), we recommend that levels be determined using uniform spacing within the feasible range. Although the frequency of both uniform and logarithmic spacing will depend on experimental and computational considerations particular to each model, three levels are recommended as a minimum, as this will allow detection of potentially nonlinear relationships between the factors/responses and their impact (i.e., a quantification of how informative a particular measurement is; high impact means that a particular measurement greatly reduces parameter covariances. Impact metrics are defined in a subsequent section.). To determine the initial measurement design, first the measurement technology(ies) is selected. This selection is based both on what technology(ies) is available and the responses of the system of interest. Given the measurement technology(ies), a set of feasible measurements is constructed, and a sampling frequency upper bound(s) is determined. The initial measurement design consists of making all feasible measurements at the sampling frequency upper bound, in response to every condition in the initial perturbation design.
10.2.1.2 Procedure 1. Identify the factors and set their levels. 2. Identify the available measurement technology(ies). 3. Construct the set of feasible measurements. 4. Determine the sampling frequency upper bound(s). 5. Use a factorial design to construct the initial perturbation and measurement designs.
10.2.2
Identifiability analysis
10.2.2.1 Purpose and implementation Identifiability analysis is performed at several steps of the design procedure. Its purpose is to determine whether the parameter values are identifiable given: (1) a potential set of experimental data, (2) current parameter values, and (3) parameter tolerances. The first step in identifiability analysis is to test structural identifiability. To do this, the design (initial or reduced, depending on the stage of the process) is implemented in $ and the parameter silico to calculate the model predictions of the experimental data, Y, sensitivity matrix Z. The next step is to calculate the scaled parameter sensitivity matrix ~ ~ Z; however, to calculate Z one must know the experimental data covariance matrix, VY, which will not be known yet because the experiments have not been performed. To solve this problem, a conservative (large) estimate for VY based on typical values for the ~ measurement technology should be used. Finally, the rank of Z is calculated and compared to the total number of parameters to evaluate whether the model is structurally identifiable. 186
10.2
Methods
If the model is not structurally identifiable, then parameters having no independent $ should be held constant. In many cases, such parameters are easily identified effects on Y as they correspond to columns of Z that contain all zeros. In some cases, however, there are no columns in Z containing only zeros, yet the model is not structurally identifiable. In these cases, QR decomposition (MATLAB function QR) of Z can be used to identify the problematic parameters. Parameters corresponding to zero-valued diagonal elements of the resulting upper triangular matrix R are the problematic parameters that should be fixed. Note that the “economy size” QR decomposition (see MATLAB help file for the function QR), which is computationally less expensive than the full QR decomposition, is sufficient for these purposes. If the model is structurally identifiable, then parameter identifiability is tested. This requires calculation of the parameter covariance matrix, VΘ , and the parameter confidence intervals, δ, and specification of the parameter tolerances, ~ κ. To calculate confidence intervals, one must calculate the confidence interval “pre-factor” β, where β ≡ t αn − p
~ eT ~ e n− p
(10.21)
However, because the experiments have not yet been carried out, both the residuals, e, and the number of measurements, n, are unknown, and it is not possible to calculate β. To address this issue, below we derive an approximation for the expected value of β that can be used for identifiability analysis during the experimental design process. The number of measurements must at least be equal to the number of parameters, and the measurements must be replicated at least three times to estimate VY. Under these conditions, n = 3p, which gives β = t α3 p − p
~ ~ eT ~ e eT ~ e = t α2 p 3p − p 2p
(10.22)
The sum of squared residuals, ~ eT ~ e, follows a χ-squared distribution with 3p degrees of freedom [1], and therefore has an expected value of 3p, giving β = t α2 p
3p 3 = t α2 p 2p 2
(10.23)
Because we are dealing with models of high parameter dimension, the number of degrees of freedom for evaluating the t-distribution statistic, 2p, will be high such that t is near the asymptotic value. Taking a standard 95% confidence level (α = 0.05) gives β = 165 .
3 ≈2 2
(10.24)
Thus, one can approximate β = 2 for purpose of testing parameter identifiability in the experimental design process. Finally, parameter identifiability is tested by comparing the calculated confidence intervals to the parameter tolerances.
187
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
10.2.2.2 Procedure $ and Z. 1. Simulate implementation of the proposed experimental design to calculate Y 2. Propose a conservative approximation for the experimental data covariance matrix VY based on known aspects of the measurement technology. ~ 3. Calculate the scaled parameter sensitivity matrix Z. ~ 4. Calculate the rank of Z to test for structural identifiability. 5. If the model is not structurally identifiable, fix parameters that have no independent effects on the observables. 6. Calculate the approximate parameter covariance matrix VΘ . 7. Calculate the parameter confidence intervals. 8. Specify the parameter tolerances ~ κ. 9. Test for parameter identifiability.
10.2.3
Impact analysis
10.2.3.1 Purpose and implementation Impact analysis is the hub of the experimental design procedure, and it is used to determine which experiments have the greatest impact on the parameter variances (i.e., which experiments are the most informative). We propose three distinct impact metrics for this task: absolute sensitivity coefficients, net impacts, and importance coefficients. These metrics give slightly different measures of impact, and which is most useful is situation dependent. Here, we only define these metrics; the results of the case studies provide insight into the pros and cons of choosing experiments based on these different metrics. ~ Absolute sensitivity coefficients are simply the absolute values of the elements of Z, ~ s ij = Zij
(10.25)
where sij is the absolute sensitivity coefficient for potential measurement i and parameter j. A high absolute sensitivity coefficient means that the simulated value of measurement i is greatly affected by parameter j. Thus, only small changes in parameter j cause large changes in residual i. In that sense, measurements with high absolute sensitivities “lock down” parameter values and therefore have high impact. The net impact and importance coefficient are both based on singular value decomposition of the parameter sensitivity matrix, ~ Z = SΣR T
(10.26)
where S and R are unitary matrices and Σ is a diagonal matrix of singular values. Substitution of (10.26) into the first-order Taylor series expansion in (10.16) yields ~ ~ $ = SΣR T ΔΘ ΔY and rearrangement gives
188
(10.27)
10.2
Methods
~ $ = ΩΨ ΔY ⎡ Ω1 ⎤ ~ Ψ ≡ R T ΔΘ; Ω ≡ SΣ = ⎢ M ⎥ ⎢ ⎥ ⎢⎣Ω n ⎥⎦
(10.28)
where Ψ is an orthogonal “eigenparameter” set, and each row vector Ωi composing Ω gives the strengths with which each measurement i affects all the eigenparameter directions. We define the net impact of measurement i as p
ρ i ≡ Ωi =
∑ (S Σ )
2
ij
jj
(10.29)
j =1
The net impact of a particular measurement i will be large if it is a major component of ~ high singular value directions. Since the square root of the singular values of Z are equal ~T ~ ~T ~ to the eigenvalues of Z Z , and the largest eigenvalues of Z Z (the FIM) denote the eigenparameter directions that have the smallest variance, high net impact measurements significantly reduce parameter variances. Now consider a slightly different rearrangement of (10.27), ~ $ = ΣΨ S T ΔY
(10.30)
The RHS of (10.30) gives the singular value-weighted orthogonal parameter directions, while the LHS of (10.30) describes how each measurement contributes to each singular value-weighted parameter direction. We define the importance coefficient of measurement i for eigenparameter j as ωij ≡ SjiT = Sij
(10.31)
The importance coefficient measures how much a particular measurement i matters for determining the eigenparameter j. As S is a unitary matrix, the norm of each column is equal to 1, and therefore any importance coefficient will be between 0 and 1. Importance coefficients closer to 1 denote higher impact. These three impact metrics are calculated for each potential measurement in a design. To analyze these impact data, two different methods can be used: rank analysis and main effects analysis. Again, pros and cons of using these different analysis methods will be illustrated in the case studies; here we only provide basics on how to perform the analyses. In rank analysis, the potential measurements are ordered according to the metric of interest. If the impact metric is the net impact, ranking is straightforward since each potential measurement is described by a single metric. However, if the impact metric is the absolute sensitivity coefficient or the importance coefficient, each experiment/parameter combination has an impact measure, and therefore a single experiment does not have a unique impact. This apparent difficultly, however, actually gives a beneficial flexibility because parameter-specific information can be incorporated into the ranking. To do this, an experiment rank vector for each parameter is constructed, which results in p different ranked impact metric vectors, μj (a vector of absolute sensitivity 189
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
coefficients or importance coefficients). Then, the top experiments from every sorted vector are chosen until a fixed fraction φj of each vector’s norm is accounted for, such that φj ≥
μcj μj
∀j = 1, K , p
(10.32)
where j denotes a parameter index and μc denotes the impact metric vector for the chosen experiments. By choosing experiments in this way, the dimension of each μcj can be different, and therefore more experiments can be allocated to model parameters that are difficult to identify. In main effects analysis, the means of the metric of interest for different classes of factors and responses, or main effects, are calculated and then analyzed to identify generally informative experiment characteristics. Factors and responses having the largest main effects have the highest impact, and experiments containing these high-impact factors and responses should be chosen first. There are many possible ways to analyze the main effects; the case studies present some examples.
10.2.3.2 Procedure 1. Calculate the absolute sensitivity coefficients, net impacts, and importance coefficients. 2. Perform rank analysis. 3. Calculate the main effects of each factor and response for each impact metric. 4. Perform main effects analysis.
10.2.4
Design modification and reduction
10.2.4.1 Purpose and implementation During the experimental design process, the results of identifiability and impact analysis are used to modify and/or reduce the design. Although this part of the design process is highly situational and model dependent—and is best illustrated through the case studies presented below—there are some generalities that can be discussed regarding the initial design. It is entirely possible that parameter identifiability issues will arise with the initial design, and there are two options for dealing with such a scenario. One option is to expand the initial design, if possible. This may be done by considering alternative measurement technologies, higher sampling frequencies, or additional levels for factors. Alternatively, one can fix the unidentifiable parameters, excluding them from the experimental design process. Which option to select is highly situation dependent, and is best decided on a case by case basis. In many cases the unidentifiable model parameters will not be important (small sensitivity) for controlling the quantities of interest in the particular model, and in this sense fixing these parameters would often be reasonable. We stress, however, that parameter unidentifiability does not imply biological unimportance; rather, an unidentifiable parameter is not important for the measured variables according to the model.
190
10.2
Methods
10.2.4.2 Procedure Based on the results of identifiability and impact analysis, expand or reduce the currently considered experimental design. As this part of the procedure is highly situational and model dependent, we refer the reader to the case studies in subsequent sections for examples of how to perform design reduction and modification.
10.2.5
Design implementation
10.2.5.1 Purpose and implementation There is a remaining fundamental issue with the proposed experimental design strategy: the design is based upon the current values of the model parameters, which are not yet known. To resolve this issue we propose using an iterative, sequential design and estimation approach, where only a small subset of the reduced experimental design is implemented at each step in the iteration (Figure 10.2) [2]. For such an approach, Atkinson and Donev recommend that the square root of the total number of measurements in the design should be implemented [2]. At each iteration, the cumulative experimental data are used to refine the parameter estimates, and the current parameter estimates are then used to propose the next round of experiments. The process is repeated until the model agrees reasonably with the experimental observations and the unknown parameters are identifiable. The initial parameter estimates, Θ0, can be obtained by a variety of means, a discussion of which is outside the scope of this chapter. Regardless of how these initial estimates are obtained, however, it is essential to start with some parameter values [14]. The initial experimental data vector, Y0, may consist of literature data and/or preliminary experimental data; however, it is not essential to begin with experimental data. In the case that Y0 is empty, the first parameter estimation step is skipped and the procedure begins with the first experimental design. It is important to note that the parameter estimation steps are not trivial, and they are also an area of active research [13, 15–19].
10.2.5.2 Procedure 1. Select a small subset of the proposed experimental design for implementation. The square root of the number of measurements is recommended as a rough guideline, but more or fewer experiments can be selected depending on the available experimental resources.
Iteration 2
Iteration 1
Y0
Experimental Design 1
Y0
Experimental Design 2 Y1
Θ0 Y0
Figure 10.2
Parameter Estimation 1
Θ1
Parameter Estimation 2
Y1 Y2
Θ2
Sequential parameter estimation/experimental design strategy.
191
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
2. Implement the selected subset of the experimental design. 3. Perform parameter estimation using all available experimental data. 4. Propose a new experimental design based on the updated parameter set.
10.3 Data Acquisition, Anticipated Results, and Interpretation We present here a detailed, step-by-step implementation of the experimental design procedure to a previously published model of Erythropoietin (Epo)-induced signal transducer and activator of transcription 5 (STAT5) signaling in BaF3-EpoR cells over a 60-minute time course [20]. This model is chosen because it is small (five parameters and four states), and thus model complexity does not convolute illustration of the experimental design procedure. In the subsequent Application Notes section (Section 10.4), we apply our procedure to a larger, more relevant model of TGF-β signal transduction. Overall, the goal of this case study is to illustrate how, by using the experimental design procedure, Swameye and coworkers could have performed fewer experiments while maintaining parameter identifiability. The STAT model differential equations are dx1 dt dx2 dt dx3 dt dx4 dt
= − k1 x1 EpoRA + 2 k4 x3 (t − τ) = − k2 x22 + k1 x1 EpoRA = − k3 x3 + 05 . k2 x
(10.33)
2 2
= − k4 x3 (t − τ) + k3 x3
where EpoRA, the amount of active Epo Receptor, is the model input (experimentally determined in [20]); x1, x2, x3, and x4 are the model states, which correspond to different STAT5 species; and k1, k2, k3, k4, and τ are the unknown parameters to be estimated. The experimental observations of the system, which are made at the time points indicated in Table 10.2, are y1 = k5 ( x2 + 2 x3 )
y 2 = k6 ( x1 + x2 + 2 x3 )
(10.34)
where y1 and y2 correspond to cytoplasmic tyrosine phosphorylated STAT5 and total cytoplasmic STAT5, respectively, and k5 and k6 are nuisance parameters that are fixed prior to parameter estimation. These nuisance parameters are unit conversion factors that relate the arbitrary measurement units to a common unit for state concentrations in the model. We simulate the model in MATLAB using the delay differential equation solver “dde23,” with initial conditions x1(0) = 1 and x2(0) = x3(0) = x4(0) = 0. The model input EpoRA is calculated by linear interpolation of the experimental data reported by Swameye and coworkers (see Table 10.2). For implementation, the input is simulated as an additional model state, with its time derivative equal to the slope of the linear interpolation. 192
10.3
Data Acquisition, Anticipated Results, and Interpretation
Table 10.2 Experimental Data Reported by [20] Time Point (min) EpoRA
EpoRA Slope pSTAT
0 2 4 6 8 10 12 14 16 18 20 25 30 40 50 60
4.27 2.88 15.20 6.85 –4.05 –2.30 –6.20 1.50 –8.50 0.25 –0.22 –3.13 –0.16 –0.08 0.03
0.00 8.54 14.30 44.70 58.40 50.30 45.70 33.30 36.30 19.30 19.80 18.70 3.04 1.45 0.68 0.99
1.08 10.01 24.80 27.40 26.50 23.40 21.70 22.10 24.20 22.10 23.00 22.50 23.20 14.40 8.67 7.96
tSTAT 1.00 0.93 0.79 0.78 0.70 0.65 0.59 0.59 0.64 0.64 0.69 0.69 0.76 0.81 0.92 0.97
The nuisance parameters k5 and k6 are assumed to be 39 and 0.95, respectively, while k1 = 0.021 min–1, k2 = 2.46 min–1mol–1, k3 = 0.1066 min–1, k4 = 0.10658 min–1, and τ = 6.4 min as reported by [20].
10.3.1
Step 1: Initial perturbation and measurement design
The first step in the experimental design process is to propose initial perturbation and measurement designs. In this example, we treat the experimental data set used by Swameye and coworkers as the initial perturbation and measurement design. Translating their experimental data into our nomenclature, for the initial perturbation design they selected a single ligand type (Epo), a single ligand input sequence (step input), and no pharmaceutical inhibitors or siRNA. For the initial measurement design, they chose two responses, cytoplasmic tyrosine phosphorylated STAT5 (pSTAT) and total cytoplasmic STAT5 (tSTAT), to be observed with the measurement technology of immunoblotting. Both of these responses are measured with a 2-minute frequency in the first 20 minutes after ligand stimulation, and subsequently at 25, 30, 40, 50, and 60 minutes.
10.3.2
Step 2: Identifiability analysis
The next step in the experimental design process is to perform identifiability analysis using the simulated initial perturbation and measurement design. The first step in per~ forming identifiability analysis is to calculate the scaled parameter sensitivity matrix Z. To do this, we need estimates of the parameter sensitivities zij and the data covariance matrix VY. We assume that the data covariance matrix is diagonal, and based on the data reported in [20] we estimate that the variance for pSTAT is 4 and for tSTAT is 0.2 (both in arbitrary measurement units). To calculate the inverse of the matrix square root of VY, the functions “inv” and “sqrt” in MATLAB are used.
193
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
To estimate the parameter sensitivities, we perform six different time course simulations using the same EpoRA input function: one with the nominal parameter values and one for each parameter with its value increased by 1% of its nominal value. Finite forward differences (Δyi/ΔΘj) are used to calculate the parameter sensitivities, which are ~ then scaled according to (10.13) to obtain Z. After calculating the scaled parameter sensitivity matrix, the next step in identifiability analysis is testing for structural identifiability, which involves evaluating ~ ~ the rank of Z. Using the function “rank” in MATLAB, we find that the rank of Z is five. ~ Thus, Z is of full rank and the model is structurally identifiable. Since the model is structurally identifiable, we move to testing parameter identifiability. In their original study, Swameye et al. calculated 1σ parameter confidence intervals using likelihood contours to conclude that their estimated parameter values were identifiable (Table 10.3). To provide a fair comparison of the confidence intervals defined in (10.19) to that of Swameye et al., we used the parameter standard deviations as the confidence intervals, neglecting the first two terms on the RHS of (10.19) (the t distribution statistic and the residual norm). Setting the parameter tolerances ~ κ to 0.3 (+/– 30%), our identifiability analysis also indicated that all parameters were identifiable (Table 10.3). Table 10.3 also indicates that there is reasonable agreement between our confidence intervals and those calculated by Swameye et al., despite their being calculated using different methods.
10.3.3
Step 3: Impact analysis
Since the initial design gives desirable parameter identifiability results, the next step in the experimental design procedure is impact analysis. Using the already calculated ~ scaled sensitivity matrix Z, we use the MATLAB function “abs” to calculate the absolute ~ sensitivity coefficients and “svd” for the singular value decomposition of Z (required to calculate the net impacts and importance coefficients). Note that the “economy size” svd as described by the MATLAB help files is adequate for our purposes. Before performing rank-based impact analysis, it is informative to analyze how these different impact metrics vary for the different measurements and measurement time points [Figure 10.3(a–c)]. All metrics show a similar trend that y1, pSTAT, in general has more impact than y2, tSTAT, implying that pSTAT measurements have a much greater effect on parameter variances that tSTAT measurements. However, there are differences between these impact metrics in terms of measurement time points. Long time points (>40 minutes) have high absolute sensitivity coefficients and net impacts, while high importance coefficients are distributed more evenly between short, mid, and long-term
Table 10.3 Parameter Values and Identifiability for the Swameye STAT Model
Name
194
Nominal Value −1
k1
0.021 min
k2
2.46 min mol
k3
0.1066 min
k4
0.10658 min
τ
6.4 min
−1
−1 −1
−1
Conf. Interval [20]
Conf. Interval (Initial Design)
Conf. Interval (Rank-Based Reduced Design)
Conf. Interval (Main effects-Based Reduced Design
+0.004/−0.003
±0.0023
±0.0030
±0.0023
+1.7/−1.0
±0.3598
±0.4147
±0.3673
+0.03/−0.022
±0.0176
±0.0196
±0.0183
+0.0016/−0.0024
±0.022
±0.0267
±0.0242
+0.5/−2.6
±0.9757
±1.0468
±0.9825
10.3
Data Acquisition, Anticipated Results, and Interpretation
measurement time points for different parameters. Furthermore, Figure 10.3(a, b) shows that the impact rankings are parameter dependent: the time points having the highest impacts are different for each parameter. Thus, although these different impact metrics give similar information in terms of which responses have the highest impact, there are differences between them in terms of which time points have more impact. These time point impact differences are manifested in the results of rank-based impact analysis as shown in Figure 10.3(d), which depicts how the number of identifiable parameters (again to ±30%) depends on the number of measurements chosen using rankings based on the different impact metrics. Figure 10.3(d) reveals that using importance coefficients as the impact metric yields the most efficient design, with all five parameters being identifiable with only 11 of the original 32 measurements. Thus, including short-time points in the design, as dictated by importance coefficients, results in improved parameter identifiability. While importance coefficients are clearly the best impact metric to use with this rank-based analysis, absolute sensitivity coefficient-based designs perform nearly as good as importance coefficient-based designs for a small number of measurements. Net impact-based designs are overall the worst performers, being only slightly better than choosing experiments randomly. However, net impacts and absolute sensitivity coefficients perform equally well for identifying all five of the parameter values, both needing 15 of the original 32 measurements. Although there are clear differences in parameter identifiability based on the impact metric used to reduce the design, experimental design based on any of these impact metrics could have saved Swameye et al. from making more than half of their measurements. The results of main-effects based impact analysis are shown in Figure 10.4 for completeness; although for this simple case study they do not give additional insight into
Absolute sensitivity coefficient values
k2 k3
(a)
k4 τ
y2 5
50 0 0 50
20
0 0 50
20
0 0 50
20
0 0 50
20
0 0
20
40
40
40
40
40
60
60
60
60
60
k1
0 0 5
20
0 0 5
20
0 0 5
20
0 0 5
20
0 0
20
40
60
k2 40
60
(b) k3 40
60
k4 40
60
τ 40
60
y1
1 0.5 0
Importance coefficient values
y1 k1
1 0.5 0 1 0.5 0
20
40
60
20
40
60
20
40
60
1 0.5 0
20
40
60
1 0.5 0 0
20
40
60
Measurement time point (min.)
Net impact values
Net impact values
(c)
60 40 20 0 0
8
20
y2
6
(d)
4 2
0 40 60 0 20 Measurement time point (min)
y2 20
40
60
20
40
60
20
40
60
20
40
60
20
40
60
Measurement time point (min.)
40
60
# of identifiable parameters
y1
80
1 0.5 0 0 1 0.5 0 0 1 0.5 0 0 1 0.5 00 1 0.5 0 0
5 4 3 Sens. coeff. Net imp. Imp. coeff. Random
2 1 0 0
5
10 15 20 25 # of measurements
30
Figure 10.3 Impact metrics and rank-based impact analysis for the Swameye STAT Model. (a) Absolute sensitivity coefficients. (b) Importance coefficients. (c) Net impacts. (d) The number of identifiable parameters versus the number of proposed measurements for rank-based experiment selection. For panels (a) and (c), note the different y-axis scales for y1 and y2.
195
0.5 0.4 0.3
k1 k2
0. 00 2. 00 4. 00 6. 00 8. 0 10 0 .0 12 0 .0 0 14 .0 16 0 .0 0 18 .0 0 20 .0 25 0 .0 30 0 .0 0 40 .0 0 50 .0 0 60 .0 0
0.2 0.1 0
50 40 30 20 10 0
0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00 25.00 30.00 40.00 50.00 60.00
Measurement time point (min) (c)
k3 k4 t
8 6 4 2 0
pSTAT
tSTAT
Measurement (b)
0.25 0.2 0.15 0.1 0.05 0
Mean importance coeff.
k3 k4 t
Mean net impact
0. 00 2. 00 4. 00 6. 00 8. 0 10 0 .0 12 0 .0 0 14 .0 16 0 .0 18 0 .0 20 0 .0 25 0 .0 30 0 .0 0 40 .0 50 0 .0 60 0 .0 0
k1 k2
Mean abs. sensitivity coeff.
10
35 30 25 20 15 10 5 0 Measurement time point (min) (a)
Mean net impact
Mean importance coeff.
Mean abs. sensitivity coeff.
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
pSTAT tSTAT Measurement (d)
20 15 10 5 0 pSTAT tSTAT Measurement (f)
Measurement time point (min) (e)
Figure 10.4 Main effects-based impact analysis of the Swameye STAT model. (a, b) Mean absolute sensitivity coefficients. (c, d) Mean importance coefficients. (e, f) Mean net impacts.
the experimental design problem. Here, we observe similar trends as described above: pSTAT generally has more impact than tSTAT [Figure 10.4(b, d, f)], absolute sensitivity coefficients and net impacts imply that long time points have more impact than short time points [Figure 10.4(a, e)], and importance coefficients imply that impact is distributed among all time points.
10.3.4
Step 4: Design reduction
Design reduction using rank-based analysis results is straightforward: choose the least number of measurements that will allow us to identify all the model parameter values. Thus, based on Figure 10.3(d) we choose to use importance coefficients as the impact metric and choose the top 11 ranked measurements. These 11 measurements are listed in Table 10.4 and, interestingly, do not include any tSTAT measurements. Importantly, an entire class of measurements, tSTAT, is unnecessary and was eliminated using our procedure. Design reduction using main effects-based analysis results is more ad hoc, as it requires reduction based on interpretation of the data presented in Figure 10.4. Since, as pointed out above, pSTAT measurements have more impact than tSTAT measurements, it is reasonable to include only pSTAT measurements in the reduced design. Further196
10.4
Application Notes
Table 10.4 Reduced Designs for the Swameye STAT Model Measurements Time Points
Rank-Based Design
Main Effects-Based Design
pSTAT 4, 8, 10, 12, 14, 18, 20, 25, 30, 40, 60 minutes
pSTAT 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 25, 30, 40, 50, 60 minutes
more, since the 40-, 50-, and 60-minute measurement time points have high net impacts and absolute sensitivity coefficients, these points should also be included in the reduced design. Importance coefficients imply that several different time points have large impact on different parameters; however, it is not clear which time points should be excluded since they all have reasonably high impact for different parameters. Therefore, we include all time points into the main effects analysis-based reduced design (Table 10.4).
10.3.5
Step 5: Identifiability analysis
After proposing a reduced design, the next step is to perform identifiability analysis on the reduced design. As part of the rank-based analysis, we performed identifiability analysis with the different reduced designs and found that all five parameters are identifiable to ±30%. The confidence intervals are shown in Table 10.3 and, as can be seen, are quite close to the intervals obtained when considering all 32 measurements. Similar results are obtained with the main effects-based reduced design, with all five parameters being identifiable to ±30% with very small increases in confidence intervals from the initial design. As we will not implement any experiments in this example, this is the final step of the analysis. The results of this case study have clearly shown that experimental design could have saved a significant amount of experimental effort for Swameye and coworkers. For this small model, rank-based analysis using importance coefficients was the most effective experimental design strategy. However, this should not be taken as a generality, and it is strongly recommended that all options (rank-based and main effects-based analysis, importance coefficients, absolute sensitivity coefficients, and net impacts) are explored until this experimental design procedure has been applied to a wide variety of models and general trends are established.
10.4 Application Notes In this section we illustrate the experimental design procedure by applying it to a practically relevant signal transduction modeling problem. The model, whose equations can be found in Supplementary Tables 10.1 and 10.2 at the end of this chapter and is schematically shown in Figure 10.5, describes how the ligand transforming growth factor β (TGF-β) induces formation of nuclear Smad2-Smad4 protein complexes over an 8-hour time course [21]. The model contains 37 unknown parameters, for which we have preliminary, nominal values (Table 10.5). However, we do not have any preliminary experimental data.
197
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
Figure 10.5
Schematic diagram of the TGF-β induced SMAD signaling model.
10.4.1
Step 1: Initial perturbation and measurement design
To construct the initial perturbation design, which is summarized in Table 10.6, we have to identify the factors and then set their levels. There is only one ligand, TGF-β, for which we consider pulse-chase, or rectangular pulse, input sequences which consist of a magnitude and duration. For both magnitude and duration we consider three levels uniformly spaced across physiologically relevant scales. We consider siRNA knock-down of either Smad2 or Smad4, but not of the receptors, since receptor knock-down would lead to a trivial signaling response (no signaling). We do not consider any pharmacological inhibitors at the present time; however, it may be of interest to consider inhibiting nuclear export with Leptomycin B in a future round of experimental design. For the initial measurement design, we consider immunoblotting as the measurement technology. Based on commercial availability of antibodies we consider absolute measurements of nine feasible responses that provide reasonable coverage of the model states (Table 10.7). It is important to recognize how the measurements are related to model state variables through an observation function, which is typically not a “one-to-one” relationship, but a sum over several model states. We list these observation functions in the right-hand column of Table 10.7, and encourage the reader to look over these functions in detail. We choose 5 minutes as a sampling frequency upper bound. Although one can certainly decrease this sampling frequency upper bound, 5 minutes provides a very fine resolution over the 8-hour model time course and as such represents
198
10.4
Application Notes
Table 10.5 Unknown Parameters in the TGF-β Signaling Model Index
Parameter
Reaction Step
Value
Unit
1 2 3 4 5 6 N/A 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
k1a k1d k2a k2d k3int k4a k4d k5cat k6a k6d k7imp k8dp k9d k10imp k10exp k11imp k11exp k12syn k12deg k13syn k13deg k14syn k14deg k15syn k15deg k16deg k16lid k17imp k18a k18d k19dp k20lid k21int k21rec k22int k22rec k23rec
ligand binding dissociation association (RI-RII*) dissociation internalization (Rc) association (Rc-S2) dissociation turnover (pS2) association (pS2-S4) dissociation nuclear import (pS2S4) dephosphorylation (pS2S4) dissociation (S2-S4) nuclear import (S2) nuclear export (S2) nuclear import (S4) nuclear export (S4) protein synthesis (RII) degradation (RII) protein synthesis (RI) degradation (RI) protein synthesis (S2) degradation (S2) protein synthesis (S4) degradation (S4) constitutive deg (Rc) ligand-induced deg (Rc) nuclear import (pS2) association (pS2-S4) dissociation dephosphorylation (pS2) ligand-induced deg (pS2) internalization (RII) recycling (RII) internalization (RI) recycling (RI) recycling (Rc)
6.60E–03 2.98E–01 6.60E–03 2.98E–01 3.95E–01 1.50E–04 9.71E–01 4.48E+04 6.00E–03 1.46E+03 8.10E–01 2.52E–02 1.01E–01 1.62E–01 3.48E–01 2.01E–02 1.74E–01 8.00E+00 2.80E–02 8.00E+00 2.80E–02 2.74E+01 6.46E–04 5.00E+01 1.20E–03 2.80E–02 3.95E–01 5.03E–01 1.67E–04 9.09E–01 2.52E–02 5.40E–03 3.95E–01 3.95E–02 3.95E–01 3.95E–02 3.95E–02
molecule · min −1 min molecule−1· min−1 −1 min −1 min −1 −1 molecule · min −1 min −1 min −1 −1 molecule · min −1 min −1 min −1 min −1 min −1 min −1 min −1 min −1 min molecule·min−1·cell−1 min−1 molecule·min−1·cell−1 −1 min molecule·min−1·cell−1 min−1 molecule·min−1·cell−1 min−1 min−1 min−1 min−1 molecule−1 · min−1 min−1 min−1 min−1 min−1 min−1 min−1 min−1 min−1
−1
−1
δIDa
δIDb
0.0497 0.1093 0.0894 0.1315 0.0496 0.0695 3087 0.1676 0.3413 0.3451 0.1132 0.0096 0.01 0.1624 0.1583 0.0822 0.0813 0.0165 0.0655 0.0172 0.0547 0.0151 0.0107 0.0158 0.0073 0.0667 0.538 0.0182 0.0344 0.0279 0.0079 0.0214 0.0705 0.0555 0.0728 0.0545 0.0028
0.06 0.1616 0.1039 0.1432 0.059 0.0615 N/A 0.2125 0.3908 0.4031 0.1618 0.0334 0.0112 0.3125 0.3039 0.1803 0.183 0.0428 0.1127 0.0455 0.0893 0.0332 0.0272 0.0575 0.0286 0.0895 0.5861 0.0362 0.0852 0.0412 0.0223 0.0505 0.1087 0.068 0.1011 0.066 0.0035
a
Confidence intervals based on the initial design. Confidence intervals based on Design D.
b
Table 10.6 Initial Perturbation Design for the TGF-β Signaling Model Factor
Level
TGF-β Magnitudes: 1, 5, 10 ng/mL Durations: 1, 4, 8 hours siRNA None, Smad2, Smad4 Pharmaceutical inhibitors None Ligand type Ligand input sequence
199
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
Table 10.7 Considered Responses for the TGF-β Signaling Model a
Response (Abbreviation)
Observation Function
Nuclear Smad2 (Nuc. Smad2) Cytoplasmic Smad2 (Cyt. Smad2) Phosphorylated uclear Smad2 (Nuc. pSmad2) Phosphorylated Cytoplasmic Smad2 (Cyt. pSmad2) Nuclear Smad4 (Nuc. Smad4) Cytoplasmic Smad4 (Cyt. Smad4) Phosphorylated Smad2-Smad4 Complex (pSmad2-Smad4) Total TGF-β Type 1 Receptor (R1) Total TGF-β Type 2 Receptor (R2)
S2S4Nuc + pS2S4Nuc + S2Nuc + pS2Nuc S2Cyt + pS2Cyt + pS2S4Cyt pS2S4Nuc + pS2Nuc pS2Cyt + pS2S4Cyt S4Nuc + pS2S4Nuc + S2S4Nuc S4Cyt + pS2S4Cyt pS2S4Cyt + pS2S4Nuc R1 + RC + RCIn + RCIn-S2Cyt + R1In R2 + R2TGF + RC + RCIn + RCIn-S2Cyt + R2In
a
Abbreviations correspond to the nomenclature shown in Supplementary Tables 1 and 2
a reasonable compromise. We note that this sampling frequency upper bound and/or the considered responses can be changed in the initial “Design Modification” step of the procedure if necessary. Using a factorial design gives 27 (3 magnitudes * 3 durations * 3 siRNAs) distinct input perturbation conditions. In response to all of these perturbations we simulate the nine responses at each of the 97 time points. This gives a total of 23,571 simulated measurements (27*97*9) in the initial design.
10.4.2
Step 2: Identifiability analysis
The next step in the experimental design process is to perform identifiability analysis using the simulated initial perturbation and measurement design, which involves first ~ calculating the scaled parameter sensitivity matrix Z. Similar to the STAT model example above, we calculate the parameter sensitivities using forward finite differences based on simulations with 1% bumps to each of the model parameters. Model trajectories are calculated using the MATLAB function “ode15s” for numerical integration of ODE systems. To simulate the effects of siRNA, we reduced the synthesis rate for the species being knocked down (k14syn for Smad2 and k15syn for Smad4) to 10% of its nominal value. We assume that the data covariance matrix VY is diagonal and each measurement’s standard deviation is 20% of its nominal value. Using these considerations to calculate ~ ~ the scaled parameter sensitivity matrix Z, we found that the rank of Z is 37, and thus the model is structurally identifiable. Since this is the first experimental design for this model, we consider generous tolerances ~ κ of 0.6 (±60%), and find that one parameter, k4d, is not identifiable (Table 10.7). This parameter, which characterizes dissociation of the Active Receptor Dimer-Smad2 Complex, is not even close to being identifiable, with a confidence interval of ~3,000 fold of the nominal value. We therefore fix this parameter at the nominal value rather than modifying the initial design, and proceed with the experimental design considering the other 36 parameters.
10.4.3 Steps 3 to 5: Impact analysis, design reduction, and identifiability analysis The overall goal of steps 3 to 5 is to reduce the initial design such that we retain identifiability of the 36 model parameters, but simultaneously arrive at a relatively inexpensive design. The results of this part of the procedure not only give a proposed 200
10.4
Application Notes
reduced design for implementation, but also give insight into the pros and cons of: (1) rank versus main effects analysis and (2) the three different impact metrics. The results of rank-based impact analysis are shown in Figure 10.6, where the number of identifiable parameters is plotted versus the number of measurements included in reduced designs based on different impact metrics. Overall, most of the parameters can be identified with relatively few measurements (< 500), but identifying all the parameter values requires a large number of measurements, regardless of the impact metric. In general, importance coefficient designs outperform absolute sensitivity coefficient and net impact designs; however, absolute sensitivity coefficient designs are slightly better for
35
# of identifiable parameters
# of identifiable parameters
small (< 200 measurements) or large (>2,000 measurements) designs. Net impact designs perform surprisingly bad in all cases, with random measurement choice performing better for designs comprising less than ~1,000 measurements. Although based on this rank analysis one can choose a reduced design comprising approximately 5,000 measurements that allows for identification of the 36 unknown parameters [Figure 10.6(b)], it is also of interest to find an inexpensive design. In general, the two experimental characteristics that will lead to a large experimental cost are a large number of input perturbations and high frequency measurements. Table 10.8 shows how 500 measurement designs based the different impact metrics perform in terms of these expensive design characteristics. Although net impact designs contain slightly fewer input perturbations and high frequency measurements than importance or absolute sensitivity coefficient designs, we see that regardless of the impact metric, these relatively small rank analysis designs contain several high frequency measurements and input perturbations. Having fewer high frequency measurements and input perturbations account for the poor performances of net impact designs. Choosing rank analysis reduced designs that allow for identification of all 36 model parameters (~5,000 measurements) will only lead to even more high frequency measurements. Thus, one drawback of choosing reduced designs based on rank analysis is high experimental cost. Figures 10.7 through 10.9 show results of the main effects analysis using as the impact metric, either net impacts (Figure 10.7), absolute sensitivity coefficients (Figure 10.8), or importance coefficients (Figure 10.9). For net impact, while low TGF-β dose
30 25 20 15 10 5 0 0
100 200 300 400 # of measurements (a)
500
35 30 25 20 15 10 5 0 0
Sens. Coeff. Net Imp. Imp. Coeff. Random
2000 4000 6000 8000 # of measurements
10000
(b)
Figure 10.6 Rank-based impact analysis for the TGF-β model. Both panels (a) and (b) plot the number of identifiable parameters versus the number of measurements in a reduced design. (a) Behavior with a small number of measurements in the reduced design. (b) Behavior with a large number of measurements in the reduced design.
201
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
Table 10.8 Rank Analysis-Based Reduced Design Characteristics
Number of input perturbations Number of responses measured Number of responses measured with high frequency
Absolute Sensitivity Coefficients 27 of 27 7 of 9 4 of 9
Net Impact 24 of 27 4 of 9 2 of 9
Importance Coefficients 27 of 27 6 of 9 5 of 9
a
Based on a 500-measurement design. A high-frequency measurement is defined as having more than 15 time points.
b
16 14 12
Mean net impact
(b)
10 8 6 4 2
16 14 12 10 8 6 4 2
0
0 1
5
1
10
16 14 12
8
30
(d)
6 4 2
25 20 15 10
0
R2
2 ad
Sm
Cy t.
N
uc
.S
m
SiRNA
2 .p S Cy ma d t. pS 2 m ad N uc .S 2 m ad Cy t. 4 pS Sm m a d4 ad 2Sm ad 4
0
Smad4
ad
Smad2
uc
None
5
N
10 8
Mean net impact
Mean net impact
(c)
4 TGF-β stimulation duration (hr.)
TGF-β dose (ng/ml)
R1
(a)
Mean net impact
[Figure 10.7(a)], short TGF-β duration [Figure 10.7(b)], and Smad4 siRNA [Figure 10.7(c)] have the largest main effects and therefore the highest impact, the main effects for other perturbation conditions are nearly as large; and for practical purposes, all these perturba-
20 18 16 14 12 10 8 6 4 2 0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480
Mean net impact
Response
Measurement time point (min.)
(e) Figure 10.7 Main effects of factors and responses based on net impact. (a) Ligand concentration. (b) Ligand stimulation duration. (c) siRNA. (d) Response type. (e) Measurement time point.
202
10.4
Application Notes
(a)
(b)
(c)
(d)
(e)
Figure 10.8 Main effects of factors and responses based on absolute sensitivity coefficients. Parameter numbers correspond to indices as shown in Table 10.5. (a) Ligand concentration. (b) Ligand stimulation duration. (c) siRNA. (d) Response type. (e) Measurement time point.
tion conditions essentially have equivalent net impact. Different response characteristics, however, have drastically different impacts. Total type II receptor (R2) and cytoplasmic pSmad2 are clearly the highest impact measurements [Figure 10.7(d)], and measurements at time zero and at times close to eight hours have much higher impact than measurements at short/mid-times [Figure 10.7(e)]. Comparing Figure 10.8 to Figure 10.9 shows an important advantage of using importance coefficients versus absolute sensitivity coefficients for main effects analysis: since importance coefficients are all scaled between zero and one, it is much easier to visualize the impact trends. Thus, for the sake of the current analysis we use only the importance coefficient results as they are much easier to interpret. However, we note that for this case study a careful analysis of the absolute sensitivity coefficients leads to similar conclusions as those discussed next. As opposed to the parameter-averaged impact that net impacts quantify, absolute sensitivity and importance coefficients give parameter specific impacts, which in some instances differ from the general trends. Figure 10.9(a–c) shows that although as observed before, there are no perturbation conditions with dominant impact, it is apparent that low TGF-β doses for short durations have significantly higher impact on parameters 14 to 19 than on other parameters. Thus, one might choose to perturb the system 203
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
(a)
(b)
(c)
(d)
(e)
Figure 10.9 Main Effects of factors and responses based on importance coefficients. Parameter numbers correspond to indices as shown in Table 10.5. (a) Ligand concentration. (b) Ligand stimulation duration. (c) siRNA. (d) Response type. (e) Measurement time point.
with a low TGF-β dose for a short duration. Additionally, since there are not significant impact differences between the siRNA perturbation conditions one might choose no siRNA, as experimentally, using no siRNA is easier than using siRNA. Figure 10.9(d, e) shows that which response characteristics have the highest impact is parameter dependent. The high net impact of total type II receptor measurements is attributable to only a few parameters, while the high net impact of cytoplasmic pSmad2 is distributed among many parameters. Thus, one might choose to measure cytoplasmic pSmad2 before total type II receptor despite the fact that total type II receptor has greater net impact. Each response has particular parameters that it has high impact on, even though it may have low net impact: for example, cytoplasmic Smad4 on parameter 9 and nuclear Smad4 on parameter 6. Therefore, based on these results it is difficult to say which responses should be excluded from the reduced design. In terms of time points, Figure 10.9(e) shows that for some parameters it is better to measure at short times, for some it is better to measure at long times, and for yet others it is good to measure both at short and long times. Thus, one might choose only short and long time point measurements, and exclude mid time measurements. 204
10.5
Discussion and Commentary
Based on these main effects analysis results, one might propose the following reduced design: 1 ng/mL TGF-β for 1 hour, no siRNA, and all responses at 0, 5, 15, 30 minutes, 7 hours, and 8 hours (Design A—Table 10.9). Design A, while being simple to implement experimentally and comprising only 64 measurements, unfortunately only yields 13/36 identifiable parameters (Table 10.9). This result implies that considering only the main effects for reducing the design is not adequate; interactions between design characteristics are important for parameter identifiability. How can Reduced Design A be improved? To answer this question we draw insight from the rank analysis reduced designs, which all include a large number of input perturbations and high frequency measurements. Thus, one might suspect that to improve Design A we need to include more input perturbations and/or high-frequency measurements. Therefore, we investigate two more Designs, B and C, that include, respectively, a single input perturbation with all responses measured at high frequency or all input perturbations with all responses measured at low frequency. Table 10.9 shows that, expectedly, both of these designs lead to an increase in the number of identifiable parameters; however, not all of the parameters are identifiable for either design. Although Table 10.9 shows that Design C yields more identifiable parameters than does Design B, it also comprises approximately double the number of measurements. These results imply that both of these design features, a multitude of input perturbations and high frequency measurements, are desirable, and perhaps essential, for parameter identifiability. This leads us to propose Design D, a hybrid of Designs B and C combining all input perturbations, low-frequency system-wide measurements, and high-frequency measurement of the high impact response cytoplasmic pSmad2. Design D indeed yields 36 identifiable parameters, does so with just over 4,000 measurements, and the parameter confidence intervals are quite close to those of the initial design (Table 10.5). Although Design D comprises a large number of measurements, it does so with only one high-frequency measurement, and as such is much more inexpensive than a comparable rank analysis reduced design.
10.5 Discussion and Commentary In this chapter we have presented an experimental design strategy for parameter identifiability, which, as opposed to traditional and recently proposed methods, is computationally feasible to apply to large signal transduction models using currently available technology. To obtain this computational feasibility, there were naturally
Table 10.9 Main Effects Analysis-Based Reduced Designs. Initial Design Number of identifiable parameters Number of measurements
36
Reduced Reduced Design A Design B 13 23
Reduced Design C 33
Reduced Design D 36
23,814
64
1,701
4,158
882
A: 1 ng/mL; 1 hour; no siRNA; all responses at 0, 5, 15, 30 minutes, 7 hours, 8 hours. B: 1 ng/mL; 1 hour; no siRNA; all responses at all time points. C: All perturbations; all responses at 0, 5, 15, 30 minutes, 7 hours, 8 hours. D: All perturbations; all responses minus Cyt. pSmad2 at 0, 5, 15, 30 minutes, 7 hours, 8 hours; Cyt. pSmad2 at all time points.
205
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
trade-offs with other desired design features: robustness and optimality. By considering only local parameter identifiability, robustness of the design to parameter uncertainty was sacrificed, and since no optimization is carried out over the entire experimental design space, acceptance of suboptimal designs is possible. Although our methods do not produce robust or optimal designs, they do produce adequate and experimentally feasible designs, and thus represent an important practical solution to an otherwise computationally infeasible experimental design problem. Implementing the experimental design in a sequential manner (see Figure 10.2) dampens the potential impact that individual designs in the iterative process have on the final outcome, and as such addresses the robustness and optimality issues. Furthermore, it is important to note that although in this chapter we only consider local parameter identifiability with a particular type of confidence interval, there are a whole host of identifiability tests and methods, all of which are compatible with our proposed experimental design procedure [22, 23]. The parameter identifiability metrics rely on the assumption that the experimental errors are multivariate normally distributed. Although raw data from many quantitative, biological experimental techniques are not normally distributed, this does not mean that our methods cannot be used with these data. Rather, modelers must be aware if their data are not normally distributed, and if they are not, the data must be transformed such that they are normally distributed. Procedures for such data transformation are well known in microarray analysis [24], and may be adaptable to other forms of experimental data. Along these lines, recent work by Kreutz and coworkers provides an excellent treatment of this problem for immunoblotting data [25]. They show how raw immunblotting data are not normally distributed, but provide a mixed-effects error model for transformation of the raw data into normally distributed data. We proposed three different metrics for quantifying the impact of potential measurements: absolute sensitivity coefficients, net impacts, and importance coefficients. An important advantage of absolute sensitivity and importance coefficients is that they provide parameter-specific impact, which can be different from the overall impact which the net impact quantifies. This parameter-specific impact led to sensitivity and importance coefficient-based rank analysis designs having a greater number of identifiable parameters for a particular number of measurements than did net impact-based rank analysis designs. However, the better performance of absolute sensitivity and importance coefficient designs came at higher experimental cost, since they included more high-frequency measurements. In most cases importance coefficient designs outperformed absolute sensitivity coefficient designs, most likely because importance coefficient designs by definition use an orthogonal basis for selecting experiments, which is a universally desirable feature. However, which impact metric is most desirable is situation dependent, and it is not clear whether the results from the case studies in this chapter are generally applicable to signal transduction models. As such, all of the impact metrics should be included in any experimental design analysis until such general understanding has been established. We also proposed two different methods for impact analysis, rank-based analysis and main effects-based analysis. While rank-based analysis is straightforward to apply, designs based on such analysis result in high experiment cost, combining a multitude of input perturbations with high frequency measurement of several responses. Although main effects analysis is more ad hoc and difficult to generalize, these designs were not as 206
10.6
Summary Points
costly as rank analysis-based designs since they included fewer input perturbations and high-frequency measurements. Importantly, interactions between factors and responses were critically important to account for when reducing designs based on main effects. Main effects analysis also revealed that while the impact of perturbation characteristics for particular parameters tended to follow general impact trends, the impact of response characteristics (what to measure and when to measure it) was highly parameter dependent. The TGF-β model case study showed that although approximately half of the model parameters can be identified with relatively little experimental effort, identification of all the model parameters requires a large amount of experimental data, much more than is typical for such modeling studies. Our results indicated that both high-frequency measurements and diverse combinations of input perturbations are important design features. One of the top designs in terms of the number of identifiable parameters for experiment cost, Design D from Table 10.9, combined a diverse set of input perturbations with system-wide low frequency measurements and high-frequency measurement of the informative species cytoplasmic pSmad2. Unfortunately, such a design would be extremely laborious, if not impossible to implement with conventional immunoblotting techniques. Alternatively, different measurement technologies are better suited to provide these data. Quantitative mass spectrometry is well suited to provide the low-frequency, system-wide measurements [25–27], and live-cell fluorescence is well suited to provide high-frequency measurements of particular proteins [28].
10.6 Summary Points •
Although the proposed experimental design for parameter identifiability procedure does not produce robust or optimal designs, it does produce adequate and experimentally feasible designs, representing a practical solution to an otherwise computationally impractical experimental design problem. Implementing the experimental design in a sequential manner helps to address the robustness and optimality issues.
•
The experimental design methodology is compatible with any nonlinear ordinary differential equation model that does not have significant model-experiment mismatch errors.
•
Structural identifiability should always be tested for first, and parameters that have no independent effects on the observables should be held constant. Any parameter identifiability test is compatible with the proposed experimental design procedure.
•
Rank-based impact analysis is straightforward to apply, but designs based on rank analysis result in high experiment cost. Although main effects-based impact analysis is more ad hoc and difficult to generalize, these designs typically are not as costly as rank analysis-based designs.
•
Only small subsets of proposed experimental designs should be implemented, and the experimental design procedure should be performed iteratively with parameter estimation steps in a sequential manner.
207
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
Acknowledgments MRB acknowledges Erik Welf for numerous helpful discussions and critical reading of this chapter in various stages of its development, and Seung-Wook Chung for providing the TGF-β signaling model.
References [1] [2] [3] [4] [5] [6]
[7] [8] [9] [10] [11] [12]
[13] [14] [15] [16] [17] [18] [19] [20]
[21] [22]
[23] [24]
208
Bard, Y., Nonlinear Parameter Estimation, New York: Academic Press, 1974. Atkinson, A., and A. Donev, (eds.), Optimum Experimental Designs, Oxford, U.K.: Clarendon Press, 1992. Draper, N., and W. Hunter, “Design of experiments for parameter estimation in multiresponse situations,” Biometrika, Vol. 53, 1966, pp. 525–533. Asprey, S., and S. Macchietto, “Statistical tools for optimal dynamic model building,” Computers and Chemical Engineering, Vol. 24, 2000, pp. 1261–1267. Asprey, S., and S. Macchietto, “Designing robust optimal dynamic experiments,” Journal of Process Control, Vol. 12, 2002, pp. 545–556. Chen, B., S. Bermingham, A. Neumann, H. Kramer, and A. Asprey, “On the design of optimally informative experiments for dynamic crystallization process modeling,” Ind. Eng. Chem. Res., Vol. 43, 2004, pp. 4889–4902. Gadkar, K.G., R. Gunawan, and F.J. Doyle, 3rd, “Iterative approach to model identification of biological networks,” BMC Bioinformatics, Vol. 6, 2005, p. 155. Kutalik, Z., K.H. Cho, and O. Wolkenhauer, “Optimal sampling time selection for parameter estimation in dynamic pathway modeling,” Biosystems, Vol. 75, 2004, pp. 43–55. Hill, W., W. Hunter, and D. Wichern, “A joint design criterion for the dual problem of model discrimination and parameter estimation,” Technometrics, Vol. 10, 1968, pp. 145–160. Stewart, W., T. Henson, and G.E.P. Box, “Model discrimination and criticism with single-response data,” AIChE Journal, Vol. 42, 1996, pp. 3055–3062. van Riel, N.A. “Dynamic modelling and analysis of biochemical networks: mechanism-based models and model-based experiments,” Brief Bioinform., Vol. 7, 2006, pp. 364–374. Yue, H., et al., “Insights into the behaviour of systems biology models from dynamic sensitivity and identifiability analysis: a case study of an NF-kappaB signalling pathway,” Mol. Biosyst., Vol. 2, 2006, pp. 640–649. Rodriguez-Fernandez, M., J.A. Egea, and J.R. Banga, “Novel metaheuristic for parameter estimation in nonlinear dynamic biological systems,” BMC Bioinformatics, Vol. 7, 2006, p. 483. Box, G.E.P., and H.L. Lucas, “Design of Experiments in Non-Linear Situations,” Biometrika, Vol. 46, 1959, pp. 77–90. Chou, I.C., H. Martens, and E.O. Voit, “Parameter estimation in biochemical systems models with alternating regression,” Theor. Biol. Med. Model, Vol. 3, 2006, p. 25. Kuepfer, L., U. Sauer, and P.A. Parrilo, “Efficient classification of complete parameter regions based on semidefinite programming,” BMC Bioinformatics, Vol. 8, 2007, p. 12. Matsubara, Y., S. Kikuchi, M. Sugimoto, and M. Tomita, “Parameter estimation for stiff equations of biosystems using radial basis function networks,” BMC Bioinformatics, Vol. 7, 2006, p. 230. Moles, C.G., P. Mendes, and J.R. Banga, “Parameter estimation in biochemical pathways: a comparison of global optimization methods,” Genome Res., Vol. 13, 2003, pp. 2467–2474. Tucker, W., Z. Kutalik, and V. Moulton, “Estimating parameters for generalized mass action models using constraint propagation,” Math. Biosci., Vol. 208, 2007, pp. 607–620. Swameye, I., T.G. Muller, J. Timmer, O. Sandra, and U. Klingmuller, “Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by databased modeling,” Proc. Natl. Acad. Sci. USA, Vol. 100, 2003, pp. 1028–1033. Chung, S.-W., et al., “Quantitative modeling and analysis of the transforming growth factor beta signaling pathway,” Biophys. J., Vol. 96, No. 5, 2009, pp. 1733–1750. Antoniewicz, M.R., J.K. Kelleher, and G. Stephanopoulos, “Determination of confidence intervals of metabolic fluxes estimated from stable isotope measurements,” Metab. Eng., Vol. 8, 2006, pp. 324–337. Hengl, S., C. Kreutz, J. Timmer, and T. Maiwald, “Data-based identifiability analysis of non-linear dynamical models,” Bioinformatics, Vol. 23, 2007, pp. 2612–2618. Huang, S., and Y. Qu, “The loss in power when the test of differential expression is performed under a wrong scale,” J. Comput. Biol., Vol. 13, 2006, pp. 786–797.
Acknowledgments
[25] [26] [27]
[28] [29]
Kreutz, C., et al., “An error model for protein quantification,” Bioinformatics, Vol. 23, 2007, pp. 2747–2753. Blagoev, B., et al., “A proteomics strategy to elucidate functional protein-protein interactions applied to EGF signaling,” Nat. Biotechnol., Vol. 21, 2003, pp. 315–318. Blagoev, B., S.E. Ong, I. Kratchmarova, and M. Mann, “Temporal analysis of phosphotyrosine-dependent signaling networks by quantitative proteomics,” Nat. Biotechnol., Vol. 22, 2004, pp. 1139–1145. Dengjel, J., et al., “Quantitative proteomic assessment of very early cellular signaling events,” Nat. Biotechnol., Vol. 25, 2007, pp. 566–568. Fujioka, A., et al., “Dynamics of the Ras/ERK MAPK cascade as monitored by fluorescent probes,” J. Biol. Chem., Vol. 281, 2006, pp. 8917–8926.
Supplementary Table 10.1 Rate Equations for the TGF-β Model Index 1 2
Rate Equations v 1 = k 1a [TGFβ][ RII] − k 1d [TGFβ: RII]
[ ] R [ ][S2 ] − k [R [R : S2 ]
3
v 3 = k 3 int R C
4
v 4 = k 4a
5
v 5 = k 5 cat
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[ ]
v 2 = k 2 a [TGFβ: RII][RI] − k 2 d R C
C in
cyt
C in
4d
C in
: S2cyt
]
cyt
v 6 = k 6 a [pS2cyt ][S4 cyt ] − k 6 d [pS2S4 cyt ] v 7 = k 7imp [pS2S4 cyt ]
v 8 = k 8 dp [pS2S4 nuc ] v 9 = k 9 d [S2S4 nuc ]
v 10 = k 10 imp [S2cyt ] − k 10 exp [S2nuc ] v 11 = k 11imp [S4 cyt ] − k 11exp [ S4 nuc ] v 12 = k 12 syn − k 12 deg [RII] v 13 = k 13 syn − k 13 deg [RI]
v 14 = k 14syn − k 14deg [S2cyt ]
v 15 = k 15 syn − k 15 deg [S4 cyt ]
[ ]
v 16 = ( k 16 deg + k 16 lid ) R C v 17 = k 17imp [pS2cyt ]
v 18 = k 18 a [ pS2nuc ][S4 nuc ] − k 18 d [pS2S] v 19 = k 19 dp [ pS2nuc ]
v 20 = k 20 lid [ pS2nuc ]
v 21 = k 21int [ RII] − k 21rec [ RIIin ] v 22 = k 22 int [ RI] − k 22 rec [RIin ]
[ ]
v 23 = k 23 rec RinC
209
Experimental Design for Parameter Identifiability in Biological Signal Transduction Modeling
Supplementary Table 10.2 Differential Equations for the TGF-β Model d[ RII] dt d[ RI] dt
= −v 1 + v 12 − v 21 + v 23
= −v 2 + v 13 − v 22 + v 23
[ ]=v C in
dR
dt d[S2cyt ]
3
− v 4 + v 5 − v 23
= −v 4 − v 10 + v 14 dt d[pS2S4 cyt ] = v6 − v7 dt d[ S2S4 nuc ] = v8 − v9 dt d[ S4 nuc ] = v 9 + v 11 − v 18 dt d[ S4 nuc ] = v 17 − v 18 − v 19 − v 20 dt d[ RIin ] = v 22 dt
210
d[TGFβ: RII] dt
[ ]=v
d RC
2
dt d RinC : S2cyt
[
dt d[pS2cyt ]
= v1 − v2
− v 3 − v 16
]=v
4
− v5
= v 5 − v 6 − v 17 dt d[ pS2S4 nuc ] = v 7 − v 8 + v 18 dt d[ S2nuc ] = v 9 + v 10 + v 19 dt d[S4 cyt ] = −v 6 − v 11 + v 15 dt d[ RIIin ] = v 21 dt
CHAPTER
11 Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes 1
2
Maia M. Donahue , Gregery T. Buzzard , and Ann E. Rundell
1
1
Weldon School of Biomedical Engineering, Purdue University Department of Mathematics, Purdue University
2
Abstract Identifying parameter values in mathematical models of cellular processes is crucial in order to ascertain if the hypotheses reflected in the model structure are consistent with the available experimental data. Due to the uncertainty in the parameter values, partially attributed to the necessary model abstraction of any cellular process, parameters are pragmatically estimated by varying their values to minimize a cost function that represents the difference between the simulated results and available experimental data. Local searches for these parameter values rarely result in an adequate fit of the model to the data since the optimization gets caught in a local minimum near the initial guess. Typically, larger regions of the parameter space must be searched for acceptable parameter values. Most of the global algorithms use stochastic sampling of the parameter space; however, these methods are not computationally efficient and cannot guarantee convergence. Alternatively, adaptive sparse grid-based optimization samples the parameter space in a more systematic manner and employs selective evaluations of the cost function at support nodes to build an error-controlled interpolated approximation of the cost function from basis functions.
Key terms
Global optimization Parameter estimation MAPK Systems biology Genetic algorithm
211
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
11.1 Introduction Increasingly, mathematical models are being used to provide insight into cellular processes [1, 2]. The construction of these models is hampered by the sheer number of participating chemical species, the uncertainty and complexity of the interconnected signaling networks, and the complicated regulation of the genetic events within a living cell. Out of necessity, the model structure must explicitly represent only the dominant events and processes for a specific application. Determining the dominant events and processes a priori is usually not trivial; hence, the first step of determining if the model structure is suitable for the specific application typically depends on finding model parameters that produce simulations consistent with available experimental data and observations. Most model parameter values are not known accurately, due to both experimental issues and omission of nonessential process details in the model. Experimentally, it is difficult to measure the concentrations, rates, and diffusion of elements within a living intact cell [3–5]. Enzyme-substrate association constants and kinase activity rates can sometimes be determined in a test tube, but there is no guarantee that these rates are the same when inside a crowded cellular environment [6]. Furthermore, the inevitable abstraction of the process being modeled causes the majority of model parameters to incorporate the net effect of a multitude of events. As a result, parameter values typically can be determined only through optimization that minimizes the difference between the simulated model output and the experimental data. This process is straightforward for linear models through linear programming. However, most optimization tools are challenged by these models since they can be highly nonlinear. In rare cases where very good estimates of parameter values are available, a local search can be adequate to find parameter values that minimize the differences between model simulations and experimental data, which is quantified as the value of a cost function. A local search starts from an initial point and finds the direction that allows the largest decrease in model/data mismatch. The search ends at the nearest minimum in that direction, shown in Figure 11.1. As Figure 11.1 demonstrates, the result of the local search is dependent on the initial point. Two of the three example points are caught in the nearest local minimum, a consequence termed the local minimum trap. If possible, a local search will modify the parameter values from the initial set to improve fitting; however, it cannot move out of the local minimum trap to find a global minimum. In contrast, global searches consider the entire parameter space when locating the global minimum. Many global optimization methods can be used to solve the parameter identification problem but these can also suffer from the local minimum trap as well as from poor convergence rates [7] and are computationally costly. For a subset of systems, the problem can be solved using deterministic optimization methods that transform the problem into a convex function or the difference between convex functions [8]. In these systems, achieving the global minimum can be guaranteed [7, 8]. As the applicability of these deterministic strategies is limited for complex, nonlinear models, many researchers resort to global optimization techniques that sample the parameter space in a stochastic manner. Existing popular stochastic methods include simulated annealing, genetic algorithms (GA), and multiple shooting strategies. The GA, for example, uses evolution-based strategies to modify a population of parameter sets, with a higher probability of keeping sets with high fitness (low cost function values) than those with low fitness; 212
11.1
Introduction
Figure 11.1 This plot illustrates the concept of a local minimum trap for a one-dimensional parameter space. A local search of this space is initialized at three different points: A, B, and C. Searches starting at points A and B will find a local minimum of the cost function, while a search starting at point C results in the global minimum.
the retained parameter sets become parent sets that are randomly combined to create children sets for evaluation in the next iteration [9]. Due to the probabilistic nature of these stochastic methods, the parameter set values and corresponding cost function value can vary considerably from run to run. In addition, these stochastic global optimization methods are computationally expensive and do not necessarily converge to a solution; hence, it can take a long time to discover that no solution exists. Alternatively, the entire parameter space can be searched using a grid algorithm. Grid algorithms divide the parameter space in a patterned manner and evaluate the cost function at each grid point; see Figure 11.2. Local or global searches can be initiated from one or more of the best grid points to find acceptable parameter estimates. As the dimension of the uncertain parameter space increases, the number of model evaluations required to cover the entire space increases exponentially for optimization with an evenly spaced (full) or pseudo-randomly spaced (latin hypercube sampling, LHS) grid [11]. Randomly spaced samples are not recommended due to inefficient clustering at some areas and inadequate sampling at others. However, adaptive sparse grid schemes avoid this exponential increase in points by selective positioning of support nodes.
11.1.1
Adaptive sparse grid interpolation
Recently sparse grid interpolation approaches have been developed that support deterministic global optimization for the minimization of functions with bounded mixed derivatives [12]. These methods are currently being refined for efficiently solving large dimension problems, with more than 10 uncertain parameters [13]. Sparse grid interpo-
1
1
1
0.5
0.5
0.5
0
0
0.5
(a)
1
0 0
0.5
(b)
1
0 0
0.5
1
(c)
Figure 11.2 Examples of grids sampling a two-dimensional parameter space: (a) latin hypercube sampling; (b) full uniform grid; and (c) Chebyshev sparse grid, generated with the Sparse Grid toolbox [10].
213
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
lation techniques were originally developed to reduce the computational cost for multivariate integrals [11, 14, 15]. A thorough review of sparse grid-based interpolation and integration is given in [16]. Adaptive sparse grid-based optimization utilizes the error-controlled interpolant as a surrogate of the cost function to search for the minimum. The process for optimizing with sparse grids requires model evaluations at selected grid points (support nodes) strategically positioned within the uncertain parameter space. An interpolated function is created by combining basis functions at the support nodes to approximate the cost function evaluated over the entire uncertain parameter space. The search for the best parameter values is performed on the interpolated function. Typically, a search along a polynomial-based interpolation function is significantly faster than a search involving repeated numerical integrations of a model; see Figure 11.3. Using sparse grid interpolation, the number of actual model evaluations is limited to just the number of support nodes. However, since the sparse grid technique relies on an approximation, the best results are usually obtained if a local search using the actual model is performed starting from the optimal values identified by the interpolation function. This adaptive sparse grid-based optimization approach and its computational efficiency rely heavily upon the construction of the interpolating function and selection of the support nodes. In brief, the construction of the interpolating function is based upon the tensor products of a univariate interpolation of the function, f, at the support nodes, χi k ∈ χi k for k ∈ [1, d], with a basis function, ax i :
(U
i1
⊗ K ⊗ U i d )( f ) = x
∑
i1
∈χ
K i1
x
∑ (a
id
∈χ
id
x i1
⊗ K ⊗ ax i d
) ⋅ f (x
i1
,K , x id )
In the sparse grid approach introduced by Smolyak, the interpolating function is obtained by summing a selected set of such tensor products. The computational efficiency of this method is a result both of the fact that this selected sum requires a relatively small set of support nodes and the fact that the sets of support nodes are nested as the interpolation depth increases. This nesting property greatly reduces the number of
(a)
(b)
(c)
Figure 11.3 Comparison of meshes created by actual cost function evaluations and from an interpolated cost function for a two-parameter search of a MAPK model [17]. (a) This mapping of the cost function was created from a 100 × 100 evenly spaced grid of parameter sets, for a total of 10,000 model evaluations. A local search on the best mapping point returned the actual parameter values, with an additional 18 model evaluations. (b) The 53 adaptive sparse support nodes used to create the interpolated function, generated by the Sparse Grid toolbox [10]. (c) An evenly spaced 100 × 100 grid of parameter sets was created and evaluated by the interpolated function, creating an identical mapping to (a) that only required the 53 model evaluations used to create the support nodes. A local search from the best support node took an additional 66 model evaluations to return the actual parameter values.
214
11.2
Experimental Design
required function evaluations by reusing the support nodes upon increased sampling refinement of the grid for a higher interpolation depth. The interpolation depth is the degree of the polynomial, k, which the univariate interpolation function can exactly match. It has been shown that the error of the interpolating function strongly depends upon the degree of the bounded mixed derivative (smoothness) and is a weak function of the dimension of the problem, O(N–k(logN)(k+1)(d–1)), where N is the number of function evaluations performed on the sparse grid at the support nodes [15]. Hence, these methods are considered nearly optimal (up to a logarithmic factor) [15] and are significantly better than those of quasi-Monte Carlo algorithms, O(N–1(logN)d) [18]. A uniform sparse grid cannot avoid a logarithmic dependence of the error on dimension; however, adaptive sparse grids sample most along the dimension of greatest importance as ascertained by the ability of samples in that direction to decrease the estimated interpolation error (Figure 11.4) [18]. This “problem-adjusted refinement” [19] most effectively reduces the computational costs for the optimization on models whose roughness is confined to a subset of the dimensions of the uncertain space and it does no worse than the uniform sparse grid methods. This adaptive sparse grid-based optimization method is deterministic so the numerical values of the identified parameter values and the quality of the results will not differ from one run to the next. Furthermore, it is anticipated that the quality of the results will improve with an increased sample size since the error of the interpolant approximation of the cost function mapping on the parameter space decreases with large N.
11.2 Experimental Design The method described in this chapter will determine, in a relatively efficient manner, the optimal parameter values to fit a model to all available experimental data. A number of factors must be considered to formulate the problem as a parameter identification experiment utilizing sparse grid-based interpolation. These factors involve ascertaining the dimension of the uncertain parameter space, the size of the uncertain parameter space, the form of the cost function, and the selection of the basis functions for the interpolating function. The dimension of the uncertain parameter space must be determined; that is, the number of parameter values to be found needs to be established. It is
0%
60%
99%
Figure 11.4 Examples of two-dimensional adaptive Chebyshev sparse grids, with increasing degree of adaptivity from left to right, generated with the Sparse Grid toolbox [10]. This figure demonstrates that the parameter along the x-axis is more important to decreasing interpolation error than the parameter along the y-axis.
215
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
desirable to keep this dimension as small as possible; initially assign the parameter values for which reasonable estimates exist. For instance, total numbers of molecules or concentrations of certain elements may be experimentally established or obtained from other published models. In cases where there are too many parameters for which there is no good estimate of their values, the dimension of the problem can be reduced by conducting a local or global sensitivity analysis [20] about some initial starting guess to ascertain which parameters should be targeted for fitting the data [see the troubleshooting section (Section 11.6)] [21, 22]. The initial starting guess values can be roughly estimated by back-of-the-envelope calculations or obtained from published models of similar reactions or processes. Parameters with the lowest sensitivity ranks can be neglected and fixed at these initial guesses. For each remaining parameter, which will be labeled as “uncertain,” an estimated initial search range must be provided. The product of these search ranges defines the span of the uncertain parameter space. Our experience has found that using a search range encompassing an order of magnitude below and above an initial guess will, in most cases, be large enough. The search for the potential parameter values should typically be conducted on the log of the uncertain parameter space to more equally spread out the support nodes over the ranges, which vary in orders of magnitude. For parameter identification, the optimization problem typically minimizes a cost function that penalizes differences between the model simulations and the experimental data. The most commonly used cost function, when quantitative experimental data is available, is the weighted least square error: ⎛ q nj 2⎞ F ( p) = log ⎜⎜ ∑ ∑ w ij [y j ( p , t i ) − y$ j, i ] ⎟⎟ ⎝ j =1 i =1 ⎠
(11.1)
where q is the number of states with experimental data, nj is the number of experimental time points for state j, y$ ij is the data for state j at time i, yj(p, ti) is the simulated model output for state j at time i for parameter set p, and wij is the weight for that point. For this construction of the cost function, it is important that the simulated data, yj(p, ti), be converted into the same units as the experimental data, y$ ij (i.e., numbers of molecules, concentration, percent of total, and so forth). The weights are used to normalize and/or incorporate confidence information in the data points; the confidence in the experimental data is typically taken into account by making the weights the reciprocal of the standard deviation of the experimental data at each time point and state, while the maximum value of the data or simulations for each state is typically used when the values of the states differ significantly in magnitude. When quantitative data is not available, a qualitative cost function can be constructed that, on a smooth scale, penalizes or rewards attributes of the simulations. It is important that the cost function be continuous in order for the interpolating function to approximate it accurately without large numbers of support nodes. Abrupt jumps due to, for instance, if-else statements will severely increase the interpolation error, as the cost function is interpolated with continuous basis functions. It is also recommended to search over log space; taking the log of the cost function will increase its smoothness. A wide variety of different basis functions can be used to support the construction of the interpolation function on the sparse grid including piecewise linear, Chebyshev polynomials, polynomial chaos [23], and multiwavelet formulations [24]. Though the 216
11.3
Materials
choice of basis function changes the placement of the support nodes, the construction of the interpolant is the same. Barthelmann, et al. compared the two most popular basis functions: piecewise linear and Chebyshev polynomial interpolation [15]. They concluded from theory and computation that if the function to be approximated is three (or more) times differentiable, then polynomial basis functions are better in the sense that the interpolation converges to the correct answer more quickly as the number of support nodes increases. If the function to be interpolated is discontinuous, then convergence is slow for both. In general, the authors recommend Chebyshev polynomial interpolation [15]. Therefore, this chapter is written from the assumption that Chebyshev interpolation will be used.
11.3 Materials The materials needed to apply this adaptive sparse grid-based optimization method for model parameter identification from available experimental data are described in Table 11.1. The specific implementation discussed in this chapter requires MATLAB and the Sparse Grid toolbox (http://www.ians.uni-stuttgart.de/spinterp) [10]. However, this method can be implemented with alternative coding packages, such as C++. Therefore, the required materials are described generically, with specifics provided in parentheses. For the examples in this chapter, a published four state, ordinary differential equation (ODE) model [17] of the mitogen activated protein kinase (MAPK) cascade was used. For these illustrations, mock experimental data was generated by model simulation. The mock data consisted of seven time points of the simulations for two of the four states.
Table 11.1 Materials Needed for Optimization with Adaptive Sparse Grid Interpolation Implementation Hardware Software
Model code
Data
Computer capable of running preferred model simulation software (MATLAB) Simulation software (MATLAB) Adaptive sparse grid algorithm (The Sparse Grid toolbox [10], installed, initialized, and modified slightly to store the best grid point for future use: the function spcgsearch was modified by inserting the code pb = x; cf_pb = fprev; save pb pb cf_pb at line 106, which is immediately after the lines: % Determine optimization start point [x, fval] = spgetstartpoint(z, xbox, options); fprev = fval; ) A local search algorithm such as the conjugate gradient method (fmincon) Model and cost function files written in preferred software format (m-files). These files should output the cost function value for the model evaluated at a given set of uncertain parameter values. Experimental data to fit uncertain model parameters. A local and global sensitivity analysis can help ascertain if the data available is sufficient to identify uncertain model parameters [21, 22]. The data must be accessible by the preferred software (.mat file).
Specifics are provided in parentheses.
217
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
11.4 Methods The general method for parameter identification with sparse grid-based optimization is outlined below with example code provided in Figure 11.5 using MATLAB and the Sparse Grid toolbox [10]. General Procedure 1. The objective of step 1 is to specify the range over which the search will be conducted for the uncertain parameter value in each dimension. Create a matrix containing the lower bound and upper bound for each uncertain parameter (typically, an order of magnitude lower and an order of magnitude higher than the initial point, respectively). Specific: The matrix should have size × 2 where d is the dimension, the number of uncertain parameters. 2. The objective of step 2 is to select the basis function type and establish the desired grid size and type. As the support node locations are a function of the basis functions used to create the interpolating function, the basis function type must be indicated. To constrain the computational time, we recommend setting the maximum number of grid points to 50 to 500 times the number of uncertain parameters, depending on how long the model simulations take. One could instead specify a minimum interpolation error to achieve, but this method could take a significant number of model evaluations, which is unknown a priori. We highly recommend enabling dimension adaptivity since the computational effort required is no worse than that for a uniform grid, but for some models it can be significantly more efficient. The degree of adaptivity can be modified to ensure a moderate coverage of the uncertain parameter space if desired. Specific: Use spset to set these options. 3. The objective of this step is to evaluate the cost function value at each support node and to use these values to create the interpolating function. This requires an iterative solution to add and locate the support nodes in the sparse grid to continuously improve the accuracy of the interpolating function to the cost function value until the maximum number of grid points has been reached or minimum relative or absolute error tolerance has been achieved. In addition, sort the grid points by cost function value (low to high) and determine the number of unique points per parameter. The sorted grid points and number of unique points can be used for further analysis. Specific: Use spvals to construct the grid and the interpolating function from the basis functions, and use the sort function to sort the grid points. 4. The objective of step 4 is to use the interpolated function from the previous step to estimate the “optimal” parameter set. A search is performed on the interpolated function, which serves as a surrogate for the cost function, to find the optimal parameter values that minimize the interpolated estimate of the cost function. Specific: Use the appropriate search function for the basis functions selected: spcgsearch for Chebyshev polynomials. This algorithm will find the grid point with lowest cost function and run a local search on the interpolated function about that point. The result of this search we will denote as pi, which has an interpolated cost function of cfpi. 5. The objective of this step is to refine the sparse grid interpolated “optimal” parameter set, pi, by searching for the nearest minimum of the actual cost function about the “optimal” point found in step 4. A local search using the original cost 218
Figure 11.5
Example code for implementing optimization with adaptive sparse grid interpolation using the Sparse Grid toolbox [10].
11.4 Methods
219
Figure 11.5
(continued)
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
220
11.5
Data Acquisition, Anticipated Results, and Interpretation
function and model is performed about pi. This will result in a candidate parameter set we will denote as pil, with a cost function of cfpil. Specific: run fminunc from pi, calling the cost function file. 6. The objective of step 6 is to identify an alternative candidate for the “optimal” parameter set by starting from the support node with the lowest cost function value, which we denote as pb, which has a cost function of cfpb. Load the data file containing the best grid point, and, if the point differs from the returned optimal point, run a local search using the original cost function about pb. This will result in a candidate parameter set we will denote as pbl with a cost function of cfpbl. Specific: load the data file pb.mat and run fminunc from the point, pb, calling the cost function file. 7. The parameter set from these two local searches with the lowest cost function value ⎧ pil cf pil < cf pbl . is considered the optimal parameter set: p * = ⎨ ⎩ pbl cf pbl < cf pil 8. Examine the resulting simulation for consistency and feasibility (i.e., both quantitative and qualitative fit with experimental data). The objective of this final step is to confirm that minimizing the cost function resulted in an acceptable fit to the experimental data. If the fit is acceptable, the optimization process is complete. 9. The objective of step 9 is to search in other areas of the cost function space with low values, besides the one containing the best support node, if step 8 is not successful. Determine the distance between the sorted grid points with the lowest cost function values (for instance, the lowest 1%), where distance can be defined as the sum of the absolute percent change in each parameter over all parameters. Run local searches on the cost function from the points with the farthest distance from the best support node. If one of these searches results in an acceptable fit, the optimization process is complete. 10. If step 9 is unsuccessful, consult the troubleshooting section (Section 11.6) and consider increasing the number of maximum grid points by 2 to 10 times. Save the previous grid as “z,” add “‘PrevResults’, z” to the options, increase the number of ‘MaxPoints’, and return to step 2.
11.5 Data Acquisition, Anticipated Results, and Interpretation The anticipated result of the method described above is a parameter set that acceptably fits the available experimental data. Whether or not the returned parameter set is adequate, the function spvals of the Sparse Grid toolbox [10] returns a structure containing a significant amount of information that may be helpful for understanding the returned “optimal” set of parameter values as well as information about the cost function values mapped onto the uncertain parameter space. This information includes the number and locations of the grid points, the cost function values at those points, the minimum and maximum cost function values, the degree of adaptivity, estimated errors, and the computational time. From this information, it takes only a few extra steps to extract other useful information, namely the sorted grid points and unique points.
221
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
11.5.1
Sorted grid points
Grid points sorted from lowest to highest corresponding cost function value are returned in step 3 of the methods section (Section 11.4) (see Figure 11.5). Reviewing these points as a sorted list can provide some insight. For instance, this method can reveal disparate, equally valid areas of the uncertain parameter space, as shown in Figure 11.6. In this example, three parameters of a MAPK model [17] were fitted to an incomplete data set, consisting only of the MAPK data [dots in Figure 11.6(b)]. The resulting three-dimensional grid of the cost function values on the parameter space indicated two disparate regions with equally low cost functions [circled in Figure 11.6(a)]. Simulations with a set of parameter values from each space [Figure 11.6(b)] were nearly identical for three of the four states; however, the simulation of MAPKK (blue) showed a distinct difference in its peak. This information suggests that in order to determine which, if either, of the parameter sets are valid, experimental data for the MAPKK, particularly at the 15-minute time point, is required. The sorted grid points can be used to determine the size of the parameter space that results in acceptable dynamics, termed the acceptable space, which can reveal properties of the model, such as the amount of confidence that can be placed in the chosen parameter values [26, 27]. In addition to the above example, as noted in the methods section (Section 11.4), step 9, in the event that the returned parameters were not adequate, the sorted grid points can provide alternative starting points for additional local or global optimizations that can refine the solution.
11.5.2
Unique points
With adaptive sparse grids, the number of unique points is the number of distinct locations of grid points along a parameter direction. This value can be obtained by applying the MATLAB unique function to a parameter’s grid points (see Figure 11.5, step 3). For
Activated Kinase (uM)
300 250 200 150 100 50 0 0
(a)
5
10 15 Time (min)
20
25
(b)
Figure 11.6 An example of a three-dimensional search of a MAPK model [17] that revealed two parameter sets that fit the mock data equally well, but predicted different dynamics for another model element. (a) Three-dimensional adaptive grid, generated with the Sparse Grid toolbox [10] and color-coded by cost function (red: high; blue: low). The two circled areas have similar cost functions when only the mock MAPK data is fitted. (b) Simulation results with parameter sets from each “optimal” area (solid: center area; dotted: right area). While three of the four state simulations (red: MAPK; green: Raf; black: Rkip) are similar, different MAPKK (blue) dynamics are predicted, suggesting that MAPKK data would be required to distinguish between the two parameter sets.
222
11.5
Data Acquisition, Anticipated Results, and Interpretation
three or fewer dimensions, the number of unique points can be seen by plotting the grid, as shown in Figure 11.7. Unique points correlate with each parameter’s importance to increasing the accuracy of the interpolant. This information is valuable because it demonstrates which parameters required the highest resolution for the interpolation. A use for unique points in aiding the optimization process is described in the troubleshooting section (Section 11.6.1) and demonstrated in the application notes (Section 11.8).
11.5.3
Unstable points
In the process of creating the sparse grid, the algorithm may return, or end with, integration errors when the model is integrated with particular parameter sets. In some cases, this error may be due to improper range setting (see Section 11.6) for certain parameters. For instance, a certain search range could allow a parameter to be a value that results in a division by zero. These unstable points should be carefully evaluated as they often reveal weaknesses in the model structure that may need revision to ensure the model is stable over the allowable parameter ranges.
11.5.4
Interpretation and conclusions
As stated above, if the optimization process is successful, one can conclude that the parameter values found are adequate for fitting the model to experimental data. However, one cannot conclude that these values are physically correct or even unique. If the process is unsuccessful, one should examine the model structure to determine whether or not it is capable of recreating experimental data. One method for examining model structure is a parameter sensitivity analysis. Conducting a sensitivity analysis not only quantifies the sensitivity of the output with regards to the model parameter values but also provides information for directing parameter fitting [21, 22]. The output of a sensitivity analysis helps to identify dominant processes or elements and recognize
log(Parameter Z)
5 4 3 2 1 −2 −4 log(Parameter Y)
−6 −4
2 0 −2 log(Parameter X)
Figure 11.7 This adaptive sparse grid (generated with the Sparse Grid toolbox [10]) of a three-parameter search of a MAPK model [17] demonstrates the concept of unique points. The parameter on the z-axis has three unique points, the y-parameter has five, and the x-parameter has 513. The points are color-coded by cost function (black and red: high; blue: low).
223
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
events/elements that can be considered negligible, including parameters whose values have little impact on fitting the experimental data [21, 22].
11.6 Troubleshooting For the cases where the methods described above do not result in a parameter set that allows the model simulations to acceptably fit the experimental data, the troubleshooting table lists some suggested approaches to deal with common issues. The simplest and first approach is to increase the number of support nodes as described in step 10 of the methods section (Section 11.4). In addition to these general recommendations, there are two special cases that are also considered: small problems (less than or equal to three uncertain parameters) and large problems (10 or more uncertain parameters), since certain trouble-shooting techniques are more helpful, or only applicable, to problems with specific dimensions of the uncertain parameter space.
11.6.1
Troubleshooting special cases: small and large problems
Small problem: three or fewer parameters 1. The objective of this step is to make use of the ability to visually inspect the cost function in the parameter space when the space has three or fewer dimensions. The cost function itself can be visualized using a mesh for two-dimensional problems and a plot for one-dimensional problems, and the grid can be visualized for one- to three-dimensional spaces using a scatter plot. Visually examine the sparse grid (in the one-, two-, or three-dimension case) and the cost function plot (in the one- or two-dimension case): i.
Sparse grid: Evaluate the interpolated function at each of the grid points and then plot the results. Specific: use scatter or scatter3, with the color of each point corresponding to the cost function value at the point.
ii.
Create a one- or two-dimensional mapping of the cost function. Specific: use plot or mesh as appropriate.
2. The objective of this step is to analyze the plots from step 1 to determine whether or not an appropriate search space was used. For instance, in the example of a two-parameter search of a MAPK model [17] shown in Figure 11.8(a–c), it can be seen that the optimal point is beyond the lower bounds for both parameters. Both the mesh and the grid suggest that the ranges should be shifted down. In the next search, the ranges were corrected and the results are shown in Figure 11.8(d–f). The improvement in fit can be seen in the simulation results [Figure 11.8(d)]. Therefore, for this step, analyze the plots and update the parameter ranges as needed. If nothing can be concluded from the plots, refer to the Troubleshooting Table. Large problem: 10 or more parameters This troubleshooting solution takes advantage of the information contained in the number of unique points per parameter, which is calculated in the general procedure (see step 3 of the methods section and Figure 11.5). Typically, these extra steps using the number of unique points are not necessary for smaller-sized problems. Optimization issues with smaller problems are 224
11.6
Troubleshooting
Troubleshooting Table Issue
Suggestion(s)
Method takes too long
Decrease the number of maximum grid points or the number of evaluations allowed by the local search algorithm. Examine parameter ranges: typically, certain parameters cannot be zero. Have parameter values automatically recorded when integration errors occur and then examine them to determine areas of instability. As appropriate, alter parameter ranges to avoid these areas or modify the model structure to eliminate the problem. Do not artificially set the cost function to an arbitrarily high value when these points occur, as this will interfere with the adaptive algorithm and with the interpolation. Redesign the cost function to more accurately reflect the data. Consider changing the weights of the LSE cost function or adding qualitative goals to the cost function. Expand the ranges of these parameters beyond the boundaries, if possible.
Method returns integration errors
Parameter sets with low cost function values do not result in a fit to the data Method returns lower or upper bounds for some parameters Method does not produce an acceptable fit
Increase maximum number of allowable support nodes. Decrease the problem dimension. Run a global sensitivity analysis, such as extended FAST [28]. Fix the least sensitive parameters at the best guess and search for the remainder. Consider an alternative model structure: the current structure may be incapable of producing the desired dynamics.
more commonly due to an issue described in the Troubleshooting Table. For a definition and description of unique points see Section 11.5. In brief, the number of unique points for a parameter is the number of unique locations of grid points for that parameter and correlates with the importance of the parameter to decrease the interpolation error. Parameters with the lowest number of unique points are the least important (or can easily be fit with low-dimensional basis functions, such as cubic polynomials). 1. This step assumes that, at a minimum, steps 1 to 9 of the methods section have been completed and not resulted in an acceptable fit of the model simulations to the data. Therefore, for the parameters with lowest number of unique points, set their values to the corresponding values of the best grid point, pb, returned by step 6. With the Sparse Grid toolbox [10], the lowest possible number of unique points is three when using Chebyshev polynomial basis functions, as the lowest interpolation depth allowed is three. 2. Start a new, lower dimension adaptive grid search for the remainder of the parameters, centering the ranges on the corresponding values of pb and following the steps in the methods section. If the dimension of the new problem is three or fewer parameters, then examine the resulting grid for range appropriateness as in the Small Problem section above (if necessary, repeat the search with adjusted ranges). Save the returned best grid point, which we will denote pbnew . 3. Create a new initial point by replacing the appropriate values of pb with the appropriate values of pbnew ; this new initial point we will denote pbinital . Perform a local search on the cost function starting from pbinital , resulting in the parameter set pbl with a cost function value of cfbl. 4. If the fit is acceptable, the optimization is completed. If not, try increasing the maximum number of grid points by two to ten times and repeating the procedure.
225
226 (e)
(d)
(f)
(c)
Figure 11.8 An example of a small (two-dimensional) parameter search of a MAPK model [17]. In this example, the initial search range did not include the optimal parameter values to match the mock data (blue and red stars). (a) The fit for the returned parameters (solid lines) compared to the mock data (stars) over the initial search range. (b) The mesh of the corresponding cost function. (c) The adaptive sparse grid, generated with the Sparse Grid toolbox [10] and color-coded by cost function (red: high; blue: low). The search range is shifted lower, based on the mesh and grid from the original search (b, c). (d) The fit for the returned parameters (solid lines) compared to the mock data (dots) with the shifted search range. (e) The mesh of the corresponding cost function. (f) The corresponding adaptive sparse grid, generated with the Sparse Grid toolbox [10] and color-coded by cost function (red: high; blue: low).
(b)
(a)
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
11.7
Discussion and Commentary
11.7 Discussion and Commentary High dimensional nonlinear models are becoming common in biomedical applications because of their usefulness in understanding biological processes, predicting behaviors, and developing therapies. However, identifying appropriate model parameters is challenging. As parameters are typically unknown, they are most often fitted to limited experimental data. Parameter optimization is a well-researched field, and many algorithms exist, including local or global and stochastic or deterministic. Local algorithms are of little use for nonlinear models since their results are highly dependent on the starting location. Global algorithms are computationally expensive and typically have no guarantee of finding, or even converging to, the global minimum [29]. Exceptions exist for smooth, twice-differentiable functions that are convex/concave or can be converted into convex/concave problems can be solved with global deterministic approaches such as branch and bound [7, 8]. However, it is highly unlikely for a cost function based on numerical integration of large, highly nonlinear models of cellular processes to be of that form. Until recently, the alternatives have been global, stochastic algorithms such as the genetic algorithm, which have no guarantee of convergence, or LHS/full grid initialization of local searches, which grow exponentially with dimension. The alternative presented in this chapter employs an adaptive sparse grid-based search. The adaptive sparse grid is designed to map the cost function onto the uncertain parameter space using interpolation with basis functions (typically polynomials) at support nodes. The error between the interpolating function and the cost function decreases with increased numbers of support nodes. This method has two benefits: grid points are placed in the most important locations (importance being defined by [30] as requiring higher level polynomials to reduce the error of the interpolating function) and the interpolant serves as a surrogate cost function. The former is like an informed LHS/full grid: entire dimensions can be mainly neglected if they are easy to fit with low dimension polynomials, slowing the increase in model evaluations needed with dimension. The second allows searches without additional model evaluations, which, depending on the interpolation accuracy, can find an optimal point very close to the global minimum. Like the stochastic and local methods, adaptive sparse grid optimization does not guarantee finding the global minimum. However, unlike, those methods, it does return valuable information about the uncertain parameter space, as described in Section 11.5. For instance, the unique points give an indication of parameter importance and can be used to improve the adaptive sparse grid search. In addition, as shown in Section 11.8, adaptive sparse grid-based optimization can result in larger, more consistent decreases in cost function values with increased numbers of model evaluations than the GA, even when followed by a local search. Complicating factors in optimization searches include parameter correlations (where changes in one parameter can compensate for changes in another) and low parameter sensitivities (where changes in a parameter have little effect on the model output or cost function). As a result, some parameters may not be identifiable from a set of experimental data [22]. The recognition of these parameters can play a key role in parameter identification as they should be neglected, and fixed at some best guess value, until further information can be obtained. Neglecting parameters decreases the dimension of the search, thereby increasing the likelihood of finding the global minimum in the fewest number of model evaluations. In the authors’ experience (data not shown), 227
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
parameters with very low sensitivity coefficients (as determined by extended FAST [28]) are more easily fit with low dimension polynomials; therefore, these parameters will have fewer unique points. However, this correlation will not always be the case and is certainly not guaranteed for all problems. Future work will explore altering the adaptive scheme for selecting sparse grid points to incorporating information on the parameter sensitivities; this is expected to facilitate the problem of parameter identification.
11.8 Application Notes The MAPK model published by Wolkenhauer et al. [17] was used as an example to demonstrate the described methods. Mock data was generated for two (MAPK and MAPKK) of the four elements (the remaining are Raf and Rkip) by simulating the model with the published parameter values and taking seven time points for each element from 0 to 25 minutes. The posed parameter identification problem attempted to identify all 18 model parameters from this mock data set. The results and computational efficiency of the adaptive sparse grid-based optimization method is compared to that of the GA. The sparse grid method, due to symmetry, automatically evaluates the center point of the parameter space. Therefore, in order to avoid biasing the sparse grid towards the actual parameter values, a new center point was created by selecting a random initialization point within an order of magnitude above and below the actual values. The uncertain parameter search range for both the sparse grid and GA was assigned with a lower limit set to an order of magnitude smaller than this initial point and the upper limit set to an order of magnitude larger.
11.8.1
Comparison of adaptive sparse grid and GA-based optimization
The resulting cost function value (the least squared error or LSE) was calculated for the adaptive sparse grid-based optimization method and the GA for increasing numbers of model evaluations. The results are shown in Figure 11.9. For the adaptive sparse grid method, steps 1 to 7 of the methods section were followed to achieve the resulting cost function value (LSE), with the total number of model evaluations being the sum of the number of grid points and the number of evaluations performed by the local search(es). For the GA method, the GA was run at least five times (because of its stochastic nature, each outcome is different), each followed by a local search. The number of model evaluations in this case is the sum of the evaluations used by the GA and the local searches. The results of the local searches were averaged and the error bars in Figure 11.9 represent the standard deviations of the results. To illustrate the differences between the GA and the adaptive sparse grid performance, an example where the maximum number of points for each method was limited to approximately 5,000 is given. In addition, the use of the number of unique points per parameter, as described under “Large problem” in Section 11.6, is demonstrated.
11.8.2
Adaptive sparse grid-based optimization
An 18-dimensional, 2,017-point grid was created, resulting in an optimal point with a cost function of 1.98E4. Eleven of the 18 parameters were found to have three unique points each, the remaining seven had five to nine each. The parameters that had three 228
11.8
Application Notes
Figure 11.9 For a MAPK model [17], a comparison of the performance, indicated by the least squared error (LSE) between the model simulations and the mock data set, of the adaptive sparse grid-based optimization (blue) and the GA (red). The adaptive sparse grid method consistently performed better than the GA for larger numbers of model simulations. The GA results are the average of at least five runs, with the error bars representing the standard deviation of the results. The adaptive sparse grid method followed steps 1 to 7 of the methods section while the GA was run with an increasing number of allowed generations and/or population sizes, followed by a local search on the result.
unique points were set at the returned values. A second, seven-dimensional grid with 2,031 points was created to search over the remaining parameters. The returned optimal point had a cost function of 1.48E4. The values for the 11 parameters returned by the first grid and the seven parameters returned by the second grid were combined into an initial point for a local search on the actual cost function. This local search, using 532 model evaluations, returned an optimal point with a cost function of 1.20E4. The resulting simulations with this parameter set are shown in Figure 11.10(a), which shows that the simulations are slightly shifted from the mock data but otherwise are quite similar and consistent with the observed trends in the mock data. A total number of 4,580 model evaluations were used. In order to improve the fit, the next step would be to increase the number of model evaluations by 5 to 10 times.
11.8.3
Genetic algorithm
The GA was run five times and a local search was run from each returned point. With an average of 5,517 model evaluations (5,000 due to the GA and an average of 517 due to the local search), this method returned an average cost function of 8.57E4, with a standard deviation of 1.33E4. The results are shown in Figure 11.10(b). Figure 11.9 suggest that the fitting could be improved by increasing the number of model evaluations, but again, the GA would have to be run multiple times in order to have a greater chance of seeing an improved fit. This 18-dimensional example demonstrates that the adaptive sparse grid-based optimization improves the fit of the model simulations to the experimental data set with increasing numbers of model evaluations while the GA fitness did improve on average but required multiple runs to assure this progress. The example also demonstrates the utility of the unique points identified by the adaptive sparse grid approach; the parameters least sensitive to reducing the error of the interpolant were temporarily fixed while the most important parameters were identified in a subsequent search. This sparse grid process provided information to refine the parameter identification problem to lead to 229
300
300
250
250
Concentration (nM)
Concentration (nM)
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
200 150 100 50 0 0
5
10 15 Time (min)
20
(a)
25
200 150 100 50 0 0
5
10 15 Time (min)
20
25
(b)
Figure 11.10 Fitting the 18 parameters of a MAPK model [17] to a mock data set (stars) using approximately 5,000 model evaluations. Red: MAPK; blue: MAPKK. (a) Results of adaptive sparse grid-based optimization. (b) Results of five independent GA runs followed by local searches. The inability of the GA to find acceptable parameter values prematurely suggests the modeled hypotheses may be inconsistent with the experimental data.
an acceptable solution while the GA failed to identify a reasonable solution even with multiple attempts. This inability of the GA to find acceptable parameter values may be inappropriately interpreted as the modeled hypotheses are inconsistent with the experimental data. However, with the same number of total model evaluations, the adaptive sparse grid-based optimization was able to find parameter values that supported the modeled hypotheses.
11.9 Summary Points The adaptive sparse grid-based optimization approach described herein has a number of advantages over other stochastic global optimization techniques.
230
•
There can be a large variance in the resulting parameter values employing multiple runs of a stochastic optimization approach whereas the adaptive sparse grid-based optimization approach will always return the same parameter values when posed with the same number of maximum model evaluations.
•
With an increased number of model evaluations with the adaptive sparse grid-based optimization, the error of the interpolant approximation of the cost function mapping on the parameter space decreases, so eventually the parameter values returned will minimize the cost function value, while the probabilistic sampling of the parameter space by stochastic optimization methods does not provide any assurances of an improvement of the solution with more supporting model evaluations.
•
The interpolant mapping of the cost function on the uncertain parameter space and the unique points generated during the adaptive sparse grid search may provide insight to refine and improve the identification process.
•
While the GA method can lead to incorrectly discarding a model hypothesis due to its inability to find well-fitting model parameters, adaptive sparse grid-based opti-
Acknowledgments
mization allows a more thorough examination of the parameter space for a better evaluation of the appropriateness of the model structure.
Acknowledgments This work supported in part by a National Science Foundation Graduate Research Fellowship.
References [1]
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
[14] [15] [16] [17] [18] [19] [20] [21]
[22]
Dharuri, H., L. Endler, N. Le Novere, R. Machne, B. Shapiro, C. Li, C. Laibe, and N. Rodriguez, “Database of Systems Biology Markup Language Models,” http://www.ebi.ac.uk/biomodels/, 2006–2008. Accessed June 17, 2008. R. J. Orton, “Compilation of useful modeling links including model databases,” http://www. brc.dcs.gla.ac.uk/projects/bps/links.html. Accessed June17, 2008. Aldridge, B.B., J.M. Burke, D.A. Lauffenburger, and P.K. Sorger, “Physicochemical modelling of cell signalling pathways,” Nature Cell Biology, Vol. 8, November 2006, pp. 1195–1203. Wolkenhauer, O., and M. Mesarovic, “Feedback dynamics and cell function: Why systems biology is called Systems Biology,” Mol Biosyst, Vol. 1, May 2005, pp. 14–16. Sachs, K., O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, Vol. 308, April 22, 2005, pp. 523–529. Balaban, R.S., “Modeling mitochondrial function,” Am. J. Physiol. Cell Physiol., Vol. 291, 2006, pp. 1107–1113. Eposito, W.R., and P.W. Zandstra, “Global optimization for the parameter estimation of differential algebraic systems,” Ind. Eng. Chem. Res., Vol. 39, 2000, pp. 1291–1310. Pardalos, P., H.E. Romeijn, and H. Tuy, “Recent developments and trends in global optimization,” Journal of Computational and Applied Mathematics, Vol. 124, 2000, pp. 209–228. Goldberg, D.E., Genetic Algorithms in Search, Optimization, and Machine Learning, Reading, MA: Addison-Wesley, 1989. Klimke, A., and B. Wohlmuth, “Algorithm 847: spinterp: Piecewise multilinear hierarchical sparse grid interpolation in MATLAB,” ACM Transactions on Mathematical Software, Vol. 31, 2005. Bungartz, H.-J., and S. Dirnstorfer, “Higher order quadrature on sparse grids,” ICCS 2004, 2004, pp. 394–401. Ferenczi, I., “Global Optimization Using Sparse Grids,” Technische Unversitat Munchen, 2005, p. 140. Klimke, A., “Sparse grid surrogate functions for nonlinear systems with parameter uncertainty,” Proceedings of the 1st International Conference on Uncertainty in Structural Dynamics, 2007, pp. 159–168. Gerstner, T., and M. Griebel, “Numerical integration using sparse grids,” Numerical Algorithms, Vol. 18, 1998, pp. 209–232. Barthelmann, V., E. Novak, and K. Ritter, “High dimensional polynomial interpolation on sparse grids,” Advances in Computational Mathematics, Vol. 12, 2000, pp. 213–288. Bungartz, H.-J., and M. Griebel, “Sparse grids,” Acta Numerica, Vol. 13, 2004, pp. 147–296. Wolkenhauer, O., S.N. Sreenath, P. Wellstead, M. Ullah, and K.H. Cho, “A systems- and signal-oriented approach to intracellular dynamics,” Biochem. Soc. Trans., Vol. 33, June 2005, pp. 507–515. Gerstner, T., and M. Griebel, “Dimension-adaptive tensor-product quadrature,” Computing, Vol. 71, 2003, pp. 65–87. Klimke, A., “Uncertainty modeling using fuzzy arithmetic and sparse grids,” Universität Stuttgart, 2006. Zheng, Y., and A. Rundell, “Comparative study of parameter sensitivity analyses of the TCR-activated Erk-MAPK signaling pathway,” IEE Systems Biology. In Press, 2006. Donahue, M.M., W. Zhang, M. Harrison, J. Hu, and A.E. Rundell, “Employing optimization and sensitivity analyses tools to generate and analyze mathematical models of T cell signaling events,” Data Mining, Systems Analysis and Optimization in Biomedicine, Gainesville, FL, 2007, pp. 43–63. Yue, H., M. Brown, J. Knowles, H. Wang, D.S. Broomhead, and D.B. Kell, “Insights into the behaviour of systems biology models from dynamic sensitivity and identifiability analysis: a case study of an NF-kappaB signalling pathway,” Mol. Biosyst., Vol. 2, December 2006, pp. 640–649.
231
Parameter Identification with Adaptive Sparse Grid-Based Optimization for Models of Cellular Processes
[23] [24] [25] [26] [27] [28] [29] [30]
Xiu, D.B., and J.S. Hesthaven, “High-order collocation methods for differential equations with random inputs,” Siam Journal on Scientific Computing, Vol. 27, 2005, pp. 1118–1139. Le Maitre, O.P., H. Najm, P. Pebay, R. Ghanem, and O.M. Knio, “Multi-resolution-analysis for uncertainty quantification in chemical systems,” SIAM J. Sci. Comput., Vol. 19, 2007, pp. 864–889. Wolkenhauer, O., M. Ullah, P. Wellstead, and K.H. Cho, “The dynamic systems approach to control and regulation of intracellular networks,” FEBS Lett., Vol. 579, March 21, 2005, pp. 1846–53. Brown, K.S., and J.P. Sethna, “Statistical mechanical approaches to models with many poorly known parameters,” Physical Review E, Vol. 68, 2003. Gutenkunst, R.H., F.P. Casey, J.J. Waterfall, C.R. Myers, and J.P. Sethna, “Extracting falsifiable predictions from sloppy models,” Ann. N.Y. Acad. Sci., Vol. 1115, 2007, pp. 203–211. Saltelli, A., S. Tarantola, and K.P.-S. Chan, “A quantitative model-independent method for global sensitivity analysis of model output,” Technometrics, Vol. 41, 1999. Moles, C.G., P. Mendes, and J.R. Banga, “Parameter Estimation in Biochemical Pathways: A Comparison of Global Optimization Methods,” Genome Res., Vol. 2003, 2003, pp. 2467–2474. Bungartz, H.-J., and S. Dirnstorfer, “Multivariate quadrature on adaptive sparse grids,” Computing, Vol. 71, 2003, pp. 89–114.
Related sources and supplementary information The Sparse Grid toolbox for MATLAB http://www.ians.uni-stuttgart.de/spinterp/.
232
used
in
this
chapter
is
available
at
CHAPTER
12 Reverse Engineering of Biological Networks Heike E. Assmus, Sonja Boldt, and Olaf Wolkenhauer Systems Biology and Bioinformatics, Department of Computer Science, University of Rostock, Albert Einstein Str. 21, 18051 Rostock, Germany; e-mail:
[email protected] and
[email protected], www.sbi.uni-rostock
Abstract The function of each single living cell as well as the complexity of any living system is the result of the interactions taking place between networks of biological entities, such as proteins, metabolites, genes, cells, organisms, groups of organisms, and so on. It is one of the foremost aims of biological research to identify and study these networks. This investigation of biological networks means not only to discover what components they are made of but also which of these components actually interact and what kind of interactions they share, and finally, how all this results in the biological processes that we observe, be it in the test tube or in the environment around us. In this chapter, we introduce approaches to infer the structure and functionality of biological networks, starting briefly with some logical and statistical methods but then focusing on those that involve differential equations. Furthermore, the final section highlights some possibilities for the exploration of the inferred networks.
Key terms
Reverse engineering Network inference Network structure Network topology Differential equations Power law modeling Parameter estimation Network biology
233
Reverse Engineering of Biological Networks
12.1 Introduction: Biological Networks and Reverse Engineering 12.1.1
Biological networks
In biological systems, a variety of networks can be distinguished, on different hierarchical and organizational levels. Networks that occur inside the cell, the smallest units of any organism (and equivalent to the organism, in case of unicellular organisms), are called cellular networks. Metabolic networks are networks of small biochemical molecules (the metabolites) that are connected by the biochemical reactions (i.e., conversions between the metabolites which are usually catalyzed by enzymes, a special class of proteins). The predominant outcomes of metabolism are energy exchange (ATP production/consumption) and de-novo synthesis of the biomolecules which the cell needs for self-maintenance or for communication with other cells. Figure 12.1(a) shows the glycolytic pathway as an example. Signal transduction networks are networks of signaling proteins. The connections between them are also biochemical reactions, often reversible modifications of the proteins contributing to the signaling, such as, for example, phosphorylation and dephosphorylation. Scaffolding proteins help to spatially organize these modifications, and metabolites also play a role in signaling, by providing the energy needed in order to modify the proteins. One of the most comprehensively studied signaling pathways up to date is the EGFR-signaling map [1]. Another well-known example is the canonical Wnt-pathway [see Figure 12.1(b)]. It highlights the main features of a signaling pathway, which usually begins with an extracellular signal that is transferred through the cell membrane into the cytosol where it is relayed and results in the activation or deactivation of a transcription factor.
(a)
(b)
Figure 12.1 There are different types of cellular networks. (a) Glycolytic pathway as an example for metabolic networks. (b) The canonical Wnt-pathway as an example for signal transduction networks.
234
12.1
Introduction: Biological Networks and Reverse Engineering
The third types of cellular networks are the gene regulatory networks. A gene regulatory network (GRN) is the structured set of information necessary to specify when and how one or a group of genes is to be expressed. GRNs contain connections that describe the regulatory relationships between genes. Gene regulatory networks include interactions between proteins and DNA [Figure 12.2(a)]. Often, only the genes are shown [Figure 12.2(b)]. Sometimes, the pathways that interconnect them are also included. These pathways consist of proteins that are expressed as the result of switched-on genes (gene products), parts of the signaling network (pathways with few or many steps and with transcription factors at their end), and finally the genes that are regulated by these transcription factors (target genes). A gene product can be a transcription factor and directly regulate the next gene(s) but usually there is indirect regulation (i.e., the gene product is secreted, and acts as a signal to a membrane receptor, which starts a new signaling cascade that peaks in activation and/or translocation to nucleus of yet another transcription factor). Established cellular networks are collected in (curated) databases. One widely used database is KEGG (www.genome.jp/kegg/pathway.html). It currently holds about 120 metabolic pathway maps consisting of an average number of 17 reactions. Some other well-known databases for metabolic and signaling pathways are BioCarta (www.biocarta.com), Reactome (www.reactome.org), BioCyc (www.biocyc.org), and MetaCyc (www.metacyc.org). The Pathway Interaction Database (PID; pid.nci.nih.gov) is for signaling pathways only, and it currently contains 87 curated human signaling pathways. For some organisms, there are databases that provide transcription or gene regulatory network information, such as EcoCyc-DB (ecocyc.org) and RegulonDB (regulondb.ccg.unam.mx) for E.coli. Eukaryotic transcription-regulating proteins and sequences are collected in TRANSFAC. The 2008 update of the molecular biology database collection lists over 1,000 databases, and more than 20 of them are for pathways [2]. Pathguide (www.pathguide.org) contains information about 240 biological pathway resources. The data needed to infer (and populate) networks differs for each network type. For metabolic networks, it comes from metabolomics, enzymology, photospectrometry,
(a)
(b)
Figure 12.2 Gene regulatory networks are cellular networks. (a) An example for a gene regulatory network with two components. Protein 1 activates expression of gene 2 by binding to its promoter, and it inhibits its own expression by the same mechanism. Regulatory regions are usually upstream of the genes coding regions. Beside promoter regions, they also include operator (for more than one gene), enhancers and silencers (can also be downstream). (b) In the summarized depiction of the same network, genes are nodes, and regulation is represented by directed sign-labeled arrows.
235
Reverse Engineering of Biological Networks
mass spectrometry, and so on, whereas gene regulatory networks are often based on data from microarray experiments, and the input for signal transduction networks (i.e., data describing the proteins’ activation status across a set of proteins) comes from various sources (e.g., Western blots or microarrays). In order to be sufficient for network inference, the data that is used to infer network structure must reflect the time course of changes in a network (time series data) or show the response of the network to several different perturbations. How such data is experimentally generated is described briefly in Section 12.2, which focuses on large-scale high-throughput experimental techniques. Furthermore, besides these cellular networks, there are networks that describe the interactions of cells with each other and networks that describe the interactions between whole populations of microorganisms, plants, or animals. Examples for the former are neuronal networks, which can be depicted in wiring diagrams, such as the synaptic connectivity map published by Hall and Russell in 1991 [3] showing the overall pattern by which the tail neurons of the nematode (roundworm) C.elegans interact. Ecological networks, also called food webs, are examples of networks describing the interactions between species populations. The components of such noncellular networks are bigger than the molecules that form cellular networks; they are more feasible to be studied by direct observation. Synaptic connectivity maps can be determined, for example, by microscopy and photography of the neuronal architecture, or by tracking the propagation of a signal through a combination of labeling, microscopy, and voltage-clamp recordings (method for studying synaptic transmission). Nevertheless, both types of networks may also be inferred by using approaches that are described in Section 12.3 (i.e., by deducing the underlying network structure from its observed output). This chapter focuses on cellular networks only, although some of the described approaches can equally be applied to the other types of biological networks that were just mentioned.
12.1.2
Network representation
For visualizing biological networks, the network components can be represented by simplified depictions or even reduced to a labeled symbol. It is more complicated to illustrate the interactions between the components of the network, because such illustration should convey not only the components that interact but also some qualitative information to the viewer. Different schematic representations of biological networks are in use (e.g., in Figure 12.1). One of the simplest is the representation of networks as graphs. A graph presentation reduces the elements of a network to nodes (vertices, junctions) and their pair-wise
(a)
(b)
Figure 12.3 Networks can be represented in many ways (e.g., as graphs or matrices). (a) Undirected graph with three nodes and corresponding adjacency matrix. (b) Directed acyclic graph with five nodes and corresponding adjacency matrix.
236
12.1
Introduction: Biological Networks and Reverse Engineering
relationships or interactions to edges (arcs, lines) which connect pairs of nodes [see Figure 12.3(a, b)]. The nodes of cellular systems may be genes, mRNA, proteins, or other molecules. Usually, one node represents one component of the network, and an edge represents the interaction between two of the components. Edges can be undirected (i.e., they simply connect two network components [Figure 12.3(a)]) or directed (i.e., they also imply some sort of causal or other asymmetry between the two connected components [Figure 12.3(b)]). In synaptic wiring diagrams, for example, the edges are always directed, as the synapses that form the connections between two neurons are clearly defining them as pre- and postsynaptic. A way of describing gene regulatory networks is to provide a directed graph. In a graph representation of GRNs, the nodes are genes and the edges describe the relations between pairs of genes [see again Figure 12.2]. Different types of functions may be associated to these edges to indicate some regulatory relationship between genes and to describe the restrictions controlling the flow of information in the GRN. Sometimes, two different types of nodes are used for the transcription factors and for the genes they regulate. Graphs are not the only way to depict networks of interactions between biological entities. In fact, for metabolic networks they are less suitable, unless one chooses to employ bipartite graphs, which include the enzymes that catalyze the reactions, and thereby allow showing reactions with more than one substrate and/or product. Stoichiometry matrices are one way of representing metabolic networks in a very concise way. They also allow stoichiometric analysis (see Section 12.4.3). Process diagrams are suited to represent metabolic and signaling networks [4]. Kohn maps are another possibility to schematically represent biological networks [5]. The latter two differ in their illustrating power and each has certain limitations. The Systems Biology Graphical Notation (SBGN) initiative is working on the standardization of the schematic representation of essential biochemical and cellular processes that are studied in systems biology (www.sbgn.org).
12.1.3
Motivation and design principles
When cellular networks, such as the ones given in the above examples, are published, are included in databases, become accepted knowledge, and are taught to students, they are the consolidated result of a long process that led from numerous empirical observations and measurements to a theoretical model of how their individual components interact. Through the process, the translation of data into predictive models is realized. But why is this network identification so important? Besides just the general curiosity of human beings to learn about the world around them and to investigate how things are related, there is a practical need to establish models of the objects or processes in question either in our mind, or in a kind of scheme or verbal description, or as a detailed formal or mathematical representation. These theoretical models are necessary in order to subject them to systematic probing and analysis, to find common or unique design principles and learn from these principles, to form and test hypotheses on the model networks, and ultimately, to apply the lessons learned by doing so to the real system (i.e., manipulate the real biological system to achieve a certain desired behavior). This can be in order to prevent or cure diseases (development of vaccines, drugs, therapies), to heal or regenerate (applications in regenerative medicine), 237
Reverse Engineering of Biological Networks
or to manipulate organisms or our environment (biotechnological or agricultural applications). Reverse engineering, as defined and discussed in this chapter, is the process of network inference. It comprises the identification of a network that represents a biological system, the formal description of this network (the examples given in Section 12.1.1 are such formal network descriptions; they are the result of extensive reverse engineering), and the discovery of organizing principles in the networks that bring about the properties and behavior of living cells and enable them to meet the demands for robustness in uncertain environments. Just as an engineer designs a device according to the tasks that it has to perform, in reverse engineering, the goal is the reverse, namely to discover the design that is behind the characteristics of the “device” (i.e., the cell) for which evolution was the engineer. The task of mapping an unknown network is known as reverse engineering [6]; it is illustrated in Figure 12.4. Metabolic, signaling, and gene regulatory networks are very complex, in that they usually have many loops. In fact, these feed-forward or feed-back loops are one of the major design principles to perform the wealth of biological functions a cell can accomplish. The biological network topology and design principles will be discussed in more detail in Section 12.4.2.
12.1.4
Reverse engineering
Biological networks have to be deduced from empiric observations, and the amount of experimental data available is growing faster and faster with the advent of high-throughput screening techniques and improvements in the detection and quantification of ever smaller molecules and of changes in their amounts. There are two sometimes distinct, sometimes complementary tasks to tackle in order to gain an understanding of the functional interactions between genes, proteins and metabolites: (1) structural identification (i.e., to ascertain network structure or topology), and (2) identification of dynamics to determine interaction details (e.g. transition rules or rate equations and kinetic parameters). A metabolic network such as the glycolytic pathway shown in Figure 12.1(a) was determined in a tedious step-by-step process. In a classical (or twentieth century) small-scale low-throughput way, each enzyme in the pathway was studied separately by
Figure 12.4 Before and after reverse engineering: a network of five components with all possible interactions (left) and (right) the result of networking interference. (Figure inspired by [7].)
238
12.2
Material: Time Series and Omics Data
classical enzyme assays. The network was then built by assembling all the individual enzymatic reactions together in one pathway. This is where systems biology entered the stage. Systems biology is about investigating, and ultimately understanding, the organizing principles of biological systems and how they bring about the observed behavior. It also comes with a change of focus from few components with few interactions to huge networks with many components and many interactions. Systems biology of the twenty-first century takes up the challenge of looking at (whole) networks instead of just collecting components to complete a parts list. Over the last decade, new technologies and experimental methods have been developed that enable acquisition of large data sets containing genomic, proteomic, and metabolic information describing the state of a cell. Employing them, more and more of the cellular components (and their concentrations and concentration changes) can be monitored and captured at once. Among these large-scale high-throughput methods are microarrays, time of flight mass spectrometry (MS-TOF), and yeast two-hybrid screening. These modern experimental techniques, in combination with an increase in computing power, made the analyses of larger networks feasible, and we have long reached a point where the thus inferred networks can no longer be grasped as a whole, including their features and behavior, by the human mind alone but need to be investigated with the help of computers. In the following section, some experimental techniques are introduced briefly. They are used by researchers in the life sciences to generate empirical (cell-)biological data that is the basis for network inference. Section 12.3 deals with theoretical approaches applied to this data to infer biological networks. After the networks have been inferred, they can be analyzed (i.e., their organizing principles uncovered and their behavior simulated). This is discussed in Section 12.4. The final section of the chapter is a summary and comparison of the various approaches.
12.2 Material: Time Series and Omics Data New technologies enable the acquisition of large data sets containing genomic, proteomic, and metabolic information that describe the state of a cell. The measured quantities are metabolite concentrations, protein concentrations, protein activation status, and mRNA levels, to name the most studied ones. Time series are series of m discrete measurements of n (state) variables. They generally take the following form:
(
)
~ D = Xi (t j )
j =1,K, m i =1,K, n
[
]
~ ~ ~ where Xj is the state vector X1 (t j ), K , Xn (t j )
T
and contains the quantities of all mea~ sured variables at the jth observed time point, and Xi is the time series for just one vari~ ~ able Xi (t1 ), K , Xi (t m ) .
[
]
Instead of various time points, tj, there can also be just one single measurement but several differing conditions in which the system is investigated.
239
Reverse Engineering of Biological Networks
Applying large-scale and high-throughput methods to determine the metabolite content of cells is called metabolomics, for proteins it is called proteomics, and for mRNA it is called transcriptomics.
12.2.1
Metabolomics
Analysis of the primary and secondary metabolism in a large scale (called metabolomics) includes the complete characterization of the metabolites that occur in a cell (called the metabolome) at a certain time, and the quantification of these metabolites at different time points, or under varying (environmental or stress) conditions, or for different strains/mutants. Metabolome databases are, for example, the HMDB (www.hmdb.ca) for human metabolomics data, and the Golm Metabolome Database (GMD; csbdb.mpimp-golm. mpg.de/csdbd/gmd/gmd.html) for plant metabolomics data. Experimental analysis of the metabolome comprises two main steps: separation and detection. The most common methods for separation are gas chromatography (GC), high performance liquid chromatography (HPLC), and capillary electrophoresis (CE). For detection, they are mass spectrometry (MS) and nuclear magnetic resonance (NMR) spectroscopy. The most common combination of methods for metabolome analysis is gas chromatography interfaced with mass spectrometry GC/MS (also the first to be developed), followed by LC/MS and by method combinations with NMR. The concluding identification of the metabolites in a sample is facilitated by the fast growing libraries of metabolite spectra, such as METLIN (metlin.scripps.edu). As of now, several hundred metabolites can be measured exactly by these modern techniques, providing a metabolic profile of a cell that gives an instantaneous snapshot of its physiology. Metabolic profiles have proved to be useful in optimization and controlling of fermentation processes and production flows but they can also be used to infer cellular metabolic networks. The procedure for this task, namely calculating the correlation for each pair of metabolites and using it to construct a metabolic network, is explained in Section 12.3.3.
12.2.2
Proteomics and protein interaction networks
The proteome is the expressed protein complement of a genome, and it can be bigger in size than the number of genes in a genome because of splice variants and post-translational modifications. The human proteome—a catalog of all proteins in the human body—is for many life scientists the next key step after the success of the Human Genome Project. Compared to the transcriptome, the proteome is less dynamic. For detection of the proteome or at least meaningful fractions of it, several proteomic techniques exist, such as HPLC and MALDI-TOF mass spectrometry. The Elisa and Western blotting are considered for small-scale systems. One aspect of proteomics is the interactions proteins can undergo. Proteins interact with metabolites, with RNA or DNA, or with other proteins. In the last case, the interaction is called protein-protein interaction (PPI). The whole of physical interactions between the proteins of a cell forms a cellular network that is the subject of research, experimentally as well as by network inference. One experimental technique for detection of protein-protein-interactions at a large scale is automated yeast-two-hybrid (Y2H) 240
12.2
Material: Time Series and Omics Data
screening [8]. Y2H screens together with complementary methods have already provided PPI networks for, for example, E.coli, S.cerevisiae, C.elegans, and D.melanogaster. The human protein interaction network is under reconstruction, and recent progress in identifying all human PPIs can be tracked by accessing protein interaction databases that collect and store available PPI data—for example, IntAct (www.ebi.ac.uk/intact) or HPRD (www.hrpd.org). Some databases include experimentally validated as well as computationally predicted PPIs (e.g., by homology). The database Unified Human Interactome (UniHI; www.mdc-berlin.de/unihi) combines the data from many other databases. Y2H tests for binary interactions (between two proteins, bait and prey). It produces a yes/no observation. This is an ideal prerequisite for a graph representation of PPI data (with protein-nodes and interaction-edges) and for subjecting it to graph-theoretical analysis (see Section 12.4.1). Unfortunately, PPI sets acquired in a large-scale high-throughput manner still contain a high percentage of false negatives (transient or low-affinity interactions may not be detected) as well as false positives, and are thus coming with the uncertainty of occurrence of the PPI in vivo. Besides technical false positives, there are biological false positives: the two proteins found to interact may not encounter each other in vivo, depending on their abundance in the cell as well as on whether they are localized in the same compartments or expressed at the same time (spatial/temporal coexistence). Therefore, the method is usually combined with other methods, to validate the PPIs and assign a confidence score, in order to verify the network derived from PPI data. Another challenge is further characterization of the interactions. Y2H detects a physical association between two proteins, but what kind of interaction is it exactly? Is it a permanent or transient complex formation, a substrate-product relationship (one protein modifying another, as in the case of phosphorylationdephosphorylation, where one interaction partner is a kinase or phosphatase), scaffolding (in which case the scaffold protein has a plethora of interactors), or something else altogether? Protein-protein interaction data is, nevertheless, useful as prior or additive knowledge when inferring signaling networks [9]. The domain composition of proteins forms the basis for a reduced form of protein interaction network, namely a domain interaction network that can be built from all known domain-domain interactions (DDIs) [10].
12.2.3
Transcriptomics
One major constituent in the process of gene expression is the messenger or mRNA, a single stranded RNA molecule that is the transcribed copy of a gene. It carries the sequence information of a gene out of the nucleus, where it is translated into its final product, the protein. The entirety of all mRNAs in the cell at a certain time point is called the transcriptome. The composition of the transcriptome is extremely dynamic. It highly depends on the internal and external (environmental) conditions of the cell. Nutriment circumstances or physical stresses, like rapidly increasing temperature, are examples for environmental factors that influence the composition of the transcriptome. Transcriptomics is the research field dealing with the characterization and analysis of the cells transcriptome. It explores the dynamics of the transcriptome and the mecha241
Reverse Engineering of Biological Networks
nism regulating the mRNA production (i.e., how genes are up and down regulated in response to external signals or how they interact with transcription factors). The transcriptome of a cell is studied by experimental methods that measure the presence and amount of specific mRNAs at a certain time point. Several different methods for monitoring the expression levels of large gene sets exist, but currently the most popular one is transcriptome detection by DNA microarrays. Microarrays are a high-throughput technology that exploits the fact that mRNA molecules hybridize specifically to complementary DNA copies. Those DNA copies, representing genes, are attached to an array, each copy forming a so-called spot. After the RNA is hybridized to the array, the expression levels of up to thousands of genes can be measured simultaneously by calculating the amount of mRNA bound to each spot. The measurement of a huge number of genes in a single experiment is called expression profiling. Many different array platforms and formats are available that realize the described array principle. The two most common platforms are the two-color array and the Affymetrix GeneChip technology. In Figure 12.5, the experimental methodology of a two-color array is illustrated. Another high-throughput approach is the Serial Analysis of Gene Expression (SAGE). The advantage of this method in comparison to the array technology is that it allows the exact measurement of any mRNA, known or unknown. Thus, new mRNAs can be identified. Two further methods for studying the transcriptome that are not considered high-throughput approaches are Northern blotting and real-time PCR.
12.3 Approaches for Inference of Biological Networks Some known and well-studied networks, such as the central carbohydrate metabolism (see paragraphs on glycolysis in Sections 12.1.2 and 12.1.4), were the result of a slow and painstaking step-by-step process, in which each biochemical reaction was detected and studied separately. Modern experimental technologies generate data that allows for deducing more comprehensive cellular interaction networks from fewer experiments. An example is the protein interaction networks built from the sets of PPIs detected by Y2H screening (see Section 12.2.2). These are static networks, but what is of even greater
Figure 12.5 Workflow of a microarray experiment. Two yeast cultures are grown, one mutant strain and one wild type strain, their mRNA is isolated, transcribed into cDNA, and labeled with different fluorescent dyes. The cDNA is then mixed and hybridized to a microarray. Each spot on the microarray corresponds to a gene; its fluorescence reflects the relative mRNA concentrations. The microarray is scanned and the resulting intensity values are stored in a gene expression matrix. (Figure reproduced with permission from [11].)
242
12.3
Approaches for Inference of Biological Networks
interest are networks that also consider the dynamics of interactions, and thus capture the behavior of the biological systems that the networks represent. For inferring such networks from experimental data, several approaches exist. These theoretical approaches use data of the kind described in the Section 12.2 but also data from small-scale low-throughput experiments. Instead of building or rather assembling networks from prior knowledge of individual interactions, network inference involves an indirect deduction process. In the following, some approaches are looked at more closely: a genome sequence based approach, a discrete logical method (Section 12.3.2 on Boolean networks), statistical and probabilistic methods (Bayesian networks and correlation metrics in Sections 12.3.3 and 12.3.4), and then the focus is on approaches that use differential equations (Section 12.3.5). Besides these, further approaches are developed and used—for example, approaches based on fuzzy logic/mathematics (e.g., fuzzy inference engines [12]), difference equations (e.g., LASSO tool [13]), stochastic ODEs, information theory, and numerical modeling with neural networks. They are not discussed in more detail in this chapter, and we refer the reader to the other chapters of this book or to text books.
12.3.1
Genome-scale metabolic modeling
This approach provides models of the complete metabolism of a cell. In contrast to the slow or “low-throughput” way of manually assembling a metabolic network from single biochemical reactions studied by enzymology, the development of genome-scale models is a more global approach. It is based on the sequenced genome of an organism, and on the assumption that all biochemical reactions that this organism can perform are encoded in the genome, namely through the genes that code for the enzymes that catalyze metabolic reactions. Analogous, the same is assumed for the transport processes between cell organelles (across organelle membranes) or between intra- and extra-cellular space (across the cell membrane). This approach, thus, reflects the central dogma of molecular biology. By taking into account all genes detected on the genome that encode for enzymes and transporters, one can build a complete metabolic model. This approach is impaired by the fact that not all genes are annotated or some are annotated incorrectly (most annotations are made by homology to already annotated genes in other organisms’ genomes), and as a result the network may have disconnected subnetworks or “orphan” or “dead-end” metabolites (see [14] for a discussion of problems with this approach). Up to now, genome-scale metabolic models have been published for several unicellular organisms (e.g., the bacteria M.tuberculosis [15], H.salinarum [16], and the yeast S.cerevisiae [17, 18]). Usually, these models of the entire cellular metabolism comprise several hundred or even more than thousand biochemical reactions and transport steps (see Figure 12.6). There are initiatives to also provide genome-scale metabolic models for unicellular algae, as a representative of plant metabolism. So far, these networks are explored mainly by subjecting them to stoichiometric analysis or flux balance analysis (see Section 12.4.3). Nevertheless, the assignment of rate equations to each reaction with appropriate kinetic parameters, the prerequisite for dynamic simulations, will become feasible in the future.
243
Figure 12.6 Genome-scale metabolic models have been published for various organisms. G–number of genese incorporated in the model; R–number or reactions; M–number of metabolites. (Figure from [19]. Reproduced by permission of The Royal Society of Chemistry.)
Reverse Engineering of Biological Networks
244
12.3
Approaches for Inference of Biological Networks
Adding to the problem of an incomplete or false annotation as was mentioned earlier, the cells of multicellular organisms often exhibit only part of the metabolism that is coded in its genome, and modeling specific cell types (e.g., human erythrocytes or hepatocytes) thus requires an adjustment of the above approach. (This can be true for microorganisms alike, if they show very different metabolism for different environmental conditions or different developmental states, such as diauxic shift in E.coli or yeast.) For more details on genome-scale metabolic models, we refer the reader to Chapter 6.
12.3.2
Boolean networks
Boolean networks were first used by Stuart Kauffman in the 1960s as a study of randomly constructed networks consisting of binary state nodes, which represented genes [20]. More than 40 years later, this network type is still used to infer biological networks in the field of reverse engineering. Especially gene regulatory networks are often reconstructed from gene expression data by employing Boolean networks. In general, gene regulatory networks modeled as Boolean networks are directed graphs, in which each node represents one gene. Nodes can adopt only two different states, namely 0 and 1. Consequently, one main feature of Boolean networks is the existence of discrete state values. A node with state value 0 represents the inactive form of the gene represented by the node, which means that the gene is currently not expressed. In contrast to that a node with state value 1 stands for the active form, indicating that the gene is expressed. In addition to the binary representation of the genes, each gene of the network influences the behavior or state of one or several other genes. Those interactions, illustrated by directed edges in the Boolean network, are modeled by Boolean (logical) functions. Each node/gene is assigned to one of those functions, such that the state of each particular gene (0 or 1) at time point t + 1 depends on the states of genes at time point t regulating that gene. At each time step, all genes are updated synchronously according to their assigned function, which means that all genes transit to a new state. A table consisting of all possible state values before and after transition is called a state transition table (see Figure 12.7). If the network structure is unknown, reverse engineering methods have to be defined that detect Boolean relations from gene expression data sets, measured at two time points at least. The first step of this procedure is always the discretization of the experimentally gained gene expression profiles into “ON” and “OFF” states by using a numerical threshold. The discrete values are written in the state transition table. It is used to compute the Boolean functions, with which the expression profiles can be described. Finding the Boolean functions is the most comprehensive step, for which different algorithms were developed in recent years. Two popular methods that address this problem are the REVEAL [21] and the BOOL-2 [22, 23] algorithm. In order to detect the gene interactions, REVEAL analyzes the mutual information between the input and output states of the measured data. An additional extension of that algorithm allows constructing multistate models. Thus the network states are not limited to 0 and 1 anymore. The second algorithm, BOOL-2, detects those Boolean functions which explain the influence of input states on corresponding nodes with a probability of higher than certain threshold [24].
245
Reverse Engineering of Biological Networks
X1
X2
X3
(a)
Input states (at timepoint t)
Output states (at timepoint t+1)
X1
X2
X3
X1
X2
X3
0
0
0
0
0
1
0
0
1
0
1
1
0
1
0
1
0
1
0
1
1
1
1
1
1
0
0
0
0
0
1
0
1
0
0
0
1
1
0
1
0
0
1
1
1
1
0
0
(b)
X1 = X2 X2 = NOT X1 AND X3 X3 = NOT X1
(c)
Figure 12.7 A Boolean network can be represented differently. Shown is a network that consists of three interacting nodes only: (a) graphical illustration of the interactions; (b) state transition table of all possible state values; and (c) Boolean functions, one for each node.
One main problem of finding those Boolean functions is that the computational complexity grows exponentially with the number of nodes within the network. For this reason many existing methods are limited by the number of arguments of each function, which means that the number of genes influencing each other is limited [24]. For a number of k arguments, there exist 2k possible input states and altogether 22k possible Boolean functions. As a direct result of the model structure of Boolean networks, there are advantages and disadvantages of using Boolean networks for gene network inference. As described earlier, the discretization of gene states is a central feature of Boolean networks. On the one hand, discretization is tantamount to the loss of information but this information reduction together with the need to limit the number of arguments may lead to falsified network inference results. On the other hand, discretization may be an advantage if one tries to reconstruct a network when only noisy expression data is available. Two extensions of the basic Boolean network are the probabilistic Boolean networks and the temporal Boolean networks. In a probabilistic Boolean network, the stochastic nature of gene expression and the noise of experimental data are considered by introducing probabilistic features to the network behavior. Thus, in a probabilistic Boolean network, each node can be assigned to more than one Boolean function. Each function is chosen according to a certain probability. The function with the highest probability will determine the state of the gene at the next time point. There are several publications that provide further information about probabilistic Boolean networks and their application in gene network inference [25, 26]. In contrast to the above, a temporal Boolean network offers the possibility to model the existence of latency periods between the expression of a gene and the observation of its effect. Therefore in a temporal Boolean network a state of a gene at time point t + 1 need not to depend only on gene states at time point t. A state of a gene at time point t + 1 can be controlled by a Boolean function of the several gene states at former time points than t. Temporal Boolean networks are described in detail in [27]. 246
12.3
Approaches for Inference of Biological Networks
Other types of network can deal with continuous data as well. They are described in Sections 12.3.3 and 12.3.5.
12.3.3
Network topology from correlation or hierarchical clustering
The construction of reaction networks from correlation metrics is a statistical method. It comprises the calculation of the correlation for each pair of metabolites from metabolomics data (see Section 12.2.3), and uses the obtained correlation metrics to construct metabolic pathways. It interprets metabolic profiles in terms of the underlying biochemical network of reactions and regulations. The rationale behind this is that correlated metabolites have a good probability of being functionally related (i.e., being substrate and products of one and the same enzyme-catalyzed reaction, respectively) or being linked by only a few steps in a metabolic network. Measurement data for the detection of correlations between different metabolites can be obtained by conducting several experiments with different setups, for example by varying the environmental conditions or by periodically forcing the system by changing one variable over time. The middle panel in Figure 12.8 depicts exemplary metabolite versus metabolite scatter-plots, in which each dot corresponds to a simultaneous measurement of two metabolite concentrations (in arbitrary units) within a single sample. The relationship between metabolites is assessed using the Pearson correlation coefficient rX,Y. The formula for an empirical correlation coefficient, given two series of n data points (Xi, Yi), reads:
rX,Y =
cov( X , Y) = σ X ⋅ σY 1 ⋅ n
(
)(
1 n ⋅ ∑ X1 − X ⋅ Yi − Y n i =1
∑ (X n
i =1
i
−X
)
2
⋅
)
∑ (Y − Y ) n
2
i
i =1
The correlation coefficients range from −1 to +1, and are close to zero in case of no detectable correlation. They can be visualized by colors in so-called heat maps. Figure 12.8 illustrates the workflow from metabolomics data via correlation metric to correlation network (sometimes also called association network).
Figure 12.8 The workflow for network inference using correlation metrics is shown in a schematic overview. Metabolite correlations are derived from metabolomics data (left and middle panel). Two metabolites are connected in a correlation network if their pair-wise correlation exceeds a given threshold. A more detailed description can be found in [28].
247
Reverse Engineering of Biological Networks
Computational metabolic modeling can more than capture the already-known pathway structure. In a study of Chlamydomonas reinhardtii, it allowed an identification of missing enzymatic links [29]. It can also lead to the proposal of enzymes not yet annotated in this particular organism or the proposal of new previously unknown connections between intermediates (hypothetical enzymes). A comparative genomics tool for discovering unknown metabolic pathways in organisms is pathway inference through pattern matching. It is a technique in which known pathways are modeled as biological functionality graphs of gene ontology (GO)-based functions of enzymes (pathway functionality templates), and these are used to locate frequent functionality patterns, and through pattern matching this allows to infer previously unknown pathways in metabolic networks [30]. Hierarchical clustering is a similar method for inferring gene networks from gene expression profiles. Relationships among genes are represented by a tree whose branch lengths reflect the degree of similarity between genes, as assessed, for example, by a pair-wise similarity function such as Pearson correlation coefficient. The rationale behind the use of correlation is that correlated genes may be functionally related. A review of inferring GRNs through clustering of gene expression data can be found in [31]. On the other hand, some claim that it is useful for finding coexpressed genes but not for network inference [32]. Clustering expression data into groups of genes that share profiles allows for grouping functionally related genes but does not order pathway components according to physical or regulatory relationships [33]. Network inference through clustering is not only applied to gene expression profiles but also to protein profiles or metabolic profiles. Here, more generally, a profile relates the measured component to the different conditions that where applied. This allows one to apply the method to time series data, for example, of metabolites [34, 35]. Modeling signal transduction networks becomes feasible by integrating protein-protein interaction and gene expression data, as shown in an example in S.cerevisiae [33]. The method was further advanced by [36].
12.3.4
Bayesian networks
A Bayesian network is a probabilistic description of a regulatory network. It is a marriage of probability theory and graph theory in which dependencies between variables are expressed graphically. It is named after Bayes’ theorem which is used for the calculation of conditional probabilities and reads: P(X1 X2 ) =
P(X2 X1 )⋅ P( X1 ) P( X2 )
where P(X1|X2) is the conditional probability of X1, given that X2 is true. According to basic probability theory, the joint probability can be factored as a product of conditional probability such that P( X1 , X2 ) = P(X2 X1 )⋅ P( X1 ) = P(X1 X2 )⋅ P( X2 ) A Bayesian network is a graphical model for probabilistic relationships among a set of continuous or discrete random variables Xi. This relationship is encoded by two components. 248
12.3
Approaches for Inference of Biological Networks
The first component is a directed acyclic graph G(V, E) consisting of a set V = {X1, …, Xn} of nodes and a set E of the directed edges between these nodes. Xi → Xj means that Xj belongs to the children ch(Xi) of Xi or, in other words, Xi belongs to the parents pa(Xj) of Xj. Thus, Xj is a descendant of Xi if there is a path from Xi to Xj. (In acyclic graphs, any node is not a descendant of itself.) The nodes in a Bayesian network represent measured variables of interest (e.g., genes or proteins). The edges represent informational or causal dependencies among the variables. For example, if i is a gene, then Xi will describe the expression level of i. The second component is the relationship between the variables which is described by a set P of n conditional probability distributions of the form fi(Xi|pa(Xi)). From the Markov assumption (i.e., each Xi is conditionally independent1 of its nondescendants given its parents) follows that the distribution f(X) can factorize with reference to the graph, and the joint probability distribution can be decomposed into n
f ( X) = ∏ f i (Xi pa( Xi )) i =1
Hence, the common distribution emerges from the relationship of the parents of each random variable as well as the conditional distributions P = { f i ( xi pa( xi ))}i =1,K, n [37]. In the example network shown in Figure 12.3(b), the random variables Xi are the five nodes V = {X1 = 1, X2 = 2, X3 = 3, X4 = 4, X5 = 5}, and the set of edges is E = {(1,3); (1,4); (2,4); (4,5)}. The joint probability distribution of G(V, E), then, has this form: P( X) = P(1) × P(2) × P(412 , ) × P(31) × P(54) In order to reverse-engineer a Bayesian network model of a gene regulatory network, one must find the directed acyclic graph G (i.e., the regulators of each transcript) that best describes the gene expression data D, where D is assumed to be a steady state data set. For this, all possible graphs G must be evaluated for the probability that the data D has been generated by the graph G. In case of previous knowledge from the biological background (i.e., if the classification of genes in functional groups is known), this can be integrated by means of a priori description, and the search space reduced. To traverse the search space of all possible graphs, heuristics are used. Three algorithms for Bayesian network inference are Variable Elimination, Likelihood Weighting, and Gibbs Sampling. These algorithms are implemented in the Mocapy Toolkit, Bayes Net Toolbox, and Deal [24]. In summary, Bayesian networks are suitable for statistical models with incorrect measurements and minimal parameterization. For the nodes, discrete and continuous variables are possible. The advantage is that samples with missing values and latent variables can be integrated. The disadvantage of Bayesian network modeling is that mutual dependencies (cycles) between variables cannot be modeled. Cycles, especially feedback loops, are a pattern often encountered in biological and other regulation networks, and they fundamental to their operation. Bayesian network analysis is often used to infer gene regulatory networks but there are also examples for inferring signaling networks [38]. 1
Here, the conditional independence of random variables Xi and Xj, given a random variable Xk, means that P( Xi , Xj XK ) = P( Xi Xk ) × P( Xj Xk ). 249
Reverse Engineering of Biological Networks
The normal Bayesian method works with data sets from experiments with differing conditions or differing cell strands, but not with time series data. An extension for time series data are the dynamic Bayesian methods, which allow cycles and therefore also feedback [39]. By means of dynamic Bayesian networks, an upgrading to longitudinal data and (observed over time as well as over space) feedback modeling is possible. In dynamic Bayesian networks, nodes are allowed to be repeated over time. A dynamic Bayesian network is a general state-space model to describe stochastic dynamic systems. In comparison, a Bayesian network structure corresponds to a first-order Markov process with states defined by the variables Xt [see Figure 12.9(a)], whereas a dynamic Bayesian network structure, by contrast, is corresponding to a second order Markov process [see Figure 12.9(b)]. Two special cases for dynamic Bayesian networks are hidden Markov models and Kalman filter models. Hidden Markov models are temporal probabilistic models in which the state of the process is described by a single discrete random variable. This is the simplest type of dynamic Bayesian network. Kalman filter models have the same topology as hidden Markov models. All nodes are assumed to have linear-Gaussian distributions. Kalman filter models are the simplest continuous dynamic Bayesian method. Finally, a Bayesian network with both static and dynamic nodes is called a partially dynamic Bayesian network, also known as a temporal Bayesian network [39]. The software Banjo is based on Bayesian network formalism and implements both Bayesian networks and dynamic Bayesian networks [40]. It can infer gene networks from steady state gene expression data or from time series gene expression data.
12.3.5
Ordinary differential equations
If determination of the quantitative interactions between the components of a network is an important issue, differential equations come into the picture. In this approach, networks of metabolic reactions, signaling networks, and gene regulatory networks are described by a system of ordinary differential equations (ODEs) of the form: dX = f (X(t ), p ) dt where teh X = [ X1 , K , Xn ] is the state vector containing the amounts/concentrations, T
activities, or expression levels, of all components in the network (with nonnegative valT ues X ∈ R n+ ), and where p = [p1, …, pn] is the parameter vector containing all adjustable parameters of the biological system under consideration, such as rate constants. The function f(X, p) determines the dynamics of the network given the states and parame-
(a)
(b)
Figure 12.9 Two types of Bayesian network structures: (a) normal Bayesian network structure; and (b) dynamic Bayesian network structure.
250
12.3
Approaches for Inference of Biological Networks
ters. In cases of small molecular concentrations and/or low levels of diffusion, partial differential and stochastic equations may be required, but this is outside the scope of this chapter. Uncovering the differential equations that best describe a biological system directly from observations is a challenging task. When modeling the behavior of biological networks with differential equations, there is a basic formula that describes the rate of change of the amount/concentration of a single variable (also called species) as a (generally nonlinear) function of the state of the variables in the system and of the set of parameters. This basic formula or rate equation reads: dXi g = ∑ σ ij ⋅ γ j ⋅ ∏ Xk jk dt k j where the σij are the stoichiometric coefficients (σij ∈ Z), γj is the rate constant (γj ∈ R ), +
and gjk is the kinetic order (gjk ∈ R). The stoichiometric coefficient σij is nonzero only if there is a direct interaction that relates the states Xi and Xj. The γj and gjk are parameters. Assembly of the dynamic information for all species in the modeled system results in a system of ODEs with usually one equation per species. The rate equation for each species contains terms with positive and terms with negative stoichiometric coefficients. These can be summed to the formation or synthesis flux v i+ , and the degradation flux v i− , and the rate equation then reads: dXi = v i+ ( X) − v i− ( X) dt The rate equations may obey further restrictions with regard to the kinetic orders (see Figure 12.10). Models containing the various types of rate equations are classified into conventional kinetic models and power law models. Section 12.3.5.1 will focus on network inference on the basis of conventional kinetic models, on a small scale. It is followed by a section on power-law modeling, and one on automated reverse engineering, and concluding with a section on parameter estimation.
12.3.5.1 Identification of small-scale biochemical networks The method for identifying the dynamic interactions between biochemical components within the cell, as proposed by [42], considers the system in the neighborhood of a steady state. It assumes that it behaves linearly for small variations around this steady state. The interactions in a linear system can be described in the form of an interaction matrix, or Jacobian matrix J. Considering that the system is in the vicinity of a steady state X0, the assumption that the system behaves linearly around X0 and thus truncating dX the Taylor expansion of = f (X(t ), p ) gives dt Δ
dX ∂f = dt ∂ Xi
ΔX(t ) = JΔX(t ) X0
where ΔX = X(t) −X0. The procedure is to experimentally determine the elements of J by systematic perturbations of the system (perturb concentrations periodically or in pulses, or perturb parameters), and then deduce the network from this matrix. In principle, one 251
Reverse Engineering of Biological Networks
Figure 12.10
Kinetic models can be classified according to the form of their rate equations [41].
perturbation experiment is sufficient to determine J but in practice more than one experiment is preferable, and combination of the obtained measurement results, to estimate the elements of the Jacobian matrix. The method is based on linear least-squares estimation. It can deal with data obtained under perturbations of any system parameter, not only concentrations of specific components. It requires that the number of samples equals at least the number of network components, and hence, the method is restricted to relatively small networks. Another problem may arise because in real experiments, it will be difficult to apply small perturbations such as to remain close to the steady state, as assumed. Realistic experimental settings usually involve a considerable perturbation of the variables (e.g., changes in expression levels of proteins by up- or down-regulation of genes would be more in the order of −50% or +100%, and so forth). Two more ODE-based approaches for inference of the structure of biomolecular networks are the method described by [43] and the method proposed by [44]. In the Kholodenko method, the network nodes are modules that have only one output and contain at least one intrinsic parameter that can be directly perturbed without the intervention of other nodes or parameters. The approach represents a distinct and fresh approach to the problem of identification of a gene network based on stationary experimental data [45], and it can reveal functional interactions even when the components of the system are not all known. The Sontag method is complementary to the Kholodenko method. It is based on time series measurements, a fact which is an advantage when steady state data are not available. Here, the basic concept is to analyze the direct effect of a small change in one network node on the activity of another node, while keeping all remaining nodes (variables) “frozen.” This method also enables quantification of the interaction strength, and thus, can also be of use when the network structure is already known [44]. 252
12.3
Approaches for Inference of Biological Networks
Finally, a combination of the two approaches is described in [45]. This approach is based on stationary and/or temporal data obtained from parameter perturbations, and it unifies the previous approaches of Kholodenko and Sontag [43, 44]. It also aims at improving experimental design by giving guidance on which parameters to perturb, and to what extent to perturb them, and answering questions of sample rate and sample time selection to be resolved. One methodology capable of identifying the correct functional interaction structure with only a few sampling points through relatively simple computations is the one described by [7]. It only uses a simple algebra based on the Mean Value Theorem. The authors also provide guidelines for an experimental design capable of supporting this methodology by taking proper measurements of the direct influences among the network nodes.
12.3.5.2 Power law modeling A type of model, where real numbers for the kinetic orders are allowed (as opposed to conventional kinetic models, which allow only integers; see Figure 12.10), are power law models [46–48]. This is seen as an advantage by modelers who prefer this type of model for modeling biological processes, because of the real conditions within a cell. The cytoplasm, for example, is an extremely inhomogeneous reaction space. Numerous biochemical reactions take place simultaneously and between molecules of very different size, geometry, and complexity. The fact that all these molecules basically take up all available space inside the cell, and this resembles by no means the conditions assumed for classical reaction kinetics (free diffusion of reactants, and the proportionality of reaction rate to probability of collision between reacting molecules), is emphasized by the notion of molecular crowding [49, 50]. The admittance of real kinetic orders can cover inhomogeneity, volume exclusion effect, and so on. Thus, it allows accounting for the properties of the cytoplasm (or other compartment) and makes the approach more acceptable and valid. The power law modeling approach is derived from fractal kinetic theory [51]; high values for kinetic orders correspond to more spatial restriction [Figure 12.11(a)] (i.e., to more molecular crowding). The most remarkable property of power law models is that the features of a power law term vary from the description of an inhibitory process to the description of cooperativity by just modifying the value of the kinetic orders [41]. Negative values for the kinetic order represent inhibition; while a zero indicates that the variable does not affect the described process [see Figure 12.11(b)]. When positive values are considered for a kinetic order, several alternatives are possible: a kinetic order equal to one means that the system is reproducing a perfectly linear (conventional kinetic) behavior; values between zero and one represent a saturation-like behavior for the rate modeled; finally, with values higher than one the rate equation models cooperative processes. The use of this property allows the modeler the evaluation of different hypotheses concerning the nature of interactions without modifying the formal structure of the equations. This property of power-law models has been exploited in recent times in gene networks where no initial information is available about the nature of the interactions between the compounds of the network or even if these interactions actually exist [23, 53–55]. 253
Reverse Engineering of Biological Networks
(a)
(b)
Figure 12.11 The power law modeling approach allows for noninteger kinetic orders. (a) Fractal-like kinetic orders for a single-reactant biomolecular reaction. (Figure adopted from [52].) (b) Values for kinetic order and implications for the associated process.
In Kikuchi et al. [53], a conventional parameter estimation genetic algorithm was refined to investigate the structure of unknown gene networks using dynamical data and S-system models [48], a subclass of power-law models. The objective function contains a term that converts those kinetic orders to zero, which the method detects to actually have no influence on the dynamics (described by the experimental data) of the other state variables of the model. In this way, the nonexistent interactions are automatically discarded and the actual structure of the network can be elucidated. In summary, in the beginning a fully connected network is assumed, where each variable modifies each other, and in the course of the parameter estimation (i.e., optimizing the set of parameters so that the model reproduces the measured data), links are removed if their kinetic order is estimated to be zero. So, in this approach, both the network structure and the reaction parameters are inferred at the same time. A disadvantage is the high number of parameters that need to be estimated. For a network with n state variables, there are n(n + 2) parameters in the initial network. This may cause a dimensionality problem, where too many network nodes are met by too few available sampling points. One way to reduce the dimensionality problem may be to incorporate prior knowledge (i.e., exclude certain interactions or start with an initial guess of network structure taken from one of the pathway databases) (see Section 12.1.2).
12.3.5.3 Automated reverse engineering The method of automated reverse engineering allows one to automatically infer the equations for a nonlinear coupled dynamical system directly from time series data of an unknown/hidden biological system [56], assuming that the time series of all variables is observable. Both the structure and the parameters of the ordinary differential equations can be determined. This is achieved by an evolutionary-algorithm-like procedure that includes the evolving of model structures and the evolving of tests to disprove as many of the candidate models as possible [57]. The method applies a basic estimation-exploration algorithm that comprises three phases: (1) in the experimental phase, time series data is obtained from the target system by experiments; (2) in the model phase, a population of symbolic models is generated 254
12.3
Approaches for Inference of Biological Networks
from basic operators and operands—the models that satisfy the data and explain the observed behavior are called candidate models; (3) in the test phase, new sets of conditions are generated that induce maximal disagreement in the predictions of the candidate models and so disambiguate competing models—these are called intelligent tests. The target system is then (in a new experiment) perturbed according to the best test in order to extract new behavior from the hidden system. Thus, an iterative cycle of hypothesis formation and experiment is created that is characteristic for the scientific method. The equations that form the models are represented as nested subexpressions, as is shown for an example differential equation in Figure 12.12(a). This provides a hierarchal description of the equations that can be visualized (e.g., in a tree [see Figure 12.12(b)]). Through this encoding, the expressions can become very large, and the depth of the trees should be limited by a maximum depth of subexpressions which then also defines the maximal allowed model complexity. To initialize the algorithm to a hidden system, the experimenter must specify the number of components (variables) of the system, how they can be observed, the possible number of operators (algebraic, analytic, or other domain-specific functions) and operands (variables and parameter ranges) that can be used to describe the behavior, relationships between these operands, the allowed model complexity (maximum depth of subexpressions), the space of possible tests (the sets of initial conditions of the variables), and the termination criteria (e.g., generations, fitness, or run time). Candidate models can now be generated by “growing” trees. The trees are grown by randomly selecting symbols from the operators and operands. The probability of selecting from each set depends on the maximal allowed complexity. Three main features of the algorithm are partitioning, automated probing, and snipping. Partitioning allows the algorithm to model equations describing each variable separately, even though their behavior may be coupled. The candidate models of several variables are integrated and evolved one equation at a time. The references to other variables are replaced with real system data. This reduces the search time and increases the accuracy. Automated probing is an algorithm that automatically generates tests that seek out previously unseen behavior of the target system. This forces the models to evolve faster towards the target system. It is a kind of intelligent testing and active learning process.
(a)
(b)
Figure 12.12 Models are represented as nested subexpressions in the automated reverse engineering approach. (a) Encoding of differential equations as nested subexpressions (prefix notation). (b) Description of the nested subexpression as hierarchical trees. The maximal complexity of a model is determined by the depth of its tree.
255
Reverse Engineering of Biological Networks
Snipping automatically simplifies and restructures models during optimization. As mentioned earlier, the equations are represented as symbolic trees with nested subexpressions. During the evolution of the model population, the subexpressions can become very large. This is a symptom of over-fitting. Snipping replaces the subexpressions with a constant value if they reach a certain maximum depth of nesting. So the range of the subexpression is sufficiently small. It is a kind of Occam’s Razor process that improves the accuracy and human readability of the evolved equations. The advantage of automated reverse engineering is that both the structure and the parameters of the ordinary differential equations can be determined. Furthermore, it is applicable to any system that can be described using sets of ordinary nonlinear differential equations. The disadvantage is that it assumes that the time series of all variables are observable, or, to put it the other way around, only observed species are included into the inferred model. Automated reverse engineering could become an important method for discovering hidden interactions in molecular pathways.
12.3.5.4 Parameter estimation In the power-law approach as well as in automated reverse engineering, the networks are inferred complete with the set of parameters but in some approaches, network inference only identifies the structure in a first step. In this case, the systems parameters need to be estimated. The estimation of the parameters of a differential equation model, especially the problem to estimate parameters from time series data can be formulated as an optimization problem: Given a differential equation model X& (t ) = f ( X(t ), p ) and ~ time series data D = ( Xi (t ))it ==11,,KK,,nT , the task is to estimate values for the parameter vector p by minimizing an objective function F(p, D). A common choice for the objective function is the sum of squared errors between measurement and model prediction. The problem can in general be formulated as a prediction problem with m observa~ ~ ~ tions having outcomes X(t1 ), X(t 2 ), K , X(t m ), and p predictors Xij, i = 1, ... , n, j = 1, ..., p. Established parameter estimation methods are, for example, multiple shooting [58], Markov Chain Monte Carlo, or evolutionary algorithms such as genetic algorithms [59]. For kinetic parameters, sometimes there is an alternative to parameter estimation. It is possible to populate a model with kinetic parameters that are specifically measured in experiments designed for this purpose, or with parameters that are retrieved from databases, such as, for example, BRENDA (www.brenda-enzymes.org) or SABIO-RK (sabio.villa-bosch.de/SABIORK). Such databases collect the kinetic parameters that are published in the scientific literature. Usually, not all parameters can be found in either way, but the use of already-known parameters at least reduces the dimensionality of the problem of parameter estimation significantly.
12.4 Network Biology—Exploring the Inferred Networks Once a network is determined by one of the methods introduced in the previous sections or by any other method, it can be compared to other known networks, analyzed (in its entirety or by decomposition), or its dynamics simulated. This interrogation is
256
12.4
Network Biology—Exploring the Inferred Networks
done in order to understand the relation between network topology and the ability of the system to exhibit certain kinds of dynamic behavior. Comparative biological network analysis, applied by contrasting networks of different species or under different conditions, facilitates the validation and interpretation of inferred networks as well as the addressing of questions about network evolution [60]. Some often-applied methods of network analysis are specific for networks that are represented as graphs: calculation of network measures and network classification (discussed in Section 12.4.1); and network decomposition and detection of motifs, modularity, hierarchical organization (discussed in Section 12.4.2). Also, mainly for metabolic networks, there is structural analysis: if the network stoichiometry is known, and a steady state can be assumed, many interesting conclusions can be drawn from an analysis of the matrix of stoichiometric coefficients (Section 12.4.3). And finally, there is the simulation of network dynamics (Section 12.4.4). Network properties that are of interest when exploring inferred networks are: basic performance, ability to withstand trauma (i.e., structural robustness), versatility of allowed interconnections, reusability of components and/or modules, and response to changed conditions.
12.4.1
Graph theory
One possibility to retrieve meaningful information that is encoded by the known biological networks is by applying graph theory. The field of graph theory is a branch of mathematics which is concerned with the theoretical analysis and comparison of graphs, and with the topology of complex networks represented by them. It offers a variety of measures for studying the structural properties of already inferred networks and for comparing the architectural features of different networks. All this applies only in case that the networks can be represented as graphs. One of the most basic characteristics of a network (and easy to evaluate) is the degree distribution. The degree k of a node specifies how many edges from this node to other nodes exist. Thus, in a biological network, it indicates the number of interaction partners of a protein, gene or metabolite. The degree distribution P(k) is defined as the probability that a specific node has exactly a degree of k. It has been shown that simple random graphs lead to a connectivity distribution that follows a Poisson distribution. Random graphs, developed in the 1960s, are built by starting with a certain number of not connected nodes in the first step, and connecting each pair of nodes with an edge with the same probability in the second step. One basic research result of the last years is that many real networks do not show this Poisson connectivity distribution, known from random networks. Instead, P(k) of real networks, including biological networks, is frequently found to adhere a scale free distribution, with P(k)~k−γ. Networks with a scale-free degree distribution are also called scale-free networks. The topology of these networks implicates that the network contains a huge number of nodes with a small degree and a small number of nodes with a high degree. The highly linked nodes of the network are called hubs and play a central role towards the robustness of a network, an important aspect for real network behavior. It has been shown that scale-free networks are highly robust against random failures, such as the removal of randomly selected nodes. If a node of a network is removed randomly, the probability is high that a node with a small degree is chosen. The removal of 257
Reverse Engineering of Biological Networks
such a node or several such nodes would probably not affect the networks integrity at all. Consequently this means for instance that the malfunction of proteins in a protein interaction network, which do not play a central role in the network (e.g., none-hubs) will not lead to an abnormal behavior of the network. A further important property of scale-free networks is that they are ultra-small, which means that the average path length of those networks is extremely small. The average path length is the average number of edges along the shortest paths for all possible pairs of nodes and is therefore a measurement for the transport efficiency in the network. One effect of this ultra-small property in biological networks could be that local perturbation or signal changes in the network can reach the whole network very quickly. Further examples of biological networks that seem to exhibit a scale-free topology are given in [61]. Besides the degree distribution, the clustering coefficient C is another important network measurement, which is often used to interrogate and to compare different complex graphs. Ci of a node i is calculated by the ratio of the actual number of edges between its neighbors to the maximal number of possible edges between its neighbors. The mean over all clustering coefficients of all existing nodes is called average clustering coefficient
. k is defined as the average clustering coefficient of all nodes with degree k. Based on C, and k conclusions can be drawn about the tendency of nodes to form highly connected groups of nodes, called clusters. Indeed, it has been shown that real-world networks exhibit a high C compared to random networks. For example, in protein networks the clustering of nodes is often observed. Proteins often interact with each other or form protein complexes in order to fulfill a special cell function in a modular manner. In inferred protein networks such interacting proteins form clusters. In this section only the most elementary measures are described; for further detailed reading the recent book by Uri Alon [6] is recommended. The graph theoretical analysis of biological networks is one major branch of network biology, and interesting discoveries continue to be made—for example, a more recent one is the importance of bottlenecks (defined as proteins with a high betweenness centrality—that is, network nodes that have many shortest paths going through them) as the key dynamic components of networks [62]. Besides the topology, which can be analyzed by the mentioned network measurements, each real network is characterized by its own set of motifs. The definition of motifs and particularly their meaning in biological networks are discussed in the next section.
12.4.2
Motifs and modules
Each complex network consists of many different subgraphs. A subgraph is a particular pattern of interconnection of two ore more nodes. The number of possible subgraphs grows exponentially with the number of nodes that are part of this pattern and the number of subgraphs with directed edges for a specific number of nodes is higher than the number of subgraphs consisting of undirected edges (see Figure 12.13). Undirected subgraphs are also called graphlets [63]. A special kind of subgraph is the so called network motif. The term network motif was introduced as patterns of interconnections occurring in complex networks at num-
258
12.4
(a)
Network Biology—Exploring the Inferred Networks
(b)
Figure 12.13 Subgraphs are particular patterns within the overall network. Shown are all the subgraphs with three or four nodes. (a) All undirected subgraphs with three nodes (top) and with four nodes (bottom). (b) all directed subgraphs with three nodes (top). In case of four nodes, there are already 199 possible subgraphs [6], so they are not shown here.
bers that are significantly higher than those in randomized networks [64]. Thus, motifs are over-represented subgraphs within a network. For measuring the statistical significance or over-representation, in a first step, the frequency of the considered motif in the real network is determined. Different ways exist for calculating the frequency. On one hand, one can count all occurrences of the motif in the network; on the other hand one can count only occurrences that have disjoint edges or even have disjoint edges and nodes. In the second step, a random network has to be generated and the frequency of the motif within the randomized version of the network is determined. Again, there are different randomization methods according to which network property of the real network, like number of nodes and edges or the nodes degree, should be preserved in the random network. Finally, there are two very common ways to express the statistical significance, the Z-score and the p-value. The Z-score indicates how many standard deviations an observed frequency is above or below the mean. Thus, for our purpose, the Z-score is calculated by the difference of the motif frequency in the real network and its mean frequency in a set of randomized networks, divided by the standard deviation of the frequencies of the different random networks. In contrast to that, the p-value represents the probability that a motif occurs in a random network in an equal or greater number of times than in the target network. For statistical significance this probability should be lower than 0.01. It has to be mentioned that the statistical significance of a motif depends highly on the way the frequency and the random network are calculated [65]. Each real network is characterized by its own set of motifs. It has been shown that networks which fulfill similar functions, often have similar motif sets [64]. For example, it was found that the gene regulatory networks of E.coli and S.cerevisiae have a similar small set of motifs [66, 67]. Within the E.coli gene regulatory network, three highly significant motifs were found by these researchers: the feed-forward loop (FFL), the single input module (SIM), and the dense overlapping regulon (Figure 12.14). The FFL is the
259
Reverse Engineering of Biological Networks
Feedfoward loop
Single input module
Dense overlapping regions
Figure 12.14 Network motifs found in the E. coli transcriptional regulation network. Left: In a FFL motif, a transcription factor X1 regulates a second transcription factor X2, and both jointly regulate an operon X3. Middle: The SIM motif comprises a single transcription factor X1 that regulates a set of operons X2, ..., Xn. X1 is usually auto-regulatory. All regulations are of the same sign. No other transcription factor regulates the operons. Right: in the DOR motif, a set of operons Xn+1, ..., Xm are each regulated by a combination of a set of input transcription factors, X1, ..., Xn.
most prominent and best studied motif of biological networks, and it was found to be part of almost all biological systems. In order to understand the function of network motifs, their dynamics are simulated using mathematical modeling. The FFL has been studied this way [64, 67, 68]. One major result is that a coherent FFL has only a signal output, if the input signal is in some way persistent. Consequently, for a transient activity of the input signal, caused by noise or fluctuating external signals for example, the FFL will exhibit no output signal [67]. Another promising approach for analyzing the function or behavior of a network motif is to determine its dynamic stability. The stability represents the probability of a motif to return to a steady state after small-scale perturbations. More detailed information about modeling the local stability of a network motif can be found in [69]. A common property of real networks seems to be the clustering of motifs to motif clusters or modules. A motif cluster is an aggregation of several motifs to a higher order structure. Within a cluster all nodes are highly connected with each other, but there are only few edges to other nodes or clusters in the network. One assumes that most of the cellular functions are carried out by clusters or aggregated clusters [61]. The tendency of nodes to form clusters and the scale-free topology of most real networks result in a network topology, called hierarchical networks [70]. In recent years, several software tools were developed for detecting and analyzing network motifs. Mfinder [71], MAvisto [72] and FANMOD [73] are possible programs fulfilling this task.
12.4.3
Stoichiometric analysis
Stoichiometric analysis (also termed structural analysis) can be applied to biochemical reaction networks where the stoichiometries of the reactions are known. The stoichiometry matrix N is then used to derive conservation relationships, enzyme subsets (or reaction correlation coefficient ϕ), elementary modes, and so on. An advantage of structural analysis is that it requires no information on concentrations of species or on volumes of compartments, nor any knowledge of kinetic parameters of enzymatic or nonenzymatic reactions. It can be performed on large-scale metabolic networks and even on the genome-scale networks mentioned (see Section 12.3.1). 260
12.4
Network Biology—Exploring the Inferred Networks
Every model that consists of a list of biochemical reactions is also represented through its stoichiometry matrix N. The elements of N, the stoichiometric coefficients of the reactions, relate the rate of change of the concentrations of each network component to the reaction rates of the reactions that produce or consume the component: dX = Nv( X , p) dt Knowledge about the structure of a metabolic2 network, reflected by N and details of its reactions’ reversibility are sufficient to perform a stoichiometric analysis on this reaction network. It is important to realize that the steady state assumption Nv = 0 is the premise for most of the concepts in structural modeling but not for all (e.g., the conservation relationships hold true at every point in time). By analyzing N, one can determine a variety of model properties that could not be found by any other means: 1. Conserved moieties: Sets of internal metabolites with a fixed total concentration [75]; metabolites that contribute to such a moiety are not free to take on every concentration but are dependent on the concentrations of the other metabolites contributing. 2. Enzyme subsets: Groups of enzymes that operate jointly in fixed flux proportions at steady state [76]. 3. Elementary modes: Minimal sets of reactions that can operate at steady state with all irreversible reactions proceeding in the appropriate direction [77]; the concept of elementary modes provides a mathematical tool to define and comprehensively describe all metabolic routes that are stoichiometrically feasible for a certain reaction network; Schuster, et al. [78] gave an overview, a calculation algorithm, and an example for this concept. The three concepts introduced above are not the only consequences arising out of the structure of a modeled network. Other contributions of structural analysis have been proposed: connectivity of metabolites [79], metabolic flux analysis [80], minimal cut sets [81], and the reaction correlation coefficient ϕ [82]. Applications to the genome-scale metabolic models (introduced in Section 12.3.1) are discussed by [83] and several more small-scale examples can be found in [84]. Based on a network’s stoichiometric structure, there are other theoretical methods for systems analysis: flux balance analysis (FBA), and energy balance analysis (EBA). In Chapter 9, dynamic flux balance models are explained.
12.4.4
Simulation of dynamics, sensitivity analysis, control analysis
Once the topology of a biological network is established and the transition rules or rate equations are defined, analysis and simulation of the dynamics of the network is possible. This can be done with a discrete time scale or with a continuous one. Continuous dynamic modeling or conventional biochemical network modeling relies on solving a system of differential equations. Usually, numerical methods and
2
There are attempts to also apply this method to signaling networks [74]. 261
Reverse Engineering of Biological Networks
software packages that provide them are needed for this as well as for stability or sensitivity analysis. Discrete dynamic modeling is, for example, the simulation of abstract network flow or information flow, the dynamic simulation of Boolean networks, or Petri Nets (not further discussed herein). The concepts of abstract network flow and information flow are very simple/basic and are thus introduced briefly before the dynamic simulation of Boolean networks will be explained in the following. One of the latest developments in signaling research is to view intracellular signaling rather as propagation in a complex network than as isolated pathways [85]. In compliance with this are approaches such as the simulation of abstract network flow [86] or of information flow [87]. Here, the connections between network nodes are all considered as of equal consequence (i.e., no weights), and starting from one seed node, the propagation of a signal in a network is studied. This can be the overall reach after a certain number of time steps, or in case of hierarchical networks or networks that contain nodes with special features (e.g., TFs in protein-protein-interaction networks), it can be the time steps it takes to reach nodes in a specific layer or nodes with the special features. The signal can also be split, and equally distributed between all nodes that are reached in the next step. In this way, biological networks can be compared to random networks, attractors can be detected, and so on, all with a relatively low computational effort, and the dynamic flow can be simulated even for an organisms’ known protein interaction network. Dynamics in Boolean networks In general, gene regulatory networks modeled as Boolean networks (BN) are directed graphs, in which each node represents one gene. Nodes can adopt only two different states, namely 0 and 1. Consequently, Boolean networks are characterized by their restriction to discrete state values. A node with state value 0 represents the inactive form of the gene represented by the node, which means that the gene is currently not expressed. In contrast to that a node with state value 1 stands for the active form, indicating that the gene is expressed. In addition to the binary representation of the genes, each gene of the network influences the behavior or state of one or several other genes. Those interactions, illustrated by directed edges in the Boolean network, are modeled by Boolean functions (Boolean variables connected by the logical operators AND, OR, and NOT). Each node/gene is assigned to one of those functions, such that the state of each particular gene (0/1 or off/on) at time point t + 1 depends on the states of genes at time point t regulating that gene. At each time step, all genes are updated synchronously. In an extended version of Boolean modeling, not only logical operators but other functions, such as the sum of all input states, are allowed. Then the rules for state transitions also define the threshold of the function for transitions from one state (input state) to another state (output state). This approach was applied to dynamically model the cell-cycle regulatory network of budding yeast [88]. Instead of modeling the gene activation status, one can also model the status of the proteins which are the products of the genes. The budding yeast cell-cycle network is a simple dynamic model containing 11 proteins. The protein states Xi with (i = 1, …, 11) in the next time step are determined by the protein states in the present time step via the rules given in a state transition table [e.g., in Figure 12.15(b)], where the aij denote the weights of the edges, with aij = 1 for a green/solid arrow from protein j to protein i and aij 262
12.4
Network Biology—Exploring the Inferred Networks
Start
SK
Ste9
Cdc2/Cdc13
PP
Rum1
Cdc25
Wee1/Mik1
Slp1
Cdc2/Cdc13*
(a)
(b)
Figure 12.15 Discrete dynamical model of the cell-cycle network of fission yeast (S.pombe) [89]. (a) Cell-cycle network of fission yeast cell-cycle regulation. The nodes denote threshold functions, representing the switching behavior of regulatory proteins. Arrows represent interactions between proteins, aij = 1 for an activating interaction (green/solid link) from node j to node i, aij = −1 for an inhibiting (red/dashed) link from node j to node i, and aij = 0 for no interaction at all. (b) State transition table for the fission yeast cell-cycle model. The rules in the table define how the states of the nodes are updated (in parallel) in discrete time steps.
= −1 for a red/dashed arrow from j to i (the edges may also be equipped with noninteger weights). Note that in this model the interactions are modeled by a sum function instead of pure logical operators, and the threshold for transitions is 0, and only if the function equals 0, the previous state is retained. The network was simulated for all 2,048 initial states (211), and the results show that most of the simulations converge to one single attractor (i.e., a state vector that always produced itself in the next time step), which was then related to the stationary G1 phase of the cell cycle. An analogous Boolean model for the biochemical network that controls the cell cycle progression in fission yeast S.pombe [89], shown in Figure 12.15(a), successfully predicted the time sequence protein activation along its cell cycle. The authors compare their results with a much more complicated ODE-based model that requires extensive parameter tuning and conclude by encouraging further modeling experiments with the here presented quite minimalistic approach, as it may prove a quick route to predicting biologically relevant dynamical features of genetic and protein networks in the living cell [89]. Sensitivity analysis, control analysis, and simulations ODE models, once they are put together (i.e., their structure defined and populated with parameter values and with initial values for all variables), can be analyzed in several different ways, by control analysis [90] or sensitivity analysis [91, 92], in order to detect crucial steps or parameters, 263
Reverse Engineering of Biological Networks
or by stability analysis [93], in order to evaluate the robustness of the network. For more details regarding sensitivity analysis, we refer the reader to Chapter 8. Furthermore, ODE models can be simulated by using software solutions that allow numerical integration. Thus, the time course and behavior of the system can be investigated for different initial or environmental conditions, predictions can be made, and hypotheses about the biological system that is described by the network can be formulated, leading to new experiments to test them. Well-known examples for modeling and simulation of biochemical networks are the various erythrocyte (red blood cell) metabolic models [94, 95], and models of microbial and plant metabolism [84]. Many metabolic and some signaling network models can be found and even interactively explored (including control and sensitivity analysis) in the model database JWS online (jjj.biochem.sun.ac.za).
12.5 Discussion and Comparison of Approaches Reverse engineering is an appropriate tool to increase our knowledge of biological systems and it is capable of creating predictive models for biological systems that also capture environmental factors that affect system responses. In this chapter, several methods for network inference and analysis were introduced that enable the identification of network structure and the detection of motifs and modules within these networks. The overall topology of the networks may ensure robustness to component failure by redundancy. The choice of approach used very much depends on the type of cellular network. This is reflected also in the fact that most published reviews of network inference methods focus on one type of network: gene networks [24, 32, 96] or metabolic networks [97]. Also crucial for the choice of approach is the type of data that is available (i.e., what biological entities (molecules) are the network components) or how the data was generated (i.e., time series data or steady state data for different conditions). When designing new experiments, the question must be: What is the (minimal) information required to uncover the network? When or how often should the measurements take place [98] or what should be measured? Bansal et al. [32] compared several approaches, Bayesian networks, information-theoretic, and ODE ones. In the following the features, advantages, and disadvantages of the various methods described in this chapter are again summarized and compared. Boolean networks (see Section 12.3.2) are deterministic. For each node a Boolean relation has to be defined that explains the influences of the input states (the states of the nodes that have an edge directed at the node in question) on the node’s state. This is summarized in the so-called state transition pair table. The discretization is a central feature of this method. It is useful when only noisy data is available but it implicates an information loss. Also, from the computational point of view, complexity grows exponentially according to the number of nodes. Boolean network modeling can be applied to data sets measured at two time points at least. In most cases they are used to infer interactions between genes in terms of gene regulatory networks. The inference of regulatory interactions between genes from experimental data collected from micro-array experiments is a major challenge. Genome expression analysis involves the use of oligonucleotide or cDNA microarrays to measure, in a parallel fashion, the mRNA levels 264
12.5
Discussion and Comparison of Approaches
of as many as possible genes in a genome (see Section 12.2.3). Many techniques are being developed to analyze these experimental measurements in order to disclose the main gene interactions in a given moment. Bayesian networks (see Section 12.3.4) are directed acyclic graphs that best describe a given set of steady state data, and they are not limited to discrete variables like Boolean networks. A major disadvantage of Bayesian networks is the restriction to acyclic graphs, as biological networks contain loops as one of their main features. Also, this network type cannot be used to model time series data. Bayesian network modeling is suitable for statistical models with incorrect measurements and minimal parameterization. The advantage of Bayesian networks is that samples with missing values and latent variables can be integrated. An extension of Bayesian networks are dynamic Bayesian networks, which allow using time series data and the modeling of feedback loops as well. ODE modeling can be applied to reconstructing cellular networks from time series of gene expression, signaling and metabolite data. Time series data should be available, though, as data from experiments with differing conditions also imply different parameters. Power-law models can give a very pragmatic description of a biological network, because noninteger kinetic orders are allowed and these can be interpreted in a biological sense (inhibition, activation, and so forth) but they have the disadvantage that the problem to be solved is of a very high dimensionality because many parameters need to be estimated in order to obtain even the network structure. The genome-scale metabolic modeling described in Section 12.3.1 requires that the genome of the organism is available. Also, they provide structural models (only). A priori information is a useful supplement to the standard numerical data coming from an experiment. It lessens the computational complexity problem that can arise if all interactions are to be inferred. If prior knowledge is available, it should be incorporated. An example for prior knowledge is protein-protein interaction data (see Section 12.2.2). Often, there is prior knowledge available for metabolic networks and signaling networks but the structure of gene regulatory networks is usually largely unknown in advance. The confidence in already-established metabolic networks (e.g., for E.coli, yeast, A.thaliana, and so on) is higher but additional theoretical analysis can lead to deeper insight (see Section 12.3.1). This chapter’s emphasis has been on gene regulatory and metabolic networks, with less detail on signaling networks. Time scale of signaling events usually is much smaller, and obtaining the necessary data for application of the inference methods is more difficult. One possibility of discovering signaling pathways that was not mentioned in the chapter yet is through alignment with known pathways in other species [99]. For signaling, see also Chapter 10. Once a network is inferred, there are several possibilities to analyze the gained network in more detail. Graph theory (see Section 12.4.1) provides many measures, with which the overall topology and the robustness of the network can be determined. The detection of network motifs like feed forward loops (see Section 12.4.2) is one possibility to get a higher understanding of the dynamics of the inferred network. A survey of the field of network biology can be found in [65]. In summary, there are different approaches from which a researcher can choose, and in the previous sections, they were introduced and their advantages and disadvantages discussed. The choice of approach taken really depends on the experimental data avail-
265
Reverse Engineering of Biological Networks
able, the type of network one wants to look at, and the question that are to be tackled by analyzing it. One last comment: As we have seen, studies are usually restricted to one type of cellular network, metabolic, signaling, or gene regulation, but the outlook to the future is that these networks on the different levels need be combined into one integrated cellular network [100] in order to get a true understanding of the inner workings of the cell.
12.6 Summary Points •
Reverse engineering is the task of mapping an unknown network. It comprises either or both of the tasks: • Network inference: process of derivation/assignment of interaction structure; • Parameter estimation: qualitative, quantitative, or dynamic description of the interactions (populate the structure, with transition rules and rate equations, parameters, and so forth).
•
Reverse engineering of biological networks is important because it allows building of predictive models (as opposed to statistic models) of the studied biological systems, and this is one major goal in the view of future applications in medicine or biotechnology.
•
There are many methods available for network inference but the choice depends on the type of network (i.e., the biological molecules that form the network) and on the type of data that is available (i.e., from time series or perturbation experiments) as well as on the amount of data.
•
Among the methods are such that use differential equations. The quickened development of more ODE-methods in the last decade plus ongoing increase in access to computational power in labs holds promises for the future.
•
To improve reverse engineering, there is a need for guidelines for appropriate experimental design; that is, optimization (minimization in number or cost) the experiments so that different network topologies can be distinguished.
•
For an understanding the dynamics of an inferred network, one has to employ further analysis steps, like motif detection and mathematical simulations of the inferred network.
Acknowledgments The authors have received financial support from the German Federal Ministry of Education and Research (BMBF) grant 01GR0475 as part of the National Genome Research Network (NGFN-2). Furthermore, we thank Sylvia Haus for her help in the initial preparation of the manuscript and with the section in Bayesian network modeling, Peter Raasch for assistance in preparing the figures, and Julio Vera Gonzalez for discussions and input on power law modeling.
266
Acknowledgments
References [1]
[2] [3]
[4] [5] [6] [7]
[8] [9] [10]
[11] [12] [13]
[14]
[15]
[16]
[17]
[18]
[19] [20] [21] [22]
[23] [24] [25]
Oda, K., M. Matsuoka, A. Funahashi, and H. Kitano, “A comprehensive pathway map of epidermal growth factor receptor signaling,” Molecular System Biology, Vol. 1, No. 1, May 2005, pp. msb4100014-E1–msb4100014-E17. Galperin, M.J., “The Molecular Biology Database Collection: 2008 update,” Nucl. Acids Res., Vol. 36, Database issue, November 2007, pp. D2–D4. Hall, D.H., and R.L. Russell, “The posterior nervous system of the nematode Caenorhabditis elegans: serial reconstruction of identified neurons and complete pattern of synaptic interactions,” Journal of Neuroscience, Vol. 11, 1991, pp. 1–22. Kitano, H., A. Funahashi, Y. Matsuoka, and K. Oda, “Using process diagrams for the graphical representation of biological networks,” Nature Biotechnology, Vol. 23, No. 8, August 2005, pp. 961–966. Kohn, K.W., M.I. Aladjem, S. Kim, J.N. Weinstein, and Y. Pommier, “Depicting combinatorial complexity with the molecular interaction map notation,” Mol. Syst. Biol., Vol. 2, No. 51, October 2006. Alon, U., An Introduction to Systems Biology, New York: Chapman and Hall: CRC Mathematical and Computational Biology Series, 2006. Cho, K.-H., H.S. Choi, and S.M. Choo, “Unraveling the functional interaction structure of a biomolecular network through alternate perturbation of initial conditions,” J. Biochem. Biophys. Methods, Vol. 70, No. 4, June 2007, pp. 701–707. Uetz, P., “Two-hybrid arrays,” Curr. Opin. Chem. Biol., Vol. 6, No. 1, February 2008, pp. 57–62. Ratushny, V., and E.A. Golemis, “Resolving the network of cell signaling pathways using the evolving yeast two-hybrid system,” BioTechniques, Vol. 44, No. 5, April 2008, pp. 655–662. Schlicker, A., C. Huthmacher, F. Ramirez, T. Lengauer, and M. Albrecht, “Functional evaluation of domain–domain interactions and human protein interaction networks,” Bioinformatics, Vol. 23, No. 7, April 2007, pp. 859–865. Schlitt, T., and A. Brazma, “Current approaches to gene regulatory network modeling,” BMC Bioinformatics, Vol. 8, Suppl. 9, September 2007. Wolkenhauer, O., Data Engineering, New York: John Wiley & Sons, 2001. van Someren, E.P., B.L.T. Vaes, W.T. Steegenga, A.M. Sijbers, K.J. Dechering, and M.J.T. Reinder, “Least absolute regression network analysis of the murine osteoblast differentiation network,” Bioinformatics, Vol. 22, No. 4, February 2006, pp. 477–484. Poolman, M.G., B.K. Bonde, A. Gevorgyan, H.H. Patel, and D.A. Fell, “Challenges to be faced in the reconstruction of metabolic networks from public databases,” IEE Proc. Syst. Biol., Vol. 153, No. 5, September 2006, pp. 379–384. Beste, D.J.V., T. Hooper, G. Stewart, B. Bonde, C. Avignone-Rossa, M.E. Bushell, P. Wheeler, S. Klamt, A.M. Kierzek, and J. McFadden, “GSMN-TB: a web-based genome-scale network model of Mycobacterium tuberculosis metabolism,” Genome Biol., Vol. 8, R89, May 2007. Gonzalez, O., S. Gronau, M. Falb, F. Pfeiffer, E. Mendoza, R. Zimmer, and D. Oesterhelt, “Reconstruction, modeling & analysis of Halobacterium salinarum R-1 metabolism,” Mol. BioSyst., Vol. 4, No. 2, February 2008, pp. 148–159. Forster, J., I. Famili, P. Fu, B.Ø. Palsson, and J. Nielsen, “Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network,” Genome Research, Vol. 13, No. 2, February 2003, pp. 244–253. Duarte, N.C., M.J. Herrgard, and B.Ø. Palsson, “Reconstruction and validation of Saccharomyces cerevisiae iND750, a fully compartmentalized genome-scale metabolic model,” Genome Research, Vol. 14, No. 7, July 2004, pp. 1298–1309. Kim, H.U., T.Y. Kim, and S.Y. Lee, “Metabolic flux analysis and metabolic engineering of microorganisms,” Mol. BioSyst., Vol. 4, No. 2, 2008, pp. 113–120. Kauffman, S.A., “Metabolic stability and epigenesis in randomly constructed genetic nets,” J. Theoret. Biol., Vol. 22, No. 3, March 1969, pp. 437–467. Liang, S., S. Fuhrman, and R. Somogyi, “REVEAL: A general reverse engineering algorithm for inference of genetic network architectures,” Pacific Symp. Biocomputing, Vol. 3, 1998, pp. 18–29. Akutsu, T., S. Miyano, and S. Kuhara, “Identification of genetic networks from a smaller number of gene expression patterns under the Boolean network model,” Pac. Symp. Biocomput., 1999, pp. 17–28. Akutsu, T., S. Miyano, and S. Kuhara, “Inferring qualitative relations in genetic networks and their inference from gene expression time series,” Bioinformatics, Vol. 16, No. 8, April 2000, pp. 727–734. Cho, K.-H., S.-M. Choo, S.H. Jung, J.-R. Kim, H.-S. Choi, and J. Kim, “Reverse engineering of gene regulatory networks,” IET Systems Biol., Vol. 1, 2007, pp. 149–163. Shmulevich, I., E.R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks,” Bioinformatics, Vol. 18, No. 2, February 2002, pp. 261–274.
267
Reverse Engineering of Biological Networks
[26]
[27] [28] [29]
[30] [31] [32] [33] [34] [35] [36] [37] [38]
[39] [40]
[41] [42] [43]
[44]
[45]
[46] [47] [48] [49]
[50] [51] [52]
268
Shmulevich, I., E.R. Dougherty, and W. Zhang W, “From Boolean to probabilistic Boolean networks as models of genetic regulatory networks,” Proc. IEEE, Vol. 90, No. 11, November 2002, pp. 1778–1792. Silvescu, A., and V. Honavar, “Temporal Boolean network models of genetic networks and their inference from gene expression time series,” Complex Systems, Vol. 13, 2001, pp. 54–70. Steuer, R. “Review: On the analysis and interpretation of correlations in metabolomic data,” Brief Bioinformatics, Vol. 7, No. 2, June 2006, pp. 151–158. May, P., S. Wienkoop, S. Kempa, B. Usadel, N. Christian, J. Rupprecht, J. Weiss, L. Recuenco-Munoz, O. Ebenhöh, W. Weckwerth, and D. Walther, “Metabolomics- and proteomics-assisted genome annotation and analysis of the draft metabolic network of Chlamydomonas reinhardtii,” Genetics, Vol. 179, May 2008, pp. 157–166. Cakmak, A., and G. Ozsoyoglu, “Mining biological networks for unknown pathways,” Bioinformatics, Vol. 23, No. 20, August 2007, pp. 2775–2783. D’haeseleer, P., S. Liang, and R. Somogyi, “Genetic network inference: from co-expression clustering to reverse engineering,” Bioinformatics, Vol. 16, No. 8, August 2000, pp. 707–726. Bansal, M., V. Belcastro, A. Ambesi-Impiombato, and D. di Bernardo, “How to infer gene networks from expression profiles,” Mol. Syst. Biol., Vol. 3, No. 78, February 2007. Steffen, M., A. Petti, J. Aach, P. D’haeseleer, and G. Church, “Automated modelling of signal transduction networks,” BMC Bioinformatics, Vol. 3, No. 34, November 2002. Arkin, A., and J. Ross, “Statistical construction of chemical reaction mechanisms from measured time-series,” J. Chem. Phys., Vol. 99, No. 3, January 1995, pp. 970–979. Arkin, A., P. Shen, and J. Ross, “A test case of correlation metric construction of a reaction pathway from measurements,” Science, Vol. 277, No. 5330, August 1997, pp. 1275–1279. Scott, J., T. Ideker, R.M. Karp, and R. Sharan, “Efficient algorithms for detecting signaling pathways in protein interaction networks,” J. Comp. Biol., Vol. 13, No. 2, March 2006, pp. 133–144. de Jong, H., “Modeling and Simulation of Genetic Regulatory Systems: A Literature Review,” J. Comput. Biol., Vol. 9, No. 1, 2002, pp. 67–103. Sachs, K., O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, Vol. 308, No. 5721, April 2005, pp. 523–529. Murphy, K., “Dynamic Bayesian networks: representation, inference and learning,” Ph.D. Dissertation, U.C. Berkeley, Computer Science Division, 2002. Yu, J., V.A. Smith, P.P. Wang, A.J. Hartemink, and E.D. Jarvis, “Advances to Bayesian network inference for generating causal networks from observational biological data,” Bioinformatics, Vol. 20, No. 18, July 2004, pp. 3594–3603. Vera, J., E. Balsa-Canto, P. Wellstead, J.R. Banga, and O. Wolkenhauer, “Power-law models of signal transduction pathways,” Cellular Signalling, Vol. 19, No. 7, July 2007, pp. 1531–1541. Schmidt, H., K.-H. Cho, and E.W. Jacobsen, “Identification of small scale biochemical networks based on general type system perturbations,” FEBS J., Vol. 272, No. 9, July 2005, pp. 2141–2151. Kholodenko, B.N., A. Kiyatkin, F.J. Bruggeman, E. Sontag, H.V. Westerhoff, and J.B. Hoek “Untangling the wires: a strategy to trace functional interactions in signaling and gene networks,” PNAS, Vol. 99, No. 20, October 2002, pp. 12841–12846. Sontag, E., A. Kiyatkin, and B.N. Kholodenko, “Inferring dynamic architecture of cellular networks using time series of gene expression, protein and metabolite data,” Bioinformatics, Vol. 20, No. 12, August 2004, pp. 1877–1886. Cho, K.-H., S.-M. Choo, P. Wellstead, and O. Wolkenhauer, “A unified framework for unraveling the interaction model structure of a biochemical network using stimulus-response data,” FEBS Letters, Vol. 579, No. 20, 2005, pp. 4520–4528. Savageau, M.A., “Biochemical systems analysis: II. Steady state solutions for an n-poll system using a power-law approximation,” J. Theor. Biol., Vol. 25, December 1969, pp. 370–379. Savageau, M.A., “Biochemical systems analysis: III. Dynamic solutions using a power-law approximation,” J. Theor. Biol., Vol. 26, February 1970, pp. 215–226. Voit, E.O., Computational Analysis of Biochemical Systems, A Practical Guide for Biochemists and Molecular Biologists, Cambridge, U.K.: Cambridge University Press, 2000. Takahashi, K., S.N.V. Arjunan, and M. Tomita, “Space in systems biology of signaling pathways— towards intracellular molecular crowding in silico,” FEBS Lett., Vol. 579, No. 8, March 2005, pp. 1783–1788. Grima, R., and S. Schnell, ”A mesoscopic simulation approach for modeling intracellular reactions,” J. Stat. Phys., Vol. 128, No. 1–2, 2006, pp. 139–164. Kopelman, R., “Fractal Reaction Kinetics,” Science, Vol. 241, No. 4873, September 1988, pp. 1620–1626. Savageau, M.A., “Michaelis-Menten mechanism reconsidered: implications of fractal kinetics,” J.Theor. Biol., Vol. 176, September 1995, pp. 115–124.
Acknowledgments
[53]
[54] [55]
[56] [57] [58] [59] [60] [61] [62]
[63] [64] [65] [66]
[67] [68] [69] [70] [71]
[72] [73] [74]
[75] [76] [77] [78]
[79]
Kikuchi, S., D. Tominaga, M. Arita, K. Takahashi, and M. Tomita, “Dynamic modeling of genetic networks using genetic algorithm and S-system,” Bioinformatics, Vol. 19, No. 5, March 2003, pp. 643–650. Veflingstad, S.R., J. Almeida, and E.O. Voit, “Priming nonlinear searches for pathway identification,” Theoretical Biology and Medical Modelling, Vol. 1, September 2004, p. 8. Kimura, S., K. Ide, A. Kashihara, M. Kano, M. Hatakeyama, R. Masui, N. Nakagawa, S. Yokoyama, S. Kuramitsu, and A. Konagaya, “Inference of S-system models of genetic networks using a cooperative coevolutionary algorithm,” Bioinformatics, Vol. 21, No. 7, April 2005, pp. 1154–1163. Bongard, J., and H. Lipson, “Automated reverse engineering of nonlinear dynamical systems,” PNAS, Vol. 104, No. 24, 2007, pp. 9943–9948. Bongard, J., and H. Lipson, “Nonlinear system identification using coevolution of models and tests,” IEEE Trans. Evol. Comput., Vol. 9, No. 4, August 2005, pp. 361–384. Pfeifer, M., and J. Timmer, “Parameter estimation in ordinary differential equations using the method of multiple shooting,” IET Syst. Biol., Vol. 1, No. 2, March 2007, pp. 78–88. Liebermeister, W., and E. Klipp, “Biochemical networks with uncertain parameters,“ IEE Proc. Syst. Biol., Vol. 152, No. 3, September 2005, pp. 97–107. Sharan, R., and T. Ideker, “Modeling cellular machinery through biological network comparison,” Nature Biotechnol., Vol. 24, No. 4, April 2006, pp. 427–433. Barabasi, A.-L., and Z.N. Oltvai, “Network biology: understanding the cell’s functional organization,” Nature Review Genetics, Vol. 5, No. 2, February 2004, pp. 101–113. Yu, H., P.M. Kim, E. Sprecher, V. Trifonov, and M. Gerstein, “The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics,” PLoS Comput. Biol., Vol. 3, No. 4, April 2007, p. e59. Przlj, N., D.G. Corneil, and I. Jurisica, “Modeling interactome: scale-free or geometric?” Bioinformatics, Vol. 20, No. 18, July 2004, pp. 3508–3515. Milo, R., S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simple building blocks of complex networks,” Science, Vol. 298, No. 5594, October 2002, pp. 824–827. Junker, B.H., et al., Analysis of Biological Networks, New York: Wiley-Interscience, 2006. Lee, T.I., N.J. Rinaldi, F. Robert, D.T. Odom, Z. Bar-Joseph, G.K. Gerber, N.M. Hannett, C.T. Harbison, C.M. Thompson, I. Simon, J. Zeitlinger, E.G. Jennings, H.L. Murray, D.B. Gordon, B. Ren, J.J. Wyrick,b J.-B. Tagne, T.L. Volkert, E. Fraenkel, D.K. Gifford, and R.A. Young, “Transcriptional regulatory networks in Saccharomyces cerevisiae,” Science, Vol. 298, No. 5594, October 2002, pp. 799–804. Shen-Orr, S., R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulation network of Escherichia coli,” Nature Genetics, Vol. 31, No. 1, May 2002, pp. 64–68. Mangan, S., and U. Alon, “Structure and function of the feed-forward loop network motif,” Proc. Natl. Acad. Sci., Vol. 100, No. 21, October 2003, pp. 11980–11985. Prill, R.J., P.A. Iglesias, and A. Levchenko, “Dynamic Properties of Network Motifs Contribute to Biological Network Organization,” PLoS Biology, Vol. 3, No. 11, October 2005, e343. Clauset, A., C. Moore, and M.E.J. Newman, “Hierarchical structure and the prediction of missing links in networks,” Nature, Vol. 453, No. 7191, 2008, pp. 98–10. Kashtan, N., S. Itzkovitz, R. Milo, and U. Alon, “Mfinder tool guide,” Technical report, Department of Molecular Cell Biology and Computer Science & Applied Mathematics, Weizman Institute of Science, 2002 Schreiber, F., and H. Schwöbbermeyer, “MAVisto: A tool for the exploration of network motifs,” Bioinformatics, Vol. 21, No. 17, September 2005, pp. 3572–3574. Wernicke, S., and F. Rasche, “FANMOD: A tool for fast network motif detection,” Bioinformatics, Vol. 22, No. 9, May 2006, pp. 1152–1153. Papin, J.A., and B.Ø. Palsson, “Topological analysis of mass-balanced signaling networks: a framework to obtain network properties including crosstalk,” J. Theoret. Biol., Vol. 227, No. 2, March 2004, pp. 283–297. Hofmeyr, J.-H. S., “Steady state modelling of metabolic pathways: a guide to the prespective simulator,” Comp. Appl. Biosci., Vol. 2, 1986, pp. 5–11. Pfeiffer, T., I. Sanchez-Valdenebro, J.C. Nuno, F. Montero, and S. Schuster, “METATOOL: for studying metabolic networks,” Bioinformatics, Vol. 15, No. 3, 1999, pp. 251–257. Schuster, S., and C. Hilgetag, “On elementary flux modes in biochemicalsystems at steady state,” J. Biol. Syst., Vol. 2, 1994, pp.165–182. Schuster, S., C. Hilgetag, J. Woods, and D. Fell, “Reaction routes in biochemical reaction systems: algebraic properties, validated calculation procedure and example from nucleotide metabolism,” J. Math. Biol., Vol. 45, No. 2, August 2002, pp. 153–181. Wagner, A., and D.A. Fell, “The small world inside large networks,” Proc. Biol. Sci., Vol. 268, No. 1478, September 2001, pp. 1803–1810.
269
Reverse Engineering of Biological Networks
[80]
Avignone-Rossa, A.J., A. White, P. Kuiper, M. Postma, M. Bibb, and M. Teixeira de Mattos, “Carbon flux distribution in antibiotic-producing chemostat cultures of Streptomyces lividans,” Metabolic Engineering, Vol. 4, No. 2, April 2002, pp. 138–150. [81] Klamt, S., and E. Gilles, “Minimal cut sets in biochemical reaction networks,” Bioinformatics, Vol. 20, No. 2, January 2004, pp. 226–234. [82] Poolman, M.G., C. Sebu, M.K. Pidcock, and D.A. Fell, “Modular decomposition of metabolic systems via null-space analysis,” J. Theoret. Biol., Vol. 249, No. 4, December 2007, pp. 691–705. [83] Jamshidi, N., and B.Ø. Palsson, “Formulating genome-scale kinetic models in the post-genome era,” Mol. Syst. Biol., Vol. 4, No. 171, March 2008. [84] Schuster, S., et al., “Modeling and simulating metabolic networks,” in T. Lengauer, (ed.), Bioinformatics: From Genomes to Therapies, New York: Wiley, 2007. [85] Friedman, A., and N. Perrimon, “Genetic screening for signal transduction in the era of network biology,” Cell, Vol. 128, No. 2, January 2007, pp. 225–231. [86] Kaluza, P., M. Ipsen, M. Vingron, and A.S. Mikhailov, “Design and statistical properties of robust functional networks: A model study of biological signal transduction,” Phys. Rev. E, Vol. 75, 2007, R015101. [87] Stojmirovic, A., and Y.-K. Yu, “Information flow in interaction networks,” J. Comput. Biol., Vol. 14, No. 8, October 2007, pp. 1115–1143. [88] Li, F., T. Long, Y. Lu, Q. Ouyang, and C. Tang, “The yeast cell-cycle network is robustly designed,” PNAS, Vol. 101, No. 14, April 2004, pp. 4781–4786. [89] Davidich, M.I., and S. Bornholdt, ”Boolean network model predicts cell cycle sequence of fission yeast,” PLoS ONE, Vol. 3, No. 2, 2008, e1672. [90] Wolkenhauer, O., M. Ullah, P. Wellstead, and K.-H. Cho, “The dynamic systems approach to control and regulation of intracellular networks,” FEBS Lett., Vol. 579, No. 8, 2005, pp. 30–34. [91] Alves, R., and M. Savageau, “Comparing systemic properties of ensembles of biological networks by graphical and statistical methods,” Bioinformatics, Vol. 16, No. 6, June 2000, pp. 527–533. [92] Alves, R., and M. Savageau, “Systemic properties of ensembles of metabolic networks: application of graphical and statistical methods to simple unbranched pathways,” Bioinformatics, Vol. 16, No. 6. June 2000, pp. 534–547. [93] Strogatz, S., Nonlinear Dynamics and Chaos, Boulder, CO: Westview Press, 2000. [94] Joshi, A., and B.Ø. Palsson, “Metabolic dynamics in the human red-cell. Part I A comprehensive kinetic model,” Journal of Theoretical Biology, Vol. 141, No. 4, 1989, pp. 515–528. [95] Ni, T.C., and M.A. Savageau, “Application of biochemical systems theory to metabolism in human red blood cells,” J. Biol. Chem., Vol. 271, No. 14, 1996, pp. 7927–7941. [96] Bornholdt, S., “Systems biology: Less is more in modeling large genetic networks,” Science, Vol. 310, No. 5747, October 2005, pp. 449–451. [97] Crampin, E.J., S. Schnell, and P.E. McSharry, “Mathematical and computational techniques to deduce complex biochemical reaction mechanisms,” Prog. Biophys. Mol. Biol., Vol. 86, No. 1, September 2004, pp. 77–112. [98] Kutalik, Z., H.-H. Cho, and O. Wolkenhauer, “Optimal sampling time selection for parameter estimation in dynamic pathway modeling,” BioSystems, Vol. 75, No. 1–3, 2004, pp. 43–55. [99] Kelley, B.P., R. Sharan, R.M. Karp, T. Sittler, D.E. Root, B.R. Stockwell, and T. Ideker, “Conserved pathways within bacteria and yeast as revealed by global protein network alignment,” PNAS, Vol. 100, No. 20, September 2003, pp. 11394–11399. [100] Lee, J.M., E.P. Gianchandani, J.A. Eddy, and J.A. Papin, “Dynamic Analysis of Integrated Signaling, Metabolic, and Regulatory Networks,” PLoS Comput. Biol., Vol. 4, No. 5, May 2008, p. e1000086.
270
CHAPTER
13 Transcriptome Analysis of Regulatory Networks Katy C. Kao, Linh M. Tran, and James C. Liao 1
Department of Chemical Engineering, Texas A&M University Rosetta Inpharmatics LLC 3 Department of Chemical and Biomolecular Engineering, University of California, Los Angeles 2
Abstract The coordinated activities of transcription factors are responsible for changes in the gene expression of the cell upon a shift in the environment. The ability of a transcription factor to bind to its DNA targets, and therefore to enact a change in expression, is the result of signal transduction cascades generated in response to environmental stimuli. Thus, the change in transcript abundance is indicative of changes in the cellular state of the organism. With high-throughput genomic tools now readily available, large scale transcriptome data can be generated in order to obtain “snap shots” of the transcriptional activity of the cell. In simple transcriptional regulatory networks, where one gene is regulated by one transcription factor, one can simply infer the activity of the transcription factor by the expression of the gene it regulates. However, most transcriptional regulatory networks are complex, with multiple connections between transcription factors and genes (Figure 13.1). Thus, systems biology approaches are required for the deconvolution of transcriptome data to infer the activities of transcription factors. The transcription factors identified to be involved under the condition of interest can be used to generate testable hypotheses regarding the cellular or metabolic responses to environmental perturbations, such as drug effects, and may identify new drug targets. Key terms
Transcription factor activities Network component analysis DNA microarrays
271
Transcriptome Analysis of Regulatory Networks
13.1 Introduction The coordinated expressions of genes define the cellular and metabolic state of the cell. With the increasing knowledge of the transcriptional regulatory network, it is now possible to infer the state of the cell via transcriptional profiling. Cells adapt to environmental changes by altering their cellular metabolism via signal transduction cascades, resulting in the ultimate activation or deactivation of a set of DNA binding proteins called transcription factors. Once activated, transcription factors bind to specific regions of DNA to either positively or negatively regulate transcription. Although some transcription factors are activated by their synthesis, others require post-translational modification or ligand binding to be able to bind to DNA via conformational changes. The expression of the genetic repertoire of a cell is dictated by the collective activities of these transcription factors that regulate whether or not specific regions in the genome are transcribed and, if so, by which degree they are transcribed. Each transcript can be regulated by more than one transcription factor, whose combined activities determine when, where, and how much a given gene is expressed. While more than one gene can be regulated by the same transcription factor, the effects of the transcription factor can be different depending on the target gene. A set of genes that is regulated by the same transcription factor is defined as a regulon. Figure 13.1 shows a cartoon of a transcription regulatory network. Until the mid-1990s, the transcriptional expression of only a handful of genes could be assayed at a time, such as through traditional northern blot analyses. In 1995, Patrick Brown’s lab at Stanford University developed the first DNA microarray [1], where thousands of any sequences of interest can be spotted on a single glass microscope slide and assayed to determine relative expression levels. These microarrays revolutionized the
TFx
TFA
TFz
TFy
TFx
TFy
TFz
CS
1
2
3
4
5
6
7
8
Genes Figure 13.1 Hypothetical transcriptional regulatory network. Green circles represent transcription factors. Brown circles represent the activity of transcription factors (TFA), which can be high or low depending on the cellular state. Yellow squares are genes. The waves under the genes represent changes in their transcript abundance. The arrows represent the connections between transcription factors and the genes they regulate. The heavier the arrows, the stronger the effect of the transcription factor on the target gene. The color of the arrows represents the effect of the transcription factor, either as a positive or negative regulator, on the target gene.
272
13.2
Methods
molecular biology field, enabling researchers to assay the gene expressions of thousands or tens of thousands of genes in a single experiment. The use of DNA microarrays has thus greatly accelerated our ability to generate data regarding specific transcriptional regulatory networks. Moreover, the recently developed ultra high-throughput sequencing (UHTS) technology allows scientists to sequence cDNA libraries (RNA-seq) at unprecedented of levels and generate tens of millions of sequence reads per experiment. In addition, with the recently developed Chromatin Immunoprecipitation DNA microarray (ChIp-chip) and the more recent Chromatin Immunoprecipitation sequencing (ChIp-seq) technologies, the DNA binding sites of transcription factors can now be determined in a high-throughput manner. With the available information on the connectivities of the transcriptional regulatory network, and the ability to assay the transcriptome on a genome-wide scale, we can start to use the measured transcriptional profiles to help us gain further understanding of cellular behavior. Utilizing DNA microarrays or the new RNA-seq methods, one can obtain the expressions of all known (and novel) transcripts in an organism under any condition of interest. However, the underlying physiological perturbations that result in the gene expression profiles cannot always be easily determined, due to the complexity of the transcriptional regulatory network. Certain signaling molecules have assays already developed for their determination, and thus can be used in conjunction with the gene expression profiles to show their specific role(s) in changing the metabolic state of the cell due to these perturbations. Unfortunately, the majority of the transcription factors cannot have their activities determined experimentally. Thus, systems biological approaches are necessary to determine the cellular perturbations associated with the environmental condition. Since each transcript level is determined by the combined activities of their regulators, it is possible to deconvolute the transcriptional profiles from DNA microarrays using Network Component Analysis (NCA) [2] and obtain the activity profiles of transcription factors. Based on which transcription factors are more active or less active under the conditions of interest, we can gain insight into how specific signaling or cellular pathways are perturbed. This information will help us to, for example, better understand the underlying metabolic and cellular responses to perturbations, such as a drug effect, and may lead to identifications of potential drug targets.
13.2 Methods 13.2.1
Materials
General equipment 1. Microcentrifuge 2. Floor centrifuge 3. Clinical benchtop centrifuge 4. Fluorometer or spectrophotometer Resources 1. MATLAB 2. Connectivity information for known transcription factors
273
Transcriptome Analysis of Regulatory Networks
13.2.2
Cell harvesting
Grow up the cells under condition of interest and harvest. The cells must be harvested in such a manner such that the effect of the harvesting of the cell on the transcriptional profile will be kept at a minimum. Filtration method is described for yeast and E. coli. 13.2.2.1 For yeast 1. Attach a 0.45-μm analytical test filter funnels (Nalgene) to a vacuum manifold. 2. Fill a 50-ml conical tube approximately half full with liquid nitrogen, placed in a bucket of dry ice. 3. Quickly filter culture through the filter funnel. 4. Snap off the top of the filter funnel and carefully peel off the filter containing cell cake with a pair of clean tweezers and immediately place it in the conical tube. 5. Allow the liquid nitrogen to evaporate, then tighten lid, and store at −80°C until ready for use. 13.2.2.2 For E. coli 1. Attach a 0.45-μm analytical test filter funnels (Nalgene) to vacuum manifold. 2. Fill a 50-ml conical tube with 5 ml of RNAlater. 3. Follow procedure as above for yeast. 4. Follow procedure as above for yeast. 5. Make sure the filter cake is completely submerged in RNAlater and mix to resuspend the cells, then store at −80°C until ready for use.
13.2.3
RNA purification
Reagents 2X RNA loading buffer (recipe from New England Biolabs): stored at –20°C 0.02% Bromophenol Blue 26% ficoll (w/v) 14 M urea 4 mM EDTA 180 mM Tris-borate pH 8.3 @ 25°C Depending the types of cells used, RNA isolation method may vary. The methods described below are for yeast and E. coli. Care should be taken whenever working with RNA to ensure the integrity of the sample, as ribonucleases (RNases) are very stable and do not require cofactors to function. Since our skin contains RNases, gloves should be worn whenever working with RNA. Make sure all reagents used are made with molecular biology grade or DEPC-treated water. Use filtered tips that are certified nuclease-free. Purified RNA should be stored at −80°C. 13.2.3.1 For yeast Before starting, set the temperature of a water bath to 60°C. 1. Thaw sample on ice. 2. Add 5 ml of RNA buffer and 5 ml of acid phenol to each conical tube. 274
13.2
Methods
3. Incubate in water bath at 60°C for 1 hour, vortex rigorously every 10 to 20 minutes. 4. Place on ice for 10 minutes. 5. Centrifuge in floor centrifuge at 10,000 rpm for 10 minutes at 4°C. 6. Transfer the aqueous phase (top layer) to new 15-ml conical tube. 7. Add 5 ml of acid phenol to sample and vortex well. 8. Centrifuge in benchtop centrifuge for 10 minutes at 3,500 rpm. 9. Transfer the aqueous phase to new 15-ml conical tube. 10. Add 5 ml of chloroform to sample and vortex well. 11. Centrifuge for 10 minutes at 3,500 rpm. 12. Transfer the aqueous phase to new 50-ml conical tube (should extract approximately 4.5 ml of aqueous solution). 13. Add 450 μl of 3M sodium acetate (pH = 5.2) and 10 ml of 100% ethanol. 14. Mix and precipitate overnight at –20°C. 15. Spin down in floor centrifuge at 10,000 rpm for 20 minutes at 4°C to pellet RNA. 16. Discard the supernatant and rinse the pellet with ice cold 70% ethanol to remove salt (make sure not to disturb the pellet). 17. Air dry the pellet (using a sterile pipette tip attached to a vacuum to get as much liquid off as possible will greatly expedite the drying process). 18. The pellet should start to clear as it dries, then resuspend the pellet in 200 to 500 μl of nuclease-free water and transfer to 1.5-ml microcentrifuge tube. 13.2.3.2 For E. coli 1. Thaw sample on ice. 2. Spin down sample at 8,000 rpm at 4°C for 5 minutes. 3. Remove filter and RNAlater solution. 4. Add 800 μl of TE + 1mg/ml lysozyme to dissolve the cell wall. 5. Add 80 μl of 10% SDS to lyse the protoplasts and denatures proteins. 6. Heat at 64°C for 1 to 2 minutes to lyse cells (solution turns clear when complete lysis). 7. Add 88 μl of 1M sodium acetate at pH = 5.2 to precipitate out the nucleic acids. 8. Add 960 μl of water saturated acid phenol at pH = 4.3 to isolate total RNA from other cellular components (low pH phenol is used because RNA is stable at low pH and DNA moves to organic phase at low pH). 9. Invert tubes 10 times. 10. Incubate at 64°C for 6 minutes, inverting tube 10 times every 40 seconds. 11. Chill on ice for 5 to 10 minutes. 12. Centrifuge at 14,000 rpm for 10 minutes at 4°C. 13. Recover the aqueous phase (top layer) which contains the RNA without disturbing the thick white interphase. 14. Add equal volume of chloroform to recovered RNA (this is to remove residual phenol). 15. Mix by inverting 10 times. 16. Centrifuge at maximum speed for 5 minutes in 4°C. 17. Recover aqueous phase (~600 μl) into a 15-ml conical tube. 275
Transcriptome Analysis of Regulatory Networks
18. Follow instructions from RNeasy midi kit from Qiagen to clean up RNA, elute twice with 200-μl RNase free water, ending with 350 μl of total RNA.
13.2.3.3 DNaseI digestion to remove residual DNA 1. Mix the following components in a microcentrifuge tube: 350 μl 39 μl 0.5 μl 0.5 μl 390 μl
Total RNA 10X React 3 buffer (New England Biolabs) RNaseOUT RNase inhibitor (Invitrogen) (20 units) DNaseI (Invitrogen) Total
2. Incubate at 37°C for 20 to 30 minutes. 3. Add 390 μl of acid phenol/chloroform (pH = 4.3) to denature the enzymes. 4. Recover aqueous layer. 5. Add 39 μl of 3M sodium acetate (pH = 5.2). 6. Mix well by vortexing. 7. Add 1,080 μl (2.5X volume) of ice cold 100% ethanol to precipitate out the RNA. 8. Place at −80°C for >30 minutes to precipitate out as much RNA as possible. 9. Spin at 4°C for 15 to 20 minutes. 10. Air dry after removing as much supernatant as possible using a sterile pipette tip attached to a vacuum. 11. Resuspend in 10 to 20 μl of nuclease-free water.
13.2.3.4 Quantify and check RNA quality 1. Quantify by fluorometer with RNA specific dyes (Qubit by Invitrogen is a good system) or by spectrophotometer. 2. Run 1 μg of total RNA on 1% agarose gel in 1X TBE: i.
Mix RNA with equal volume of RNA loading dye.
ii.
Denature at 65°C for 10 minutes then immediately chill on ice.
iii. Run gel to make sure the 28S and 18S (Yeast) and the 23S and 16S (E. coli) ribosomal RNA bands are sharp and not smeary.
13.2.4
Transcriptional profiling using DNA microarrays
There are several different types of DNA microarrays widely used currently, including homemade spotted arrays, manufactured arrays such as Affymetrix GeneChips, Agilent and Nimblegen arrays. Depending on the platform of arrays used, different labeling and hybridization protocols are used. The labeling and hybridization protocol described in the first part are for spotted arrays using amino silane coated glass slides. In this protocol, the fluorescent dyes are incorporated during the cDNA synthesis process using a reverse transcriptase, where the reference is labeled with either cy3 or cy5 and the experimental sample is labeled with the other. Reagents 1. Labeling dNTP mix: 276
13.2
i.
0.5 mM dATP, dCTP, dGTP
ii.
0.2 mM dTTP
Methods
2. Hybridization chambers 3. Staining dishes 4. Microarray scanner 5. Hybridization and washing stations (Affymetrix) 6. Additional reagents and manufacturers are listed under each protocols section
13.2.4.1 Labeling 1. Combine the following in 12 μl total volume: 1.5 μl
Random hexamer (stock solution at 3 μg/μl) Total RNA sample (total 30 μg)
10 μl
2. Denature RNA at 70°C for 10 minutes. 3. Immediately chill on ice. 4. Add the following together: 11.5 μl 5 μl 2.5 μl 2.5 μl 2.5 μl 25 μl
30 μg total RNA + 4.5 μg random hexamer 5X first strand buffer DTT Labeling dNTP mix Total of 2.5 nmol of cy3-dUTP or cy5-dUTP Total volume
5. Preheat solution at 42°C for 2 minutes. 6. Add 2 μl of Reverse transcriptase II H (200 U/μl), mix well. -
7. Label at 42°C for 1 to 2 hours in the dark for the cDNA synthesis. 8. Stop the reaction by adding 2.5 μl of EDTA (pH = 8.0), mix well. 9. Hydrolyze the RNA by adding 5 μl of 1N sodium hydroxide, mix well, and incubate at 65°C for 40 minutes. 10. Add 150 μl of TE buffer (pH = 8.0) to each cy3 and cy5 labeled cDNA samples. 11. Combine with cy3 and cy5 labeled samples and remove unincorporated dyes and concentrate using a Microcon-30 column to approximately 2 μl: i.
Apply the sample to Microcon-30 column.
ii.
Spin at max speed in a microcentrifuge for 12 minutes at room temperature, discard the flow through.
iii. Add 300 μl of TE buffer to column. iv. Repeat steps ii. and iii. v.
Spin at max speed for 12 minutes.
vi. Check the color of the flow through, if the flow through is not clear or near clear, then repeat the washing. vii. Invert the column into a new microcentrifuge tube and recover sample by centrifugation for 1 minute (the recovered sample should be a dark purple color).
277
Transcriptome Analysis of Regulatory Networks
13.2.4.2 Hybridization 15 μl of formamide 4.5 μl of 20X SSC 3 μl of 10% SDS 3 μl of 10X Denhardt’s solution 4.5 μl of Blocking DNA 1. Make hybridization solution (for a 22 × 40-mm printed microarray slide): 15 μl 4.5 μl 3 μl 3 μl 4.5 μl
Formamide (kept at 4–8°C) 20X SSC (filtered through 0.22-μm filter) 10% SDS (filtered through 0.22-μm filter) 10X Denhardt’s solution (prevent unspecific hybridization) Blocking DNA (1:1 yeast tRNA (10 μg/μl): Salmon Sperm DNA (10 μg/μl))
2. Add hybridization solution to labeled sample and denature at 95°C for 3 minutes. 3. Let stand at room temperature for 5 minutes. 4. Collect content by brief centrifugation. 5. Carefully pipette 25 μl of sample to the active side of the DNA microarray slide (usually the side with the barcode, but may differ depending on the array manufacturer). 6. Place a glass cover slip with clean tweezer over sample (be careful not to introduce any bubble; in case bubbles are visible under the cover slip, carefully lift one corner of the cover slip and the bubbles will usually migrate to the edge and be removed). 7. Fill the appropriate holes in the hybridization chambers with 3X SSC or water to keep the sample hydrated during hybridization. 8. Place slide inside the hybridization chamber and make sure the chamber is securely tightened, then hybridization for 12 to 16 hours overnight in a 42°C water bath.
13.2.4.3 Washing and scanning 1. Remove the slide from the hybridization chamber and dip in staining dish containing 0.2X SSC + 0.1% SDS (filtered with 0.22-μm filter) several times to allow the cover slip to fall off (never forcefully pry the cover slip off as it may damage the array). 2. Wash in 0.2X SSC + 0.1% SDS (filtered with 0.22-μm filter) for 5 minutes with occasional agitation (be careful not to touch the active side of the slide to anything to prevent scratching). 3. Immediately transfer to new staining dish containing 0.2X SSC and wash for 5 minutes by occasional agitation (do not allow the slide to dry during transfer as it may result in streaking on the array). 4. Immediately transfer to new staining dish containing 0.02X SSC and wash for 5 minutes by occasional agitation. 5. Quickly dry in table top centrifuge at < 2,000 rpm for 5 minutes. 6. Scan using a DNA microarray scanner following the manufacturer’s operational instructions (most scanners allow the user to modify the PMT or sensitivity of the lasers, to maximize the dynamic range of the array signals, make sure the resulting microarray image contain a few saturated spots). 278
13.2
Methods
The second microarray protocol is for Affymetrix GeneChip arrays. These are one-channel arrays, where the samples are labeled with biotin, and the reference and experimental samples are hybridized to separate arrays.
13.2.4.4 For mRNA 1. Poly(A) RNA enrichment from 1 mg of total RNA using Oligotex mRNA Midi kit (Qiagen) following manufacturer’s instructions. 2. Repeat Poly(A) RNA enrichment for a total of two rounds of enrichment. 3. Remove residual DNA from the sample by treating the poly(A) RNA with Turbo DNA free kit (Ambion) following manufacturer’s instructions. Take care not to carry any inactivating agents in the last step. 4. Quantify poly(A) RNA concentration using a fluorometer or a spectrophotometer. 5. Mix the following together: x μl 1.5 μl
9 μg of poly(A) RNA 3 μg/μl random hexamer (total 4.5 μg) Molecular biology grade water Total
y μl 125 μl
6. Incubate at 70°C for 10 minutes. 7. Immediately chill on ice. 8. Mix on ice the following: 125 μl 40 μl 20 μl 5 μl 10 μl 200 μl
Poly(A) RNA and random hexamer 5X First strand buffer (Invitrogen) 0.1 M DTT (Invitrogen) (0.01 M final) 10 mM dNTP mix (0.25 mM final) Superscript II reverse transcriptase (200 U/μl) (Invitrogen) (2,000 U final) Total volume
9. Carry out the reaction at 42°C for 1 hour. 10. Remove RNA by treating with the following: 3 μl 6 μl
RNase cocktail (Applied Biosystems) (final 15 U RNase A and 60 U RNase T1) RNase H (New England Biolabs) (final 30 U)
11. Incubate for 20 minutes at 37°C. 12. Purify by extraction with 210 μl of buffer saturated phenol:chloroform (1:1) solution. 13. Mix well. 14. Centrifuge at max speed for 2 minutes. 15. Extract the aqueous phase (top layer). 16. Add 21 μl of 3M sodium acetate (pH 5.2) and mix well. 17. Add 500 μl of 100% ethanol. 18. Mix well and precipitate at –20°C for at least 30 minutes. 19. Spin at 4°C for 10 minutes to pellet the first strand cDNA. 20. Remove the supernatant. 21. Rinse twice with 500 μl 80% ice cold ethanol. 22. Air dry.
279
Transcriptome Analysis of Regulatory Networks
23. Resuspend in 25 μl of molecular biology grade water. 24. Quantify using a fluorometer or spectrophotometer (usually obtain 2 to 3 times the amount of cDNA as input total RNA). 25. Take 4.5 μg of cDNA and fragment to 50 to 100 bp using DNaseI (this takes some trial and error as DNaseI activity may be slightly different depending on batch and age of enzyme): i.
Mix together the following in 50 μl total volume on ice: a. 4.5 μg of cDNA; b. 5 μl of 10X OnePhorAll buffer (Amersham Pharmacia); c. 1.5 μl of 50 mM CoCl2; d. DNase I (start with 0.1 U);
ii.
Fragment at 37°C for 5 minutes (best do in a thermal cycler).
iii. Inactivate the DNaseI at 99°C for 15 minutes (best do in a thermal cycler). iv. Run 5 μl of sample on 2% agarose gel along with unfragmented cDNA (usually smear between approximately 100 to 1,600 bp) and make sure the DNaseI digested cDNA has size distribution between 50 to 100 bp. 26. Label via 3’ end labeling with biotin via Terminal Transferase. 45 μl 3.5 μl 1 μl
Digested cDNA 1 nmol/μl Biotin-11-ddATP (Perkin Elmer) (0.07 mM final concentration) 400 U/μl Terminal Transferase (Roche Applied Science)
27. Label at 37°C for 2 hours.
13.2.4.5 For Total RNA 1. Remove residual DNA from the sample by treating the total RNA with Turbo DNA free kit (Ambion) following manufacturer’s instructions. Take care not to carry any inactivating agents in the last step. 2. Quantify total RNA concentration using a fluorometer or a spectrophotometer. 3. Mix the following together: x μl 9 μl y μl 120 μl
20 μg of total RNA 3 μg/μl random hexamer (total 9 μg) Molecular biology grade water Total
4. Incubate at 70°C for 10 minutes. 5. Immediately chill on ice. 6. Mix on ice the following: 120 μl 40 μl 20 μl 10 μl 10 μl 200 μl
Total RNA and random hexamer 5X First strand buffer (Invitrogen) 0.1 M DTT (Invitrogen) (0.01 M final) 10 mM dNTP mix (0.5 mM final) Superscript II reverse transcriptase (200 U/μl) (Invitrogen) (2,000 U final) Total volume
7. Carry out the reaction at 42°C for 1 hour. 8. Remove RNA by following steps 10-24 from mRNA protocol.
280
13.3
Data Acquisition, Anticipated Results, and Interpretation
9. Fragment 15 μg of cDNA using 0.6 units of DNaseI and end label with terminal transferase following steps 25 to 27 from the mRNA protocol.
13.2.4.6 Hybridization Hybridize cDNA to GeneChip arrays, wash, and scan following manufacturer’s instructions.
13.3 Data Acquisition, Anticipated Results, and Interpretation 13.3.1
Acquisition of DNA microarray data
The signal intensities in the cy3 (green) and cy5 (red) channels for each spot (gene) represented on the microarray are proportional to the gene expression ratios between the two samples. There are several software packages available for the retrieval of the signal intensity data for each spot (gene) on the array. Some examples are Imagene (Biodiscovery), GenePix Pro (Molecular Devices formerly Axon), and free software like ScanAlyze developed by Mike Eisen’s group at Berkeley and Spotfinder by TIGR. Most analysis software has efficient gridding algorithms for finding each spot. However, it is important for each researcher to manually look over the spots identified by the algorithm as some arrays may have shifted grids/spots which the program is not able to find accurately. Most software will output both the mean and the median signal intensities for each spot and for the background surrounding the spot. It is better to use the median intensity, since it is less prone to dust and other small defects in the array. Output the signal and background intensities from the program and use the array definition files provided by the manufacturer of the array and match it to each spot (usually output as coordinates within the metagrid and the grid). Some image analysis software allow users to input in an array definition file, and the output will already contain the information for each spot (e.g., gene name, sequence, function). For Affymetrix GeneChips, follow the manufacturer’s instructions for image analysis.
13.3.2
Normalization
Due to variations in the amount of starting RNA, cDNA synthesis efficiency, dye-specific and laser-specific effect, the raw data needs to be normalized in order to get an accurate ratio between the two samples. There are a variety of different methods for normalization. In general, without external controls, Lowess normalization is considered a reasonable method for normalizing microarray data. Freely distributed software, such as lcDNA developed as a collaboration between Wing Wong (Stanford University) and James Liao’s (UCLA) groups and MIDAS from TIGR, have options for the filtering of bad spots and for normalization of the data. For Affymetrix GeneChips, software such as the Affymetrix’s GeneChip Operating Software or dChip are used for normalization, quality control, and obtaining normalized ratios of the data. Multiple replicates are necessary to assess the quality of data and to assign significance to the expression ratios obtained for each gene for each experiment. Significance Analysis of Microarray (SAM) from the Tibshirani group at Stanford and lcDNA are both good software packages for 281
Transcriptome Analysis of Regulatory Networks
finding genes that are significantly induced or repressed. With both of these, the genes that have log ratios statistically significantly different from 0 are considered to be either induced or repressed.
13.3.3
Network Component Analysis (NCA)
NCA utilizes the knowledge of transcription factor (TF)-gene interactions, called connectivity information, within the organism of interest to deconvolute gene expression data to obtain an estimate on the activities of the TFs. The approach is great for well-studied organisms like E. coli and yeast (S. cerevisiae) since extensive connectivity information is required for the success of this type of analysis. The method has recently been applied to mammalian organisms such as mouse (M. musculus) and human (H. sapiens), whose connectivity information is available in some databases such as TRANSFAC, Transcriptional Regulatory Element Database (TRED). NCA is particularly useful for looking at the effects of drugs over a time course or during an environmental switch. The software can be downloaded from http://www.seas.ucla.edu/~liaoj/download.htm. Figure 13.2 summarizes the steps of the approach. The following paragraphs describe the analysis procedure in detail. 1. Preprocess the inputs. The approach requires two inputs: the connectivity information and the gene expression data in log ratio form. 1.1 Connectivity information: The information for E. coli, and S. cerevisiae can be downloaded at the above Web site; however, it may not be up to date, thus more recent connectivity information may need to be added on. In general the connectivity information is formatted on the tabulation form with the first row and the first column listing name of TFs and genes, respectively. If connection exists between TF j and gene i, the cell corresponding to row i and column j is filled by any nonzero number; otherwise, it is filled by 0. Note that the connectivity information of a gene cannot be described by more than one row.
Preprocess connectivity information & gene expression
Match gene connectivity information and expression
NCA compliant network
Random networks
NCA
NCA
TFAs
Null distributions of TFAs
Perturbed TFs Figure 13.2
282
Flowchart of NCA procedure.
13.3
Data Acquisition, Anticipated Results, and Interpretation
The file of connectivity information can be saved in Excel or text with tab delimited format. 1.2 Gene expression data: Similar to the connectivity information, the gene expression acquired after normalization is arranged in tabulation form with the header row and column providing the experiment and gene names, respectively. Note that the gene identity system in the expression data must be consistent with that in the connectivity information. It also requires that the expression profile of a gene must be represented by one row. If a gene has multiple probesets, the input of the expression profile can be the average of all or best correlated probesets. Genes having missing data points should be eliminated from the data. Since NCA is applied for deconvoluting the gene expression in log ratio form, the data from Affymetrix array chips, which are single channel arrays, must be converted to log ratios. Therefore, one sample such as the wild-type or t = 0 sample is selected as the reference for comparing to the other samples. In general the signal intensities are often in logarithm form after being normalized by Robust Multi-array Average (RMA) or Affymetrix GeneChip software. That means the log ratios are calculated by subtracting the data of the reference sample from the others. It is recommended that the biological repeat arrays should be averaged first before calculating the log ratios to filter out extreme values of log ratios. The generated log ratios the data now can be treated as those from the two-channel arrays. The gene expression data can be saved in Excel or text with tab delimited format. Both files are then imported to the MATLAB workspace through NCA GUI toolbox. 2. Match gene expression and connectivity information in genome scale. This task can be performed by GUI toolbox to obtain the network composed of genes having both connectivity information and gene expression. 3. Select the NCA compliant network. GUI toolbox can select the NCA compliant network based on the all (default) or specific TFs selected by the user. 4. Deconvolute the network gene expression to obtain the profiles of TF activities (TFAs) by NCA. Different initial guesses (e.g., for A matrix) are used for robustly estimating the activities of TFs by the NCA numerical algorithm. It is recommended for using n > 10 different initial guesses. 5. Permutation tests to assess the statistical significance of TFAs (option). The goal of this step is to determine which TFs are statically significant perturbed when comparing to the reference conditions. In this analysis, the null distributions of TFAs are built from network component analyzing n (>50) random networks whose gene expression profiles are randomly sampled from whole genome transcriptome data, but the connectivity information is maintained the same as the report network. The z-statistic is used to calculate the p-values of the TFAs obtained in step 4. 6. Export the profiles of TFAs to text or Excel file for further studies. Note that NCA method cannot identify if a TF is activator or repressor, so the entire activity profile of a TF can be flipped over by multiplying it by −1 if the user knows the biological information.
283
Transcriptome Analysis of Regulatory Networks
13.4 Discussion and Commentary It is very important to not only have technical repeats of DNA microarray data, but to also have biological repeats. Technical repeats are considered to be the same sample hybridized to separate arrays to include the slide-to-slide variations and differences in labeling and hybridization efficiencies. However, biological repeats are important to obtain biologically relevant results of the transcriptome. It is important to obtain statistical information on the data instead of using a strict fold cutoff, as some gene expressions can be statistically significantly perturbed, but still not meet any arbitrary threshold cutoff. In addition, it is important to keep in mind that depending on the statistical algorithm used and the stringency used, different sets of genes can be selected as induced or repressed. If the experiment is looking for a small set of targets, a more stringent analysis (lower false discovery rate or higher confidence interval) can be used. However, if the experiment is looking for an overall trend, then a more relaxed analysis can be used. The NCA toolbox provides three different numerical algorithms for decomposing the transcriptome data. Each one has its own advantage and disadvantage because of the tradeoff between the computation time and the stability of the solutions. For example, the algorithm using QR factorization is fast, but its solutions are unstable, while the one using Tikhonov regularization is slow, but provide stable solutions. The orthogonal algorithm has medium speed, and its solutions are less stable comparing to the regularization algorithm. Therefore, if the data is less noise and composed of many (> 30) arrays, the orthogonal algorithm is the best candidate. The regularization algorithm becomes outstanding if the number of data point is limited or the connectivity information might contain many false positive and negative connections. Besides, NCA should be applied to data collected from the similar environment conditions or tissues because the approach assumes that the TF-gene relationship maintain constant over the whole dataset. The NCA solutions become less stable when applying it in the dataset that combines data collected from different environment conditions (e.g., rich media and heat shock), or tissues because the TF-gene interactions are conditional and tissue dependent. Since gene expression changes can occur rapidly upon shifting to a new environment (e.g., time course data following the addition of a drug), if the activities of transcription factors known to be transiently involved do not appear to be affected, it is possible that the time scale of sampling should be reduced to include any potentially interesting transient signaling. As with the analysis of any computationally predicted outcomes, it is important to provide experimental validations for the important conclusions drawn. For example, if a transcription factor is predicted to be active in a particular condition, then the use of a deletion strain in the transcription factor should result in a corresponding change in the TF activity profile using NCA. Alternatively, if an assay is available for the detection of the TF activity, such as quantification of ligand concentration or active form of the transcription factor, then it should be used as validation.
13.5 Application Notes When cells are subjected to environmental or genetic perturbation, it is difficult to determine the coordinated regulatory responses to these perturbations. Network Component Analysis of a series of time course microarray data successfully determined the 284
13.6
Summary Points
Troubleshooting Table Problem
Possible Explanation
Potential Solutions
Degradation of RNA samples
RNase contamination
Error or warning when running NCA
Missing data points in microarray or connectivity data Need to specify in NCA
Use filtered pipette tips; clean bench and all equipments with an RNase remover, such as RNaseZap (Applied Biosystems, Foster City, California) Eliminate genes with missing data; fill missing values using an imputation algorithm, such as K nearest neighbor (KNN) imputation After step 3 in Section 13.4.3, select “Select TF Constraints,” add the TF and the appropriate microarray datasets to the list of available constraints. Then click “Save Constraints,” and run NCA
Activity profile nonzero for a disrupted transcription factor
transcription factors involved when E. coli transitions from utilizing glucose to acetate as a sole carbon source. The transient activities of key transcription factors elucidated the coordinated regulatory response of the cell during this metabolic switch. NCA analysis helped to identify a key intracellular metabolite responsible for the prolonged growth lag phase during this carbon source transition in a mutant deficient in a gluconeogenic gene [3]. The role of the intracellular metabolite on the observed phenotype of the mutant was confirmed via experimental validation.
13.6 Summary Points •
It is important to have multiple biological and technical replicates of each microarray experiment to ensure statistical significance on genes called to be induced or repressed.
•
The connectivity information between transcription factors and genes they regulate must be available for this analysis.
•
This analysis is best used with time course data, as NCA is for determining the transient activities of transcription factors.
•
The user can control the sign of the transcription factor activities if the regulator is known to be an activator or repressor.
References [1] [2]
[3]
Schena, M., D. Shalon, R.W. Davis, and P.O. Brown, “Quantitative monitoring of gene-expression patterns with a complementary-DNA microarray,” Science, Vol. 270, No. 5235, 1995, pp. 467–470. Liao, J.C., R. Boscolo, Y.L. Yang, L.M. Tran, C. Sabatti, and V.P. Roychowdhury, “Network component analysis: Reconstruction of regulatory signals in biological systems,” Proc. Natl. Acad. Sci. USA, Vol. 100, No. 26, 2003, pp. 15522–15527. Kao, K.C., L.M. Tran, and J.C. Liao, “A global regulatory role of gluconeogenic genes in Escherichia coli revealed by transcriptome network analysis,” Journal of Biological Chemistry, Vol. 280, No. 43, 2005, pp. 36079–36087.
285
CHAPTER
14 A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks Rajanikanth Vadigepalli Daniel Baugh Institute for Functional Genomics and Computational Biology, Department of Pathology, Anatomy, and Cell Biology, Room 381 JAH, Thomas Jefferson University, 1020 Locust Street, Philadelphia, PA 19107; phone: (215) 955-0576; fax: (215) 503-2636; e-mail: [email protected]
Abstract An integrated approach is presented here to analyze the time series microarray gene expression data that extends beyond the list of differentially expressed genes and focuses on the characterization of their transcriptional regulation. In the present approach, the differentially expressed genes are identified through a local false discovery rate and the resultant time series data is analyzed in a robust clustering scheme. The expression clusters are then analyzed using the PAINT bioinformatics tool to uncover shared transcriptional regulation. This integrated approach enables transformation of descriptive data on gene expression to functional mechanisms underlying regulation of the observed dynamic profiles.
Key terms
Gene expression data analysis Transcriptional regulatory network analysis Computational negative control Clustering Promoter analysis Gene regulation Network analysis
287
A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks
14.1 Introduction In a typical global gene expression profiling study considering the dynamic response of a biological system, microarrays are used to monitor changes in gene expression at different time points under a perturbation as compared to paired controls at each time point. These time points provide information on the course of gene regulatory events during the response. An integrated approach is presented here to analyze the time series microarray gene expression data that extends beyond the list of differentially expressed genes and to focus on the characterization of their transcriptional regulation, which is one of the key mechanisms by which protein expression changes are controlled. In this approach, the differentially expressed genes are identified through an ANOVA [1, 2] and local false discovery rate (FDR) based approach [3], and the resultant time series data is analyzed in a robust clustering scheme termed computational negative control [4–6]. The expression clusters are then analyzed using the PAINT bioinformatics tool [1, 2, 6–13] to uncover shared transcriptional regulation potentially shaping the observed dynamical expression patterns. In a typical clustering approach, a key consideration is that the number of clusters is user-specified (e.g., as in K-means and so forth), and hence, there could be genes that are considered as “incorrectly clustered” for a given number of partitions. The computational negative control approach overcomes this limitation by scanning a range of user-specified numbers of clusters and choosing the maximum number of patterns that are well distinguishable from clustering randomized data. The original expression time series is permuted to generate a randomized data set that is comparable in data range to other overall statistics to the original data set. The set of clusters of original gene expression time series that are well distinguishable from the randomized data is chosen for subsequent transcriptional regulatory network analysis using PAINT. Candidate transcription factors (TFs) responsible for differential expression profiles of the dynamically responsive genes are characterized using the Promoter Analysis and Interaction Network Toolset (PAINT) software available online at http://www.dbi. tju.edu/PAINT [8]. The concept driving the analysis in PAINT is that many coexpressed genes share regulatory elements, typically TF binding sites (or transcriptional regultory elements; TREs) in their promoters, leading to coregulation. PAINT uses bioinformatics in combination with robust statistical approaches to identify the significantly enriched TREs in the promoters of the genes of interest (e.g., gene groups from cluster analysis of expression data). The TRE enrichment is based on higher than random frequency as determined by a Fisher’s Exact Test. A key aspect of the analysis is the unbiased approach that considers all known TF binding sites as being equally probable for significance to winnow down the list of TFs from hundreds to a relatively small panel of TFs that could play a role under these experimental conditions. PAINT can also be used to simultaneously analyze multiple groups of genes (e.g., cluster analysis of multicondition microarray data). In this case, the TRE enrichment analysis is performed for each individual cluster as compared to the specified reference as well as to the entire input list itself (i.e., all clusters combined). This functionality is employed in the workflow detailed below to analyze the clusters of differential gene expression time series data. In PAINT, the clustered data is visualized as a matrix layout with the hierarchical tree structure aligned to the rows and the columns of the Feasnet. The zeros in the matrix are shown in black and the nonzero entries in the Feasnet are colored based on the p-value of the corresponding TRE. The brightest shade of red repre288
14.2
Materials
sents low p-value (most significantly over-represented in the Feasnet). Conversely, the brightest shades of cyan represent smaller p-values for under-representation in the observed Feasnet, indicating more significantly under represented TREs. This image can optionally represent the cluster index of each gene, where such cluster indices are generated from other sources such as expression or annotation-based clustering. With such visualization, it is straightforward to explore the relationship between expression-based clusters and those based on cis-regulatory pattern (i.e., the Feasnet). Different aspects of the integrated workflow presented above have been applied to study microarray gene expression time series data in a number of biological problems. ANOVA has been widely used in the analysis of microarray gene expression data [14, 15]. The computational negative control approach has been used to study gene expression dynamics during erythroid development [5] and rat liver regeneration [4]. PAINT has been employed in studying coordinated gene regulation in a wide range of systems including neuronal differentiation, neuronal adaptation, blood cell development, retinal injury, brain stroke, bladder inflammation, and liver regeneration [4, 5, 9–12, 16–18].
14.2 Materials 1. Normalized microarray gene expression time series data file (user-provided). Organize the normalized microarray data in a tab-delimited plain text file with the first column containing the Gene Identifier (e.g., any one of Accession Number, Affymetrix Probeset ID, Clone ID, and so forth), and the remaining columns corresponding to the normalized data values, one column per sample. The columns should be ordered with samples for each time point grouped together in adjacent columns, and within each time point, data from biological replicates are ordered as treatment and control samples in adjacent columns. Name this file as ArrayTimeSeriesData.txt. An example data file based on the cDNA microarray time series data described in [4] is available in the Supplement (exArrayTimeSeriesData.txt). To use the example file, copy it to a new plain text file with the name ArrayTimeSeriesData.txt. 2. Experimental Design Matrix for ANOVA (user-provided). Organize this information in a tab-delimited plain text file with three rows corresponding to three factors (Treatment, Timepoint, and AnimalPair). The first column should contain the row headers Treatment, TimePoint, AnimalPair (one per row), and the remaining columns correspond to the samples organized in the same order of columns as the normalized microarray data file (Item 1). The values in the table indicate the particular level for each of the three factors (e.g., Table 14.1). Name this file as DesignData.txt. An example data file is available in the Supplement (exDesignData.txt). To use the example file, copy it to a new plain text file with the name DesignData.txt. 3. Statistical analysis software. Download and install the R Project for Statistical Computing available at http://www.r-project.org. 4. R scripts for differential gene expression analysis. The required files for differential gene expression analysis using mixed-effects ANOVA, false discovery rate analysis, and the input data files (Items 1 and 2) are available in the Supplement
289
A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks
Table 14.1 A Typical Experimental Design Matrix Required for Mixed-Effects ANOVA Analysis of Time Series Gene Expression Microarray Data Treatment P
C
P
C
P
C
P
C
P
C
P
C
TimePoint 1h AnimalPair P1
1h P1
1h P2
1h P2
1h P3
1h P3
2h P4
2h P4
2h P5
2h P5
2h P6
2h P6
P and C represent perturbation and control, respectively.
(DiffExpAnalysis.R, localFDRanalysis.R, and IdentifyDiffExpGenes.R). Save this file in the same directory as the input data files. 5. R script for robust cluster analysis. The required R code for performing robust clustering using the computational negative control approach is available in the Supplement (robustClustering.R). 6. Gene Level Identifier Resources. The SOURCE tool for conversion between various gene identifiers can be found at http://source.stanford.edu. Ensembl gene identifiers can be obtained using the BioMart function at http://www.ensembl.org/Multi/ martview. 7. Gene List Input Data File. This is a single column list of gene identifiers, one identifier per line, in a plain text file. This file is generated as part of the analysis procedure detailed below, in formatting the statistical analysis results for promoter analysis using PAINT. An example file available in the Supplement (exGeneList.txt) is needed only if the differential gene expression analysis is bypassed to directly start with promoter analysis. 8. Cluster Membership Data File. This is a tab-delimited plain text file with two columns. The first column must contain one gene identifier per row, and the second column must contain a corresponding single word alphanumeric cluster label. This file is generated as part of the analysis procedure detailed below, in formatting the statistical analysis results for promoter analysis using PAINT. An example file available in the Supplement (exClusterInfo.txt) is needed only if the differential gene expression analysis is bypassed to directly start with promoter analysis. 9. Transcription Factor Binding Site Data. The descriptions of transcription factor binding sites for use with PAINT are provided in two forms from Biobase International. A publicly available database is available through http://www.generegulation.com. A commercially licensed version is available from http://www. biobase-international.com/. PAINT requires users to obtain an account with either of these resources in order to perform binding site analysis. The professional version of TRANSFAC contains significantly higher number of TREs and TFs than the public version and hence significantly improves the analysis. 10. PAINT: The latest version of the Promoter Analysis and Interaction Network Toolset is available at http://www.dbi.tju.edu/PAINT [8]. The original version is described in [13] and a subsequent update is detailed in [7]. The example files noted above are based on the cDNA microarray time series data described in [4]. In the Web-based PAINT, existing analyses can be retrieved using a unique job key provided for each analysis. The PAINT results are presented in a hyperlinked report and are available for download as a single compressed file for offline perusal. Nomenclature for this chapter includes bold italic for on-screen text, bold for buttons, and Courier font for files, folders, and software code for execution in R program. 290
14.3
Methods
14.3 Methods The methods outlined below describe transcriptional regulatory network analysis of gene expression time series data, based on differential gene expression from ANOVA followed by a false discovery rate (FDR) analysis, robust clustering and promoter analysis of the grouped genes using PAINT. The term Project Directory below refers to the directory where the input data files (Items 1 and 2) as well as the scripts for the differential gene expression and clustering analysis (Items 4 and 5) are located. The workflow detailed below assumes that all of these files are located in the same directory. If such a setup is not preferred, researchers with advanced proficiency in R program can modify the scripts and the code detailed below to specify the corresponding locations as appropriate.
14.3.1
Identification of differentially expressed genes
14.3.1.1 Mixed-effects ANOVA of the normalized time series data 1. Start the R program for statistical computing. 2. Modify the following line of code to specify the full path to the Project Directory. Paste it at the R prompt and hit Enter. setwd(“C:/Users/username/research/project/”) 3. Paste the following line of code at the R prompt and hit Enter. source(“DiffExpAnalysis.R”) 4. After the above script runs without errors, a new tab-delimited plain text file named DiffExpRawPvalues.txt containing the raw p-values and differential gene expression time series data is generated in the Project Directory.
14.3.1.2 Local false discovery rate analysis 5. Paste the following line of code at the R prompt and hit Enter. source(“localFDRanalysis.R”) 6. Paste the following line of code at the R prompt and hit Enter. pi.not 7. Proceed to the next step if pi.not value is greater than 0.7. Otherwise, see Note 14.5.1. 8. The above script generates a plot of the local and overall FDR (e.g., Figure 14.1). The x-axis represents genes in the order of the raw p-values from the most to the least significant. The y-axis represents the FDR values, for the two curves shown (local FDR within a window of 50 successive genes and overall FDR). 9. Consider a local FDR threshold of 0.3. Paste the following line of code at the R prompt and hit Enter in order to add a horizontal line to the plot at the threshold value of 0.3. abline(h=0.3) 10. If the subsequent local FDR values for the next 50 to 100 genes agenes lie above the horizontal line for the chosen local FDR threshold, then skip Step 11 and proceed to Step 12. 11. If the local FDR values for the next 50 to 100 genes lie below the threshold line, increase or decrease the threshold value by 0.01 to 0.05 and repeat Step 9 with the
291
A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks
new threshold value. Refer to Note 14.5.2 on how to choose an appropriate threshold. 12. Use the local FDR threshold identified as providing a reasonable opportunity cost to determine the differential expressed genes in the time series experimental data. Estimate the range of approximate number of genes (value on x-axis) corresponding to this threshold (e.g., 300 to 325 in Figure 14.1). 13. In the following lines of code, replace the number 0.3 with the threshold identified in Step 12, paste the code at the R prompt and hit Enter. # Replace 0.3 with the threshold value from Step 12. threshold <- 0.3 approxGenesStart <- 300 approxGenesEnd <- 325 source(“IdentifyDiffExpGenes.R”) 14. After the above script runs without errors, a tab-delimited plain text file named DiffExpGeneData.txt containing the differential gene expression time series data, raw and adjusted p-values will be generated based on the chosen local FDR threshold.
14.3.2 Robust clustering of differential gene expression time series data using computational negative control approach 15. Paste the following line of code at the R prompt and hit Enter. source(“robustClustering.R”)
0.4 Local fdr Overall FDR
False discovery rate
0.3
0.2
0.1
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
0.0
# Genes
Figure 14.1 Relationship between overall FDR, local FDR, and the number of predicted differentially expressed genes. We chose a 30% local FDR as a threshold resulting in 309 differentially expressed genes (corresponding to a 21.4% overall FDR). Additional genes selected would be at a higher “opportunity cost” as the local FDR is higher than 30% for the next 100 genes. (Reproduced from [4].)
292
14.3
Methods
16. The above script generates a plot of the clustering results using Partitioning Around Mediods (also known as K-medians) using Pearson Correlation as the dissimilarity metric (e.g., Figure 14.2), between 2 and 10 clusters. Identify the number of clusters that are significantly distinct from randomized data (e.g., 6 clusters in Figure 14.2). Refer to Note 14.4.1 for further information on how to choose appropriate number of clusters. 17. In the following lines of code, replace the number 6 with the number of clusters identified from Step 16, paste the code at the R prompt and hit Enter. # Replace 6 with the number of clusters from Step 16. numClusters <- 6 save.clustering.results(numClusters) 18. After the above code runs without errors, two tab-delimited plain text files named GeneList.txt and ClusterInfo.txt will be generated. These form input files to the transcriptional regulatory network analysis using PAINT, as detailed below.
14.3.3
Transcriptional regulatory network analysis using PAINT
14.3.3.1 Generation of PAINT-compatible input files
1.0 Actual expression data Randomized data
0.8 0.6 0.4 0.2
10
9
8
7
6
5
4
3
0.0
2
Silhouette coefficient
The following two steps are necessary only if the above Steps 1 to 18 in the differential gene expression analysis are skipped to directly proceed to the promoter analysis using PAINT.
1.4
1.2 1.0
0.8
10
9
8
7
6
5
4
3
0.6
2
Difference in SC* Number of clusters
Number of clusters
Number of clusters Figure 14.2 Assessment of the gene expression clustering results using the computational negative control (CNC) approach [6]. (a) For each specified number of clusters, the cluster quality metric, silhouette coefficient (SC), is evaluated and compared to that from the randomly permuted data. (b) Difference in SC from (a) multiplied by number of clusters shows a marked decrease at more than six clusters, indicating that SC is no longer distinct from the randomized data. (Reproduced from [4].)
293
A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks
19. The starting point to PAINT is a file containing the list of differentially expressed genes. The file should be a single column plain text file, each row listing a gene identifier. All the identifiers in the file need to be of the same type, for example, Genbank Accession Number. The file GeneList.txt generated in Step 18 conforms to this file format. If Steps 1 to 18 were skipped, copy the example file exGeneList.txt to a new plain text file with the name GeneList.txt. 20. The Cluster Membership Data File is a tab-delimited plain text file with two columns. The first column must contain one gene identifier per row and the second column must contain a corresponding single word alphanumeric cluster label. The file ClusterInfo.txt generated in Step 18 conforms to this file format. If Steps 1 to 18 were skipped, copy the example file exClusterInfo.txt to a new plain text file with the name ClusterInfo.txt.
14.3.3.2 Identification of over-represented transcription factor binding sites In the steps below, the GeneList and ClusterInfo from Step 18 (or Steps 19 and 20, as appropriate) will be used to retrieve promoter sequences, analyze them using TRANSFAC Public (or Professional), build a Feasnet, and analyze this matrix as compared to a reference Feasnet to derive hypotheses on over-represented TREs in each of the identified coexpression time series clusters. 21. Use a Web browser to open the Web page http://www.dbi.tju.edu/PAINT. 22. Click on the button Start New Analysis on the PAINT main page. 23. Select appropriate Organism Name, 2000 for the Desired upstream length, Clone ID for Gene Identifier type, Gene Identifiers List for Upload text file of type. 24. Click the Browse button to locate and select the file GeneList.txt on the computer from the Project Directory. 25. Select the check box next to TFRetriever. 26. Select MATCH (TRANSFAC Public) for TRE finding program. Refer to Note 14.5.5 on the issues involved in the choice of the TRE finding programs. 27. Enter the username and password for logging into the Web site for TRANSFAC Public http://www.gene-regulation.com (or Transfac Pro at http://www.biobaseinternational.com). 28. Select Minimize False Positives for the MATCH filter option. 29. Select 1.00 for the Core similarity threshold. Check the box for Find TREs on complementary strand? 30. Click Execute Feasnet Builder at the end of the Feasnet Builder form. A new page will be loaded indicating the status of the analysis. Note the job key at the top of the status page for access at a later time, as the analysis might take considerable time depending on the size of the gene list. 31. After the FeasnetBuilder step is completed without errors, the highlighted status text at the top of the page will be replaced by a link to the ZIP file containing all the results including the status page. 32. After successful completion of FeasnetBuilder, the PAINT status page indicates the number of promoters that were retrieved (refer to Note 14.5.6 on how redundancies in the gene list are handled), the promoter sequences of specified length in the 294
14.3
Methods
FASTA format, and also a link to a list of genes for which the promoter sequences were not found in the PAINT database. Next, the status page indicates whether the gene list was split into multiple parts for processing using MATCH. Links to the actual HTML output from MATCH are provided next to each of the split sequence files. Lastly, the overall Feasnet corresponding to the input Gene List is given next to the text Feasnet file. 33. After successful completion of the FeasnetBuilder step, the PAINT status page shows a link to the follow-up Feasnet Analysis and Visualization. Click on the link indicated by the text Click here to continue with Feasnet Analysis and Visualization. 34. On the Feasnet Analysis page, the parameters corresponding to the Feasnet, Organism, Upstream sequence length, Gene Identifier type, TRE finding program, Core similarity threshold, TREs on the complementary strand included?, will be automatically set based on the data entered earlier in the FeasnetBuilder step. 35. Under Clustering Options, check both the boxes corresponding to TREs based on the promoters they are present on and genes based on the TREs present on their promoters to hierarchical cluster the PAINT analysis results. 36. Under Select Reference Feasnet(s) for significance analysis of TREs, check the box corresponding to the microarray used in the project. If this information is unknown, select All promoter sequences in PAINT database. If your microarray is not listed, refer to Note 14.5.7 for information on how to choose or generate an appropriate reference Feasnet. 37. Check the box next to Generate filtered Gene-TRE networks based on TRE over-representation. Under this text, select 0.30 for the parameter Only those TREs of FDR-based adjusted p-value <=, and 0.05 for the parameter Only those TREs of raw p-value <=. Refer to Note 14.5.8 for information on these two thresholds employed in the analysis. 38. Click the Browse button for the parameter Gene cluster information file to locate and select the ClusterInfo.txt file in the Project Directory. Click the Execute Feasnet Analyzer/Viewer button at the end of the form. The job status page will be loaded indicating the status of the analysis. The job key will be the same as earlier, as this is merely continuation of the PAINT analysis. 39. Once the Analysis and Visualization step is complete, the highlighted text at the top of the status page will be replaced by a link to the ZIP file containing all the results including the status page. 40. The results from the TRE enrichment analysis are under the headings Significance of TRE occurrence (in clusters compared to a reference) and Significance of TRE occurrence (in individual clusters compared to the list). Links to the specific reference used, p-values for over-representation, and the Feasnet images are provided. Under the subheading Hypothesis Gene-TRE network, links are provided to the filtered Feasnet data and images based on the specified p-value threshold (e.g., 0.10 in Step 37). Links to the Network image and Graphviz source file are also given. Refer to Note 14.4.2 for information on how to interpret the PAINT results.
295
A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks
14.4 Data Acquisition, Anticipated Results, and Interpretation 14.4.1
Selection of number of clusters
The Silhouette Coefficient (SC) plotted in Step 16 (example in Figure 14.2) indicates the “quality” of the partitioning into multiple clusters [4–6]. These values range from 0 to 1, with the zero value corresponding to no separation between clusters and a value of 1 indicating perfect partitioning. The randomized data partitioning is based on permuting the original data set and applying the same clustering algorithm as with the original data set. The difference in SC between the actual and randomized data set may be highest at low number of clusters. However, it may be informative to further separate these clusters to derive more specific temporal gene expression patterns. The plot generated in Step 16 [e.g., Figure 14.2(b)] considers the product of difference in SC and number of clusters to examine the trade off over a range of total number of clusters and identify an optimal value. The example results in Figure 14.2(b), based on the results in [4], show a marked decrease at more than six clusters, indicating that SC is no longer distinct from the randomized data.
14.4.2
PAINT result interpretation for gene coexpression clusters
The hypothesized Gene-TRE network from the enrichment analysis (Step 40) indicates those TREs that are significantly over-represented in the promoters corresponding to each of the gene expression clusters as compared to the promoters in the reference (Figure 14.3). The significantly-above-random nature of occurrence of certain TREs makes these ideal candidates for further experimental validation. When using PAINT to analyze the gene expression time series clusters, the results indicate if any of the binding sites found on promoters of differentially expressed genes are diagnostic for any specific gene co-expression cluster. Therefore, the desired result would be identification of a TRE determined to be statistically enriched in one or a few of the clusters, but not all. The cluster-enriched TREs will appear on the Feasnet image as a vertical collection of red boxes which mirror the limits of the gene list for the group. The enrichment can be more easily visualized by graphing the significance score, –log10(p-value), for each TRE of interest in each gene expression cluster. The enrichment p-values of TREs in each cluster can be obtained from the Significance of TRE occurrence (in clusters compared to a reference) section of the PAINT output, following the link Over-representation in either raw or overall FDR-adjusted p-values. The inference for biological significance is that these specific TREs, and their cognate TFs, are specifically involved in the regulation of the corresponding coexpression clusters of genes.
14.5 Discussion and Commentary The methodology presented above is motivated by the need to make sense of the coordinated changes in gene expression observed as a function of time in a typical transcriptional profiling experiment. The present approach utilizes multifactorial ANOVA followed by a robust clustering scheme to uncover nonrandom temporal patterns in the differentially expressed gene profiles. These clusters of coregulated genes 296
14.5
(a)
Discussion and Commentary
(b)
Figure 14.3 Analysis of gene expression time series data from [4]. (a) Cluster analysis of the differential expression temporal profiles. The data was clustered using Partitioning Around Medoids using Pearson Correlation as the distance metric and with k = 6 (optimal number obtained from the results shown in Figure 14.1). Each row corresponds to a gene and each column corresponds to one of the four time points (1, 2, 4, 6 hours post partial hepatectomy). Lines demarcate the cluster boundaries. (b) The six clusters from (a) were analyzed for over-represented TF binding sites in the corresponding promoters using PAINT. The representative interaction matrix is shown. The rows represent the promoters and columns represent TFs. Each binding site for a TF on a promoter is marked red or gray, depending whether the frequency of that binding site in that cluster is statistically significantly overrepresented or not, respectively. Binding sites for several TFs are enriched in distinct expression clusters. Lines indicate the mapping between the gene groups in the expression map and the corresponding promoter sets in the regulatory interaction matrix. (Reproduced from [4].)
are analyzed in PAINT for statistically enriched shared binding site incidence on their promoters, yielding hypotheses on transcriptional factors implicated in driving the observed coregulation. Several aspects of the present approach require careful consideration of various choices available in each stage, as discussed below.
14.5.1
Estimation of nondifferentially expressed genes (pi.not value)
The pi.not value represents the fraction of nondifferentially expressed genes in the data [19]. The R scripts presented here include an estimate of pi.not as detailed in [3]. A pi.not value less than 0.7 may indicate issues with normalization of the data, as the assumptions in microarray data normalization about “most of the genes unchanging on any given array between treatment and control” may be inappropriate. If this occurs, return to the appropriate normalization procedure in generating the prerequisite data and redo the analysis from Step 1.
14.5.2
Threshold for local false discovery rate analysis
The local FDR estimator employed here produces a nonmonotonic result (Figure 14.2). A set of heuristics for choosing a p-value threshold given a local FDR threshold are given 297
A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks
in [3]. A high level summary is presented here. If the local FDR estimate crosses the threshold only once, choose the corresponding number of genes from the x-axis. Otherwise, start at the lowest gene number and if the local FDR threshold for the next 50 to 100 genes is higher than the threshold for most of the genes, choose the corresponding number of genes from the x-axis. If that is not the case, consider the next intersection point for differentially expressed genes and repeat the above procedure until a satisfactory value at which most of the genes have lower local FDR threshold is reached. The local FDR metric should be used in conjunction with the overall FDR metric shown in the plot (Steps 8 to 12). The local FDR metric gives an upper bound for useful overall FDR threshold values and if the local FDR threshold identified above corresponds to an overall FDR at an unacceptable level, a more stringent threshold should be considered to yield a smaller number of differently expressed genes albeit with fewer estimated number of false positives.
14.5.3
Format of gene identifiers
Starting from the microarray gene expression data, the Clone IDs or Genbank Accession numbers should be preferred as the Gene Identifiers. The SOURCE resource noted in the materials section (Section 14.2) may be used to convert the data from alternative identifiers. In addition, almost all of the commercial microarray platforms are associated with annotation software to enable this transformation to appropriate Gene Identifiers. PAINT employs the UniGene database to map the Clone IDs to the corresponding Entrez Gene IDs and then utilize the Ensembl cross-reference annotation information to obtain the corresponding unique set of Ensembl Gene IDs.
14.5.4
Cluster size issues
Based on the results from multiple studies, it is recommended that the Gene List contain at least 30 genes in each cluster. Otherwise, the results for those clusters are difficult to interpret in a robust fashion. Experience indicates that small inaccuracies (< 10%) in the clustering algorithms do not significantly influence the results.
14.5.5
TRANSFAC version issues
In order to perform transcriptional regulatory network analysis using PAINT, users need to obtain appropriate licensed access to the public or professional versions of the TRANSFAC database, both of which are not affiliated with the PAINT development team. The public version is available online at http://www.gene-regulation.com and is available following a free registration process. Access to professional version is available through http://www.biobase-international.com. The login and password required in the analysis step are only used to interact with the appropriate Web servers. This ensures proper handling of the license management issues while providing an option to PAINT users. The professional version of TRANSFAC contains a significantly higher number of TREs and TFs than the public version and hence its use significantly improves the analysis results.
298
14.5
14.5.6
Discussion and Commentary
Annotation redundancy in the gene list and multiple promoters
Often, several gene identifiers in the input Gene List map to same Ensembl Gene. PAINT builds the entire cross-referenced list of Ensembl Genes that corresponds to the input Gene List and considers the unique Ensembl Gene ID list in subsequent analysis. In addition, due to the nature of the cross reference in the Entrez Gene and Ensembl databases, it is likely that a few of the gene identifiers in the input Gene List individually map to more than one Ensembl Gene. In these cases, PAINT includes all the mapped Ensembl Genes in the analysis. In the rare cases when an identifier maps to five or more Ensembl Genes, it is recommended to be excluded from the analysis by removing from the files GeneList.txt and ClusterInfo.txt.
14.5.7
Reference Feasnet selection/generation
The selection of appropriate reference set is the key to derive meaningful hypotheses in PAINT analysis. Comparison of the experiment Feasnet to the entire genome gives erroneous results if the input gene list is obtained from a microarray that does not span the entire genome or is specific to a particular tissue/disease. PAINT contains prebuilt reference files for All Ensembl Promoters and Affymetrix arrays. Other commercial arrays are in the process to be added to the prebuilt reference list. In case the reference for the microarray of interest is not listed in the Feasnet Analysis and Visualization form (Step 36), the microarray Gene List needs to be first processed in the Feasnet Builder to obtain a microarray Feasnet (Steps 21 through 32, with the appropriate plain text file named microarrayGeneList.txt prepared as specified in Step 19 and used in Step 24).
14.5.8
Multiple testing correction in PAINT
The raw p-values in each over-representation analysis are corrected for multiple testing using an overall FDR estimate. As a first option, the results from the FDR-based adjusted p-values should be employed in identifying the significantly over-represented TREs. However, in some cases, this particular correction is either inappropriate or conservative (due to correlations among TREs) and may yield little or no results. In such cases, one can utilize the raw p-value based results in PAINT in a discovery approach. While this alternative may result in a set of hypotheses with high estimated proportion false positives, in practice, this amounts to prioritizing the validation experiments based on individually enriched TREs. The primary role of the presented computational workflow
Troubleshooting Table Problem
Explanation
pi.not value is less than 0.7
Indicates too many differentially expressed Revisit the normalization procedure to genes. May be caused due to errors in normal- check for bias in the data ization of the raw gene expression data High variability in the expression data Power Analysis to estimate the required number of replicates Time series may not be well defined Consider adding time points to the experimental design Cluster sizes may not be reasonable for Consider larger-sized clusters, or increase statistically significant results, or promoter the promoter length analyzed in PAINT length considered is short
Too few genes passing reasonable FDR thresholds No meaningful clusters in the data Too few TFs passing reasonable FDR thresholds in PAINT
Potential Solutions
299
A Workflow from Time Series Gene Expression to Transcriptional Regulatory Networks
is in generating a reasonable set of candidates for experimental validation. Hence, when multiple testing correction yields little or no results the alternative raw p-value based approach is the next best option available.
14.6 Application Notes The integrated approach starting from microarray gene expression time series data presented here has been successfully employed to study blood cell development [5] and liver regeneration [4]. In [4], gene expression data was obtained from regenerating rat liver at 1, 2, 4 and 6 hours following partial hepatectomy. The excised livers from each rat at 0 hours were considered as within-animal controls at each time point. Microarray data was obtained using cDNA arrays with ~9,000 clones spotted on glass slides. Mixed-effects ANOVA and local FDR analysis, as detailed in the methods section (Section 14.2), yielded a total of 309 genes at a local FDR threshold of 0.3, corresponding to an approximately 20% overall FDR (Figure 14.1). The robust clustering using the CNC approach detailed above yielded six clusters that are well distinguishable from randomized data (Figure 14.2). These clusters represent early responsive genes as well as those that are differentially regulated at later time points. Approximately half of the differential regulation was comprised of up-regulation of a number of genes at the 6-hour time point [clusters 5 and 6 in Figure 14.3(a)]. The PAINT analysis identified 22 TFs as enriched (overall FDR<30%), in individual clusters with distinct temporal patterns [Figure 14.3(b)]. Some of these TFs (e.g., NF-κB, HNF-1, CREB, C/EBP, GATA, and ATF) are known to be involved in the early phase of liver regeneration from previous studies, whereas others (e.g., AP2a, LEF1, PAX6) are known to contribute to the regulation of cellular processes related to proliferation and differentiation (refer to [4] for details). Several of these predicted TFs were experimentally validated for differential DNA binding activity dynamics [4]. These results demonstrate that relevant functional information on the transcriptional regulatory processes active in the early liver regeneration can be obtained from PAINT analysis of clustered microarray gene expression time series data.
14.7 Summary Points The methodology detailed in this chapter describes an integrated workflow for microarray gene expression data analysis using: 1. A mixed-effects ANOVA approach to quantify differential expression across multiple time points. 2. A local false discovery rate based approach to choose a suitable threshold for identifying differentially expressed genes. 3. A robust clustering approach termed computational negative control for determining distinct dynamical expression patterns that are well separated from randomized partitioning. 4. A sensitive bioinformatics approach using PAINT software for developing hypotheses on transcriptional regulators potentially shaping the observed gene expression dynamics. 300
Acknowledgments
Acknowledgments This work was supported by National Institutes of Health grants AA016919, HL088283, and HL087361.
References [1] [2]
[3] [4]
[5]
[6]
[7]
[8] [9]
[10]
[11]
[12]
[13]
[14] [15] [16]
[17]
[18]
[19]
Pavlidis, P., “Using ANOVA for gene selection from microarray studies of the nervous system,” Methods, Vol. 31, 2003, pp. 282–289. Scholtens, D., A. Miron, F.M. Merchant, A. Miller, P.L. Miron, J.D. Iglehart, and R. Gentleman, “Analyzing factorial designed microarray experiments,” J. Multivariate Anal., Vol. 90, 2004, pp. 19–43. Khan, R.L., R. Vadigepalli, G. Gao, and J.S. Schwaber, “A windowed local fdr estimator providing higher resolution and robust thresholds,” arXiv:q-bio/0702044v1, 2007. Juskeviciute, E., R. Vadigepalli, and J. Hoek, “Temporal and functional profile of the transcriptional regulatory network in the early regenerative response to partial hepatectomy in the rat,” BMC Genomics, Vol. 9, No. 1, 2008, p. 527. Keller, M.A., S. Addya, R. Vadigepalli, B. Banini, K. Delgrosso, H. Huang, and S. Surrey, “Transcriptional regulatory network analysis of developing human erythroid progenitors reveals patterns of co-regulation and potential transcriptional regulators,” Physiol. Genomics, Vol. 28, No. 1, 2006, pp. 114–128. Pearson, R.K., T. Zylkin, J.S. Schwaber, and G.E. Gonye, “Analytical evaluation of clustering results using computational negative controls,” Proc. 4th Soc. Indust. Appl. Math. Int. Conf. Data Mining, 2004, pp. 188–199. Gonye, G.E., P. Chakravarthula, J.S. Schwaber, and R. Vadigepalli, “From promoter analysis to transcriptional regulatory network prediction using PAINT,” in Methods in Molecular Biology: Gene Function Analysis, M. Ochs, (ed.), Totowa, NJ: Humana Press, 2007, pp. 49–68. PAINT: Promoter Analysis and Interaction Network Toolset, http:/www.dbi.tju.edu/PAINT. Pratt, C.H., R. Vadigepalli, P. Chakravarthula, G.E. Gonye, N.J. Philp, and G.B. Grunwald, “Transcriptional regulatory network analysis during epithelial-mesenchymal transformation of retinal pigment epithelium,” Mol. Vis., Vol. 14, 2008, pp. 1414–1428. Saban, M.R., H.L. Hellmich, M. Turner, N.B. Nguyen, R. Vadigepalli, D.W. Dyer, R.E. Hurst, M. Centola, and R. Saban, “The inflammatory and normal transcriptome of mouse bladder detrusor and mucosa,” BMC Physiol., Vol. 6, No. 1, 2006, p. 1. Stevens, S.L., B. Gopalan, M. Minami, C.E.H. Erdmann, C.A. Harrington, W.R. Cannon, R.P. Simon, and M.P. Stenzel-Poore, “LPS preconditioning provides neuroprotection via reprogramming of cellular responses to stroke,” Soc. Neuroscience Abstr., 2004, p. 457.14. Vadigepalli, R., H. Hao, G.M. Miller, H. Liu, and J.S. Schwaber, “EGFR-induced circadian-time dependent gene regulation in suprachiasmatic nucleus,” Neuroreport, Vol. 17, No. 13, 2006, pp. 1437–1441. Vadigepalli, R., P. Chakravarthula, D.E. Zak, J.S. Schwaber, and G.E. Gonye, “PAINT: a promoter analysis and interaction network generation tool for genetic regulatory network identification,” Omics, Vol. 7, No. 3, 2003, pp. 235–252. Churchill, G.A., “Using ANOVA to analyze microarray data,” Biotechniques, Vol. 37, No. 2, 2004, pp. 173–177. Kerr, M.K., and G.A. Churchill, “Statistical design and the analysis of gene expression microarray data,” Genet. Res., Vol. 77, No. 2, 2001, pp. 123–128. Addya, S., M.A. Keller, K. Delgrosso, C.M. Ponte, R. Vadigepalli, G.E. Gonye, and S. Surrey, “Erythroid-induced commitment of K562 cells results in clusters of differentially expressed genes enriched for specific transcription regulatory elements,” Physiol. Genomics, Vol. 19, No. 1, 2004, pp. 117–130. Dozmorov, M.G., K.D. Kyker, R. Saban, N. Knowlton, I. Dozmorov, M. B. Centola, and R.E. Hurst, “Analysis of the interaction of extracellular matrix and phenotype of bladder cancer cells,” BMC Cancer, Vol. 6, 2006, p. 12. Zak, D.E., H. Hao, R. Vadigepalli, G.M. Miller, B.O. Ogunnaike, and J.S. Schwaber, “Systems analysis of circadian time dependent neuronal epidermal growth factor receptor signaling,” Genome Biol, Vol. 7, No. 6, 2006, p. R48. Broberg, P., “A comparative review of estimates of the proportion unchanged genes and the false discovery rate,” BMC Bioinformatics, Vol. 6, 2005, p. 199.
301
About the Editors Arul Jayaraman is an assistant professor in chemical engineering and biomedical engineering at Texas A&M University. He received a Ph.D. in chemical engineering from the University of California at Irvine in 1998 and did his postdoctoral training at the Center for Engineering in Medicine at Massachusetts General Hospital from 1998 to 2000. Dr. Jayaraman’s research interests are in systems biology of inflammation and interkingdom signaling in host-pathogen interactions. Juergen Hahn is an associate professor in chemical engineering at Texas A&M University. He received a Ph.D. in chemical engineering from the University of Texas, Austin, and did his postdoctoral training at RWTH Aachen. Dr. Hahn’s research interests include systems biology of signal transduction networks and process modeling and analysis.
303
About the Editors
List of Contributors Frank Allgöwer Institute for Systems Theory and Automatic Control Universität Stuttgart Pfaffenwaldring 9 70550 Stuttgart, Germany e-mail: [email protected]
Rolf Findeisen Institute for Automation Engineering Otto-von-Guericke University Universitätsplatz 2 D-39106 Magdeburg, Germany e-mail: [email protected]
Anand R. Asthagiri Division of Chemistry and Chemical Engineering California Institute of Technology Mail Code 210-41 Pasadena, CA 91125 USA e-mail: anand@[email protected]
Juergen Hahn Department of Chemical Engineering Texas A&M University 3122 TAMU College Station, TX 77843 USA e-mail: [email protected]
Heike E. Assmus Systems Biology and Bioinformatics Department of Computer Science University of Rostock Albert Einstein Str. 21 18051 Rostock, Germany e-mail: [email protected]
Nan Hao Department of Pharmacology University of North Carolina Chapel Hill, NC 27599 USA
Marc R. Birtwistle University of Delaware Department of Chemical Engineering Newark, DE 19716 USA
Jason M. Haugh Department of Chemical & Biomolecular Engineering North Carolina State University Box 7905, Engineering Building I 911 Partners Way Raleigh, NC 27695 USA e-mail: [email protected]
Sonja Boldt Systems Biology and Bioinformatics Department of Computer Science University of Rostock Albert Einstein Str. 21 18051 Rostock, Germany
Michael A. Henson Department of Chemical Engineering University of Massachusetts 686 North Pleasant Street Amherst, MA 01003 USA e-mail: [email protected]
Gregery T. Buzzard Department of Mathematics Purdue University West Lafayette, IN 47907 USA
Jared L. Hjersted Department of Chemical Engineering University of Massachusetts Amherst, MA 01003 USA
Christina Chan Department of Chemical Engineering & Materials Science 1257 Engineering Building Michigan State University East Lansing, MI 48824 USA e-mail: [email protected]
Zuyi Huang Department of Chemical Engineering Texas A&M University 3122 TAMU College Station, TX 77843 USA
Murat Cirit Department of Chemical & Biomolecular Engineering North Carolina State University Box 7905, Engineering Building I 911 Partners Way Raleigh, NC 27695 USA Ertugrul Dalkic Department of Biochemistry and Molecular Biology Michigan State University East Lansing, MI 48824 USA Maia M. Donahue Weldon School of Biomedical Engineering Purdue University, West Lafayette, IN 47907 USA Timothy C. Elston Department of Pharmacology University of North Carolina Chapel Hill, NC 27599 USA
304
Arul Jayaraman Department of Chemical Engineering and Biomedical Engineering Texas A&M University 3122 TAMU College Station, TX 77843 USA e-mail: [email protected] Katy C. Kao Department of Chemical Engineering Texas A&M University 3122 TAMU College Station, TX 77843 USA e-mail: [email protected] Boris N. Kholodenko Department of Pathology, Anatomy, and Cell Biology Thomas Jefferson University Philadelphia, PA 19107 USA Jin-Hong Kim Division of Chemistry and Chemical Engineering California Institute of Technology Mail Code 210-41 Pasadena, CA 91125 USA
List of Contributors
Kyongbum Lee Department of Chemical & Biological Engineering Tufts University Medford, MA 02155 USA e-mail: [email protected] James C. Liao Department of Chemical and Biomolecular Engineering University of California at Los Angeles Los Angeles, CA 90095 USA Colby Moya Department of Chemical Engineering Texas A&M University 3122 TAMU College Station, TX 77843 USA Ryan Nolan Wyeth BioPharma Andover, MA 01810 USA Babatunde A. Ogunnaike Department of Chemical Engineering University of Delaware Newark, DE 19716 USA e-mail: [email protected] Ann E. Rundell Weldon School of Biomedical Engineering Purdue University West Lafayette, IN 47907 USA e-mail: [email protected] Ranjan Srivastava Department of Chemical Engineering University of Connecticut 191 Auditorium Road, Unit 3222 Storrs, CT 06269 USA e-mail: [email protected] Stefan Streif Max Planck Institute for Dynamics of Complex Technical Systems Sandtorstr. 1 39106 Magdeburg, Germany Linh M. Tran Department of Chemical and Biomolecular Engineering University of California at Los Angeles Los Angeles, CA 90095 USA
Rajanikanth Vadigepalli Daniel Baugh Institute for Functional Genomics Department of Pathology Thomas Jefferson University 1020 Locust Street Philadelphia, PA 19107 USA [email protected] Steffen Waldherr Institute for Systems Theory and Automatic Control Universität Stuttgart Pfaffenwaldring 9 70550 Stuttgart, Germany [email protected] Chun-Chao Wang Department of Chemical & Biomolecular Engineering North Carolina State University Box 7905, Engineering Building I 911 Partners Way Raleigh, NC 27695 USA Xuewei Wang Department of Chemical Engineering & Materials Science Michigan State University East Lansing, MI 48824 USA Olaf Wolkenhauer Systems Biology and Bioinformatics Department of Computer Science University of Rostock Albert Einstein Str. 21 18051 Rostock, Germany e-mail: [email protected] Ming Wu Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824 USA Xuerui Yang Department of Chemical Engineering & Materials Science Michigan State University East Lansing, MI 48824 USA Necmettin Yildirim Department of Pharmacology University of North Carolina Chapel Hill, NC 27599 USA
305
Index importance, 256 method, 254–56 partitioning, 255 snipping, 256 See also Reverse engineering
A Absolute sensitivity coefficients, 188, 203, 206 Activated GFP, 47, 50 Adaptive Chebyshev sparse grids, 215 Adaptive sparse grid, 223 Adaptive sparse grid-based optimization, 211–31 anticipated results, 221–24 application notes, 228–30 computational efficiency, 214 data acquisition, 221–24 as deterministic, 215 discussion and commentary, 227–28 error-controlled interpolant, 214 example code, 219–20 experimental design, 215–17 GA-based optimization comparison, 228 general procedure, 218–21 interpretation, 223–24 introduction to, 212–15 materials, 217 parameter space sampling, 211 search range, 218 sorted grid points, 222 summary points, 230–31 troubleshooting, 224–26 troubleshooting table, 225 unique points, 222–23 unstable points, 223 Adaptive sparse grid interpolation, 213–15 AHDC1, 87 ANOVA Experimental Design Matrix, 289 multifactorial, 296 use of, 289 A priori information, 265 Arrow diagram models, 58 Automated probing, 255 Automated reverse engineering, 254–56 advantage, 256 candidate models, 255 defined, 254
B Balanced growth simulation, 118 Basis functions, 216 Bayesian networks, 248–50 analysis, 249 defined, 248 as directed acyclic graphs, 265 dynamic, 250 as graphical model, 248 interference, 124 reverse engineering, 249 for statistical models, 249 structure, 250 Bidirectional search algorithm illustration, 42 illustrated, 40 image analysis based on, 38–41 procedure, 39–40 Biochemical reaction networks bound chemical steady states, 144 sensitivity analysis, 129–46 Biological networks, 234–36 approaches for inference, 242–56 Bayesian, 248–50 Boolean, 245–47 comparative analysis, 257 describing, 237 design principles, 237–38 discussion and comparison of approaches, 264–66 genome-scale metabolic modeling, 243–45 graph theory, 257–58 hierarchical, 260 inferred, 256–64 material, 239–42 metabolomics, 240 307
Index
Boolean (continued) motifs and modules, 258–60 ordinary differential equations, 250–56 proteomics, 240–41 representation, 236–37 reverse engineering, 233–66 scale-free, 258 static, 242–43 stoichiometric analysis, 260–61 summary points, 266 topology, 247–48 transcriptomics, 241–42 types of, 234–35 visualizing, 236 Biomass composition, 156 Biomolecular networks, 252 BOOL-2 algorithm, 245 Boolean networks, 245–47 as deterministic, 264 directed edges in, 262 dynamics in, 262 probabilistic, 246 reverse engineering and, 245 temporal, 246 BRENDA, 256 C Canonical Wnt-pathway, 234 Carbon-backbone network, 115 Carbon-shuttle metabolites, 114 Cell harvesting, 274 Cellular metabolism, stoichiometric models, 151–52 Cellular network modeling, 111–26 anticipated results, 121–22 application notes, 125 cell culture, 113 data acquisition, 121–22 database, 113 dynamic simulation parameters, 122 generalized kinetic expressions, 123–24 interpretation, 121–22 introduction to, 112–13 kinetic, 117–20 materials, 113 methods, 113–21 model network, 121–22 modularity, 122–23 parameter estimation, 120–21 population heterogeneity, 124–25 summary points, 126 troubleshooting table, 125 Cellular networks carbon-backbone, 115 defined, 235 308
design principles, 237–38 functional reduction, 116–17 genome-scale, 114 reconstruction, 113 reduction, 113–17 structural reduction, 113–16 Chebyshev polynomials, 216, 217, 218 Classical flux balance analysis (FBA), 152–54 Clustering gene expression, 292–93 of motifs, 260 network inference before, 248 Cluster size, 298 Collective fitting approach, 64 CONOPT, 162 Conserved moieties, 261 Continuous dynamic modeling, 261–62 Control analysis, 263–64 Controllability, 136 Corrected FRET (FRETc), 27 Correlation coefficients, 247 Correlation networks, 247 Cost function, 216 evaluation of, 218 interpolant mapping, 230 searching areas of, 221 surrogate, 227 Covalent modification system, 145–46 Covariance matrix parameter, 182 scaled measurement, 182 scaled parameter, 182 Cross Gramian defined, 137 empirical, 137–38, 139 for input and output, 137 See also Gramians Crosstalk data-driven model to characterize, 67 inhibition of PI3K affects, 69 Cytotoxicity management, 79 D DAPI images, 5 Data-driven modeling, 57–72 computational analysis of signal specificity, 69–72 data processing, 60–62 examples, 64–72 experimental data types, 59–60 introduction to, 58 model complexity and, 63 normalization, 60–62 parameter specification and estimation, 63–64
Index
principles of, 59–64 with quantitative data, 62–63 systematic analysis of crosstalk, 64–69 Data processing, 60–62 Degree distribution, 82 Design reduction, 190–91 experimental design procedure, 196–97 main effects analysis-based, 205 procedure, 191 purpose and implementation, 190 rank analysis-based, 202 See also Signal transduction modeling Deterministic models, 62 Digital images, mathematical description, 37–38 Direct Search toolbox, 121 Discrete dynamic modeling, 262, 263 Discretization, 161 Domain-domain interactions (DDIs), 241 Dual problem, 143–44 Dynamic flux balance analysis (DFBA), 149–75 advantage, 154 assumptions, 173, 175 classical FBA versus, 154–55 defined, 149, 150 discussion and commentary, 172–75 fed-batch cultures, 157–64, 175 methods, 151–55 model illustration, 154 in novel metabolic capabilities, 168 results and interpretation, 155–72 for Saccharomyces cerevisiae, 149–75 scope of, 175 for sensitivity of ethanol productivity, 166 stoichiometric models, 151–52 summary points, 175 Dynamic flux balance model, 150 alternative, 174 batch, 175 fed-batch, 175 parameterization, 173 Dynamic optimization problem, 162 Dynamic simulations fed-batch cultures, 157–59 kinetic modeling, 118–20 parameters, 122 E Electroporation efficiencies, 30 Electroporation of TF reporter plasmids, 23–26 clonal screening, 25 clonal selection, 24 into 3T3-L1 preadipocytes, 23–24
PPARy activation monitoring, 25 See also Transcription factor (TF) reporter Elementary flux mode (EFM) algorithm, 115–16 analysis, 115 Elementary modes, 261 Empirical cross Gramian, 137–38, 139 defined, 139 nonlinear, 140 sensitivity measure, 140 See also Gramians Endoplasmic reticulum (ER) stress, 89 Energy balance analysis (EBA), 261 Enzyme-linked immunosorbent assays (ELISAs), 60 Enzyme subsets, 261 Epidermal growth factor (EGF), 3 ERK graded response to, 8 samples, 7 stimulation, 7 Ethanol productivity, 160, 161 for aerobic-anaerobic switching time, 167, 172 DFBA results for sensitivity of, 166 ethanol yield trade-off, 163 fed-batch, 171 overproduction mutants, 164–67 Eukaryotic transcription-regulating proteins, 235 Experimental data types, 59–60 Experimental design methodology, 207 Experimental design procedure design reduction, 196–97 factors, 185 feasibility, 184 identifiability analysis, 193–94, 197 impact analysis, 194–96 initial perturbation and measurement design, 193 overview, 184–85 responses, 185 Express GFP, 34 Extracellular-regulated kinase (ERK), 1 activation in HepG2 cells, 26–28 graded response to EGF, 8 F False discovery rate (FDR), 287 analysis threshold, 297–98 Benjamin-Hochberg, 82 local estimator, 297, 298 FANMOD, 260 Feasibility problem defined, 142 semidefinite relaxation and, 142–43 309
Index
Feasnet, 299 Fed-batch cultures dynamic optimization of, 159–64 dynamic simulation of, 157–59 Fed-batch ethanol productivities, 171 Fed-batch operating policy, 171 Feed-forward loop (FFL), 259–60 Fibroblasts, 66 Fisher Information Matrix (FIM), 180 Fluorescence intensity, 43, 50 Fluorescence microscopy imaging, 5–6 Fluorescence resonance energy transfer (FRET), 2, 60 control plasmid development, 22–23 corrected (FRETc), 27 element selection, 18–19 in kinase activity monitoring, 12 occurrence of, 12 signals, 2 spectral overlap during, 18 Fluorescent microscopy image analysis anticipated results, 46–50 application notes, 50–53 based on K-means clustering and PCA, 41–43 based on wavelets and bidirectional search, 38–41 data acquisition, 46–50 image intensity, 43–45 interpretation, 46–50 introduction to, 34–35 inverse problem solution, 47–50 methods, 38–46 method selection, 46 model development, 46–47 preliminaries, 35–38 procedure comparison, 45–46 summary and conclusions, 53 Fluorescent protein cloning, 20–21 PCR, 19–20 Flux analysis theory, 98–99 Flux balance analysis (FBA), 99, 152–54, 261 assumptions, 173, 175 classical, 152–54 optimization problem, 153, 155 for stoichiometric model, 153 See also Dynamic flux balance analysis (DFBA) Fractal kinetic theory, 253 Free fatty acid (FFA) concentration in plasma, 78 cytotoxicity, 75, 77 intracellular metabolic pathways, 76 types of, 78 310
Fus3 cross-inhibition models, 71 G Gas chromatography (GC), 240 GeneChip Operating Software, 281 Gene expression clustering, 293 profiles, 79–80, 82, 83 time series, 287–300 Gene identifiers, 298 Gene pairs, synergy scores, 82, 83 GenePix Pro, 281 Gene regulatory networks, 235 Genetic algorithms (GA), 121, 212, 229–30 finding parameter values, 229–30 fitness, 229 optimization comparison, 228–29 Genome expression analysis, 264–65 Genome-scale models, 243–45 cellular network, 112 metabolic, 96, 243–45 Genome-scale network, 114 Gibbs free energy change, 117 Global algorithms, 227 Global sensitivity analysis, 133, 145, 216 Glucose media dynamic simulation, 158 Glycerol production, 169 GNU Linear Programming Kit (GLPK), 105 Gramians controllability, 136 cross, 137 linear sensitivity analysis and, 136–37 for nonlinear systems, 137–38 sensitivity measure based on, 138–40 uses, 136 Graphs, 237 connectivity, 115 directed acyclic, 265 subgraphs, 258, 259 theory, 257–58 Green fluorescent protein (GFP), 2 activated, 47, 50 express, 34 formation for cell line, 34 Green fluorescent protein (GFP) reporter systems, 12 anticipated results, 28 application notes, 23–28 buffers and reagents, 13 cell and bacterial culture, 13 cells, 34 cloning, 14 data acquisition, 28 discussion and commentary, 29–30 illustrated, 15
Index
interpretation, 28 kinase reporter development, 17–23 materials, 13–14 methods, 14–23 microscopy, 14 principles of, 35–36 summary points, 30 3T3-L1 cell culture, 14 transcription factor reporter development, 14–17 troubleshooting table, 30 Growing trees, 255 H HepG2 cells, 79 Hidden Markov models, 250 Hierarchical clustering, 248 Hierarchical networks, 260 High performance liquid chromatography (HPLC), 240 Hub genes, 82, 85–89 Hybridization, 278 Hybrid models, 62 I Identifiability classes, 182–83 metrics and conditions, 182–84 parameter, 183–84, 187 structural, 183 Identifiability analysis, 186–88, 200 in experimental design process, 193–94, 197 purpose, 186–87, 188 steps, 186 Image analysis based on K-means clustering and PCA, 41–43 based on wavelets and bidirectional search, 38–41 for fluorescent microscopy, 51 goal, 38 mathematical description of, 37–38 Image contrast, wavelets and, 39 Imagene, 281 Immunoblotting, 198 Immunofluorescence (IF) staining, 2 Impact metrics, 189, 195 net, 206 parameter-specific, 206 Impact analysis, 188–90, 200–201 defined, 189 effects-based, 196 experimental design procedure, 194–96 main effects-based, 206
methods, 206 procedure, 190 purpose and implementation, 188–90 rank-based, 195, 201, 206, 207 See also Signal transduction modeling Importance coefficients, 203, 204, 206 Independent component analysis (ICA), 124 Infeasibility certificates from dual problem, 143–44 sensitivity analysis via, 141–46 Inferred networks, 256–64, 266 Information theory, 243 Information theory-based scores, 80 Initial perturbation measurement design constructing, 198 experimental design process, 193 immunoblotting, 198 procedure, 186 purpose and implementation, 185–86 See also Signal transduction modeling INSIG2, 88 Intracellular signaling pathways, 1 Inverse Laplace transform, 49–50 Inverse problem for determining TF concentrations, 47–50 solving, 47–50 K KEGG database, 166, 167 defined, 235 gene insertion library from, 168 LIGAND, 167 Kholodenko method, 252 Kinase reporter development, 17–23 fluorescent protein cloning, 20–21 fluorescent protein PCR, 19–20 FRET-based, 17 FRET control plasmid development, 22–23 FRET element selection, 18–19 functionality, 17 linker oligonucleotide development/annealing, 21 linker region cloning, 22 See also Green fluorescent protein (GFP) reporter systems Kinases, 11 Kinetic modeling, 117–20, 252 dynamic simulations, 118–20 rate equations, 117–18 See also Cellular network modeling K-means clustering, 36–37 in dynamic profile determination, 50 fluorescent cell regions/clusters by, 45 image analysis based on, 41–43 311
Index
K-means clustering (continued) key idea, 37 principle, 36 procedure steps, 36–37 K-medians, 293 L Labeling, 277 Lagrange dual problem, 143–44 Laplace transformation, 48–49 application, 48–49 inverse, 49–50 LASSO tool, 243 Levenberg-Marquardt method, 64 Linear program (LP), 153 Linear sensitivity analysis, 134–36 defined, 135 disadvantages, 135–36 Gramians and, 136–37 relative sensitivities, 135 See also Sensitivity analysis Linker oligonucleotide development/annealing, 21 region cloning, 22 Local parameter identifiability, 183–84 Local sensitivity analysis, 216 Local structural identifiability, 183 M MACF1, 87–88 Main effects-based impact analysis, 206, 207 MAPKs. See mitogen-activated protein kinases Markov Chain Monte Carlo, 256 Markov transition model, 119 Mass spectrometry (MS), 240 Mathematical modeling, 57, 179, 212 MATLAB delay differential equation solver, 192 EFM algorithm implementation in, 116 optimal search function, 165 MAvisto, 260 Mean Value Theorem, 253 Mechanistic model complexity, 62 MEK activation comparator (MAC), 69 Meshes, comparison of, 214 Metabolic control analysis (MCA), 130 Metabolic flux analysis (MFA), 98–99, 116 Metabolic modeling, 95–107 anticipated results, 105–6 applications of, 96 benefits, 96 data acquisition, 105–6 discussion and commentary, 106–7 feasible solution determined, 105–6 312
flux analysis theory, 98–99 genome-scale, 96 implementation, 95 interpretation, 105–6 introduction to, 96–98 materials and methods, 98–105 model development, 99–100 no feasible solution determined, 106 objective function, 100–104 optimization, 104–5 summary points, 107 uses, 95 Metabolic networks defined, 234 reconstructed, 97 Metabolic profiles, 240 Metabolites carbon shuttle, 114 gene selection and, 80 integrating, 83 mass balances, 118 measurements, 80 trends, 80, 81 Metabolomics, 240 Metatool, 116 METLIN, 240 Metropolis algorithm, 65–66 Mfinder, 260 Michaelis-Menten enzymatic rate equations, 117 alternatives, 123 as hyperbolic functions, 123 Michaelis-Menten kinetics, 169 Microarrays, 242 cDNA, 264 data acquisition, 281 experiment workflows, 242 transcriptional profiling with, 276–81 Mitogen-activated protein kinases (MAPKs), 57, 214 cascades, 64 components, 69 experimental data, 222 Fus3, 70 ODE model, 217 two-parameter search, 224, 226 Modularity, cellular network modeling, 122–23 Monod kinetics, 117 Monte Carlo optimization, 70 MOSEK, 159 Motifs, 258–59 clustering of, 260 defined, 258 dynamic stability, 260
Index
in E. coli transcriptional regulation network, 260 network characterization, 259 MRNAs, 242, 279–80 Multiple shooting strategies, 212 Multiwavelet formulations, 216 N Negative control experiments, 50 Net impacts, 206 Network component analysis (NCA), 124, 273, 282–83 defined, 282 procedure, 282–83 solutions, 284 toolbox, 284 uses, 282 Network component mapping (NCM), 124 Network reconstruction, 113 Nonalcoholic steatohepatitis (NASH), 76 Normalization, 60–62, 281–82 biological variability, 61 population endpoint measurements, 61 purpose, 60–61 Nuclear magnetic resonance (NMR), 240 O Objective functions, 100–104 across multiple scales, 103 choices, 100–102 determination, 102–3 evaluation, 102–3 with largest value, 103 maximization of ATP production rate, 100 maximization of biomass production, 100 minimization of, 182 minimization of ATP production rate, 100 minimization of nutrient uptake rate, 100 minimization of redox potential production rate, 100 multiple simultaneous, 104 steady state flux distribution, 116 See also Metabolic modeling Observability, 136 ODRPACK, 64 Optimization adaptive sparse grid-based, 211–31 of fed-batch cultures, 159–64, 174 global methods, 212 searches, 227 tissue and cell function, 111 Ordinary differential equations (ODEs), 97, 217, 250–56 automated reverse engineering, 254–56 control analysis simulation, 261–64 defined, 250
dynamics simulation, 261–64 form, 250 parameter estimation, 256 power law modeling, 253–54 sensitivity analysis simulation, 261–64 small-scale biochemical network identification, 251–53 stochastic, 243 system of, 251 Organismal scale, 104 Over-representation, 296 P PAINT, 287 availability, 290 clustered data, 288 cross-referenced list, 299 defined, 288 in gene regulation, 289 multiple testing correction in, 299–300 prebuilt reference files, 299 results, 290 results interpretation, 296 transcriptional regulatory network analysis with, 293–95 uses, 288 Parameter estimation cellular network modeling, 120–21 signal transduction modeling, 181–82 Parameter identifiability defined, 183 local, 183–84 metrics, 206 testing, 187 See also Identifiability Parameters covariance matrix, 182 identification, 211 sensitivity matrix, 180, 182 Parametric sensitivity analysis, 132–33 Partitioning, 255 Pathway Genome Database (PGDB), 99 Pathway Interaction Database (PID), 235 Pearson Correlation, 293 Permutation tests, in synergy significance evaluation, 82 Phenotype, 77, 83 Phenotype-specific gene network, 75–90 anticipated results, 82 application notes, 83–89 cell culture and reagents, 79 cytotoxicity management, 79 data acquisition, 82 discussion and commentary, 83 experimental design, 78–79 313
Index
Phenotype-specific gene network (continued) fatty acid salt treatment, 79 gene expression profiling, 79–80 gene selection based on metabolite trends, 80 hub genes in network, 85–89 interpretation, 82 introduction, 76–77 materials, 79 metabolites measurements, 80 methods, 79–82 network topology evaluation, 82 reconstruction, 83 summary points, 89–90 synergy network characteristics, 84–85 synergy scores calculation, 80–82 synergy significance evaluation, 82 topological analysis, 78 troubleshooting table, 84 Phosphoinositide 3-kinase (PI3K), 57, 66 Phosphorylated ERK (ppERK) antibody labeling of, 4–5 average nuclear intensities, 7 fluorescence microscopy imaging of, 5–6 measurements, stimulation for, 4 nuclear, 6, 7 Phosphorylated STAT5 (pSTAT), 193, 194, 196 Platelet-derived growth factor (PDGF) dose, 68 independent signaling modes, 69 receptors, 66, 68 stimulation, 69 Polynomial chaos, 216 Power law modeling, 253–54 from fractal kinetic theory, 253 illustrated, 254 properties, 253 Principal component analysis (PCA), 33 application illustration, 43 defined, 37 in dynamic profile determination, 50 fluorescent cell regions/clusters by, 45 image analysis based on, 41–43 motivation for using, 37 Probabilistic Boolean networks, 246 Process diagrams, 237 Protein kinase C (PKC), 12 Protein-protein interaction (PPI), 240 data analysis, 241 network, 87 validating, 241 Proteomics, 240–41 Q Quantitative immunofluorescence, 1–9 314
anticipated results, 6–7 data acquisition, 6–7 discussion and commentary, 8 experimental design, 3 interpretation, 6–7 introduction to, 2–3 materials, 3–4 methods, 4–6 statistical guidelines, 7 summary points, 8–9 troubleshooting table, 9 Quantitative mass spectrometry, 207 Quasi-Monte Carlo algorithm, 215 R Rank analysis, 189 Rank-based impact analysis, 195, 201, 206, 207 Ras-dependent pathways, 57 Ras/Erk pathway, 66 Rate equations generalized, 123–24 kinetic modeling, 117–18 Michaelis-Menten, 117, 123 Relative sensitivities, 135 REVEAL algorithm, 245 Reverse engineering, 238–39, 266 automated, 254–56 Bayesian network, 249 before/after, 238 Boolean networks and, 245 defined, 238 improving, 266 Reversible covalent modification, 131–32 RNA purification, 274–76 transcriptional profiling for, 280–81 S SABIO-RK, 256 Saccharomyces cerevisiae dynamic simulation, 151 fed-batch cultures, 157–64 fed-batch fermentation, 150 growth phenotypes of knockout mutants, 150 for renewable liquid fuel applications, 164 steady-state FBA, 171 steady-state FBA mutants, 166 stoichiometric models of, 155–57 Scaffolding proteins, 234 Scale-free networks, 258 Semidefinite relaxation, 142–43 Sensitivity analysis, 129–46 discussion and outlook, 146
Index
via empirical Gramians, 136–41 global, 133, 145, 216 via infeasibility certificates, 141–46 introduction to, 130 linear, 134–36 local, 216 parametric steady-state sensitivity, 132–33 purposes, 130 reversible covalent modification, 131–32 simulation of, 263–64 system class and, 131–34 uses, 129 Serial Analysis of Gene Expression (SAGE), 242 SH3RF2, 88–89 Shortest path length, 82 Signal specificity in yeast, 69–72 Signal transducer and activator of transcription 5 (STAT5) signaling, 192 Signal transduction, 34 cascades, 271 in eukaryotic cells, 58 networks, 2, 234 Signal transduction modeling, 179–207 anticipated results, 192–97 application notes, 197–205 classes of factors and responses, 185 data acquisition, 192–97 design implementation, 191–92 design modification and reduction, 190–91 discussion and commentary, 205–7 experimental design procedure, 184–85 identifiability analysis, 186–88 identifiability metrics/conditions, 182–84 impact analysis, 188–90 initial perturbation and measurement design, 185–86 interpretation, 192–97 introduction to, 180–85 methods, 185–92 parameter estimation, 181–82 structure, 180–81 summary points, 207 Silhouette Coefficient (SC), 296 Simulated annealing, 212 Single-cell endpoint measurements, 60 Single-cell kinetic measurements, 60 Single input module (SIM), 259 Singular value decomposition (SVD), 124 Snipping, 256 Sontag method, 252 Sorted grid points, 222 Source tool, 290
Sparse Grid toolbox, 217, 223 Static networks, 243 Steady-state flux distribution, 116, 117 Steady states bound feasible, 144–45 computing, 141 shift, 134 Steady-state sensitivity defined, 133 parametric, 132–33 See also Sensitivity analysis Stochastic sampling, 211 Stoichiometric analysis biological networks, 260–61 defined, 260 properties found by, 261 Stoichiometric models, 150 of cellular metabolism, 151–52 classical FBA for, 153 of S. cerevisiae metabolism, 155–57 wild-type, 168, 169 Structural identifiability defined, 183 local, 183 parameters and, 187 testing, 207 See also Identifiability Subgraphs, 258, 259 Subpopulation fractions, 111 Sum of squared differences (SSD), 70 Synergy analysis, 75–90 defined, 80 significance evaluation, 82 Synergy networks degree distribution of, 85 distribution of shortest path lengths in, 85 hub genes in, 85–89 topographical characteristics, 84–85 topology analysis, 90 Synergy scores calculation of, 80–82 gene pairs, 82, 83 range, 82 Systematic system perturbations, 138 Systems biology, 239 Systems biology Graphical Notation (SBGN) initiative, 237 Systems Biology Markup Language (SBML), 99 T Temporal Boolean networks, 246 3T3-L1 cell culture, 14 315
Index
Time of flight mass spectrometry (MS-TOF), 239 Time series gene expression, 287–300 analysis, 297 anticipated results, 296 application notes, 300 data acquisition, 296 differentially expressed gene identification, 291–92 discussion and commentary, 296–300 interpretation, 296 introduction to, 288–89 materials, 289–90 methods, 291–95 normalized, 289 number of clusters, 296 robust clustering, 292–93 summary points, 300 transcriptional regulatory network analysis with PAINT, 293–95 Total cytoplasmic STAT5 (tSTAT), 193, 194, 196 Total internal reflection fluorescence (TIRF), 60 Transcriptional profiling, 276–81 with DNA microarrays, 276–81 hybridization, 278, 281 labeling, 277 for mRNA, 279–80 for total RNA, 280–81 washing and scanning, 278–79 Transcriptional Regulatory Element Database (TRED), 282 Transcriptional regulatory networks analysis with PAINT, 293–95 application notes, 284–85 cell harvesting, 274 discussion and commentary, 284 DNA microarray data acquisition, 281 illustrated, 272 introduction to, 272–73 materials, 273 methods, 273–81 NCA, 282–83 normalization, 281–82 profiling with DNA microarrays, 276–81 RNA purification, 274–76 summary points, 285 troubleshooting table, 285 Transcription factors (TFs), 11
316
activities (TFAs), 283 binding elements, cloning, 16–17 binding states, identification of, 15–16 candidate, 288 concentrations, inverse problem for determining, 47–50 DNA binding sites, 273 GFP-based, 12 identification, 271 master controllers of, 64 response element, 14 Transcription factor (TF) profiles computation of, 33–53 concentration, 50 damped oscillation, 50 Transcription factor (TF) reporter cloning TF binding elements into, 16–17 defined, 14 development, 14–17 electroporation of plasmids, 23–26 See also Green fluorescent protein (GFP) reporter systems Transcriptome analysis, 271–85 Transcriptomics, 241–42 defined, 241–42 methods for studying, 242 TRANSFAC database, 298 Troubleshooting tables adaptive sparse grid-based optimization, 225 cellular network modeling, 125 green fluorescent protein (GFP) reporter systems, 30 phenotype-specific gene network, 84 quantitative immunofluorescence, 9 transcriptional regulatory networks, 285 U Unique points, 222–23 Unstable points, 223 W Washing and scanning, 278–79 Wavelets, 36 image analysis based on, 38–41 in image contrast, 39 Whole-cell models, 112 Z Z-score, 259