Disease Markers: Cancer Genomics

Disease Markers Editor-in-Chief Sudhir Srivastava, Ph.D, MPH Office address: Chief Cancer Biomarkers Research Group Div...

Author: Robert L. Strausberg

102 downloads 1398 Views 12MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Disease Markers Editor-in-Chief Sudhir Srivastava, Ph.D, MPH Office address: Chief Cancer Biomarkers Research Group Division of Cancer Prevention National Cancer Institute E-mail: [email protected] Correspondence address: Disease Markers c /o /OS Press, Inc. 5795-G Burke Centre Parkway Burke, VA 22015, USA Tel.:+1 7033235554 Fax: +1 703 323 3668 E-mail: [email protected] Disease Markers c /o /OS Press Nieuwe Hemweg 6B 1013 BG Amsterdam, The Netherlands Tel.: +31 20 688 33 55 Fax: +31 20 620 24 19 E-mail: [email protected] Editors D.S. Alberts Tuscon, AZ, USA P.G. Anker Geneva, Switzerland S.G. Baker Bethesda, MD, USA B. Bapat Toronto, Canada

D.K. Bol Princeton, NJ, USA

B. Levin Houston, TX. USA

P. Boffetta Lyon, France

L.A. Liotta Bethesda, MD, USA

H.B. Burke Valahalla, NY, USA

D. Lo Hong Kong

W Russa Madison, Wl, USA R.G.H. Cotton Fitzroy, VIC, Australia G.J. Downing Bethesda, MD, USA Z.Feng Seattle, WA, USA J. Greenman Hull, UK S.M. Hanash Ann Arbor, Ml, USA G. Haroske Dresden, Germany M. Harrington Pasadena, CA, USA N.K. Hayward Herston, QLD, Australia D.E. Henson Rockville, MD, USA

Houston, TX, USA H. Magdelenat Pan's, France H. Malm Chicago, IL, USA L.Mao Houston, TX, USA F. Otopade Chicago, IL, USA R.B. Parekh Abingdon, UK G. Rennert Haifa, Israel R. Schleimer Baltimore, MD, USA J.W. Shay Da//as, TX, USA \. Shoulson Rochester, NY, USA M. Steel St. Andrews, UK

P.E. Barker Gaithersburg, MD, USA

W.N. Hrttelman Houston, TX, USA

R.L. Strausberg Bethesda, MD, USA

M.J. Birrer Rockville, MD, USA

S. Kaneko Kanazawa, Japan

E. Tahara La Jolla, CA, USA

T.M. Block Doylestown, PA, USA

M. von Knebel Doeberitz Heidelberg, Gemany

H.Vainio Lyon, France

Aims and Scope The journal publishes original research findings (and reviews solicited by the Editor) on the subject of the identification of markers associated with the disease processes whether or not they are an integral part of the pathological lesion. The disease markers may be a genetic host factor predisposing to the disease or the occurrence of cell-surface markers, enzymes or other components, either in altered forms, abnormal concentrations or with abnormal tissue distribution. This journal is designed to provide a forum for publications dealing with original observations in this developing field on any aspect of the general topic including: • •

Identification of new genetic or non-genetic markers (e.g. cell-surface antigens, serum proteins, intra- and extra-cellular enzymes, cytogenic markers and DNA-sequences). Population studies of new and existing markers, designed to elucidate information on their normal distribution as well as that in disease states. Amplification of knowledge about existing markers. Family studies of markers in disease. New techniques for identification and/or isolation of important marker molecules. Use of monoclonal antibodies for the definition of molecular structures associated with disease markers. Identification of disease-associated abnormalities in DNA using recombinant DNA techniques, gene-cloning and DNA restriction enzyme fragment polymorphisms. Identification of markers identifying malignantly transformed neoplastic cells, including precancerous lesions.

Published quarterly

0278-0240/01/$8.00

Printed in The Netherlands

39

Guest-Editorial

Talkin' Omics The 1990s will be remembered as the decade when advances in biomedical research launched the genomics era. While new information and technologies are clearly important products of the genomics revolution, perhaps most important is a change in mindset of how we pursue scientific discovery. We are no longer satisfied to study a gene or gene product in isolation, but rather we strive to view each gene within the complex circuitry of a cell. Understanding how genes and their products interact will open many exciting avenues in biological and biomedical research. In rapid succession, this new mindset has invigorated the analysis of all molecular entities, from the genome, to transcripts (transcriptome) and proteins (proteome). And it is clear that this is just the beginning of the omics revolution. While the understanding and treatment of many diseases will be impacted by omics, arguably the greatest biomedical opportunity for discovery is cancer. As a family of diseases, all cancer results from changes in the genome. The genomic changes take many forms, from point mutations, to amplifications and deletions, to translocations. Cancers in particular body sites (breast, prostate, brain) display a multitude of different changes in the genomic blueprint that can result in disease. As we move toward classification schemes based on molecular signatures, omics approaches provide the opportunity to search for differences and similarities in those tumors. Moreover, cancer is a temporal disease, usually developing over many years from an accumulation of changes occurring within the genome. Therefore, the opportunity exists to define events not only occurring in tumors, but also in precursor stages

Disease Markers 17 (2001) 39 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

of disease that might be most amenable to intervention. The complexity of molecular events within the genome are reflected and amplified by the diversity of transcripts and proteins within a cell. A variety of transcripts can be derived from the same gene, and the encoded proteins can be modified extensively to fulfill specific biological functions within a cell. The omics revolution has challenged researchers to integrate the study of the genome, transcriptome, and proteome, for this is the most promising approach to attaining a comprehensive omic view of the molecular circuitry within a cell. In this issue of Disease Markers, we are fortunate to have contributions from some of the leaders of the omics approach. While many of the articles feature cancer research, we hope that the more general applicability of the described approaches is apparent. We have tried to make this inaugural omics special issue of Disease Markers provocative and informative, and we hope that it captures the excitement that has led to the description of omics research as revolutionary. Robert L. Strausberg, Ph.D. Cancer Genomics Office National Cancer Institute 31 Center Drive Bethesda, MD 20892, USA Tel.: +1 301 496 1550 Fax: +1 301 4967807 E-mail: [email protected]

This page intentionally left blank

41

Using Serial Analysis of Gene Expression to identify tumor markers and antigens Gregory J. Riggins Duke University Medical Center, Durham, NC 27710, USA Tel: +1 9196843250; Fax: +1 919681 2796; E-mail: [email protected] Tumor markers and antigens are normally highly expressed in malignant tissue, but not in the surrounding normal tissue. Serial Analysis of Gene Expression (SAGE) is a technology that counts mRNA transcripts and can be used to find those genes most highly induced in malignant tissues. SAGE produces a comprehensive profile of gene expression and can be used to search for tumor biomarkers in a limited number of samples. Public sources of SAGE data, in particular through the Cancer Genome Anatomy Project, increase the value of this technology by making a large source of information on many tumors and normal tissues available for comparison. Although the perfect tumor-specific gene does not exist, the differences in gene expression between tumor and normal can be exploited for therapeutic or diagnostic purposes.

1. Introduction During tumor growth, the pattern of expressed genes in the tumor diverges from that of the surrounding normal tissue. Simple knowledge of which tumor genes are induced compared with normal tissues can aid in locating clinically useful markers or antigens, even though the molecular basis of the altered expression is typically unknown. Frequently, genes over-expressed in tumors have been sought as a marker for early detection. Tumor markers are used to indicate the presence of cancer, or to follow response to therapy. Typically tumor markers are assayed by the detection of protein from serum or other accessible body fluids or tissues. Tumor markers are clearly useful, but there is a lack of good markers for most cancers where early detection is warranted. Proteins over-expressed relative to normal tissue have a second important practical use as Tumor Specific Antigens (not present in any normal tissue) or Tumor Disease Markers 17 (2001) 41-48 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

Associated Antigens (expression in some normal cells). Tumor antigens may indeed be the same protein as a tumor marker, but their purpose is therapeutic rather than diagnostic. Toxic antibodies immunized to tumor antigens on the cell-surface or in the extra-cellular matrix may kill enough cancer cells to be therapeutic [21 ]. This approach ideally requires the cell surface protein to be uniquely expressed on all the tumor cells, but not expressed in any normal cells that would be in contact with the antibody during treatment. Also promising is a 'tumor vaccine' approach where the goal is to direct immune defenses toward the tumor by 'educating' host cells with tumor-derived material [4]. Expression of the marker on the cell surface is not a requirement of this system, but successful systemic administration of a tumor vaccine may require a much higher expression of the tumor antigen in the tumor compared to vital cells throughout the body. Either of these immunebased therapies would benefit from the discovery of new tumor antigens. Another class of differentially expressed genes in cancer is prognostic markers. Recently, many groups have sought to classify tumors by gene expression pattern, in addition to histopathology. The introduction of large-scale gene expression analysis has been used successfully to classify tumors by RNA expression patterns, in particular using DNA microarrays. This has helped further separate aggressive from non-aggressive tumors and in some cases, help predict response to therapy. Tumor markers, tumor antigens and prognostic markers (cancer biomarkers) have great potential for clinical application, but there has been a lack of highquality markers for the various types of cancers [38]. Finding a candidate marker has frequently been the by-product of other studies and not the initial intent of the research. Furthermore, generating the expression profile for each suspect gene has often relied on time consuming techniques, such as Northern Blotting, in situ hybridization, or immunohistochemistry. Fortunately, the advent of large-scale gene expression analysis and information technology have accelerated

42

G.J. Riggins / Using Serial Analysis of Gene Expression to identify tumor markers and antigens

biomarker discovery. In particular, Serial Analysis of Gene Expression (SAGE) [33] can decipher complex expression patterns and be helpful for locating biomarkers. This review will focus on the use of SAGE for locating clinically useful cancer-induced genes. SAGE is a technology that identifies and counts most all expressed transcripts from an RNA sample. Due to the quantitative and comprehensive nature of the data, it is particularly good for locating tumor markers. The transcripts found by SAGE to be induced to high level normally expresses the coded proteins at high levels, making this approach useful for locating candidate tumor antigens, as well. There are several examples that have been published already using SAGE for locating cancer markers and antigens, but the potential for this technology to discover disease biomarkers is just being realized.

2. What is SAGE? Serial Analysis of Gene Expression (SAGE) is a sequence-based approach that produces and counts the transcripts expressed within a group of cells [35]. With SAGE a 10 base-pair 'tag' sequence plus a specific four base-pair restriction site is used to distinguish and identify transcripts. These tags are ligated and cloned into a sequencing vector, allowing the serial analysis of multiple transcripts using an automated sequencer. The number of times a particular tag is observed in a tag population made from one mRNA sample (SAGE library), is used to determine the relative abundance of each transcript in the mRNA sample. The counts of each transcript are stored on a computerized database and are used to make statistical comparisons between libraries. Numerous reviews have been published describing the technology [13.17,24,34]. Many studies have been performed generating a comprehensive analysis from a diversity of tissues, in particular malignant tissues. A detailed protocol can be obtained through the SAGE Home Page from the Johns Hopkins Oncology Center (http://www.sagenet.org). The technology is patented by Johns Hopkins University and licensed to Genzyme Molecular Oncology (Framingham, MA) but freely available to academia and nonprofit organizations for research purposes. Further information on the license agreements for commercial applications can be obtained directly from Genzyme (www.genzyme.com/sage/welcome.htm).

An outline of SAGE is shown in Fig.1. The construction of a SAGE library starts with the purification of polyadenylated mRNA which subsequently is converted into double stranded cDNA using a biotinlabeied oligo-dT primer during the first strand synthesis for the recovery of the cDNA. A frequent-cutting anchoring enzyme, usually Nla III, defines the position in the transcripts from which the sequence tags are derived. After digestion of the generated cDNA with the anchoring enzyme, the 3'-terminal cDNA fragments are bound to streptavidin-coated beads. Next, an oligonucleotide linker containing recognition sites for a type restriction enzyme (tagging enzyme) is ligated to the bound cDNA fragments. Type IIs enzymes cut at a defined distance away from the recognition site. The SAGE tags are then released from the bound cDNA by cleavage with the tagging enzyme (usually BsmFI) and dimerized by a tail-to-tail blunt-end ligation. The linked ditags (102 bp) are then amplified and digested with the same anchoring enzyme used for the initial digestion of the double stranded cDNA. The resulting ditag fragments of 26-28 bp lengths all have cohesive ends and can therefore be linked together to form concatemers and cloned into a plasmid for sequencing. Modifications to the SAGE library construction have been made to allow libraries to be constructed with smaller starting samples. A 'microSAGE' procedure reported to assay small numbers of purified endothelial cells has worked robustly in our hands and other laboratories [30]. This procedure starts with total RNA or cell lysates and conjugates the mRNA in the sample to magnetic beads, where subsequent cDNA synthesis and digests are performed. This increase in efficiency allows libraries to be made from as little as 1 ug of total RNA. The final experimental step in the SAGE analysis is sequencing of the library. The data files generated by automated sequencing of the plasmids with ligated SAGE tags are analyzed by software that extracts and counts the occurrence of each tag. In a typical SAGE experiment, sequencing 2.000 inserts can produce over 50.000 tags, producing a sensitive level of detection. Cumulative databases can be formed, and bioinformatics applied to find transcripts with a particular pattern of expression. Therefore, SAGE data derived from a series of tumor and normal tissues can be queried directly for transcripts that are highly expressed in tumor, but not in normal tissue.

G.J. Riggins / Using Serial Analysis of Gene Expression to identify tumor markers and antigens

43

Fig.1. Outline of SAGE approach for transcript profiling.

3. Assessment of SAGE technology Since SAGE identifies and counts transcripts by nucleic acid sequence, it is frequently regarded as an accurate means for large-scale expression profiling. SAGE transcript levels are expressed as a fraction of the total transcripts counted, not relative to another experiment, standard, or a housekeeping gene, avoiding error prone normalization between experiments. The standardized nature of SAGE data makes cumulative data sets possible and historical comparisons valid. An additional strength of SAGE is that it determines expression levels directly from an RNA sample. It is not necessary to have a DNA probe arrayed to assay each gene, as with

chip technology. This allows SAGE to identify genes that are not included in an array [23] and avoids the infrastructure necessary to create and read large DNA arrays. This flexibility of SAGE has some disadvantages. The number of samples that can be processed using SAGE is small compared to DNA arrays since it takes about two weeks of skilled labor to construct a SAGE library. Analysis of hundreds of samples by SAGE for a single laboratory is not a practical option for the technology in its present form. However, when an in-depth and quantitative profile is desired for a small number of samples, the extra work involved in creating a SAGE library can certainly be justified. To date, SAGE has

G.J. Riggins / Using Serial Analysis of Gene Expression to identify tumor markers and antigens

been successful for determining the differentially expressed transcripts in well-controlled experimental systems [1,6-8,10,16.23,39]. This type of data generated by SAGE is often complementary to a typical use of DNA arrays in cancer research for a wide survey of many patient tumor samples. However, the increasing amount of public SAGE data makes it possible to rapidly build upon the work of others to locate the genes of interest.

4. Public sources of SAGE data One advantage of SAGE technology is that public data that can be easily downloaded or queried online. Links to SAGE data, and SAGE resources are listed in Table 1. This data is a valuable resource for comparing internally generated expression data, or for mining of novel cancer biomarkers [15,28]. The Cancer Genome Anatomy Project (CGAP) specializes in creating databases and resources for cancer research [26,31,32] and has produced this large source of public SAGE information. CGAP adopted Serial Analysis of Gene Expression (SAGE) technology starting in 1998 with the introduction of the SAGEmap web site [12]. On-line tools built specifically to handle SAGE data [12,14] allow users to make statisticalbased comparisons between libraries to find differentially expressed genes using the 'xProfiler',or by downloading data for local analysis. SAGE tags can be 'mapped' to UniGene clusters via SAGEmap, making the identification of a gene from a differentially expressed tag easier. However, the UniGene mapping is not always accurate and efforts are underway to produce a more accurate mapping of tags to transcript data. The SAGE data generated through this project is also used to create a 'Digital Northern' tool, where the expression level of a particular gene can be determined for each of the tissues used to make SAGE libraries. To date, over four million valid transcript tags have been processed from nearly 100 different malignant and normal cell types. 5. Bioinformatics For handling SAGE data most investigators rely on the SAGE software generated by Ken Kinzler and coworkers to process raw data. This software extracts tag sequences from raw sequence data and tabulating the counts in a database. The software also will

make comparisons between libraries of tags and calculate the statistical significance of differences based on Monte-Carlo simulations [40]. Additionally, the software helps create a relational database by extracting tags, gene name and gene information from sequence database. The program uses this information to match tags to known genes or ESTs. Additional 'tag to gene' mapping information can be downloaded from the NCBI from the SAGEmap website (Table 1). The SAGE software is freely available to non-commercial users of the technology and can be obtained via SAGEnet (Table 1). Investigators who plan a use of SAGE technology for commercial purposes should contact Genzyme Molecular Oncology for a license agreement.

6. Verification of genes identified by SAGE After a gene expression profile has been obtained on a set of RNA samples, it is desirable to experimentally confirm the expression differences and to extend the analysis to other samples. Normally a small set of interesting genes has been identified using DNA arrays or SAGE, but several different techniques are more efficient for assaying this smaller set of genes. In addition, each gene expression technique has inherent errors and an independent method is required for validating the original expression levels. Although Northern Blotting is a time-consuming approach, it is still a useful and accurate way to confirm profiling data for a limited number of genes. When a good antibody is available for the gene of interest, a western blot or immunohistochemistry are reliable methods for confirming expression changes. This approach is advantageous; in particular when the endpoint is knowledge of protein levels rather than mRNA levels. Real-Time PCR, sometimes called 'quantitative' or 'fluorescent' PCR, has gained popularity for rapid follow-up and confirmation of profiling data [15,25]. Expression determination by real-time PCR is based on continuous fluorescent monitoring of PCR products [20,36,37] from a reverse transcriptase-generated cDNA template. The number of cycles required to PCR-amplify a product to a certain level is proportional to the amount of starting template and can be used to accurately determine starting mRNA levels. Normally a serially diluted known sample is used for a standard curve to interpolate concentrations of unknown samples.

GJ. Riggins / Using Serial Analysis of Gene Expression to identify tumor markers and antigens

45

Table 1 Links to SAGE data and information Web site SAGEmap

URL www.ncbi.nlm.nih.gov/SAGE

Description Large RNA expression database from CGAP based on SAGE profiles of malignant and normal cells.

SAGEnet (Kinzler-Vogelstein Lab)

www.sagenet.org

SAGE database, protocols, references and additional links.

Genzyme Molecular Oncology

www.genzyme.com/sage/welcome.htm

Cancer Genome Anatomy Project (CGAP)

cgap.nci.nih.gov

SAGE information and applications for commercial users of the technology. CGAP homepage with links to expression databases and cancer research resources.

To look at protein levels of many samples simultaneously, a tissue microarray system has been developed [11,19,29]. This system allows for up to one thousand small tissue samples, made from a narrow gauge biopsy needle, to be arrayed in a single block of tissue. This block of tissue can that be used to produce hundreds of slides that can be probed by immunohistochemistry or other means. In this way a standard set of the same samples can be probed for expression levels for many different genes. A digital imaging system is used to record and read the data. Although, robotics are now employed to array the tissues, many good quality samples must be collected and oriented for biopsy in the region of interest oriented by a pathologist. The results must also be scored in some fashion by signal intensity, done manually at this point in the technologies development. Finally, a good antibody is needed for each gene of interest that will work in the fixed tissue. This approach has the potential to make gene expression correlations with a vast archive of preserved tumor material.

7. Examples of cancer biomarker discovery 7.1. Colon cancer The first application of SAGE to human tissues was to colon cancer [40]. Comparing colon tumors to normal colon epithelium showed that less than 1.5% of the transcripts were differentially expressed. Many genes elevated in cancer represented products known to be involved in growth and proliferation, while genes found in the normal colon were often related to differentiation. SAGE was also used more recently to locate candidate biomarkers for metastasis in colon cancer, using cell lines as a model [22].

7.2. Ovarian cancer Ovarian cancer treatment would benefit from tumor markers capable of early detection, since most ovarian cancers have already metastasized at the time of diagnosis. In order to locate ovarian cancer markers, a total of 385,000 transcripts from ten different ovarian libraries were analyzed by SAGE [9]. From this data transcripts were identified that were high in all three primary ovarian cancers and low in all three nonmalignant specimens. A total of 27 genes were identified that met these criteria and that were over-expressed more than 10-fold in ovarian tumors. Interestingly, a majority of those genes were predicted to encode membrane or secreted proteins, making them candidates for biomarkers or tumor targeting. Many of these secreted genes encoded protease inhibitors. 7.3. Tumor vascular endothelium Endothelial cells provide the blood supply to solid tumors and are intimately involved in supporting their growth. The tumor antigens located on tumor endothelial cells could provide an excellent target for anti-tumor therapy. SAGE was used to identify genes differentially expressed between the endothelial cells from either normal colon or colon adenocarcinoma [30]. The study detected 79 different genes differentially expressed between these tissues, including 46 that were specifically elevated in tumor-associated endothelial cells. On the basis of these results, it was suggested that endothelium growing in a tumor is more like developing endothelium, and that these differences may be clinically relevant. Nine SAGE tags elevated in the tumor corresponded to novel, uncategorized genes. These genes were named tumor endothelial marker (TEM), and designated TEM-1 to TEM-9. Further experiments confirmed the tumor endothelium-specific expression of these genes, not only for colorectal tumors but also

46

G.J. Riggins / Using Serial Analysis of Gene Expression to identify tumor markers and antigens

for other major tumor types. These TEMs or other genes identified in this study may become targets of anti-angiogenic therapies. 7.4. Brain cancer SAGE has been used to study the most common adult malignant brain tumor, Glioblastoma Multiforme (GBM). The first SAGE analysis of GBM compared over 200.000 transcript tags from primary GBMs and normal brain cortex [12]. Approximately 1% of the genes detected were differentially expressed and included angiogenesis factors such as vascular endothelial growth factor, cell cycle regulators, and transcription factors. This data was also used by the Cancer Genome Anatomy Project to help start the public SAGEmap database and is available online at this site. Cancer-induced genes mined from this data were further tested using real-time PCR, western and Northern Blotting to see if candidate tumor marker could be located [15]. Most of the tumor over-expressed genes predicted by SAGE could be confirmed in a subset of glioblastoma. In general, a particular antigen was only highly expressed at most in about one-third of the GBMs tested, likely due to the molecular heterogeneity of this cancer. However, in combination, 75% of the tumors had at least one antigen that was strongly expressed, and not present in a panel of normal neural tissues. Two antigens were located that coded for cell surface proteins, and may be useful for targeting gliomas with antibody based therapy. Investigators have also used SAGE to study a rat C6 glioma cell model [5]. Over-expressed genes were found that were related to invasion and cell-surface interactions. The SAGE results were confirmed by Northern Blotting. Brain tumors other than GBM have been studied by expression profiling. The major malignant pediatric brain tumor, medulloblastoma, has been studied by SAGE [18]. Detailed SAGE expression profiles are also available for medulloblastomas and a variety of gliomas at the CGAP SAGEmap database [12]. 7.5. Pancreatic cancer Pancreatic cancers are another major tumor that would greatly benefit by having an effective means for early detection. SAGE was used early on to profile the genes expressed in pancreatic cancer, although it was not possible to perform a SAGE analysis on the corresponding normal pancreatic ductal epithelium [40].

Despite this limitation, an effective tumor marker was found for pancreatic cancer, TIMP-1, in particular when it was used in conjunction with CA19-9 and carcinoembryonic antigen [41]. Clustering algorithms, first developed for DNA array data, were applied to SAGE expression data of pancreatic cancer [27]. In this study a group of invasion and metastasis specific genes were identified that may be useful as diagnostic or therapeutic markers for pancreatic cancer. 7.6. Breast cancer SAGE was used to compare breast carcinoma cells and normal breast epithelium [2]. The gene. 14-3-3 final sigma was found to be reproducibly repressed in breast carcinoma. The response of breast cancer cell lines to the effects of estrogen has also been studied using SAGE [1.10]. Among the genes found were WISP-2 (a Wnt-1 inducible signaling protein), and five novel genes (E2IG1-5).

8. Discussion and future goals SAGE is currently one of the most useful methods for profiling as many of the expressed transcripts in a population of cells as possible. It provides perhaps the best chance to obtain an accurate and comprehensive picture of expressed transcripts in a particular tissue, although the technique is time consuming and laborious for multiple samples. Fortunately, the growing amount of public data makes it possible to search for candidate tumor biomarkers directly, or to augment private datasets with public data. The first challenge is to determine how to find from complex gene expression data the best candidates for a tumor markers or antigens. Improved bioinformatics and computational methods allow the data to be queried more easily, but there is still much progress necessary to be able to integrate SAGE and other sources of molecular information in a meaningful way. Validation of candidate biomarkers at the RNA level is now much quicker with the use of real-time PCR techniques. The application of in situ hybridization or immunohistochemistry can be used to determine if all cells within a tumor are expressing the marker or if there is some small population of normal cells that highly expresses the gene of interest. When it is necessary to screen large sample sets for protein levels, immunohistochemistry using tissue microarrays can provide a rapid means [11]. Various improvements in proteomic

G.J. Riggins / Using Serial Analysis of Gene Expression to identify tumor markers and antigens

technology may also eventually provide a means to assay proteins on a level as comprehensive as currently available for mRNA [3]. One overall conclusion that can be made from gene expression profiling of cancer is that tumors, even with identical histopathology, are very heterogeneous at the expression level. Although this makes it difficult to select a molecular-based therapy using just histology, application of gene expression measurements to clinical samples will make it possible to identify the tumor that expresses the antigens for which a therapy is available. The rate-limiting step for tumor marker application or discovery is still the work required to show that the marker will be clinically useful. It is, therefore, important that the best candidate markers or antigens can be predicted with some degree of accuracy from gene expression data. It still remains to be seen if the candidate markers or antigens discovered initially by SAGE will produce useful clinical tests or therapies. Although this process will take several or many years, it seems appropriate to use the most comprehensive data sets possible and careful validation of the candidates prior to embarking on the laborious task of further developing a tumor specific gene for clinical use.

References [1] A.H. Charpentier, A.K. Bednarek, R.L. Daniel et al., Effects of estrogen on global gene expression: identification of novel targets of estrogen action, Cancer Rex 60 (2000), 5977-5983. [2] A.T. Ferguson, E. Evron, C.B. Umbricht et al., High frequency of hypermethylation at the 14-3-3 sigma locus leads to gene silencing in breast cancer, Proc Natl Acad Sci USA 97(2000), 6049-6054. [3] D. Figeys and D. Pinto, Proteomics on a chip: promising developments, Electrophorexix 22 (2001), 208-216. [4] E. Gilboa, S.K. Nair and H.K. Lyerly, Immunotherapy of cancer with dendritic-cell-based vaccines, Cancer Immunol 1mmunother 46 (1998), 82-87. [5] J.M. Gunnersen, V. Spirkoska, P.E. Smith, R.A. Danks and S.S. Tan, Growth and migration markers of rat C6 glioma cells identified by serial analysis of gene expression, Glia 32 (2000), 146-154. [6] T.C. He, C. Rago, H. Hermeking et al., Identification of cMYC as a target of the APC pathway, Science 281 (1998), 1509-1512. [7] T.C. He, T.A. Chan, B. Vogelstein and K.W. Kinzler, PPARdelta is an APC-regulated target of nonsteroidal antiinflammatory drugs, Cell 99 (1999), 335-345. [8] H. Hermeking, C. Lengauer, K. Polyak et al., 14-3-3 sigma is a p53-regulated inhibitor of G2/M progression, Mol CM 13 (1997), 3-11. [9] C.D. Hough, C.A. Sherman-Baust, E.S. Pizer et al., Largescale serial analysis of gene expression reveals genes differentially expressed in ovarian cancer, Cancer Rex 60 (2000), 6281-6287.

47

[10] H. Inadera, S. Hashimoto, H.Y. Dong et al., WISP-2 as a novel estrogen-responsive gene in human breast cancer cells, Biochem Biophys Res Commun 275 (2000), 108-114. [11] J. Kononen, L. Bubendorf, A. Kallioniemi et al., Tissue microarrays for high-throughput molecular profiling of tumor specimens, Nat Med 4 (1998), 844-847. [12] A. Lal, A.E. Lash, S.F. Altschul et al., A public database for gene expression in human cancers, Cancer Res 59 (1999), 5403-5407. [13] A. Lal, I.-M. Sui and G. Riggins, Serial Analysis of Gene Expression: Probing transcriptomes for molecular targets. Current Opinion in Molecular Therapeutics I (1999), 720-726. [14] A.E. Lash, C.M. Tolstoshev, L. Wagner et al., SAGEmap: A public gene expression resource, Genome Rex 10 (2000), 1051-1060. [15] W.T. Loging, A. Lal, I.-M. Siu et al., Identifying potential tumor markers and antigens by database mining and rapid expression screening, Genome Rex 10 (2000), 1393-1402. [16] S.L. Madden, E.A. Galella, J. Zhu, A.H. Bertelsen and G.A. Beaudry, SAGE transcript profiles for p53-dependent growth regulation, Oncogene 15 (1997), 1079-1085. [17] S.L. Madden, C.J. Wang and G. Landes, Serial analysis of gene expression: from gene discovery to target identification, lit Drug Discov Today 5 (2000), 415-425. [18] E.M.C.Michiels, E. Oussoren, M. Van Groenigenetal., Genes differentially expressed in medulloblastoma and fetal brain, Physiol Genomics 1 (1999), 83-91. [19] H. Moch, P. Schraml, L. Bubendorf et al., High-throughput tissue microarray analysis to evaluate genes uncovered by cDNA microarray screening in renal cell carcinoma, Am J Pathol 154 (1999), 981-986. [20] T.B. Morrison, J.J. Weis and C.T. Wittwer, Quantification of low-copy transcripts by continuous SYBR Green I monitoring during amplification, Biotechniques 24 (1998), 954-958, 960, 962. [21] R.G. Panchal, Novel therapeutic strategies to selectively kill cancer cells, Biochem Pharmacol 55 (1998), 247-252. [22] A. Parle-McDermott, P. McWilliam, O. Tighe, D. Dunican and D.T. Croke, Serial analysis of gene expression identifies putative metastasis- associated transcripts in colon tumour cell lines, Br J Cancer 83 (2000), 725-728. [23] K. Polyak, Y. Xia, J.L. Zweier, K.W. Kinzler and B. Vogelstein, A model for p53-induced apoptosis, Nature 389 (1997), 300305. [24] J. Powell, SAGE. The serial analysis of gene expression. Methods Mol Biol 99 (2000), 297-319. [25] M.S. Rajeevan, S.D. Vernon, N. Taysavang and E.R. Unger. Validation of array-based gene expression profiles by real-time (kinetic) RT-PCR, J Mol Diagn 3 (2001), 26-31. [26] G.J. Riggins and R. Strausberg, Genome and genetic resources from the Cancer Genome Anatomy Project, Hum Molecular Genet (2001), in press. [27] B. Ryu, J. Jones, M.A. Hollingsworth, R.H. Hruban and S.E. Kern, Invasion-specific genes in malignancy: serial analysis of gene expression comparisons of primary and passaged cancers, Cancer Res 6l (2001), 1833-1838. [28] D. Scheurle, M.P. DeYoung, D.M. Binninger, H. Page, M. Jahanzeb and R. Narayanan, Cancer gene discovery using digital differential display, Cancer Res 60 (2000), 4037-4043. [29] P. Schraml, J. Kononen, L. Bubendorf et al., Tissue microarrays for gene amplification surveys in many different tumor types, Clin Cancer Res 5 (1999), 1966-1975. [30] B. St Croix, C. Rago, V. Velculescu et al., Genes expressed in human tumor endothelium, Science 289 (2000), 1197-1202.

48

[31]

G.J. Riggins / Using Serial Analysis of Gene Expression to identify tumor markers and antigens

R.L. Strausberg, K.H. Buetow, MR. Emmert-Buck and R.D. Klausner, The cancer genome anatomy project: building an annotated gene index. Trends Genet 16 (2000), 103-106. [32] R.L. Strausberg, C.A. Dahl and R.D. Klausner, New opportunities for uncovering the molecular basis of cancer. Nat Genet 15(1997), 415-416. [33] V.E. Velculescu, S.L. Madden, L. Zhang et al., Analysis of human transcriptomes, Nat Genet 23 (1999), 387-388. [34] V.E. Velculescu, B. Vogelstein and K.W. Kinzler, Analysing uncharted transcriptomes with SAGE, Trends Genet 16 (2000), 423-425. [35] V.E. Velculescu, L. Zhang, B. Vogelstein and K.W. KinzJer, Serial analysis of gene expression. Science 270 (1995), 484487. [36] C.T. Wittwer, M.G. Herrmann, A.A. Moss and R.P. Rasmussen. Continuous fluorescence monitoring of rapid cycle DNA amplification. Biotechniauex 22 (1997). 130-131. 134138.

[37]

C.T. Wittwer, K.M. Ririe, R.V. Andrew, D.A. David, R.A. Gundry and U.J. Balis, The LightCyclen a microvolume multisample fluonmeter with rapid temperature control. Biotechniques 22 (1997), 176-181. [38] J.T. Wu, Review of circulating tumor markers: from enzyme. carcinoembryonic protein to oncogene and suppressor gene, Ann Clin Lab Sci 29 (1999), 106-111. [39] J. Yu, L. Zhang, P.M. Hwang, C. Rago. K.W. Kinzler and B. Vogelstein, Identification and classification of p53-regulated genes. Proc Natl Acad Sci USA 96(1999), 14517-14522. [40] L. Zhang, W. Zhou, V.E. Velculescu et al., Gene expression profiles in normal and cancer cells. Science 276 (1997), 12681272. [41] W Zhou, L.J. Sokoll, D.J. Bruzek et al., Identifying markers for pancreatic cancer by gene expression analysis. Cancer Epidemiol Biomarkers Prev 7 (1998), 109-112.

49

Cancer proteomics: The state of the art Paul C. Herrmanna, Lance A. Liottaa and Emanuel F. Petricoin IIIb,* a Clinical Proteomics Program, Laboratory of Pathology, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA b Clinical Proteomics Program, Division of Therapeutic Proteins, CBER, Food and Drug Administration, Bethesda, MD 20892, USA Now that the human genome has been determined, the field of proteomics is ramping up to tackle the vast protein networks that both control and are controlled by the information encoded by the genome. The study of proteomics should yield an unparalleled understanding of cancer as well as an invaluable new target for therapeutic intervention and markers for early detection. This rapidly expanding field attempts to track the protein interactions responsible for all cellular processes. By careful analysis of these systems, a detailed understanding of the molecular causes and consequences of cancer should emerge. A brief overview of some of the cutting edge technologies employed by this rapidly expanding field is given, along with specific examples of how these technologies are employed. Soon cellular protein networks will be understood at a level that will permit a totally new paradigm of diagnosis and will allow therapy tailored to individual patients and situations. Keywords: Proteomics, laser capture microdissection, cancer, signal transduction

1. Introduction The human genome is now mapped [1,2]. For the first time in history it is possible to take the full measure of human heredity. Genomics is impacting science with seemingly endless possibilities; however, the challenges presented by cancer continue to be quite daunting. Cancer still lacks a definition based on molecu* Address for correspondence: Emanuel F. Petricoin HI, Tissue Proteomics Unit, Division of Therapeutic Products, Center for Biologies Evaluation Research, FDA, Bldg. 29A Room 2B02, 8800 Rockville Pike, Bethesda, MD 2089, USA. E-mail: petricoin@cber. fda.gov. Disease Markers 17 (2001) 49-57 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

lar criteria alone, and a completely robust correlation between cancer and DNA based changes has not been found [3]. While the genome provides the underlying blue print of life, an information archive, the proteins do the work of the cell. Most licensed therapeutics and diagnostics work by targeting or analyzing the proteome. The recent publication of the network of protein interactions in the yeast Saccharomyces cervisiae demonstrates the vast complexity buried in protein interaction networks and unrecoverable from the genome itself [4]. In the absence of an understanding of protein changes, the information contained in the genome yields only a limited view of the full repertoire of tantalizing leads for effective new drug targets and markers for early disease detection present in a cell. The relatively new field of proteomics seeks to expand this view. Just as the genome denotes the entire DNA code of a cell, the proteome denotes the entire protein complement of a cell, quantitatively as well as qualitatively [5]. The aforementioned yeast example is illustrative of proteome complexity. A single network of 1,548 proteins encompassing 2,358 interactions was discovered in addition to several smaller networks [4]. Through concurrent evaluation of multiple variations in this proteome, a better understanding of cell function and miss-function may be gleaned. Proteomics, from the very core, tackles a greater number of variables than does genomics. There are 20 coded amino acids rather than 4 nucleotides. Multiple copies of individual proteins exist that vary based on cellular environment as well as cell stage. To further complicate the study, no ability to amplify specific proteins outside of increasing transcription is known. The luxury of a PCR like method of amplifying a single protein molecule is nonexistent. At first glance it might appear that simply monitoring mRNA levels would yield proteomic information. Recently, two pharmaceutical companies have entered into collaboration based on this premise [6]. Hooper et al. demonstrated an example of such a study, analyzing the effect of commensal gut flora on endothelial mRNA expression by mice utilizing cDNA methods. The authors noted that the mRNA levels of 105 transcripts changed greater than two-fold after colonization with

50

P.C. Herrmann et al. / Cancer proteomics: The state of the art

B. thetaiotomicron. Seventy-one of these transcripts were assigned to known genes, while 34 transcripts were from uncharacterized genes [7]. While producing very interesting and valuable findings, this approach does not address the actual changes at the protein level. Protein concentration is not determined by mRNA level alone and translation does not occur at the same rate on all mRNA molecules present. Post-translational processing and degradation also run at variable rates [8]. In order to understand proteins and protein interaction, the peptide molecules must be studied directly.

and harmful phenotypes must have apparent clinical differences, the hypothesis linking genotype and phenotype must be convincing, and the number of exemplary cases must be sufficient to draw conclusions [ 15]. The field of proteomics is well equipped to satisfy such criteria that all require an evaluation of protein levels and changes. In the future, changes at the protein level alone without definable genomic alterations may be sufficient for individual patient tailored therapy [16. 17].

3. Separation technology 2. Disease markers for diagnosis and tailored treatment Although still in its infancy, and currently heavily focused on techniques and methods, the field of proteomics already promises many applications to clinical medicine. These fall under two main categories: the diagnosis of disease states, and the discovery of treatment targets. In the realm of disease marker discovery, major advances are occurring already, thanks to this new field. Examples include the recent documentation of potential new markers for invasive breast carcinoma, including Proliferating Cell Nuclear Antigen and some members of the stress protein family [9], a pair of markers for lung adenocarcinoma TA01/TA02 [10], and the observation by our group of a decrease in Annexin 1 in prostate cancer [11]. As the understanding of protein interaction and networks improves, markers more closely tied to disease specifics should surface. When a more complete understanding of which factors are causative in disease is gained, new targets for treatment should emerge. An early example of new target discovery is the identification of 25 signaling targets of the MAP Kinase pathway by proteomic analysis. Only five of these had been previously characterized as MKK/ERK effectors [12]. While a very early result, this report illustrates the power of proteomics in unraveling the interplay of multiple variables. The two main categories of diagnosis and treatment targets can be brought together to specifically tailor treatment for individual patients. Tailored treatment is already being utilized in the care of Hodgkin's disease and is being considered for coronary artery disease [ 13, 14]. Rosenthal and Schwartz have published some criteria to be used in establishing links between genomic variations and disease in the field of patient tailored therapy. They require that the change in genetics must cause an alteration at the protein level, the beneficial

The first step in proteomic evaluation is choosing an appropriate specimen to study. The most easily obtained samples consist of tissue homogenates. These have the advantage of large size, but the disadvantage of heterogeneity. A bulk tumor may consist of cancerous cells, histologically diverse normal cells and stromal components - vasculature, lymphocytes, etc. [16]. The effects of disease may be diluted or masked by the non-cancerous components, and changes in the noncancerous components may be mistaken for disease markers. One method partially circumventing these problems is the production of cell line cultures, which produces a homogeneous population of cells, but may not accurately reflect the proteomic state of the original tumor in the actual patient. A recent study demonstrated only a 20% similarity in the proteomes between cell lines and laser capture microdissected tumor epithelium. In the same study, similarity between tumor and normal tissue obtained from a single patient and even between other patients was near 95% [17]. The ideal material for evaluation should be procured in a patient matched method, isolating tumor and normal tissue from the same specimen when possible. By comparison of such patient matched material the true effects of disease can be isolated from interpersonal differences. When tissue separation is carried out on a microscopic scale, diseased cells can be specifically selected and then compared with specifically selected non-diseased cells from the same individual organ. Such a method of tissue separation has been made possible by the invention of the Laser Capture Microscope (LCM)(Fig. 1)[18,19]. After appropriate fixation and staining of a specimen on a standard microscope slide, the slide is placed on the LCM stage and a region of interest delineated. A cap with a film of low melting temperature plastic is placed over the sample and at the push of a button an adjustable circle of 6-30 micron

P.C. Herrmann et al. / Cancer proteomics: The state of the art

51

Fig. 1. Illustration of the laser capture microdissection (LCM) system. The transfer film is melted by a pulse from the laser beam and adheres to the cells of interest for transfer. Illustration is from [51].

diameter is melted onto the sample by a laser. When the cap is picked up, the tissue over which the circle was melted is adherent to the cap. The adherent cells can then be lysed by standard methods [18]. The advantages of this technique are numerous. Particularly noteworthy are those that include low energy activation that preserves the cell's original proteome unaltered, accuracy of tissue removal, and very small tissue quantities required for separation. While convenient to use, the procurement of large quantities of tissue on the order of 200,000 cells or more can be temporally prohibitive. Another microscopic tissue dissection system that has been described utilizes UV light to "blow away" tissue that is not of interest. The remaining tissue is then transferred to an appropriate medium by a laser pulse [20]. One method of tissue sampling makes use of a tissue array. Several specimens are placed adjacent to each other on a microscope slide for concurrent evaluation as illustrated by Kanonen et al. Specimens are obtained

by core drilling a donor block with a thin walled sharpened stainless steel tube of 0.6 mm diameter. Several such core samples are then placed in an array pattern in a wax block, which is sectioned sequentially, thereby producing a polka dot tissue array on a glass slide. The tissue is then fixed and analyzed by standard immunohistochemical means [21]. The method permits the concurrent evaluation of protein expression in many specimens. However, the heterogeneity of tumor specimens dictates that occasionally samples will be excised which contain no tumor cells at all and the histologic diversity of the sample dilutes the observable effects of disease. Furthermore, analysis of protein content is limited by antigen retrieval, inherent subjectivity of immunohistochemistry and the inability to perform analysis on rare cell types such as microscopic premalignant lesions. Recently, a new "reverse lysate array" technology by Paweletz et al. (Fig. 2) has been described which may provide a more thorough approach to translational mul-

52

P.C. Herrmann et al. / Cancer proteomics: The state of the an

Fig. 2. Illustration of protein lysate array technique. Selected microdissected cells are lysed and spotted directly onto a nitrocellulose membrane. The membrane is then probed with antibodies specific to proteins of interest. Illustration is from [22].

tiplexed analyte analysis. The authors applied lysates of microdissected esophageal material by pin array to nitrocellulose membranes. The membranes then were probed by specific antibodies for the phosphorylation status of the signal proteins AKT and ERK illustrating pro-survival pathways at the cancer invasion front. The study demonstrated very high protein sensitivity. Protein quantities found in less then 4 x 10 -4 cell equivalencies were detected. Only about 1000 molecules were necessary for detection and concentration sensitivity was demonstrated through dilution curves [22]. This technology will prove very valuable for detecting low quantities of protein. By using longitudinal patientmatched microdissected material, comparison can be made between normal, low-grade and high-grade premalignant lesions, and diseased tissue from the same

patient without masking by histologic diversity. Probing the arrays with antibodies provides specific protein information and arraying of samples permits very high throughput. The technique's only limitations lie in the time required to procure samples and the requirement of antibodies previously made and purified against known proteins. However, once acquired, microdissected cell lysate libraries from as few as 2000 cells can be used to produce several hundred arrays, each of which can be probed with a specific antibody of interest recognizing a new biomarker for early disease detection, surrogate endpoints for therapeutic efficacy, or even a new therapeutic target. In the absence of specific antibodies, protein separation is one of the most important steps in the entire process. Nonspecifically detectable proteins that are in-

RC. Herrmann et al. / Cancer proteomics: The state of the art

separable are unobservable. No method has been found which will separate the proteome in its entirety in a single step. Methods must be used in series to separate specific parts of the proteome for analysis. Techniques such multiplexed tandem liquid and affinity chromatography followed by MS-MS nanoESI mass spectroscopy currently require concentrations of protein which deem the minor components of the human proteome currently undetectable [23]. In the future, however, this technology may ultimately provide a non gel-based solution to proteome mining. Currently, the predominant technique for protein purification and separation in proteomics is currently two-dimensional (2D) gel electrophoresis (Fig. 3). Proteins are first separated along one axis according to charge in an isoelectric focusing step. The gel is then exposed to an electrical gradient over a perpendicular axis along which proteins migrate according to the inverse of their molecular weights [24]. Better separation is achieved than by a traditional gel based on either technique alone, but it is clear that there is not a one to one spot to protein correlation. Although new approaches such as the development of "zoom gels" to expand the separation range of the technology are increasing resolving capacity, current size limitations and insensitive staining methods place restrictions on required quantities and physical characteristics of the proteins separable by this technique [23]. 2D-gel electrophoresis will probably remain useful in magnification and resolution of specific regions of the proteome, but will always have limitations precluding high-throughput assessment of the entire proteome simultaneously on a single gel format. In the past, many investigators have analyzed lysates from cell lines [25-28] and human tissue [29-35] by 2D-PAGE to look at tumor specific alterations in protein expression for new marker and target discovery. Image databases were developed to map proteins expressed in specific cell types and at defined stages of tumor progression [36-38]. All of these annotations are derived from cell type-enriched human tissue. The recent ability to identify new potential disease markers from actual laser capture microdissected cells from stained human tissue specimens has enabled the analysis of protein expression in not only the affected diseased epithelium, but also the surrounding stroma, normal epithelium, and importantly the premalignant lesions [39-41]. 2D-gel profiles from the LCM procured patient-matched normal and cancer epithelium have enabled the discovery of several new potential marker candidates for prostate and esophageal cancers. Intriguingly, these proteins were not detected in stromal cells procured from the same patient tissue sections.

53

Fig. 3. Illustration of 2-D gel electrophoresis. In step one, the proteins are separated along a narrow strip of gel on the basis of pI in an isoelectric focusing step. In step two, the gel strip from step one is applied to a larger gel and separation is made based on molecular

4. Protein analysis technology Once the proteins are separated, methods of analysis must be brought to bear on the separated entities. Protein sequencing by Edman degradation is the most specific method of analysis, but it requires a large quantity and high purity of protein [42]. Proteins removed from 2D gels are being sequenced by this method since the information available has not been exhausted within the current technical limitations, but other methods are needed. Mass spectral analysis recently has been explored as a detection and protein identification method. When coupled to separation techniques, the resulting tech-

54

P.C. Herrmann et al. / Cancer proteomics: The state of the an

Fig. 4. Conceptual illustration of mass spectrometry. An ionized particle will have a trajectory through a magnetic field dependent on its mass to charge ratio. By allowing charged particles to pass through such a field a separation is achieved which can be analyzed by a variety of detectors.

nology can be very powerful and will probably be one of the main avenues of future exploration. After appropriate molecule charging, the mass spectrometer instrument detects molecules on the basis of their mass to charge ratio (Fig. 4). In the time of flight motif, molecules are charged and accelerated through an electric field and a recording is made of how long they take to travel a specified distance and strike a detector. The longer the time, the more massive the particle relative to its charge. Mass accuracy in the range of a few parts per million are possible through recent innovations. A more sensitive method consists of monitoring the radio frequency (rf) of a circulating population of charged particles in a cyclotron. Fourier transform of the rf signal yields the individual mass to charge ratios of the members of the population. This technique has an extremely low detection limit, but instruments are currently very expensive [43,44]. Mass spectral analysis can yield sequence information, though the complete sequence cannot be determined in all cases [43-45]. Pattern analysis shows which ions or fragments contain ammonia or water losing species. Ammonia can be lost from the N-terminal amino acid, lysine or arginine, and water can be lost from serine or threonine. The fragment containing the N-terminus is identified and computer reconstruction of fragments made. Functionalization of proteins with deuterium or reactive groups with known mass such as acetyl groups are then added and the data acquired used to further specify protein sequence. The comparison of fragment fingerprints with databases of known proteins shortens the entire process considerably. Consequently. as more proteins are discovered and characterized mass spectral analysis will improve. Another advantage of mass spectrometry is that it can be used as a separating technique allowing analysis of an inhomogeneous sample. Tandem mass spectrom-

etry uses the mass spectrometer to isolate an ionized protein. The isolated ionized protein is subsequently fragmented through a second charging cycle and the resulting fragment pattern analyzed for structural information [43-45]. Particles must be charged to be observable by mass spectrometry. The charging process has a separating effect, so picking the appropriate method allows detection of variable parts of the proteome. Electrospray excitation is accomplished by putting the molecules in a solvent in which ions are generated. The specimen is sprayed into an electric field under vacuum. In the vacuum, the uncharged solvent evaporates away, gently concentrating charge onto ionizabie molecules which are then analyzed [43-45]. Matrix-Assisted Laser Desorption time-of-flight (MALDI-TOF) charges particles through excitation of the matrix by a laser [43J. The matrix then transfers energy to the species contained within it. Surface Enhanced Laser Desorption lonization Time-of-FIight (SELDI-TOF) utilizes a similar phenomenon, except it has a unique protein baiting technology coupled to it on the front end, enabling the selection and purification of classes of proteins up-front before MALDI-based analysis. Investigators have successfully coupled this technology to LCM for the ability to discover new disease marker patterns and perform molecular fingerprinting of stages of human prostate cancer as well as rapid profiling of colon, esophageal, breast, and ovarian cancer. [46]. An example of the results of these studies is shown in Fig. 5. SELDI has recently been used in a variety of applications including protein profiling in a search for soft tissue regeneration genes [47], monitoring Alzheimer's b-amyloid production [48] and analysis of the proto-oncogene TCL1 as an Akt kinase co-activator [49]. Mendrinos et al. also recently used SELDI in the discovery of urine protein biomarkers in bladder cancer patients [50]. Both

P.C. Herrmann et al. / Cancer proteomics: The state of the art

55

Fig. 5. Illustration of molecular profiling and fingerprinting using SELDI-TOF coupled with laser capture microdissection (LCM) showing protein patterns that are unique to each human cancer type. A denstigy plot of the mass chromatogram is shown as a protein "bar code". Selected tissue is lysed and the lysate applied to a H4 reverse-phase chip. The chip is then analyzed by SELDI methodology . Illustration is from [46].

MALDI and SELDI are powerful substituents of the growing list of proteomic technologies enabling the discovery of disease markers and therapeutic targets.

5. Looking back, looking forward As we look back over the last decade, many changes are apparent in the understanding of microbiology and biochemistry. The level of detail to which the various aspects of normal and aberrant cellular function, cell signaling, respiration, division, and death are understood is many times greater than it was even a few years ago. The explosion in biotechnology and the products produced for detection and treatment of disease is now only beginning in earnest. The completion of the genome project will only serve to expand the coverage and is now ushering in the next step to understanding the cellular basis of disease. Proteomics, because of its unique position for the elucidation of the components that make up the actual molecular targets for therapy and disease markers, stands poised to take up and carry on the progress made to date.

References [1]

J.C. Venter, M.D. Adams and E.W. Myers et al., The sequence of the human genome, Science 291 (2001), 1304.

[2] J.D. McPherson, M. Marra and L. Hillier et al., A physical map of the human genome, Nature 409 (2001), 934-941. [3] A.A. Alaiya, B. Franzen, G. Auer and S. Linder, Cancer proteomics: From identification of novel markers to creation of artificial learning models for tumor classification, Electmphorexix 21 (2000), 1210-1217. [4] B. Schwikowski, P. Uetz and S. Fields, A network of proteinprotein interactions in yeast, Nat Biotechnol 18 (2000), 12571261. [5] S. Nock and P. Wagner, Proteomics: The post-genome revolution, Chem Unserer Zeit 34 (2000), 348-354. [6] Press Release, Elitra Pharmaceuticals 3510 Dunhill Street, San Diego, CA 92121, July 28, 2000. [7] L.V. Hooper, M.H. Wong, A. Thelin, L. Hansson, P.G. Falk and J.I. Gordon, Molecular analysis of commensal host-microbial relations hips in the intestine, Science 291 (2001), 881-884. [8] S. Mullner, T. Neumann and F. Lottspeich, Proteomics - A new way for drug target discovery, Arzneimittel-Forsch 48 (1998), 93-95. [9] B. Franzen, S. Linder and A.A. Alaiya et al., Analysis of polypeptide expression in benign and malignant human breast lesions, Electmphurexis 18 (1997), 582-587. [10] T. Hirano, K. Fujioka and B. Franzen et al., Relationship between TA01 and TA02 polypeptides associated with lung adenocarcinoma and histocytological features, Brit J Cancer 75 (1997), 978-985. [11] C.P. Paweletz, D.K. Ornstein and M.J. Roth et al., Loss of annexin 1 correlates with early onset of tumorigenesis in esophageal and prostate carcinoma. Cancer Res 60 (2000), 6293-6297. [12] T.S. Lewis, J.B. Hunt and L.D. Aveline et al., Identification of novel MAP kinase pathway signaling targets by functional proteomics and mass spectrometry, Mol Cell 6 (2000), 13431354. [13] G.Palmieri, A. Morabito and A. Rea et al., Tailored therapy for aggressive non-Hodgkin's lymphoma: Results of a phase II

56

[14] [15] [16] [17]

[18] [19]

[20] [21] [22]

[23] [24] [25]

[26]

[27]

[28]

[29] [30]

[31]

[32]

P.C. Herrmann et al. / Cancer proteomics: The state of the art study with a long-term follow-up, Int J Oncol 13(1998), 121127. J.W. Jukema and J.J.P. Kastelein, Tailored therapy to fit individual profiles - Genetics and coronary artery disease, Ann NY Acad Sci 902 (2000). 17-26. N. Rosenthal and R.S. Schwartz, In search of perverse polymorphisms. New Engl J Med 338 (1998), 122-124. L. Liotta and E. Petricoin, Molecular profiling of human cancer, Nat Rev Genet 1 (2000), 48-56. O.K. Ornstein. J.W. Giilespie and C.P Paweletz et al., Proteomic analysis of laser capture microdissected human prostate cancer and in vitro prostate cell lines, Electrophoresis 21 (2000), 2235-2242. M.R. EmmertBuck. R.F. Bonner and P.D. Smith et al.,. Laser capture microdissection. Science 274 (1996), 998-1001. R.F. Bonner. M. Emmert-Buck and K. Cole et al., Laser capture microdissection: molecular analysis of tissue. Science 278 1481. 1483. R. Srinivasan, Ablation of polymers and biological tissue by ultraviolet-lasers. Science 234 (1986), 559-565. J. Kononen. L. Bubendorf and A. Kallioniemi et al., Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med 4 (1998), 844-847. C.P. Paweletz, L. Charboneau and V.E. Bichsel et al., Reverse Phase Protein Microarrays Which Capture Disease Progression Show Activation of Pro-survival Pathways at the Cancer Invasion Front, Oncogene in press. Yates Nature Biotech paper. A. Gorg, C. Obermaier and G. Boguth et al., The current state of two-dimensional electrophoresis with immobilized pH gradients, Electrophoresis 21 (2000), 1037-1053. Y. Katagata and S. Kondo, Keratin expression and its significance in five cultured melanoma cell lines derived from primary, recurrent and metastasized melanomas, FEBS Lett 407 (1997). 25-31. S. Prasad, V.A. Soldatenkov. G. Srinivasarao and A. Dritschilo, Identification of keratins 18, 19 and heat-shock protein 90 beta as candidate substrates of proteolysis during ionizing radiation-induced apoptosis of estrogen-receptor negative breast tumor cells. Int J Oncol 13 (1998). 757-764. S.C. Prasad, P.J. Thraves, V.A. Soldatenkov, S. Varghese and A. Dritschilo. Differential expression of stathmin during neoplastic conversion of human prostate epithelial cells is reversed by hypomethylating agent. 5-azacytidine, Int J Oncol 14(1999). 529-534. H.F. Yam. Z.H. Wang, PC. Or. S.W. Wang, J. Li and E.C. Chew, Effect of glucocorticoid hormone on nuclear matrix in cervical cancer cells in vitro, Anticancer Res 18 (1998), 209-216. J.E. Celis. H. Wolf and M. Ostergaard, Bladder squamous cell carcinoma biomarkers derived from proteomics [In Process Citation], Electrophoresis 21 (2000), 2115-2121. M.J. Page. B. Amess and R.R. Townsend et al., Proteomic definition of normal human luminal and myoepithelial breast cells purified from reduction mammoplasties, Proc Natl Acatl Sci USA 96 (1999), 12589-12594. H. Kovarova. J. Stulik, D.F. Hochstrasser, J. Bures, B. Melichar and P. Jandik, Two-dimensional electrophoretic study of normal colon mucosa and colorectal cancer, Appl Theor Electrophor 4 (1994), 103-106. S.R. Lawson. G. Latter and D.S. Miller et al., Quantitative protein changes in metastatic versus primary epithelial ovarian carcinoma. Gvnecol Oncol 41 (1991), 22-27.

[33]

[34] [35]

[36]

[37] [38] [39] [40]

[41]

[42] [43] [44] [45]

[46]

[47]

[48]

[49] [50]

K. Okuzawa, B. Franzen and J. Lindholm et al., Characterization of gene expression in clinical lung cancer materials by two-dimensional polyacrylamide gel electrophoresis, Electmphtirexis 15 (1994), 382-390. C. Sarto, A. Marocchi and J.C. Sanchez et al., Renal cell carcinoma and normal kidney protein expression, Electmphoresis 18 (1997), 599-604. O.S. Soldes, R.D. Kuick and T.A. Thompson 2nd et al., Differential expression of Hsp27 in normal oesophagus. Barrett's metaplasia and oesophageal adenocarcinomas. Br J Cancer 79(1999), 595-603. J.E. Celis, M. Ostergaard and B. Basse et al., Loss of adipocyte-type fatty acid binding protein and other protein biomarkers is associated with progression of human bladder transitional cell carcinomas. Cancer Res 56 (1996), 47824790. C.S. Giometti, K. Williams and S.L. Tollaksen, A twodimensional electrophoresis database of human breast epithelial cell proteins, Electrophoresis 18 (1997), 573-581. H. Ji, G.E. Reid, R.L. Moritz. J.S. Eddes, A.W. Burgess and R.J. Simpson, A two-dimensional gel database of human colon carcinoma proteins, Electmphoresis 18 (1997). 605-613. M.R. Emmert-Buck, J.W. Giilespie and C.P. Paweletz et al., An approach to proteomic analysis of human tumors, Mol Carcinog 27 (2000), 158-165. R.E. Banks, M.J. Dunn and M.A. Forbes et al., The potential use of laser capture microdissection to selectively obtain distinct populations of cells for proteomic analysis - preliminary findings, Electmphorexix 20 (1999), 689-700. D.K. Ornstein, C. Englert and J.W. Giilespie et al., Characterization of intracellular prostate-specific antigen from laser capture microdissected benign and malignant prostatic epithelium, Clin Cancer Rex 6 (2000), 353-356. K. Biemann, Mass-spectrometry of peptides and proteins. Annu Rev Biochem 61 (1992), 977-1010. I.J. Amster. Fourier transform mass spectrometry, J Mass Spectrum 31 (1996), 1325-1337. D.F. Hunt, J.R. Yates and J. Shabanowitz et al., Protein sequencing by tandem mass-spectrometry, P Natl Acad Sci USA 83 (1986), 6233-6237. P Chaurand, F. Luetzenkirchen and B. Spengler, Peptide and protein identification by matrix-assisted laser desorption ionization (MALDI) and MALDI-post-source decay time-offlight mass spectrometry, J Am Soc Mass Spectr 10 (1999). 91-103. C.P. Paweletz, J.W. Giilespie and D.K. Ornstein et al., Rapid protein display profiling of cancer progression directly from human tissue using a protein biochip. Drug Develop Res 49 (2000), 34-42. X. Li, S. Mohan and W. Gu et al., Differential protein profile in the ear-punched tissue of regeneration and non-regeneration strains of mice: a novel approach to explore the candidate genes for soft-tissue regeneration, BBA-Gen Subjects 1524 (2000), 102-109. B.M. Austen, E.R. Frears and H. Davies, The use of Seldi ProteinChip (TM) arrays to monitor production of Alzheimer's beta-amyloid in transfected cells. J Pept Sci 6 (2000). 459469. J. Laine. G. Kunstle and T. Obata et al., The protooncogene TCL1 is an Akt kinase coactivator, Mol Cell 6 (2000). 395407. S. Mendrinos, A. Vlahou and P. Kondylis et al., Protein biomarkers discovered in urine from bladder cancer patients by

PC. Herrmann et al. / Cancer proteomics: The state of the art

[51]

a novel protein biochip SELDI, Am J Clin Pathol 114 (2000), 631-631. N.L. Simone, C.P Paweletz, L. Charboneau, E.F. Petricoin and

57

L.A. Liotta, Laser Capture Microdissection: Beyond Functional Genomics to Proteomics, Mol Diag 5(4) (2000), 301307.

This page intentionally left blank

59

Analysis of expression patterns: The scope of the problem, the problem of scope Yidong Chena, Zohar Yakhinib, Amir Ben-Dorb, Edward Doughertyc, Jeffrey M. Trenta and Michael Bittnera,* a Cancer Genetics Branch, National Human Genome Research Institute, NIH, Bethesda, MD 20892, USA b Chemical and Biological Systems Department, Agilent Laboratories, Palo Alto, CA 94304, USA c Department of Electrical Engineering, Texas A & M University, College Station, TX 77843, USA Studies of the expression patterns of many genes simultaneously lead to the observation that even in closely related pathologies, there are numerous genes that are differentially expressed in consistent patterns correlated to each sample type. The early uses of the enabling technology, microarrays, was focused on gathering mechanistic biological insights. The early findings now pose another clear challenge, finding ways to effectively use this kind of information to develop diagnostics.

erations of a healthy cell and those of a diseased cell of the same type would provide useful information. At a minimum, those differences that are most extreme could provide useful markers to differentiate the diseased cells from the healthy ones in routine diagnostic testing. In some cases, this information could conceivably be much more useful. In the field of oncology, our particular area of interest, the variances might point to some difference that could be exploited to kill or terminally differentiate the cancerous cells through some form of treatment. These tantalizing possibilities have led many researchers to pursue the development of methods for gathering and analyzing expression data. The experiences gained through these early efforts have begun to outline useful approaches to the collecting and analyzing expression profile data and to identify the most serious obstacles to realizing the desired benefits. This review will summarize some of the strengths of data viewing and analysis approaches used to date, and sketch some of the approaches and limitations to the development of more powerful forms of analysis.

1. Introduction The profiling of transcription patterns is a central and long-standing tool in molecular biology. Recently, it has become possible to gather this kind of data for many genes simultaneously, allowing a wider view of the transcriptional activity of a particular cell type or tissue [1,2]. The fundamental assumption concerning the utility of such transcript abundance profiles is that transcription profiles convey information about the processes operating in a cell of a given type or state. The view obtained is limited to those parts of operations directly influencing the transcriptional activity of the cell. Still, it is obvious that even a clear and complete listing of the variances in transcription between the op* Address for correspondence: Michael Bittner, NHGRl/NIH, Building 49, Room 4A52, 9000 Rockville Pike, Bethesda, MD 20850, USA. Tel.: +1 301 496 7980; Fax: +1 301 402 3241; E-mail: mbittner @ nhgri .nih. gov. Disease Markers 17 (2001) 59-65 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

2. Methods of analysis 2.1. Clustering/correlation As might be expected, the first attempts to harness the information provided in profiles centered on the most readily detected forms of relationships in these kinds of observations, simple correlation of similarities [3]. The typical way to view similarities is to perform a clustering operation. This type of comparison is meant as a very preliminary way to look at data, a way of discovering trends. The methods do not offer any estimate of the way in which the variance arising from the system biology or the measurement procedure will affect the reproducibility of the resulting groupings. Additionally, the similarity measurement steps used frequently incorporate averaging and normalization steps that make it impossible to compare the resulting sets of similarity measurements in a quantitative fashion.

60

Y Chen et al. / Analysis of expression patterns: The scope of the problem, the problem of scope

In spite of these drawbacks, the use of correlation and clustering has a rich history in mathematics and engineering, and a large and growing number of the extant approaches to clustering data are being examined in the context of expression profile analysis. They provide a very quick way to see the most evident relationships, and when supplemented with other forms of analysis or connections to prior knowledge can help identify differences in expression worth further consideration. 2.2. Gene by gene correlation At the start of efforts to gather transcription profiles, there were two simple expectations for clustering results, both based on historical knowledge of transcriptional regulation. The first was that genes responding to a given type of signal, such as a fundamental change in metabolism, would be identifiable in data from a sample series that spanned such a transition, due to the similarity of their transcriptional response to the signal. This was quickly demonstrated in yeast undergoing a diauxic shift [4]. Many subsequent experiments in a variety of systems have demonstrated that correlating similarities of response can be a very powerful way of grouping genes involved in processes such as serum response [5], or cell cycling [6,7]. Very complex data sets containing mixtures of developmental time-course and mutant analysis series have been shown to allow clustering along functional lines [8]. A further power of this form of correlation is its power to group genes whose expression is driven by the same kinds of cis-regulatory sites. Demonstrations of the clustering of genes sharing known or new cisregulatory sites have been presented in several studies of transcription dynamics associated with the yeast cell cycle [6,9]. While these results provide a demonstration of the competence of clustering to disclose explicit co-regulation, both these studies, and one examining transcription during meiosis [10] suggest cautious expectations for the extent of utility of clustering in disclosing cis-regulatory sharing. Even in yeast, it appears to be difficult to identify a large fraction of the motifs that must be in place. Clustering paired with promoter sequence searching is most useful for those elements that have the least ambiguity and the greatest length. This signal to noise constraint makes the cluster and search strategy likely to be most useful in identifying genes sharing already discovered cis-regulatory elements in multi-cellular organisms. This is due to their large genome size, their typically short and variable cis-regulatory sequences and their highly combi-

natorial use of these regulators (extensively reviewed by Davidson in [11]). The results of a number of systematic informatics efforts can also be expected to enhance the kinds of insight that correlation can provide. Various efforts to associate the currently available knowledge of genes' functions and relationships with other genes are under development. One approach that has already reached a fairly high level of sophistication is the production of gene ontologies [12]. This is a large, collaborative effort that seeks to provide curated information about the biological role of genes in many different organisms using a unitary set of classifiers (controlled vocabulary). As the human gene ontology becomes more complete, it can be used to provide a summary view of the various biological activities represented in a particular cluster. Known genes in other species could supplement this process by suggesting that a human EST of unknown function that is related by sequence to the known gene may have a function that makes sense in the context of the cluster in which it falls. Similar aid in deciphering function may become available from a less supervised, indexing approach to evaluating gene functions. One example of this type of approach is the High-density Array Pattern Interpreter program (http://array.ucsd.edu/tools.htm). This program uses controlled terminology hierarchies, based on the National Library of Medicine's Medical Subject Headings, to delineate how genes have been described in the published literature. In general, any further characterization that can be associated with a gene is likely to improve the odds of estimating whether a gene with a biologically interesting expression pattern should be further studied as a candidate marker or target. 2.3. Sample by sample correlation The second a priori expectation for expression patterns was that it would be possible to discern differing types of healthy and pathological ceils by considering the overall profile of similarities and differences across many genes. There have been many demonstrations of the use of overall transcription similarity measures to group samples into their previously known classes [ 1315]. There have also been demonstrations of the practicality of using this strategy to discover classes within samples that could not be subdivided in this fashion by conventional measurements [16-18]. Surveying the uses of this strategy, it has become apparent that there are both differing biological bases driving the separa-

Y. Chen et al. / Analysis of expression patterns: The scope of the problem, the problem of scope

tions and a wide spectrum in the number of genes that strongly contribute to the separation. One clear component of the separation between cell types, a "tissue of origin" component arises from the particular differentiation state of the cells being studied. It is easy to imagine that vastly different cell types, such as muscle versus neuron, would show considerable differences in their expression of specialized gene products and be readily classified based on those differences. It was less easy to predict that even lymphomas arising from close relatives in the B-cell lineage have a large number of transcriptional differences that can be easily exploited for classification [16]. Some other ways in which surprising degrees of variation have been seen are evident in a study of tumor material from breast cancer [17]. A very clear finding of this study was the distinctiveness of individual tumors. Primary tumors and their metastases were found to have the highest degree of similarity in this study, which encompassed a variety of breast tumor types. That a tumor growing at a distant site and time than its primary is much more similar to its primary than to another metastasis arising from the same type of tissue implies that there is a large "space" of transcriptional settings available to a developing tumor. It also implies that the "position" the tumor occupies in that space is well separated from that chosen by other tumors of similar origin. As the volume of expression space inhabited by cells of different types does have a significant impact on the ease, resolution and reliability of discrimination between diseases that can be achieved via expression profiling, it is worth having a look at the amount and magnitude of variance there is between samples. It is difficult to get a feel for this from the processed data usually presented in papers, since the most common representation is the use of a color scale, which tends to understate the differences. A much different intuition is conveyed by a scatter plot comparing the average intensities of the two fluorescent cDNA probes hybridized at each immobilized detector on a cDNA array. Three such plots are presented in Fig. 1. Panels A and B show that when an mRNA pool derived from either a melanoma cell line or a myeloid cell line is profiled against itself, there is very little variance in the intensity of any of the genes detected. Panel C shows the contrasting result of widespread variation, ranging from minor to major differences, when these different mRNA pools are compared to each other. This is not an exceptional result; it is the typical outcome. It appears that every cell state has unique settings for the expression levels of most of the genes expressed in the cell,

61

including both the ubiquitously expressed genes and the genes specific to particular states. This level of idiosyncracy of expression differences provides both vast opportunities and complications to the identification of useful disease markers. The likelihood of finding a small set of accurate, discriminative markers that could be pressed into service in a traditional format such as immunohistochemistry increases with the number of differentially expressed genes available, however one must be ready to sift many candidates for consistency and tolerance to noise. 2.4. Simple gene ranking methods When looking at the numerous correlates between genes and samples that result from a large array study one is frequently overwhelmed by the many possibly interesting genes that would make sense to study, based on the relevance of their known biological activities to the system being explored. To further reduce the number of candidates, one can choose a variety of filters based on the pattern of gene expression in the various sample types. Two simple tools have been described for carrying out such analysis. One, the "WeightedList" method, is based on estimating the compactness within sample groups and the separation between sample groups that a given gene or genes' expression values would produce between the sample types [15,18]. This approach is conceptually related to the standard statistical F and t-tests. A geometric interpretation of the weight value produced by this analysis is presented in Fig. 2. The other method, Threshold Number of Misclassification (TNoM), is based on finding the minimal error rate of separation for ranked samples. The samples are ranked according to the expression values a gene produces, and then separated at a point that produces the least misclassifications [18-20]. Both of these methods are based on the supposition that those genes whose action is of strong consequence to the biological differences between the samples will accurately separate them. The Weighted List method emphasizes the average extent to which expression values separate the samples, making it a useful tool for finding genes with relatively large shifts in expression between sample types. The TNoM method emphasizes the integrity of the separation, making it useful for finding genes that accurately separate the samples but do not have as large ratio shifts. Genes that produce particularly clear discriminations achieve high scores with both measurements. Both methods of analysis are tested by forming permuted sample groups of the

62

Y. Chen el al. / Analysis of expression patterns: The scope of the problem, the problem of scope

Fig. 1. Scatter-plots of average channel intensity per gene. The average red (y-axis) and green (x-axis) intensities at each immobilized gene detector element on an array of approximately 7000 genes is plotted. A) RNA from cell line ML1I used for both channels. B) RNA from cell line UACC903 used for both channels. C) RNA from UACC 903 used for red channel, RNA from cell line ML1 used for green channel. (From [22]).

same size as the authentic sample sets, but with randomized membership. Running and scoring a thousand such permuted sets provides an empirical estimate of the highest expected weight or TNoM value in a random collection of biological samples, providing a useful estimate of the lower limit on values that are significant. Figure 3 is a diagram showing this kind of analysis applied to a fairly homogeneous subset of 19 melanomas versus 12 melanomas having much greater diversity in their expression profiles [17]. The black line depicts the actual number of genes able to separate the samples with the indicated level of accuracy (on the x-axis). The gray line depicts the expected number of such separating genes when 19 samples are chosen uniformly at random and designated as a class. The error bars indicate the 95% confidence interval for these numbers, under the same stochastic model. The difference between the authentic sample curve and the permuted sample/theoretical curve shows that there are many genes whose expression pattern aligns with the sample sets in a very non-random way. A similar differential is seen with the Weighted List results. Sharp overabundance of informative or highly separating genes is also observed in other studies such as [1416]. Other approaches to finding highly discriminative genes include ones where methods similar or identical to those used in formal statistical sample classification are explored. Examples include studies of differences in gene expression between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) [14]. and between breast tumors arising in patients with or without the breast cancer predisposing

mutations BRCA1 or BRCA2. The methods utilized in these studies explored the results of allowing larger numbers of genes to participate to varying extents in a decision about what class a given sample was in. Such studies provide another way of probing the robustness of the differentiation in expression patterns. An estimate of the consistency and robustness of the differentiation is achieved by serially building the decision function using all of the samples save one, for all of the N-1 sample sets, a process known as leave-one-outcross-validation. The results may be further queried to determine whether a significant fraction of identical deciding genes are employed in all sets and whether the classification is approximately equally accurate in all cases. Forming permuted sample groups with randomized membership and reconstituting and re-scoring a classifier, as above, allows estimation of the significance of the achieved classification. These and many other approaches to finding discriminating genes for further study in mechanistic or diagnostic settings are in the early phases of development. A more refined sense of their practical utility will emerge as experimental determination of the importance of the high scoring genes to the phenotypic differences between the sample sets is carried out. 2.5. Classification The ability to employ microarray methodology to carry out formal diagnostic classification of tissue is a reasonable long-term goal, given the demonstrated ability of the method to discern differences in the patterns of gene expression between normal and healthy

Y. Chen et al. / Analysis of expression patterns: The scope of the problem, the problem of scope

63

Fig. 2. Weighted Discriminator Method. Assuming K categories (or clusters) for a set of samples, a discriminative weight for each gene can be evaluated by w = Average(BD) / (Average(WD) + a) where Average(BD) is the average of the between cluster Euclidean distance for all pairs of clusters (total of (K*K-l)/2 pairs), and Average(WD) is the weighted average of the within-cluster distance (weighted by the number of samples in the cluster). The within-cluster distance is the average distance of all pairs of samples in the cluster. a is a small constant to prevent zero denominator case. (See http://www.nhgri.nih.gov/DIR/Microarray/discriminative.html).

tissue, and between differing types of diseased tissue. At the present there appear to be two main obstacles to making profiling a sufficiently practical form of diagnostic to find wide use. The first difficulty is an analytical one. How can very good candidate diagnostic panels be rapidly developed from profile data? The ideal panel would be one that used a very small number of genes, each of which provided at least some unique information (i.e. information that was not equivalent to the contribution of the other genes) and which was relatively insensitive to the levels of biological variance and measurement noise routinely encountered. The problems associated with finding small classifier gene sets that meet robustness and uniqueness criteria, using expression-profiling data, have been concisely reviewed by Dougherty [21]. The general basis of the problem is that expression studies tend to be

carried out as surveys aimed at developing insight into the biological mechanics of pathology. In studies of human tissue the goal has been to sample the broadest number of genes possible with the limited number of tissue samples and microarrays available. A consequence of this strategy has been that in most cases, there are neither sufficient numbers of samples nor sufficient numbers of replications of data sets to get good estimates of the error rates over the general population of the various genes in classification. Given small sample sets and large numbers of genes being sampled, it becomes possible to identify many small sets of genes for which the estimated error of classification is zero. In many cases, this estimate will not be markedly improved by small-sample-number validation procedures, such as leave-one-out cross-validation. As was mentioned in the Gene Ranking section above, there are

64

Y. Chen et al. / Analysis of expression patterns: The scope of the problem, the problem of scope

Fig. 3. Threshold Number of Misclassifications data from melanoma study [18]. The black line shows the number of genes in the original data set capable of producing the given number of misclassifications. The gray line is the result if the samples in the sets containing 19 and 12 members are permuted. Error bars show the calculated 95% confidence interval for the same size data set if gene expression behavior is independent and random relative to the samples.

many genes whose differences in expression pattern are aligned with differences in sample type in a very simple way, being more highly expressed in one sample type than the other. This produces a considerable overlap of the information content in relation to sample type in these sets of genes with the attendant problem that combining these genes in a classifier can easily lead to decreased performance via increased noise. An urgent need is therefore some readily computable analysis of the data that will help identify the most noise-resistant and least redundant classifier gene sets. In addition to the problems of designing a classifier and choosing the genes that will provide the highest accuracy in the classifier, there are pragmatic problems that further complicate the use of expression profiling as a diagnostic. The primary analyte in the technique is mRNA, which is much less stable than DNA or protein, placing considerable constraints on sample collection. The methods of converting the mRNA into a species

that can be detected and scored is very sensitive to the integrity of the mRNA and to contaminants that copurify with the mRNA during sample preparation. The technique, as now practiced, requires significantly more cells than many diagnostics, and could be confounded by the presence or variable content of cells other than disease cells in the sample. Were the technology to be used in diagnosis, its focus would need to be shifted from breadth of examination toward precision. The starting point for practical diagnosis would be a small, specialized set of genes, not as many genes as possible, with sufficient replicates of this set to provide the required degree of measurement precision.

3. Conclusion Expression profiles can be seen to provide a rich source of data on the differential expression of genes

Y. Chen et al. / Analysis of expression patterns: The scope of the problem, the problem of scope

between cell states. Early results have demonstrated that it is possible to find many genes that exhibit statedependent patterns of expression, even between closely related pathologies. The expression studies carried out to date have been of sufficiently limited scope to provide the large amounts of data needed for confident design of classifiers based on expression data, however even with limited data the trends are encouraging. Technologic improvements will continue to increase the precision and reproducibility of measurement that can be achieved. Larger studies designed to support the development of disease markers will no doubt be undertaken. In the shorter term, a good analytic method for identifying robust candidate classifier gene panels based on smaller sample number could be developed. With such a tool, it may be possible to use limited information to construct immunohistochemical assays, usable within the sphere of current diagnostic practice.

References [1]

[2]

[3] [4]

[5] [6]

M. Schena, D. Shalon, R.W. Davis and P.O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science 270(5235) (1995), 467470. D.J. Lockhart, H. Dong AND M.C. Byrne et al., Expression monitoring by hybridization to high-density oligonucleotide arrays, Nat Biotechnol 14(13) (1996), 1675-1680. M.B. Eisen, P.T. Spellman, P.O. Brown and D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proc Natl Acad Sci USA 95(25) (1998), 14863-14868. J.L.DeRisi, V.R.Iyer and P.O. Brown, Exploring the metabolic and genetic control of gene expression on a genomic scale, Science 278(5338) (1997), 680-686. V.R. Iyer, M.B. Eisen and D.T. Ross et al., The transcriptional program in the response of human fibroblasts to serum, Science 283(5398) (1999), 83-87. P.T. Spellman, G. Sherlock and M.Q. Zhang et al., Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol

65

Biol Cell 9(12) (1998), 3273-3297. [7] P. Tamayo, D. Slonim and J. Mesirov et al., Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, Proc Natl Acad Sci USA 96(6) (1999), 2907-2912. [8] A. Ben-Dor, R. Shamir and Z. Yakhini, Clustering gene expression patterns, J Comput Biol 6(3-4) (1999), 281-297. [9] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho and G.M. Church, Systematic determination of genetic network architecture, Nat Genet 22(3) (1999), 281-285. [10] M. Primig, R.M. Williams and E.A. Winzeler et al., The core meiotic transcriptome in budding yeasts, Nat Genet 26(4) (2000), 415-423. [11] E.H. Davidson, Genomic Regulatory Systems in Development and Evolution, Academic Press, London, 2001. [12] M. Ashburner, C.A. Ball and J.A. Blake et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium, Nat Genet 25( 1) (2000), 25-29. [13] J. Khan, R. Simon and M. Bittner et al., Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays, Cancer Rex 58(22) (1998), 5009-5013. [14] T.R. Golub, D.K. Slonim and P. Tamayo et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286(5439) (1999), 531-537. [15] I. Hedenfalk, D. Duggan and Y. Chen et al., Gene-expression profiles in hereditary breast cancer, N Engl J Med 344(8) (2001),539-548. [16] A.A. Alizadeh, M.B. Eisen and R.E. Davis et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403(6769) (2000), 503-511. [17] C.M. Perou, T. Sorlie and M.B. Eisen et al., Molecular portraits of human breast tumours, Nature 406(6797) (2000), 747-752. [18] M. Bittner, P. Meltzer and Y. Chen et al., Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature 406(6795) (2000), 536-540. [19] A. Ben-Dor, L. Bruhn and N. Friedman et al., Tissue classification with gene expression profiles, J Comput Biol 7(3-4) (2000), 559-583. [20] A. Ben-Dor, N. Friedman and Z. Yakhini, Scoring genes for relevance, Palo Alto, Agilent Technologies, 1999. [21] E.R. Dougherty, Small sample issues for microarray-based classification, Comp Funct Genom 2 (2001), 28-34. [22] Y. Jiang, J. Lueders, A. Glatfelter, C. Gooden and M. Bittner, in: Profiling human gene expresxion with cDNA microarrays. Current Protocols in Human Genetics, N. Dracopoli, ed., John Wiley & Sons, New York, 2000.

This page intentionally left blank

67

Alternative spliced transcripts as cancer markers Otavia L. Caballerob, Sandro J. de Souzab, Ricardo R. Brentania,b and Andrew J.G. Simpsonb,* a

Hospital do Cancer AC Camargo, Sao Paulo, Brazil Ludwig Institute for Cancer Research, Sao Paulo, Brazil b

Eukaryotic mRNAs are transcribed as precursors containing their intronic sequences. These are subsequently excised and the exons are spliced together to form mature mRNAs. This process can lead to transcript diversification through the phenomenon of alternative splicing. Alternative splicing can take the form of one or more skipped exons, variable position of intron splicing or intron retention. The effect of alternative splicing in expanding protein repertoire might partially underlie the apparent discrepancy between gene number and the complexity of higher eukaryotes. It is likely that more than 50% of human genes produce more than one transcipt form. Many cancer-associated genes, such as CD44 and WT1 are alternatively spliced. Variation of the splicing process occurs during tumor progression and may play a major role in tumorigenesis. Furthermore, alternatively spliced transcripts may be extremely useful as cancer markers, since it appears likely that there may be striking contrasts in usage of alternatively spliced transcript variants between normal and tumor tissue than in alterations in the general levels of gene expression.

1. Introduction The improved management of human cancer will depend on early detection, more accurate prognostic assessment and the availability of a larger selection of therapeutic agents. There is no doubt that the current intense exploitation of the human transcriptome will make a fundamental contribution in each of these areas. We know that cancer is the result of surprisingly subtle * Address for correspondence: Dr. Andrew J.G. Simpson, Ludwig Institute for Cancer Research, Rua Prof. Antonio Prudente 109 - 4o andar, Sao Paulo, SP 01509-010, Brazil. Tel.: +55 11 2704922; Fax: +55 11 2707001; E-mail: [email protected]. Disease Markers 17 (2001) 67-75 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

changes in overall gene expression caused both by mutation of key regulatory genes or epigenetic phenomena such as aberrant methylation. Quantification of gene expression using microarrays or the serial analysis of gene expression (SAGE) estimate that only some 5% of genes have altered expression levels in tumors as compared with corresponding normal tissues. The percentages that exhibit down regulation and up regulation are approximately equal. Moreover, the majority of differentially expressed genes encodes proteins associated with basic cellular functions and in particular reflects the altered proliferative state of the malignant cell. They are thus not causal agents of the tumor nor attractive drug targets. Alteration of the levels of expression is also on the whole rather modest with very few genes consistently showing more than 5-fold differences between normal and tumor samples. These limited alterations render the use of these genes as tumor markers requiring sensitive and accurate detection techniques as well as very well defined tissue samples. Many human genes remain to be identified due to the still draft stage of the available human genome sequence, the impossibility of accurate gene prediction using currently available computational tools and the lack of sufficient transcript sequence data. Microarray analysis, in particular, is only as useful as their DNA content permits. Thus as we complete the inventory of human genes and locate them in micrparrays, examples more fundamentally transcriptionally regulated may be discovered. These are likely to be rather poorly expressed genes, however, as most genes that exhibit higher levels of expression are already fully defined. Again these changes will be relatively difficult to use in clinical assays. The complete definition of the human transcriptome will not only permit the eventual global assessment of differential gene expression but will also lead to the identification of all the different transcript forms that can originate from the same gene. It is possibly in this area that the most exciting prospects for harnessing the power of genomics in the fight against cancer lie. Possibly the majority of genes in the human genome pro-

68

O.L. Caballero et al. / Alternative spliced transcripts as cancer markers

duce alternative transcripts that when transcribed lead to protein products with altered stucture and function. Crucially there is already evidence that such alternative transcript generation, through the process of alternative splicing as described below, can accompany the process of tumorigenesis. Indeed, the generation of alternatively spliced transcripts may prove to be one of the causal steps in the development of some cancers. In this instance, there may be complete presence or absence of particular splicing forms in a tumor as opposed to a corresponding normal tissue rendering them highly suitable both as cancer markers and potential drug targets. Although this area of research still remains to be explored in any detail, indeed it is likely that the vast majority of human transcripts types still remain unknown to us, the preliminary data available raise exciting possibilities.

2. Alternative transcript splicing Eukaryotic mRNAs are transcribed as precursors containing their intronic sequences. Subsequently, these are excised and the exons are spliced together to form mature mRNAs. This process can lead to diversification through the phenomenon of alternative splicing. Variation in mRNA structure takes many different forms [1.2] that include the use of cryptic donor and acceptor splicing sites, exon skipping and the use of intronic sequence as an exon. Also, the positions of either 5' or 3' splice sites can shift to make exons longer or shorter. In addition to these changes in splicing, alterations in the transcriptional start site or the polyadenylation site also allow production of multiple mRNAs from a single gene (Fig. 1). The consequences of alternative splicing range from switching expression of a protein on and off, by excluding and including stop codons, to structural and functional diversification of protein products. Usually, the process is highly regulated so that particular splicing patterns occur only under particular conditions. It is becoming clear that alternative splicing has an extremely important role in expanding the protein repertoire and might therefore partially underlie the apparent discrepancy between gene number and the complexity of higher eukaryotes. Indeed, alternative splicing can generate more transcripts from a single gene than the total number of genes in an entire genome [3]. As yet, however, for the vast majority of alternative splicing events their functional significance remains unknown [4].

The effect of altered mRNA splicing on the structure of the encoded protein can be profound [1,2]. In some transcripts, whole functional domains are added to or subtracted from the protein coding sequence. In other systems, the introduction of an early stop codon can result in a truncated protein, transforming a membrane bound into a solube protein, for example, or an unstable mRNA. Alternative splicing is also commonly used to control the inclusion of particular short peptides sequences within a longer protein. These optional sequence cassettes range from one to hundreds of aminoacids in length, and have many specific effects on the activity of a protein product. Changes in splicing have been shown to determine the ligand binding of growth factor receptors and cell adhesion molecules, and to alter the activation domains of transcriptional factors [1,2]. Furthermore, the splicing pattern of an mRNA may determine the subcellular localization of the encoded protein, the phosphorylation of proteins by kinases or the binding of an enzyme by its allosteric effector. Determining how these sometimessubtle changes in the sequence affect protein function is a crucial question in many different problems in developmental and cell biology, including control of apoptosis, neuronal connectivity, cell contraction and rumor progression [4]. The amount of variation that can be generated from a single gene through alternative splicing is astonishing. An example is the Drosophila melanogaster DSCAM gene [5]. Each transcript of the gene contains 24 exons that encode an axon guidance receptor. However, the gene contains an array of potential alternatives for exons 4,6,9 and 17. These exons are used in a mutually exclusive way with 12 alternatives for exon 4, 48 for exon 6, 33 for exon 9 and 2 for exon 17. Thus, alternative splicing can potentially generate more than 38,000 transcripts from this gene! How do cells choose between specific splicing pathways? The information content in a splicing site is limited. The most accurate computer programs achieve an identification rate of approximately 50% [2]. Mutations that destroy splice sites or create new ones are responsible for 15% of all human genetic disease. The initial commitment of an mRNA to splicing involves a series of interactions between the mRNA and several other RNAs and proteins. A family of proteins critical to splicing is the serine-arginine family of splicing factors (SR proteins) [6]. These proteins seem to act as bridges between the mRNA and several other protein factors. Alignment of both full length transcripts and expressed sequence tags (ESTs) has provided a minimum

O.L. Caballero et al. / Alternative spliced transcripts as cancer markers

69

Fig. 1. The RNAs of some genes follow patterns of alternative splicing, whereby a single gene gives rise to more than one mRNA sequence. The majority of genes are transcribed into RNA giving rise to a single type of transcribed mRNA. Alterations in the the polyadenylation site allow production of multiple mRNAs. Exons can be substituted, added or deleted. Introns that are normally excised can be retained in the mRNA. The positions of either 5' or 3' splice sites can shift to make exons longer or shorter.

estimate that 35% of human genes exhibit alternative spliced products [7]. The same rate was observed by Hanke et al. [8] in a set of 475 human proteins that were aligned to the human EST databases. It is widely thought, however, that this value is probably an under-

estimate since the transcript sequence information is derived from a limited number of tissues and developmental stages and as yet covers only a fraction of the human transcriptome. Lander et al. [9], for example, have found that 59% of all genes mapped on chromo-

70

O.L. Caballero et al. / Alternative spliced transcripts as cancer markers

some 22 have more than one splicing variant. The rate of alternative splicing in human genes already detected, however, seems to be considerably higher than in C. elegans where approximately 20% of all genes have at least 2 splicing variants. It is quite common to find human genes with dozens of splicing variants, including the neurexins, N-cadherins and potassium channels. Thus the current estimate of ~ 30,000 human genes may translate to hundreds of thousands of proteins.

3. Alternative splicing in cancer cells Many cancer-associated genes are alternatively spliced. Loss of fidelity or variation of the splicing process occurs during tumor progression and may well play a major role in tumorigenesis [10-14]. In addition, controlled switching to specific splicing alternatives may also occur. Transcript variants that only occur in tumors are both potential novel drug targets as well as potential diagnostic markers. In addition, their detailed analysis may prove crucial to our eventual understanding of the phenomena of malignancy. Probably the best characterized of known cancer associated genes that exhibit cancer associated alternative splicing are CD44 and WT1. These two genes represent two quite different situations. CD44 is a gene, similar to the Drosophila DSCAM gene that is apparently designed to exhibit considerable variability and has a large number of variably used exons. WT1 on the other hand has far fewer alternatives but which are of extreme importance to the malignant process. 3.1. CD44 The CD44 gene generates a family of molecules consisting of many isoforms [15]. CD44 proteins are single chain molecules comprising an N-terminal extracellular domain, a membrane proximal region, a transmembrane domain, and a cytoplasmic tail. The extracellular domain is glycosylated. CD44 is the principal hyaluronic acid (HA) receptor, although the molecule can bind other ligands, in some cases with low affinity. The CD44 proteins bind extracellular matrix glycoproteins. such as collagens and fibronectin. The CD44 gene has only been detected in higher organisms and the amino acid sequence of most of the molecule is highly conserved between mammalian species. The molecular diversity of this glycoprotein is generated by both post-translational modification and differential exon utilization (Fig. 2). CD44 is encoded by

a single gene composed of 20 exons, located on the short arm of chromosome 11, spanning approximately 50 kb of human DNA [16]. The first 5 exons coding for the extracellular domain are designated the 5' constant region, whereas the next 10 exons are subjected to alternative splicing. This generates a variable region containing different exon combinations [15]. Variable region exons are designated V1 to V10. Exons 16 and 17 are the first two constant exons of the 3' constant region and they, together with part of exon 5, encode the membrane proximal region of the extracellular domain (with optional inclusion of variant exons). The next domain is the hydrophobic transmembrane region, which is encoded by exon 18 of the 3' constant region. The cytoplasmic domain is also subjected to alternative splicing. Differential utilization of exons 19 and 20 generates the short version (3 amino acids) and the long version (70 aminoacids) of the cytoplasmic tail, respectively. The first 3 aminoacids, common to both tails, are encoded by exon 18. The DNA sequence of exon 19 carries a long poli A+T tract, possibly causing instability in the mRNA of the short version. The additional amino acids of the long cytoplasmic domain are encoded by exon 20. The long version of the cytoplasmic tail is much more abundant than the shorter version [15]. The most abundant version of CD44 is the standard lacks the entire variable region, with exon 5 of the constant 5' region being directly spliced to exon 16 of the 3' constant region [15,17]. Individual cells can simultaneously express different isoforms [15]. The major physiological role of CD44 is to maintain organ and tissue structure via cell-cell and cellmatrix adhesion, but other isoforms can also participate in cell traffic, lymph node homing, presentation of chemokines and growth factors to traveling cells and transmission of growth signals [18]. The physiological functions of CD44 indicate that the molecule has characteristics that are consistent with it playing a role in the metastatic spread of tumors. Many studies have detailed the pattern of CD44 splicing and the transcript abundance in tumors. It has been found that changes in CD44 expression (mainly up-regulation, occasionally down-regulation, and frequently alteration in the pattern of isoforms expressed) are associated with a wide variety of cancers and the degree to which they spread. This is not universal, however, in some types of cancers, the CD44 pattern remains unchanged. Most importantly, the expression of CD44 has been shown to correlate with the progression and prognosis of some malignant tumors. Recent studies have shown that CD44 is involved in two of the three steps of the invasive cascade: adhesion

O.L. Caballero et al. / Alternative spliced transcripts as cancer markers

71

Fig. 2. Representation of CD44 genomic organization and examples of alternatively spliced transcripts. CD44 is encoded by a single gene composed of 20 exons, spanning approximately 50 kb of human DNA. The first 5 exons code for the extracellular domain are designated the 5' constant region, whereas the next 10 exons, designated V1 to V10 are subjected to alternative splicing. Exons 16 and 17 are the first two constant exons of the 3' constant region and they, together with part of exon 5, encode the membrane proximal region of the extracellular domain (EC) (with optional inclusion of variant exons). The next domain is the hydrophobic transmembrane region (TM), which is encoded by exon 18 of the 3' constant region. The cytoplasmic domain (CT) encoded by exons 19 and 20 is also subjected to alternative splicing. To (E) Alternative splicing enriches the CD44 repertoire. The standard, ubiquitously expressed isoform of CD44, does not contain sequences encoded by the variant exons. Differential utilization of exons 19 and 20 in the standand CD44 generates the long version (B) and the short version (3 amino acids) (C) (70 aminoacids) of the cytoplasmic tail, respectively. The first 3 aminoacids, common to both tails, are encoded by exon 18. Numerous variant isoforms of CD44 containing different combinations of exons V1-V10 inserted into the extracellular domain can be expressed (D and E). EC: extracellular domain, TM: transmembrane domain, CT: cytoplasmic tail.

to the extracellular matrix and cell motility [19]. CD44 may contribute to malignancy through changes in the regulation of HA recognition, the recognition of new ligands and/or other new biological functions of CD44 that remain to be discovered [15,18,20]. CD44 proteins can bind growth factors and present them to their authentic high-affinity receptors, and thus promote proliferation and invasiveness of cells. This mode of action could account for the tumor-promoting action of CD44 proteins. The second mode of action of CD44 proteins comes into play when cells reach confluent growth conditions. Under specific conditions, binding of another ligand, the ECM component hyaluronate, leads to the activation and binding to the CD44 cytoplasmic tail of the tumor suppressor protein merlin. The activation of merlin confers growth arrest, so-called contact inhibition. This function of CD44 proteins defines them as tumor suppressors, but the type of action of CD44 on a given cell will depend on the isoform pattern of CD44 expressed [21]. Additional evidence of the importance of CD44 in tumorigenesis is that metastatic potential can be conferred on non-metastasizing cell lines by transfection with specific CD44 variants [18]. Furthermore, the introduction of antisense CD44 cDNA down-regulates

expression of overall CD44 isoforms and inhibits tumor growth and metastasis in highly metastatic colon carcinoma cells [22]. Moreover, it has been shown in animal models that injection of reagents interfering with CD44-ligand interaction (e.g., CD44 standand or CD44v-specific antibodies) inhibit local tumor growth and metastatic spread. These findings suggest that CD44 may confer a growth advantage on some neoplastic cells and, therefore, could be used as a target for cancer therapy [15]. Whereas some tumors, such as gliomas, exclusively express standard CD44, other neoplasms, including gastrointestinal cancer, bladder cancer, uterine cervical cancer, breast cancer and non-Hodgkin's lymphomas, also express CD44 variants. In prostate cancer, down-regulation of both CD44 standard and CD44V6 was related to high T classification, metastasis, high Gleason score, DNA aneuploidy, high S-phase fraction, high mitotic index, perineural growth and dense amount of tumor infiltrating lymphocytes, poor survival and unfavorable prognosis [23]. The correlation between lymph node metastasis and the expression of standard-type CD44 in cancer cells was examined immunohistologically in samples of superficially invasive colorectal cancer. In cases of inva-

O.L. Caballero et al. / Alternative spliced transcripts as cancer markers

sive colorectal cancer, the loss of standard-type CD44 expression in the invaded area is a sensitive marker for metastasis to the lymph nodes [24]. The expression of CD44 isoforms has also been evaluated in breast infiltrating lobular carcinomas in a panel of 39 tumors. The expression of membranous and cytoplasmic CD44s, V3, V5, V6, V7 and V3-10 was analyzed in the infiltrating cells by immunohistochemical staining. The protein positive tumors showed membranous and/or cytoplasmic staining with all antibodies used except for CD44V7, which only displayed cytoplasmic staining. Cytoplasmic expression of CD44V3 and membranous expression of V6 were significantly associated with alveolar, classical/alveolar carcinomas and mucinous/alveolar carcinomas. Furthermore, in alveolar, classical/alveolar and mucinous/alveolar carcinomas, cytoplasmic staining of CD44V5 was correlated with lymph node negative patients, whereas membranous V5 was correlated with lymph positive patients. In classical, classical/trabecular and trabecular carcinomas expression of membranous CD44s was significantly correlated with lymph node status [25]. The serum levels of different soluble CD44 molecules (CD44 standard form and CD 44 splice variant V6) were measured with an enzyme immunoassay method in venous blood samples preoperatively collected from 100 patients with invasive breast carcinoma. Preoperative serum soluble CD44 V6 was found to be closely related to distant metastases and TNM staging, indicating that CD44 V6 may have prognostic value in breast carcinoma. It is becoming clear that CD44 variants may be useful as diagnostic or prognostic markers in some human malignant diseases. However, the data are conflicting. and further studies are needed to establish the prognostic value of CD44 and its variant isoforms. Furthermore, the precise function of CD44 in the metastatic process and the degree of involvement in human malignancies has yet to be established. Nevertheless, the studies cited above provide one of the most advanced instances of the systematic study of the association of particular alternatively spliced isoforms of a protein and malignancy and provides a proof of principal that alternatively spliced transcripts can act as cancer mark-

3.2. WT-I Wilms tumor (WT) or nephroblastoma is a pediatric kidney cancer arising from pluripotent embryonic renal precursors. Multiple genetic loci have been linked

to Wilms tumorigenesis; positional cloning strategies have led to the identification of the WT1 tumor suppressor gene at chromosome 11pl3 [26]. WT1 encodes a zinc finger transcription factor that is inactivated in the germline of children with genetic predisposition to Wilms tumor and in a subset of sporadic cancers. When present in the germline, specific heterozygous dominant-negative mutations are associated with severe abnormalities of renal and sexual differentiation, pointing to an essential role of WT1 in normal genitourinary development [27]. WT1 encodes a DNA binding protein that is thought to act as a transcriptional regulator. Exons 1-6 of WT1 encode domains involved in transcriptional regulation, dimerization, and possibly RNA recognition, whereas exons 7-10 encode four zinc fingers of the DNAbinding domain. Four isoforms of WT1 are formed by alternate RNA splicing [28] (Fig. 3), but in total, twenty-four potential protein isoforms may be synthesized due to contribution of two alternative splicing regions corresponding to the whole of exon 5(17 amino acids) and to the three last codons of exon 9 (KTS). respectively [28], a site of RNA editing at codon 281 in exon 6 (a C to T transition producing a leucine to proline substitution [29], a non-AUG initiation codon resulting in WT1 proteins with a higher molecular weight [30] and an internal AUG initiation codon resulting in WT1 proteins with a lower molecular weight [31]. Biochemical and genetic evidence is accumulating that the WTl(-KTS) and WT1(+KTS) isoforms have different functions. WTl(-KTS) behaves as a transcription factor and in vitro can regulate several genes expressed during kidney development, including IGF2, PDGFA, EGFR, PAX-2, and WT1 [32]. However, the precise physiological and functional significance of this regulation is still unknown. The most frequent splice variants include the additional 17 aminoacids inserted Nterminal to the first zinc finger through the inclusion of exon 5 Insertion of the KTS tripeptide has a profound effect on both the DNA-binding affinity and the specificity of WT1. The WT1-KTS isoform binds to a 9-bp early growth response protein (EGR-1) consensus site with high affinity whereas the + KTS splice variant binds to the same site 10- to 20-fold more weakly [33]. Moreover, the different KTS+/- WT1 isoforms localize to distinct compartments in the nucleus suggesting that these two different forms of the protein have different functions. WT1-KTS co localizes with other transcription factors, whereas the more abundant +KTS isoform co localizes and is physically associated with splice factors where it potentially functions through

O.L. Caballero et al. / Alternative spliced transcripts as cancer markers

73

Fig. 3. Representation of WT1 genomic organization and alternatively spliced transcripts. The exons of WT1 (1-10) are shown in open boxes, while the introns and alternative splice sites are shaded. The carboxyl terminus of this protein contains four zinc finger domains. Four different transcripts (I to IV) are the result of alternative splicing. The most prevalent is isoform I.

RNA binding [34]. Thus, the presence or absence of the KTS insert in the third linker modulates both the DNA-binding affinity and the functional distribution of WT1 within the nucleus. Indeed, it has been suggested that differences in DNA-binding affinity of the +KTS and KTS splice variants might dictate the pattern of nuclear localization, with the tighter-binding KTS isoform being preferentially compartmentalized with the DNA. The balance between isoforms with and without the 17-amino acid insertion seems to affect the regulation of proliferation, differentiation, and apoptosis and the prevention of tumor formation [35,36]. In Wilms tumors, changes in the WT1 expression were found in 90% of unilateral unifocal WT cases, with 63% showing splicing alterations. Disruption of exon 5 splicing was the most frequent alteration, but alteration of exon 9-KTS splicing with an increase in the amount of isoforms with the KTS domain has also been observed in some tumors [12]. These results raise the possibility that regulation of splicing is an important factor in the development of the genitourinary system,

and that tumors may arise through aberrant splicing. WT1 isoform imbalance may be involved in various types of cancer because it has also been reported in breast tumors [37]. The distinct functional properties of WT1 isoforms and tumor-associated variants may shed light into the link between normal organ-specific differentiation and malignancy. The WT1 gene contrasts starkly with CD44, since very discrete and limited transcript variability appears to have a profound effect on protein function. Given the very high percentage of human genes that exhibit alternative splicing of this type, it is clear that a huge, and as yet very poorly explored, a wealth of biological information exists that could profoundly alter our understanding of cancer and our ability to detect and treat this devastating disease. 4. Conclusion These two examples provide the first glimpse of the enormous potential importance of alternative splicing

74

O.L. Caballero et al. / Alternative spliced transcripts as cancer markers

to the understanding detection and treatment of human cancer. Alternate transcripts lend themselves well to laboratory-based detection. For example, one can envisage microarrays specifically composed of exons that we know to be alternatively spliced and which could prove significantly more powerful than other microarray types for cancer diagnosis and prognosis. On a more specific basis, individual transcript variants of particular relevance to the disease can be readily detected and quantified by established PCR technologies. Furthermore, since the alternative transcripts lead to altered protein structure, it should be possible to generate monoclonal antibodies that distinguish between isoforms. These will be applicable both in immunohistochemical analysis and also for ELISA determinations. Given the relatively small number of genes in the human genome there is widespread expectation that the complexity of transcript variants may prove to be in the key in understanding the functioning of the human body. We predict that they might be equally crucial in the development of complex genetic diseases such as cancer. The systematic exploration of this possibility over the current decade may thus be one of the most important routes to the development of the next veneration of cancer markers. References [ I]

[2]

[3] [4] [5] [6] [7]

[8] [9] [10]

A.J. Lopez. Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation, Annu Rev Genet 32(1998), 279-305. C.W. Smith and J. Valcarcel, Alternative pre-mRNA splicing: the logic of combinatorial control. Trends Biochem Sci 25 (2000). 381-388. D.L. Black. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell 103 (2000). 367-370. B.R. Graveley, Alternative splicing: increasing diversity in the proteomic world. Trends Genet 17 (2001). 100-107. D. Schmucker, J.C. Clemens and H. Shu et al., Drosophila Dscam is an axon guidance receptor exhibiting extraordinary molecular diversity. Cell 101 (2000), 671-684. R. Tacke and J.L. Manley, Determinants of SR protein specificity, Curr Opin Celt Biol 11(3) (1999), 358-362. L. Croft, S. Schandorff. F. Clark, K. Burrage, P. Arctander and J.S. Mattick. ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome. Nat Genet 24 (2000), 340-341. J Hanke, D. Brett and I. Zastrow et al., Alternative splicing of human genes: more the rule than the exception? Trends Genet 15(1999). 389-390. E.S. Lander. L.M. Linton and B. Birren et al., Initial sequencing and analysis of the human genome. Nature 409 (2001). X60-921. B. Kwabi-Addo. F. Ropiquet, D. Giri and M. Ittmann, Alternative splicing of fibroblast growth factor receptors in human prostate cancer. Prostate 46 (2001). 163-172

[ 1 1 ] T.I. Orban and E. Olah. Expression profiles of BRCAI splice variants in asynchronous and inGl/S synchronized tumor cell lines, Biochem Biophys Res Commun 280 (2001), 32-38. [12] D. Baudry. M. Hamelin and M.O. Cabanis et al., WT1 splicing alterations in Wilms' tumors. Clin Cancer Res 6 (2000). 39573965. [13] M. Mixon, F. Kittrell and D. Medina. Splice variant expression of CD44 in patients with breast and ovarian cancer. Oncol Rep 8(2001), 145-151. [14] M. Hori. J. Shimazaki. S. Inagawa, M. habashi and M. Hon. Alternatively spliced MDM2 transcripts in human breast cancer in relation to tumor necrosis and lymph node involvement. Pathol Int 50 (2000), 786-792. [15] D. Naor, R.V. Sionov and D. Ish-Shalom. CD44: structure, function, and association with the malignant process. Aih Cancer Res 71 (1997). 241-319. [16] G R. Screaton, M.V. Bell, D.G. Jackson. F.B. Cornells. U Gerth and J.I. Bell, Genomic structure of DNA encoding the lymphocyte homing receptor CD44 reveals at least 12 alternatively spliced exons. Proc Natl Acad Sci USA 89 (1992). 12160-12164. [17] I. Stamenkovic, A. Aruffo, M. Amiot and B. Seed. The hematopoietic and epithelial forms of CD44 are distinct polypeptides with different adhesion potentials for hyaluronate-bearing cells. EMBO J 10(1991). 343-348. [18] R.J. Sneath and D.C. Mangham. The normal structure and function of CD44 and its role in neoplasia. Mol Pathol 51 (1998), 191-200. [19] A Herrera-Gayol and S. Jothy, Adhesion proteins in the biology of breast cancer: contribution of CD44. Exp Mol Pathol 66(1999), 149-156. [20] J. Lesley. R. Hyman. N. English, J.B. Catterall and G.A Turner. CD44 in inflammation and metastasis. Glycoconj J 14 (1997). 611-622. [21] P. Herrlich. H. Morrison and J. Sleeman et al., CD44 acts both as a growth- and invasiveness-promoting molecule and as a tumor-suppressing cofactor. Ann N Y Acad Sci 910 (2000). 106-118. [22] N Harada, T. Mizoi and M. Kinouchi et al., Introduction of antisense CD44S CDNA down-regulates expression of overall CD44 isoforms and inhibits tumor growth and metastasis in highly metastatic colon carcinoma cells. Int J Cancer 91 (2001), 67-75. [23] S Aaltomaa, P. Lipponen. M. Ala-Opas and V.M. Kosma, Expression and Prognostic Value of CD44 Standard and Variant v3 and v6 Isoforms in Prostate Cancer. Eur Urol 39 (2001). 138-144. [24] T Asao, J. Nakamura and Y. Shitara et al., Loss of standard type of CD44 expression in invaded area as a good indicator of lymph-node metastasis in colorectal carcinoma. Dis Colon Rectum 43 (2000), 1250-1254. [25] H.S. Berner and J.M. Nesland, Expression of CD44 isoforms in infiltrating lobular carcinoma of the breast. Breast Cancer Res Treat 65 (2001). 23-29. [26] K.M. Call, T. Glaser and C.Y. Ito et al., Isolation and characterization of a zinc finger polypeptide gene at the human chromosome 11 Wilms' tumor locus, Cell 60(1990). 509-520. [27] S.B. Lee and D.A. Haber, Wilms tumor and the wt1 gene. Exp Cell Res 264 (2001), 74-99 [28] D.A. Haber. R.L. Sohn, A.J. Buckler. J. Pelletier, K.M. Call and D.E. Housman, Alternative splicing and genomic structure of the Wilms tumor gene WT1. Proc Natl Acad Sci USA 88 (1991). 9618-9622.

O.L. Caballero et al. / Alternative spliced transcripts as cancer markers [29]

P.M. Sharma, M. Bowman, S.L. Madden, F.J. 3rd Rauscher and S. Sukumar, RNA editing in the Wilms' tumor susceptibility gene WT1, Genes Dev 8 (1994), 720-731. [30] W. Bruening and J.A. Pelletier, non-AUG translational initiation event generates novel WT1 isoforms, 7 Biol Chem 271 (1996), 8646-8654. [31] V. Scharnhorst, P. Dekker, A.J. van der Eb and A.G. Jochemsen, Internal translation initiation generates novel WT1 protein isoforms with distinct biological properties, J Biol Chem 274 (1999), 23456-23462. [32] A. Menke, L. Mclnnes, N.D. Hastie and A. Schedl, The Wilms' tumor suppressor WT1: approaches to gene function, Kidney Int 53 (1998), 1512-1518. [33] W.A. Bickmore, K. Oghene, M.H. Little, A. Seawright, V. van Heyningen and N.D. Hastie, Modulation of DNA binding

[34]

[35] [36]

[37]

specificity by alternative splicing of the Wilms tumor wtl gene transcript, Science 257 (1992), 235-237. S.H. Larsson, J.P. Charlieu and K. Miyagawaet al., Subnuclear localization of WT1 in splicing or transcription factor domains is regulated by alternative splicing, Cell 81 (1995), 391-401. S.M. Hewitt and G.F. Saunders, Differentially spliced exon 5 of the Wilms' tumor gene WT1 modifies gene function, Anticancer Rex 16 (1996), 621-626. C. Englert, X. Hou and S. Maheswaran et al., WT1 suppresses synthesis of the epidermal growth factor receptor and induces apoptosis, EMBO J 14 (1995), 4662-4675. G.B. Silberstein, K. Van Horn, P. Strickland, C.T. Jr. Roberts and C.W. Daniel, Altered expression of the WT1 wilms tumor suppressor gene in human breast cancer, Proc Natl Acad Sci USA 94 (1991), 8132-8137.

This page intentionally left blank

77

Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research John N. Weinstein* Laboratory of Molecular Pharmacology, National Cancer Institute, Bethesda, MD, USA

With 35,000 genes and hundreds of thousands of protein states to identify, correlate, and understand, it no longer suffices to rely on studies of one gene, gene product, or process at a time. We have entered the "omic" era in biology. But large-scale omic studies of cellular molecules in aggregate rarely can answer interesting questions without the assistance of information from traditional hypothesis-driven research. The two types of science are synergistic. A case in point is the set of pharmacogenomic studies that we and our collaborators have done with the 60 human cancer cell lines of the National Cancer Institute's drug discovery program. Those cells (the NCI60) have been characterized pharmacologically with respect to their sensitivity to > 70,000 chemical compounds. We are further characterizing them at the DNA, RNA, protein, and functional levels. Our major aim is to identify pharmacogenomic markers that can aid in drug discovery and design, as well as in individualization of cancer therapy. The bioinformatic and chemoinformatic challenges of this study have demanded novel methods for analysis and visualization of highdimensional data. Included are the color-coded "clustered image map" and also the MedMiner program package, which captures and organizes the biomedical literature on gene-gene and gene-drug relationships. Microarray transcript expression studies of the 60 cell lines reveal, for example, a genedrug correlation with potential clinical implications - that between the asparagine synthetase gene and the enzyme-drug L-asparaginase in ovarian cancer cells. Keywords: Microarray, genomics, proteomics, omics, cancer, cell line, pharmacology, pharmacogenomics, molecular marker, cancer therapy, clustered image map, MedMiner

* Address for correspondence: Dr. J.N. Weinstein, M.D., Ph.D., Bldg. 37, Rm 4E-28, NIH, 9000 Rockville Pike, Bethesda, MD 20892, USA. Tel.: +1 301 4969571; Fax: +1 301 4020752; E-mail: [email protected]. Disease Markers 17 (2001) 77-88 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

1. Introduction A prediction: Future historians of science will refer to the turn of the millennium as a watershed, the start of a Golden Age of biomedical science [1]. They will note - in passing and without much excitement - the half-century prodromal period after Watson-Crick in 1953, during which increasingly powerful techniques were developed to study one gene or gene product at a time and during which the foundations of high throughput molecular biology were laid down. But, they will be distinctly impressed by completion of the DNA sequences of small organisms just before the turn of the century and quasi-completion of the human sequence soon after it. Beyond simply the sequence, they will focus on development at this time of large databases on transcript and protein expression patterns, single nucleotide polymorphisms, chromosomal aberrations, and epigenetic changes. They will appreciate the increasing integration of these massive new molecular biology databases with those from structural and combinatorial chemistry, x-ray crystallography, magnetic resonance spectroscopy, high-throughput screening, two-hybrid and fluorescence energy transfer studies of protein-protein interaction, epidemiological studies, and the clinic. All of these developments - which are rapidly transforming our ability to identify and use molecular markers of disease - reflect what can be termed "omic" research [1-3]. Omic research includes studies in genomics, proteomics, transcriptomics, CHOmics (for the carbohydrates), kinomics (for the kinases), and methylomics (for epigenetic methylations and imprinting), among many others. It also includes compound forms like pharmacogenomics, functional genomics, structural genomics, and pharmacomethylomics [3]. Notions such as immunomics, metabolomics, toxicomics, literomics, and ecogenomics have been introduced, not entirely in jest. It's not that we really need

78

J N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

more jargon, but, aside from any amusement value, the omic terminology can be a useful shorthand - and it is at least etymologically respectable. Webster's dictionary defines "-ome" as an abstract entity, group, or mass, so omic research in biology is the study of entities in aggregate - DNA, RNA, protein, or other molecular components of a cell, tissue, or organism. The substantive point here is that omic research requires a different mind-set from the more traditional study of one gene, gene product, or process at a time [1-3]. One generally ends up knowing a little about a lot, rather than a lot about a little. Often, the databases of molecular information are generated without knowing what about them will prove most valuable, but that fact in no way obviates the need for careful design and rigorous attention to experimental detail. In a sense, the guiding hypothesis in omic research relates to information and its utility, rather than to biological specifics. But anyone who does omic research quickly realizes its dependence on traditional one-at-a-time hypothesis-driven studies. The former type of research establishes context in a world of 35,000 genes and hundreds of thousands of interesting protein states; the latter identifies what data to generate and which relationships in the final database are worth further pursuit. This synergy between traditional and omic approaches to biology is reflected in the way we identify and validate molecular markers of disease and molecular markers for therapy. The aim of this article is to illustrate that synergy through our studies with the drug discovery and development program of the National Cancer Institute (NCI). The NCI's cell-based screen, in which > 70. 000 chemical compounds plus natural products have been tested one at a time and independently over the last 11 years, provides a unique opportunity complementary to the study of clinical tumors. Cancer cell lines clearly are not the same as cancer cells in vivo. Even primary cultures from tumors are artificial in that they have been removed from their natural state and society in the body. But cultured cells do at least circumvent many of the logistical, technical, ethical, and conceptual difficulties that complicate work with clinical materials, and one can step into the same stream multiple times. Most of our present understanding of basic molecular pharmacology has come from studies in cultured cells, not from clinical materials. However, projecting in the other direction - from cultured cells toward the clinic - is more dangerous. One can hope to find clues with which to formulate hypotheses for further study.

2. The NCI-60 panel of human cancer cell lines In 1990, the NCI Developmental Therapeutics Program (DTP) began operation of what was then considered a rather high-throughput screen, in which compounds are tested for their ability to inhibit growth of 60 different human cancer cell lines (the NCI-60) in culture [4-9]. Included currently are melanomas (8 cell lines), leukemias (6), and cancers of breast (8), prostate (2), lung (9), colon (7), ovary (6), kidney (8), and central nervous system (6) origin. The assay is a simple one. The cells are incubated with various concentrations of drug for 48 hours, and growth inhibition is then assessed using a sulforhodamine B assay for the amount of protein in the well. Fifty percent growth inhibitory concentrations (GI 50 's) and other indices of potency are then read from the resulting dose-response curves. The top section of Figure 1 shows a highly schematic view of this part of the NCI drug discovery-development process. The compounds have come largely from synthetic chemistry and natural product sources, but biologicals and combinatorial libraries are also being tested. In recent years, the role of this process has changed progressively from primary screening to secondary testing as compounds have, increasingly, been selected for the assay on the basis of interesting prior information, and as molecular screens have been established in the program. This cell-based strategy for drug discovery was originally based on the hypothesis that selective activity in vitro against cancer cell lines from a particular organ would predict selective activity against the corresponding tumor types in humans. For present purposes, however, we will avoid the endless arguments about the best way to screen or test for anticancer agents and focus on the screen as a generator of profile data on the potencies of compounds tested and the drug sensitivities of the 60 cell types. Patterns of activity against the NCI-60 have proved predictive at the molecular level: they often provide incisive information on mechanisms of action and also on molecular targets and modulators of activity within the cancer cells. The patterns of activity were first analyzed using the COMPARE algorithm developed by the late K.D. Paull [5,10,11]. Given one compound as a "seed", COMPARE searches the database of agents screened and generates a list of those most similar to the seed in their patterns of activity against the 60 cell lines. Similarity in pattern generally indicates similarity in mechanism of action, mode of resistance, and molecular structure [10-14]. This form of analysis has been

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

79

Fig. 1. Schematic of the NCI-60 screen and profiling system, with associated databases of activities (A), molecular structure descriptors of the compounds tested (S), and molecular "targets" in the cells (T). The T-database includes measurements of one target at a time and aggregate (omic) measurements at the DNA, mRNA, and protein levels. Conceptually, there is also a clinical features database (C), not shown here. The informatics challenge is to analyze and understand each of these databases separately, then to integrate them with each other and with public information resources to address pharmacogenomic questions. Modified from [14].

applied productively to topoisomerase 2 inhibitors [15], pyrimidine biosynthesis inhibitors [16], and tubulinactive compounds [17,18], among many other classes of agents. We have used back-propagation neural networks and predictive methods from classical statistics to find ways in which the patterns of activity could indeed predict a compound's mechanism of action [12]. More detailed information on the relationship between pattern and mechanism has come from a variety of other statistical and artificial intelligence techniques [13,14, 19-26]. 3. Structure, activity, and target databases The bottom half of Fig. 1 shows three types of databases that arise from the NCI-60 screen [14]: (A) contains the activity patterns, (S) contains molecular structural features of the tested compounds, and (T) contains characteristics of the cells that may be targets or modulators of drug activity or may be neither. The chemical structures in (S) can be coded in terms of any set of 1 -, 2- or 3-dimensional molecular structure descriptors, or a combination thereof. The NCI's Drug Information System (DIS) contains structural builds for - 500,000 molecules, including most of the > 70,000 tested to date ([27] and D. Zaharevitz, et al., unpublished). This database provides a basis for pharma-

cophoric searches; if a tested compound is found to have an interesting pattern of activity, its structure can be used to search for similar molecules in the DIS database that have not been tested. More pertinent for present purposes is the target (T) database, each row of which defines the 60-cell line pattern of a measured cell characteristic [14]. Many laboratories at the NCI and elsewhere have been assessing these targets one at a time (or a restricted class at a time). The list includes oncogenes, tumor suppressor genes, molecules of the cell cycle and apoptotic pathways, drug resistance-mediating transporters, metabolic enzymes, cytokine receptors, heat shock proteins, telomerase, DNA repair enzymes, intracellular signaling molecules, and components of the cytoarchitecture. But a number of years ago, we decided to take a broader brush, omic approach to characterization of these cells - at the DNA, RNA, and protein levels. We started where any molecular pharmacologist would, given a choice: with the proteins. 4. Pharmacoproteomics and the NCI-60 In collaboration with Leigh Anderson (Large Scale Biology, Inc.), we [28] assessed patterns of protein expression by two-dimensional polyacrylamide gel electrophoresis (2-D PAGE) with detection by colloidal

80

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

Fig. 2. Proteomic profiling of the NCI-60 cell lines [28]. A: 2-D gels were run for duplicate harvests of the 60 cell lines, the aim being to index spots across the 60 and develop quantitative patterns of protein expression analogous to those for compound activities. B: Computer-processed pseudocolor image of the central section of a master gel based on breast cancer line MCF-7. The spots are modeled as bivariate Gaussians of Coommassie blue intensity, and spot "volumes" are calculated as the integral of intensity over the area of the spot. Red indicates spots in a quality-controlled database of 151; blue indicates additional spots in the overall cross-indexed database of 1,014 spots. C: Clustered image map (CIM) showing cell-cell relationships in terms of patterns of protein expression. Red indicates positive Pearson correlation coefficient: blue indicates negative Pearson correlation coefficient. The cells are clustered in the same order on both axes, so there is, by definition, 100% correlation on the main diagonal. This is a TT.T CIM (see discussion of CIMs later in text). Modified from Buolamwini. et al. (in preparation).

Coomassie blue and image processing by the Kepler program package. Figure 2 summarizes that project, which established an early link between the enterprise of proteome research [29,30] and the molecular pharmacology of cancer. The database generated consisted of 1,014 indexed and quantitated protein spots, of which 151 were quality-controlled over all 60 cell lines and incorporated into a primary data set for analysis [28]. The database was informationally coherent in the sense that different harvests of the same cell line were more highly correlated with each other in expression pattern than were parallel harvests of different cell lines. That is, the signal-to-noise ratio was sufficiently high to permit meaningful clustering of the cell lines on the basis of their patterns of protein expression. For this purpose, the 2-D gel spots were quantitated in terms of

spot "volume", intensity of staining integrated over the area of the computer-processed spot image. The bottleneck in the project turned out to be identification of the spots. It was possible to distinguish meaningful patterns of association between spots or between cell types without knowing the identities of the spots. But for most purposes, including the search for molecular markers, spot identity proved crucial. For the identification, we developed our own version of a rapid MALDI-TOF mass spectrometric technique based on peptide mapping [31]. The essential steps in the method included in-gei digestion of the proteins with combinations of proteases, purification of the peptides, analysis by MALDI-TOF mass spectrometry, and peptide fingerprinting. We used the method to identify a number of spots but soon realized that it was not the

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

job of a small academic laboratory to identify hundreds of proteins in that way. Accordingly, we decided to move on to mRNA expression profiling and wait for high-throughput proteomics to catch up. The wait has been longer than I expected. Despite numerous promising techniques, most of them based on mass spectrometry for detection, there still does not seem to be a complete solution to the proteomic profiling of mixtures as complex as those of mammalian cells. Even the nature and magnitude of the challenge become harder and harder to define, given the increasing focus on alternative splicings, post-translational modifications, and extensive, complex family relationships among proteins and their domains. We will all await with interest the results of ongoing large-scale proteomic efforts in the public and private sectors.

81

transcripts for which results from the two very different technologies are reasonably concordant across the 60 cell types (J.K. Lee, et al., in preparation). This concordance set is as well validated as any gene expression database of which we are aware. Conceptually, it is almost as if one had done northern blots or real-time RT-PCR studies for all of the genes across 60 cell lines to validate the cDNA array results. The drug and cDNA gene expression databases used in this study, along with tools of analysis, can be found at our web site, http://discover.nci.nih.gov. The oligonucleotide chip data will appear there soon. Additional data and the COMPARE program can be found at the DTP's web site, http://www.dtp.nci.nih.gov.

6. Color-coded clustered image maps (CIMS) 5. Transcriptomics and the NCI-60 Most drug targets are proteins, and, clearly, proteomic status cannot be inferred or predicted from data on the RNA. Not yet, at least. Complicating factors include the complexities of translational regulation, posttranslational modifications, and differing patterns of protein metabolism and degradation. However, mRNA expression levels are a useful second best, and the technology for determining them is considerably more advanced than it is for proteins. Most important, it is easier to establish identities. We have performed gene expression profiling studies of the NCI-60 using cDNA microarrays [32,33] with the Brown/Botstein laboratory at Stanford University and Affymetrix oligonucleotide chips [34] with the Lander/Golub group at the Whitehead Institute. The cDNA microarray studies profiled approximately 8,000 distinct genes using the two-color methodology [32,33]. Figure 3 shows hierarchical clustering of the cells based on gene expression patterns (left) and on drug sensitivities (right). In each case, the cells group in part by organ of origin but in part according to other principles. It was a surprise, though perhaps it should not have been, that the two clusterings are very different. The correlation of correlations between them [33] is only +0.21. At least one reason is that particular gene products, for example mdrl/Pgp, can influence the activities of many drugs across organ of origin categories but, being only single genes, have little effect on the clustering by gene expression pattern. We have since gone on to cross-compare the cDNA array and oligonucleotide chip databases gene by gene and establish a robust database of > 2,000

One useful and compact way to represent patterns in the data from "high-dimensional" datasets such as gene expression profiles is what we have termed the "clustered image map" (CIM) (sometimes called a clustered "heat map"). The principle is illustrated in Fig. 4 for gene expression over the 60 cell lines. We developed CIMs in the early 1990's for data on drug activities, target expression levels, gene expression values, and proteomic profiles [13,14,28,33]). The clustering of both axes (or sometimes only one if there is another organizing principle for the second axis) puts like together with like and brings out patterns. A red-green color scheme for the CIM has been popularized by our collaborators [35]. A flexible program for producing CIMs can be found at our web site, http://discover.nci.nih.gov. The gene-cell CIM in Fig. 4 is simple in that, in terms of Fig. 1, it involves only a single database, T. If we want to assess relationships between drug activity and gene expression, it is necessary to map the A database into the T database (which can be done most straightforwardly by multiplying A by the transpose of T and normalizing so that entries in the product matrix (A-TT) are Pearson correlation coefficients [14, 33]. Figure 5 shows such a drug-target CIM. Alternatively, CIMs can be formed by multiplying a database (i.e., matrix) times its own transpose to produce a symmetrical product matrix [13,14,28,36]. For example, the TT-T CIM expresses the correlation of each cell type with each other cell type in terms of pattern of expression, as in Fig. 2(C). Each point and each patch of color in a CIM (such as that in Fig. 5) represents a possible story. But how can one determine whether a patch represents a causally

82

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between otnic and hypothesis-driven research

Fig. 3. Clustering of cells in two ways: Based on patterns of cDNA microarray gene expression patterns (A) and on drug sensitivity patterns ( B ) . The two clusterings are very different, the overall "correlation of correlations" being only +0.21."Indicates parental and transfectant cell lines from the pleural effusion of a breast cancer patient but expressing the proteins and transcripts characteristic of melanoma (as discussed in [32. 33]). Average linkage clustering and a correlation coefficient similarity metric were used in this analysis. Modified from [33]

interesting story, an epiphenomenal correlation (which still may identify a useful molecular marker), or statistical coincidence? The statistical robustness of association can be assessed in various ways, for example by using the bootstrap [37] to obtain approximate confidence limits on the estimated correlation coefficient and to test the null hypothesis that the true correlation is zero. But Fig. 5, which represents a small set of drugs and a relatively small set of genes, still reflects about 160,000 drug-gene pairs. By definition, 5% of these pairs (i.e., 8.000 of them) would appear to be statistically significant at the P = 0.05 level even if

the data were just noise. There are too many falsepositives. If this "multiple comparisons" problem is taken into account by making a Bonferroni correction (which assumes statistical independence), then almost all of the true correlations will be thrown out. There are too many false negatives. Other, more sophisticated corrections can be made but, ultimately, in this type of situation, the statistics can take one only so far. We are left with a long list of gene-drug (or gene-gene) correlations, each of which must be assessed for its biological sense. This problem is most acute for database associations such as those considered here, but it also

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

83

7. Organizing the literature on gene-gene and gene-drug correlations: MedMiner and EDGAR

Fig. 4. Illustration of the principle of the clustered image map (CIM) for compact representation of high-dimensional data. Red in this case indicates high gene expression; blue indicates low expression. The clustering of like with like on both axes brings out pattern. (In some other cases, only one axis may be clustered if there is some other organizing principle for the other axis.) See [13,14,28,33,35].

pertains to the simplest binary experiments in which, for example, a malignant cell type or tissue is compared with its normal counterpart. Even with enough replicates to obviate the question of statistical significance, such experiments typically produce lists of hundreds of genes that differ in expression, and one is left to figure out which differences have biological plausibility. This is where synergy between omic research and hypothesis-driven studies of particular genes and drugs becomes necessary. To figure out where to look in the massive databases that arise from the former, we generally need to make use of the latter. That can mean experiments done after the fact, it can mean plumbing rich public databases such as those of the NCI's Cancer Genome Anatomy Project [38,39], or it can mean laboriously searching the extant literature. Because literature searching quickly becomes tedious, we developed web-based text-mining and literature-organizing tools, MedMiner [40] and EDGAR [41], to facilitate the process.

MedMiner, which is publicly available at our web site (http://discover.nci.nih.gov), can be used for gene, gene-gene, gene-drug, drug-drag, or more general literature queries. Input can include gene accession numbers, gene names, drug NSC numbers, drug names, and/or free text (e.g., "apoptosis" or "transport"). In the case of microarray analysis, the user can specify a list of arrayed genes. MedMiner uses a combination of GeneCards from the Weizmann Institute, PubMed from the National Library of Medicine (NLM), syntactic analysis, truncated-keyword filtering of relationals, and user-controlled sculpting of a Boolean query to generate key sentences from the pertinent abstracts. Those sentences are then organized so that the user can access the most pertinent ones directly by clicking on a relevance-term. Whole abstracts deemed to be of interest can men be accessed fluently and dropped into a "shopping basket" for display or for automated entry into an EndNote library. Experienced users have estimated that MedMiner speeds up 5- to 10-fold the process of capturing and organizing the literature from PubMed searches on lists of gene-gene and gene-drug relationships [40]. MedMiner is fast enough and transparent enough for real-world use on the Web, but it by no means captures all of the information that is theoretically available in the free text of an abstract. Natural language processing (NLP) is one of the great intellectual challenges, and a number of attempts are being made to harness NLP principles for omic studies. Our own effort in this direction is EDGAR, (Extraction of Data on Genes and Relationships), a software tool for semantic analysis and organization of the literature relevant to our studies in the molecular pharmacology of cancer [41]. Many different approaches can be used to the extract factual assertions from biomedical text. Methods used include syntactic parsing, processing of statistical and frequency information, and rule-based decision-making (reviewed in [41]). EDGAR draws on all of these, using a stochastic part of speech tagger in support of an underspecified syntactic parser. Fully general semantic analysis is unrealizable, so we had to develop suitable restricted ontologies and controlled vocabularies. The goal was to extract factual assertions in the form of first order predicate calculus statements about the relationships between genes and drugs in cancer therapy. EDGAR is strong on the identification of

84

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

Fig. 5. Clustered image map (CIM) relating activity patterns of 118 tested compounds to the expression patterns of 1,376 genes in the 60 cell lines. Included in addition to the gene expression levels are data for 40 molecular targets assessed one at a time in the cells. A red point (high positive Pearson correlation coefficient) indicates that the agent tends to be more active (in the two-day assay) against cell lines that express more of the gene; a blue point (high negative correlation) indicates the opposite tendency. Genes were cluster-ordered on the basis of their correlations with drugs (mean-subtracted, average-linkage clustered with coneladon metric); drugs were clustered on the basis of their correlations with genes (mean-subtracted, average-linkage clustered with correlation metric). Sharp edges of the colored patches reflect deep forks in the corresponding cluster tree. Insert A shows a magnified view of the region around the point (white circle) representing the correlation between the dihydropyrimidine dehydrogenase gene and 5-fluorouracil. Insert B is an analogous magnified view for the asparagine synthetase gene and the drug L-asparaginase. Modified from [33].

"referential" (i.e., noun-related) relationships, weaker with respect to "relational" (i.e., verb-related) ones. Interpretation of the referential vocabulary in EDGAR is based on NLP tools and knowledge sources developed at NLM. The primary knowledge source supporting EDGAR is the Unified Medical Language System (UMLS) Metathesaurus, a compilation of > 600,000 concepts from controlled vocabularies in the biomedical sciences. We tested EDGAR's capability by applying it to a set of 383 literature abstracts related to drug resistance mechanisms. The results, expressed in a cluster tree with 383 leaves, showed considerable co-

herence by drug and mechanism of action [41]. That was achieved without the manual reading of a single abstract. EDGAR is Web-based but not yet fast enough or transparent enough for public use. It illustrates, however, both the potential and the challenges of automated literature analysis in omic studies. 8. Pharmacogenomic markers The two white rectangles on the gene expression vs. drug sensitivity CIM in Fig. 5 indicate stories with

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

likely causal significance on the basis of literature information. 8.1. Dihydropyrimidine dehydrogenase and 5-fluorouracil 5-Fluorouracil (5-FU), an antimetabolite drug often used against colorectal and breast cancer, can inhibit both RNA processing and thymidylate synthesis. Dihydropyrimidine dehydrogenase (DPYD), the ratelimiting enzyme in uracil and thymidine catabolism, is also rate limiting to 5-FU catabolism. Hence, high DPYD levels might be expected to decrease the activity of 5-FU. Consistent with this hypothesis, we found a highly significant negative correlation (-0.53) between DPYD gene expression and 5-FU potency against the 60 cell lines [33]. On closer examination, we found that 14 of the 18 low-expressers of DPYD (> 4-fold lower than the reference pool) are sensitive or highly sensitive to 5-FU. Perhaps not coincidentally, given the clinical use of 5-FU against colon cancer, all of the colon-derived cell lines (7 out of 7) were sensitive to 5-FU and low in DPYD expression. Previous studies of DPYD correlations in clinical materials have been difficult to interpret, but these microarray data suggest further study of DPYD as a pharmacogenomic marker [33]. 8.2. Asparagine synthetase (ASNS) and L-asparaginase Many acute lymphoblastic leukemias (ALL) lack asparagine synthetase (ASNS) and therefore must scavenge exogenous L-asparagine to survive (see Fig. 6). This dependence is exploited by treating ALL and other lymphoid malignancies with bacterial L-asparaginase, which depletes extracellular L-asparagine and selectively starves the cancer cells. As shown in Fig. 7, we found a moderately high negative correlation (-0.44; bootstrap 95% confidence interval -0.59 to -0.25) between expression of the ASNS gene and L-asparaginase sensitivity in the 60 cell lines [33]. But we also knew to look specifically at the leukemic subpanel, and there the correlation was a striking -0.98 (bootstrap 95% confidence interval -1.00 to -0.93). This value survived even a Bonferroni correction for the statistical multiple comparisons problem. Furthermore, the two ALL-derived lines expressed the lowest levels of ASNS mRNA and were the most sensitive to L-asparaginase, as might have been expected. These results supported

85

the possible use of ASNS as a marker for clinical decisions about L-asparaginase therapy [33]. The next question was obvious: Would any other cell line panel show similar correlation. The answer was "yes", though not as strongly. The correlation coefficient for the ovarian lines was —0.88 (confidence interval -0.23 to -0.99) [33]. Early clinical trials done with a scattering of solid tumors showed occasional responses to L-asparaginase in melanoma, chronic granulocytic leukemia, lymphosarcoma, and reticulum cell sarcoma but not in other tumor types (see [33] for references). The microarray findings support a closer look at L-asparaginase therapy for solid tumors, particularly for a low-ASNS subset of ovarian cancers. The preferred material for a clinical trial would be the polyethylene glycol-modified forms of L-asparaginase, which shows much better pharmacokinetic and immunological properties than does the native bacterial form of the enzyme. Studies of asparaginase/L-asparaginase correlations in clinical materials are underway in collaboration with D. Von Hoff and his research group at the Arizona Cancer Center.

9. Concluding remarks As indicated by the foregoing examples, omic and hypothesis-driven research should be seen as synergistic, not mutually exclusive. But there is a paradox: the easiest associations to identify in an omic database are the least interesting: ones that have been identified previously. Next easiest to identify are those that, with hindsight, make biological or pharmacological sense. Hardest are those that would be most exciting: the unexpected, the paradigm shifters. These tend to get lost among the multitude of false-positives. The problem is most acute for cross-database comparisons, less so but still considerable for binary experimental designs and time-course studies. In this paper, I have emphasized the effort to find markers of sensitivity to a treatment. One can also ask a complementary question about the molecular consequences of therapy. Both omic and hypothesis-driven studies to address the latter type of question are ongoing in our own and many other laboratories [42]. Another type of synergy deserves at least brief mention. Gene expression profiling is in vogue at the moment, but, clearly, no single type of molecular information can capture all of the pharmacological and toxicological phenomena relevant to drug discovery and selection of therapy. Data on DNA sequence, transcript

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

Fig. 7. Relationship between asparagine synthetase transcript expression and chemosensitivity of the NCI-60 to L-asparaginase. Each point represents one of the 60 cell types. Leukemia and ovarian points are larger, open circles. Main effects have been removed for both cells and drugs. Hence, a -log(GI50) value of 1 for sensitivity indicates a 10-fold higher than average sensitivity of the cell line to the agent. The asparagine synthetase expression level is plotted as the relative log2 abundance of the asparagine synthetase transcript. A value of +2 indicates 4-fold higher expression than in the reference pool.

expression, protein expression, chromosomal aberrations, chromosomal copy number changes, single nucleotide polymorphisms, promoter methylation, and molecular interactions, inter alia, can all contribute to our understanding. But each provides only partial insight. As our laboratory and collaborators combine these different classes of information for the NCI-60, it becomes progressively more apparent that they are synergistic. Acknowledgments I am very grateful to the many members of my research group, the Genomics and Bioinformatics Group

in the Laboratory of Molecular Pharmacology, NCI for their individual and collective contributions to the projects touched on here. Included prominently have been U. Scherf, M. Waltham, W.C. Reinhold, T.G. Myers, Y. Zhou, L.H. Smith, L. Tanabe, J.K. Lee, S. Richman, F. Gwadry, S. Kim, Ajay, H. Kouros-Mehr, J. Alexander, S. Daoud, S. Nishizuka, and K. Bussey. Other principal collaborators in various facets of the proteomic and transcriptomic work have included Y. Pommier, K.W. Kohn, D. Ross, M. Eisen, P.O. Brown, D. Botstein, D. Shalon, D. Lashkari, E. Liu, L. Miller. E. Lander, T. Golub, D. Slonim, H. Coller, P. Tamayo. J. Staunton, and N.L. Anderson. I am especially grateful to members of the NCI Developmental Therapeutics Program, who have made this molecular targets effort

J.N. Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

work. Prominent in this regards are E.A. Sausville, M. Grever, A. Monks, D.A. Scudiero, D. Zaharevitz, R. Camalier, S. Holbeck, and K.D. Paull. In particular, the late Kenneth Paull laid the foundations for the informatics of this program, and our subsequent contributions simply follow in his footsteps.

[14]

References

[15]

[1] J.N. Weinstein and J.K. Buolamwini, Molecular targets in cancer drug discovery: Cell-based profiling, Current Pharmaceutical Design 6 (2000), 473-483. [2] J.N. Weinstein, Fishing Expeditions, Science 282 (1998), 627628. [3] J.N. Weinstein, Pharmacogenomics: Teaching old drugs new tricks, New England J. Med. 343 (2000), 1408-1409. [4] MR. Boyd, The future of new drug development, in: Current Therapy in Oncology, J.E. Neiderhuber, ed., Philadelphia, Decker, 1992, pp. 11-22. [5] M.R. Boyd and K.D. Paull, Some practical considerations and applications of the National Cancer Institute in vitro anticancer drug discovery screen, Drug Development Research 34 (1995), 91-109. [6] M.C. Alley, D.A. Scudiero, A. Monks, M.L. Hursey, M.J. Czerwinski, D.L. Fine, B.J. Abbot, J.G. Mayo, R.H. Shoemaker and M.R. Boyd, Feasibility of drug screening with panels of human tumor cell lines using a microculture tetrazolium assay, Cancer Rex. 48 (1988), 589-601. [7] A. Monks, D.A. Scudiero, P. Skehan, R.H. Shoemaker, K. Paull, D. Vistica, C. Hose, J. Langely, P. Cronise, A. VaigroWolff, M. Grey-Goodrich, H. Campbell, J. Mayo and M.R. Boyd, Feasibility of a high flux anticancer drug screen using a diverse panel of cultured human tumor cell lines, J. Natl. Cancer Inst. 83 (1991), 757-766. [8] M.R. Grever, S.A. Schepartz and B.A. Chabner, The National Cancer Institute: Cancer drug discovery and development program, Seminars in Oncol. 19 (1992), 622-638. [9] S.F. Stinson, M.C. Alley, W.C. Kopp, H.H. Fiebig, L.A. Mullendore, A.F. Pittman, S. Kenney, J. Keller and M.R. Boyd, Morphological and immunocytochemical characteristics of human tumor cell lines for use in a disease-oriented anticancer drug screen, Anticancer Res. 12(1992), 1035-1053. [10] K.D. Paull, R.H. Shoemaker, L. Hodes, A. Monks, D.A. Scudiero, L. Rubinstein, J. Plowman and M.R. Boyd, Display and analysis of patterns of differential activity of drugs against human tumor cell lines: Development of mean graph and COMPARE algorithm, J Natl. Cancer Inst. 81 (1989), 1088-1092. [11] K.D. Paull, E. Hamel and L. Malspeis, Prediction of biochemical mechanism of action from the in vitro antitumor screen of the National Cancer Institute, in: Cancer Chemotherapeutic Agents, W.E. Foye, ed., American Chemical Soc. Books, 1993,pp.1574-1581. [12] J.N. Weinstein, K.W. Kohn, M.R. Grever, V.N. Viswanadhan, L.V. Rubinstein, A.P. Monks, D.A. Scudiero, L. Welch, A.D. Koutsoukos, A.J. Chiausa and K.D. Paull, Neural computing in cancer drug development: predicting mechanism of action, Science 258 (1992), 447-451. [13] J.N. Weinstein, T.G. Myers, J.K. Buolamwini, K. Raghavan, W. van Osdol, J. Licht, V.N. Viswanadhan, K.W. Kohn, L.V. Rubinstein, A.D. Koutsoukos, A.P. Monks, D.A. Scudiero, N.L. Anderson, D. Zaharevitz, B.A. Chabner, M.R. Grever

[16] [17] [18]

[19]

[20]

[21]

[22]

[23]

[24]

[25] [26]

[27] [28]

and K.D. Paull, Predictive statistics and artificial intelligence in the US National Cancer Institute's drug discovery program for cancer and AIDS, Stem Cells 12 (1994), 13-22. J.N. Weinstein, T.G. Myers, P.M. O'Connor, S.H. Friend, A.J. Fornace, K.W. Kohn,T. Fojo, S.E. Bates,L.V. Rubinstein, N.L. Anderson, J.K. Buolamwini, W.W. van Osdol, A.P. Monks, D.A. Scudiero, E.A. Sausville, D.W. Zaharevitz, B. Bunow, V.N. Viswanadhan, G.S. Johnson, R.E. Wittes and K.D. Paull, An information-intensive approach to the molecular pharmacology of cancer, Science 275 (1997), 343-349. R.J. Sapolsky and R.J. Lipshutz, Mapping genomic library clones using oligonucleotide arrays, Genomics 33 (1996), 445-456. E.S. Cleveland et al., Biochem. Pharmacol. 49 (1995), 947. K.D. Paull, C.M. Lin, L. Malspeis and E. Hamel, Cancer Rex. 52(1992), 3892. R. Bai, K.D. Paull, C.L. Herald, L. Malspeis, G.R. Pettit and E. Hamel, Halichondrin B and homohalichondrin B, marine natural products binding in the vinca domain of tubulin. Discovery of tubulin-based mechanism of action by analysis of differential cytotoxicity data, J. Biol. Chem. 266 (1991), 15882-15889. B.A. Chabner, J.N. Weinstein, K.D. Paull and M.R. Grever, Cell line-based screening for new anticancer drugs, in: Cancer Treatment, P. Banzet, F. Holland, D. Khayat and M. Weil, eds, an Update, Springer-Verlag, Paris, 1994, pp. 10-16. A.D. Koutsoukos, L.V. Rubinstein, D. Faraggi, S. Kalyandrug, J.N. Weinstein, K.D. Paull, K.W. Kohn and R.M. Simon, Discrimination techniques applied to the NCI in vitro antitumor drug screen: Predicting biochemical mechanism of action, Statistics in Medicine 13 (1994), 719-730. L.M. Shi, T.G. Myers, Y. Fan and J.N. Weinstein, Application of Genetic Function Approximation to the QSAR Study of Anticancer Ellipticine Analogs. Fifth Conference on Current Trends in Computational Chemistry, 1996, pp. 1-5. L.M. Shi, T.G. Myers, Y. Fan, P.M. O'Connor, K.D. Paull, S.H. Friend and J.N. Weinstein, Mining the National CancelInstitute's Anticancer Drug Screen Database: cluster analysis of ellipticine analogs with p53-inverse and central nervous system-selective patterns of activity, Mol. Pharmacol. 53 (1998),241-251. L.M. Shi, Y. Fan, T.G. Myers, K.D. Paull and J.N. Weinstein, Mining the NCI anticancer drug discovery databases: Genetic function approximation for the quantitative structure-activity relationship study of anticancer ellipticine analogs, ./. Chem. Inf. Comput. Sci. 38 (1998), 189-199. W.W. van Osdol, T.G. Myers, K.D. Paull, K.W. Kohn and J.N. Weinstein, Use of the Kohonen self-organizing map to study the mechanisms of action of chemotherapeutic agents, J. Natl. Cancer Inst. 86(1994), 1853-1859. W.W. van Osdol, T.G. Myers and J.N. Weinstein, Neural network techniques for the informatics of cancer drug discovery, Methods in Enzymology 321 (2000), 369-395. S.E. Bates, A.T. Fojo, J.N. Weinstein, T.G. Myers, M. Alvarez, K.D. Paull and B.A. Chabner, Molecular targets in the National Cancer Institute drug screen, J. Cancer Rex. Clin. Oncol. 121 (1995), 495-500. G.W.A. Milne, M.C. Nicklaus, J.S. Driscoll, S. Wang and D. Zaharevitz, National Cancer Institute drug information system 3D database, J. Chem. Inf. Comput. Sci. 34 (1994), 1219-1224. T.G. Myers, M. Waltham, G. Li, J.K. Buolamwini, D.A. Scudiero, L.V. Rubinstein, K.D. Paull, E.A. Sausville, N.L. Anderson and J.N. Weinstein, A protein expression database for the

N Weinstein et al. / Searching for pharmacogenomic markers: The synergy between omic and hypothesis-driven research

[29]

[30] [31]

[32]

[33]

[34]

molecular pharmacology of cancer, Electrophtiresis 18( 1997). 647-653. N.L. Anderson. J.-P. Hofmann. A. Gennell and J. Taylor. Global approaches to quantitative analysis of gene-expression patterns observed by use of two-dimensional gel electrophoresis. Clin. Chem. 30 {1984). 2021-2036. N.G. Anderson and N.L. Anderson. Twenty years of twodimensional electrophoresis: Past, present and future. ElectruplKirexix 17 (1996). 443-453. G. Li. M. Waltham, E. Unsworth, A. Treston, J. Mulshine, N.L. Anderson. K.W. Kohn and J.N. Weinstein. Rapid protein identification from two-dimensional polyacrylamide gels by MALDI mass spectrometry, Electrophoresis 18 (1997). 647653. D.T. Ross. U Scherf, M.B. Eisen, CM. Perou, P. Spellman. V Iyer, C. Rees. S.S. Jeffrey, M. Van de Rijn. M. Waltham, A Pergamenschikov, J.C.F. Lee, D. Lashkari, D. Shalon, T.G. Myers. J.N Weinstein, D. Botstein and P.O. Brown, Systematic variation in gene expression patterns in human cancer cell lines, Nature Genetics 24 (2000), 227-235. U Scherf. D.T. Ross. M. Waltham. L.H. Smith, J.K. Lee. K.W. Kohn. W.C. Reinhold, T.G. Myers, D.T. Andrews, D.A. Scudiero, M.B. Eisen. E.A. Sausville, Y. Pommier. D. Botstein. P.O. Brown and J.N. Weinstein, A gene expression database for the molecular pharmacology of cancer. Nature Genetics 24(2000), 236-244. J.E. Staunton. D.K. Slonim, H.A. Coller, P. Tamayo. M.J. Angelo. J. Park. U. Scherf. J.K. Lee, J.N. Weinstein, J.P. Mesirov. F..S. Lander and T.R. Golub. Chemosensitivity prediction by

[35]

[36] [37] [38] [39]

[40]

[41] [42]

transcnpuonal profiling, Proc. Natl. Acad. Sci. USA, in press. M.B. Eisen, P.T. Spellman. P.O. Brown and D. Botstein, Cluster analysis and display of genome-wide expression patterns. PNAS95(1998), 14863-14868. PS. Shenkin and Q. McDonald. Cluster analysis of molecular conformations, J Comput. Chem. 15 (1994), 899-916. B. Efron and G. Gong, A leisurely look at the bootstrap, the jackknife. and cross-validation, Amer Statistician 37 (1983). 36-38. R.L. Strausberg, C.A. Dahl and R.D. Klausner, New opportunities for uncovering the molecular basis of cancer. Nature Genetics 15(1997), 415-416. R.L. Strausberg. The Cancer Genome Anatomy Project: Building a new information and technology platform for cancer research, in: Early Detection of Cancer, E.A. Srivastava. ed., Elsevier, 1998. L. Tanabe, L.H. Smith. J.K. Lee, U. Scherf, L. Hunter and J.N Weinstein, MedMiner: An internet tool for mining information, with application to gene expression profiling. BinTechniauex 27 (1999), 1210-1217. TC. Rindflesch, L. Tanabe, J.N. Weinstein and L. Hunter. EDGAR: Drugs, genes and relations from the biomedical literature, Pac. Symp. Biocomput. (2000), 571-528. Y. Zhou, F.G. Gwadry, W.C. Reinhold, L. Miller, L.H. Smith. U Scherf, E. Liu, K.W. Kohn, Y. Pommier and J.N. Weinstein. Transcriptional regulation of mitotic genes by camptothecininduced DNA damage: Microarray analysis of dose- and timedependent effects. Cancer Res., submitted

89

Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease Stephen Chanock* National Cancer Institute, Gaithersburg, MD 20877, USA The genomic revolution has generated an extraordinary resource, the catalog of variation within the human genome, for investigating biological, evolutionary and medical questions. Together with new, more efficient platforms for highthroughput genotyping, it is possible to begin to dissect genetic contributions to complex trait diseases, specifically examining common variants, such as the single nucleotide polymorphism (SNP). At the same time, these tools will make it possible to identify determinants of disease with the expectation of eventually, tailoring therapies based upon specific profiles. However, a number of methodological, practical and ethical issues must be addressed before the analysis of genetic variation becomes a standard of clinical medicine. The currents of variation in human biology are reviewed here, with a specific emphasis on future challenges and directions. Keywords: Variation, genome, genetic, mutation, disease susceptibility

1. Introduction With the completion of the first composite map of the human genome, an extraordinary resource has become available to investigate the role of genetic variation in human diseases [1,2]. While the total number of genes is roughly 35,000, less than the initial estimates, it is nonetheless, a formidable task to organize and catalog the differences between any two genomes. The concept of a single human genome was useful for constructing * Address for correspondence: Steven Chanock, M.D., Immunocompromised Host Section, Pediatric Oncology Branch and The Advanced Technology Center, National Cancer Institute, 8717 Grovemont Circle, Gaithersburg, MD 20877, USA. Tel.: +1 301 435 7559; Fax: +1 301 402 3134; E-mail:[email protected]. Disease Markers 17 (2001) 89-98 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

a first generation map, but now, it is clear that there is no such entity. In fact, any two human genomes are estimated to differ by approximately 0.1% or less. Remarkably, it is within this tiny fraction of a genome, namely the collection of sequence variations, that we find an opportunity to decipher genetic determinants of disease susceptibility and outcome. Furthermore, the catalog of variations represents an unprecedented resource for investigating evolutionary and migratory events in human history. The combination of technical advances in genetic analyses coupled with the enormous resource of genetic information has set the stage for a new age in the study of human disease. 2. Variation in the human genome: The predominance of SNPs The most common sequence variation in the human genome is the substitution of a single base, commonly referred to as a single nucleotide polymorphism (SNP). By definition, a SNP is a variation in sequence with a frequency of greater than 1% in at least one population. It has been estimated that the number of SNPs in an individual numbers in the millions, reflecting enormous sequence diversity across all human chromosomes [3, 4]. Initial estimates of the density of SNPs across the genome suggest that the average frequency of SNPs is between 1 in 1.3 and 1.9 kb overall [4-6]. It is notable that the density of SNPs varies between regions of chromosomes, as well as between chromosomes [7, 8]. These differences reflect a spectrum of selective pressures on genes as well as complex rates of mutation and recombination, which, also vary greatly across the genome. Less frequent variants (i.e., with less than 1% frequency) might also be informative in the mapping of complex-trait disorders as well as familial diseases. By definition, a mutation results in a significant phenotype, whereas a SNP possesses mild or no phenotypic changes. It is more than likely than some rarer variants

90

S. Chanock / Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease

act more like SNPs. conferring a mild phenotypic difference in rarer disorders. The challenge lies in identifying infrequent variants from public databases and applying them to directed studies of rarer diseases (i.e. childhood cancer). The study of SNPs, which are, by definition, stable genetic changes in the genome, permits us to look closely at the footprints of past generations. Thus, the study of SNPs within families or pathways of genes also offers a perspective on prior events in human evolution, ones that have shaped the diversity we appreciate today [9]. Several factors in combination have shaped the diversity we now observe as SNPs, fixed variations in gene sequence. One of the major contributing factors is selective pressure in response to challenges within discrete populations. Random mutation at each base in the 3.1 billion base pair genome is not equally tolerated; in the extreme, some changes are not compatible with survival (i.e., pre-terminal stop codons in critical genes). Since the rate of random mutational rate is roughly 2 x 10 -8 , it is expected that each base has been mutated many times in the history of human evolution. So far, initial surveys indicate that transitional changes are more common than transversions, suggesting that deamination contributes to a higher likelihood of mutation, especially at sites of CpG dinucleotides [10]. Based on these assumptions and the data emanating from the SNP Consortium, it has been posited that there are as many 11 million SNPs per person [3]. Historically, the first single nucleotide polymorphisms were identified during 'gene-centric' studies, detected by restriction fragment length polymorphism (RFLP) and used to analyze single genes in populationbased studies. The utility of the RFLP was replaced by the use of the simple tandem repeat (STR) as the preferred unit for genetic studies. STRs are highly polymorphic allelic repeats of 2, 3 and 4 nucleotide units strewn evenly through-out the genome, but with substantially lesser frequency than SNPs [11]. Rarer variations, namely large scale deletions, substitutions or duplications arise with lower frequency and are useful for study of informative genes [12]. In some cases, a SNP is in linkage with a more complex, stable variant. Interestingly, some of the strongest associations identified so far have been between complex variants and disease outcomes, and not simple SNPs, though many use the term SNP to indicate more than a single nucleotide variation. Adequately spaced STRs have been be used to scan the whole genome for markers in linkage disequilibrium with a trait. Using PCR amplification technology.

it is possible to amplify a collection of STRs distributed across the genome and map monogenic disorders to a unique locus using linkage analysis in family pedigrees. The success of this approach has been employed in the search for rare, familial mutations with high penetrance (in other words, when the phenotypic expression of the rare mutation is quite significant). For more complex disorders, in which multiple genes each contribute a small effect, the utility of the whole genome scan approach has been more problematic for several reasons. The ability to discriminate the effect of single genes is difficult in complex disorders, which, by definition, are characterized by interactions between multiple genes. So far, inadequate coverage of the genome with mapped SNPs together with technical challenges have undermined the effectiveness of this approach. Previously, a whole genome scan identified one or more genetic markers within a specific region of a chromosome, but does not pinpoint the specific gene nor the significant change(s) in a gene's sequence, which ultimately alters the phenotype. Recently, many investigators have turned back to SNPs, planning to capitalize on several important features. One, SNPs are more abundant than STRs. Second, they are more stable compared to short nucleotide repeats. The high degree of variation in STRs represents a formidable barrier to population-based studies. as is frequently employed in candidate genetic association studies. Lastly, a subset of SNPs (see below) effect the biological properties or expression of a gene product, thus providing a functional component that can neatly tie in with in vitro studies. In the future, the catolog of 'functional' SNPs will be useful in mapping complex trait diseases, while at the same time, providing insights into the mechanisms of disease or treatment outcomes.

3. A book of past lives: The catalog of SNPs It is safe to say that we are still in the initial phase of discovery with respect to SNPs. There are a number of web-based tools designed to interrogate public databases in search of SNPs (see Table I). The National Center for Biotechnology Information is curating a public website, db-SNP, for deposition of SNPs and tagging the SNPs to other webtools useful for investigating genetic information in silico [13]. Already, nearly 1.4 million possible SNPs have been deposited in the SNP Consortium Database for public use with more expected in the months to come (http://snp.cshl.org/) [4].

S. Chanock / Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease

91

Table 1 Selected public resources for SNP analysis Public databases of SNPs PubMed (www. ncbi.nlm.nih.gov/entrez/query.fcgi) dBSNP (www.ncbi.nlm.nih.gov/SNP/) The SNP Consortium (snp.cshl.org/) Web-based tools for identifying SNPs Cancer Genome Anatomy Project- GAI (lpg.nci.nih.gov/GAI/) SNP pipeline (lpgws.nci.nih.gov:82/perl/snp/snp_cgi.p) Leelab SNP Database (www.bioinformatics.ucla.edu/snp/) Databases of SNPs with annotation HG Base (Hgbase.cgr.ki.se/) Imunology -SNP database (www-dcs.nci.nih.gov/pedonc/ISNP/) University of Utah Genome Center GeneSNPs (http://www.genome.utah.edu/genesnps.old/)

Published Literature (search by gene) - Resource for searching published data NCBI Database of deposited SNPs - Central repository of SNPs Database of predicted SNPs - Deposited in db-SNP NCI-based SNP Discovery Project - Gene lists, tools for SNP analysis including predicting non-synonymous SNPs CGAP-GAI search of EST/Unigene - Search EST sequences for SNPs UCLA search of EST/Unigene - Search EST sequences for SNPs International Database - Repository for SNP Curated collection of immunologically significant SNPs - SNP database of known genes with SNPs Curated collection of SNPs derived from public databases

The importance of validating SNPs can not be overemphasized. Sophisticated web-tools have been developed to search public databases of expressed sequence tags (ESTs) in search of putative SNPs [14,15]. Recently, it has been suggested that roughly three-fourths of the SNPs in two large public databases can be validated at a high frequency (> 20% for the minor allele) [14-16]. Others have reported that using in silico tools yields relatively high specificity, but low sensitivity in SNP validation [17]. Still, candidate SNPs need to be validated in genomic assays that amplify the specific locus of interest from a unique chromosomal location, determine that the variant exists by one of several technologies (see below) and demonstrate its Mendelian inheritance. So far, this approach has been biased towards SNPs that arise in unique regions of the genome. On the other hand, using PCR technology, it is difficult to amplify regions replete with a high density of redundancies, where informative SNPs (or STRs) could be still positioned. While most SNPs are neutral and do not alter the phenotype, there is a subset of SNPs, which are predicted to change the phenotype of the gene. These are the most interesting targets for investigating the contribution of SNPs to disease. Traditionally, SNPs in the coding region have been of particular interest because an amino acid substitution can alter the function of a protein. These are known as non-synonymous coding SNPs and are less common than synonymous SNPs in the coding region (i.e., those that do not re-

sult in the substitution of an amino acid). Early studies have confirmed that non-synonymous coding SNPs are rarer, and thus, support the hypothesis that greater selective pressure is required to effect a functionally significant amino acid substitution [5,6]. The same can be argued on behalf of variants critical for the regulation of a gene's expression. In the rush to catalog SNPs, many groups have concentrated on identifying non-synonymous SNPs and in some cases, developing sophisticated web-tools to browse public databases to display in silico translation predictions of validated SNPs (http://lpgws.nci.nih.gov:82/perl/snp2ref). On the other hand, it is harder to identify SNPs that alter the regulation of a gene, namely its expression. This class of SNPs are particularly important in complex pathways, such as an immunological cytokine network or the coagulation cascade, where small differences in expression can have pleotropic effects downstream, in either amplification or dampening of a response pathway [18]. Since the field of promoter analysis and gene regulation is highly specific to unique sequence motifs, it is unlikely that a comparable search program will facilitate identification of informative, promoter SNPs. As mentioned above, there is great interest in the cataloging of "functionally" important SNPs, namely, those SNPs that might directly alter the phenotypic expression of the gene. The field is beginning to address the importance of correlating phenotype with SNPs, specifically within pathways or collections of genes that fit a biological paradigm. Based upon the num-

92

S. Chanock /Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease

ber of genes in the human genome (roughly 35,000), a SNP density of 1 in 1.8 kb and the average size of the coding region, 5' and 3' UTRs (1.34 kb, 300 bp and 770 bp, respectively), it is possible to estimate that there are perhaps as few as 55,000 or as many as 250,000 functionally interesting SNPs. The challenge is to find these SNPs and apply them in welldesigned molecular epidemiology studies. A public effort at the National Cancer Institute, the Genetic Annotation Initiative of Cancer Genome Anatomy Project is re-sequencing the entire coding region and 5'/3' untranslated regions of candidate genes of importance to cancer biology (http://lpg.nci.nih.gov/GAJ/). This directed approach has been employed by other groups in an effort to validate common SNPs in genes of immediate importance to one or more common complex disorders (i.e., diabetes, neurodegenerative diseases or cancer) [5,6]. Taking the next step, several web-tools have been developed to curate SNPs within biological pathways; one such example is the first generation of the Immunology-SNP database (http://wwwdcs.nci.nih.gov/pedonc/ISNP/), which catalogs polymorphisms, primarily SNPs in genes of immunological importance. Comparable efforts are underway to develop virtual links between in vitro biological observations, SNPs and genetic association studies. In parallel, an international web-based effort is underway to annotate genes, linking them with function in specific pathways and families; this is known as the Gene Ontology Project (http://www.geneontology.org) [19]. There is no doubt that this resource will be invaluable in the search to map SNPs, both as genetic markers and in informative cases, in which SNPs are defined as modifiers of disease outcome. The SNP Consortium has deliberately sought to validate over 100,000 SNPs throughout the genome, without a bias towards coding or regulatory SNPs [4]. This collection will be invaluable for large scale linkage studies, and in selected circumstances, genetic association studies. A formidable number of SNPs, at least several hundred thousand are required to adequately provide a SNP density of one every 3 kb; this is based upon inferred linkage disequilibrium calculated for the whole genome [20-22]. As mentioned above, linkage disequilibrium may not be evenly present throughout the genome. Consequently, predictions for the number of SNPs required to conduct whole genome scans in population based studies have been decreased, but not by an order of magnitude. Another, more directed approach is to annotate SNPs in the roughly 35,000 known genes. This approach, known as an 'intelligent SNP'

scan, favors identifying functionally significant SNPs. However, this bias, as attractive as it is to interrogate genes, does not provide sufficient coverage of intergenie regions for linkage analysis, which still could influence expression or less likely, function. Nonetheless, in the future, one could imagine a more refined form of an 'intelligent scan' - one that uses validated 'functional' SNPs to conduct scans of the genome in well-defined population-based studies. Or for that matter, one can envision studies that analyze SNPs derived from biologically critical pathways only.

4. Applying SNPs to the study of disease: The changing landscape There is ample evidence to suggest that in common complex trait diseases, genetic factors contribute to the disease process but, it is most likely that multiple genes contribute to disease susceptibility. Although the effect of any single variant is probably small, combinations of SNPs, either as haplotypes or between distant genes, may coordinately contribute to disease risk. Still, the net effect of each SNP is small, which making it especially difficult to identify key contributors. There have been a few examples in which linkage studies have localized a variant, later shown to confer risk (e.g., APOE and Alzeihmers) [23,24]. Still, the utility of the whole genome scans is limited by the low-penetrance of many common variants. Genetic association studies have emerged as the primary method of studying the effect of SNPs on disease outcomes, whether it is susceptibility to a disease or differences in phenotypic expression of a known gene. A genetic association study is designed to evaluate the contribution of one or more SNPs to well defined, clinical endpoints. Success is predicated on accurate determination of clinical outcomes in a wellcharacterized population based case control or cohort study. In either a cohort or case control study, statistical significance is determined by comparison of the distribution of genotypes in two groups, with one serving as a "control" and evaluates the effect of a variant on a clinical endpoint. Usually, these are conducted in population-based studies, consisting of unrelated subjects. Family-based analysis can be useful, too, especially in transmission disequilibrium studies. Until recently, genetic association studies have been limited to studies involving one or a few candidate genes using classical epidemiologicai tools designed to evaluate the effect of a gene and its variant(s) in a

S. Chanock / Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease

population-based case control or cohort study. Typically, an interesting variant was discovered in a gene and studied in a population-based study, if a plausible mechanism could be advanced. The emerging field of molecular epidemiology has been driven by the candidate gene approach, namely, designing studies with known variants in which prior biology or association studies justified study. The landscape has changed dramatically with the explosion of knowledge. The current problem lies in extracting useful and informative SNPs from the public databases, which include millions of putative SNPs. In this regard, the gap between understanding the biological significance of a variant and the identification of a variant is widening at an accelerated pace. It is the annotation and possible functional significance that highlights the importance of looking at candidate SNPs that will be of immediate interest to investigators conducting population-based studies. Most experts would agree that until the cost and efficiency of high-throughput platforms for SNP analysis are markedly improved, the candidate gene(s) approach will be preferred. Already, new technologies have moved the field from the interrogation of single SNPs to as many as several dozen SNPs in a welldefined population study. The choice of SNPs for a new study should take into consideration several important factors. One is the frequency of the SNP in the population of study. There is great diversity in the frequency of common SNPs between different populations. Second, a plausible mechanism should underlie the choice of genes or families of genes. In other words, a sound biological basis is important in designing populationbased genetic association studies. The literature is filled with small pilot studies that are not confirmed by subsequent, better designed studies. Two major reasons account for this recurrent problem, small cohorts and poorly tracked collection of data points, often retrospective in execution. Pilot studies are useful to identify the likely candidate genes to be validated in larger, more focused studies. It can be argued that false positives in pilot studies are greatly preferred to false negatives, mainly because subsequent studies will determine the validity of a finding, as opposed to missing a potentially informative SNP. A series of stringent studies is required before the results of clinical association studies can be applied to clinical decision making [25]. It is also notable that common polymorphisms, especially SNPs are population specific and have to be viewed in the context of a particular population. Recently, some groups have argued that population strati-

93

fication (i.e. heterogenous mixture of individuals) does not represent a major stumbling block to execution of well-designed case control studies [26]. Still, the importance of defining the population is evident in interpreting the significance of a SNP. For example, individuals who are heterozygous for the sickle cell mutation in the malaria belt of Western Africa are protected from complications of the infection. In other regions of the world, where the selective pressure of malaria no longer exists, it is viewed differently, as a life-threatening monogenic disorder. The context of the variant defines whether it is a 'balanced polymorphism' , as in regions of Africa, or a highly penetrant, deleterious mutation. Genetic association studies can be used to address two fundamentally different questions. The first seeks to identify genetic variants that influence susceptibility to a complex, multi-factorial disorder. The second type addresses differences in outcomes within a disease population, including differences in responses to medications. The latter is known as pharmacogenomics, a burgeoning field that promises to revolutionize medical treatment in the future [25].

5. SNPs and susceptibility to disease Candidate gene selection based upon a proposed role for the variant in disease susceptibility has assumed a central role in conducting genetic association studies. More often than not, the genes are chosen a priori, based upon a hypothesis and studied in a suitable population based study. The net effect of any single SNP, especially for a common disorder is generally small, making it difficult to conduct linkage analysis, even with low confidence levels [27]. Still, a number of important studies have demonstrated the utility of examining candidate genes in a range of diseases, including cancer, and diabetes [28-30]. In some circumstances, the contribution of a functionally important SNP is population specific; this has been elegantly shown with a population-specific TNF SNP and susceptibility to malaria in a region of West Africa [31]. Common diseases, such as cancer or hypertension have the advantage of the availability of sufficiently large enough cohorts to examine the question in different populations. In rarer disorders, the collection of adequate numbers of subjects with firm clinical endpoints is a daunting challenge. After a sufficient number of studies have been published, it is possible to conduct case-series meta-analyses to assess the net effect of the

94

S. Chanock / Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease

variant. For example, the importance of two separate genetic variants has been shown in bladder cancer [29, 32,33]. With respect to both genes, the odds ratio is greater than one, but less than two, indicating a modest, hut real effect.

6. Understanding outcomes in a disease: SNPs as disease modifiers It has long been appreciated that among individuals with a monogenic disorder, such as cystic fibrosis or chronic granulomatous disease, there can be heterogeneity in the disease course. Even within families, outcomes vary greatly between members with an identical primary mutation, suggesting that other genetic factors modify outcomes. Already, a number of seminal studies have identified modifying SNPs in different populations afflicted with a life-threatening monogenic disorder. If this trend is confirmed in further studies, our notion of a monogenic disorder will have to "modified" to include an appreciation of secondary genes, which influence outcomes. In the future, genetic counseling might include analysis of secondary modifying genes, particularly in an effort to institute preventive measures to intercede and avoid or ameliorate serious complications. It is notable that under stress conditions, such as a primary immunodeficiency, common polymorphisms, mainly SNPs, can act as modifying genes for specific outcomes. In chronic granulomatous disease, a primary immunodeficiency of innate immunity (i.e., neutrophils and monocytes fail to effectively kill invading organisms), individuals who inherit a variant of a low affinity Fc gamma receptors (FCGR2A or FCGR3B) or the myeloperoxidase gene (MPO) are at increased risk for immunologically mediated gastrointestinal complications [34]. In a preliminary analysis, combinations of the three informative variants conferred a higher risk tor this type of complication. This pilot study illustrates the possible significance of the phenotype of the SNP in a pathway not directly affected by the primary mutation is intensified and provides an opportunity to investigate the contribution of genes in vivo [35]. Certainly, it is possible that we will better appreciate the relative importance of genes within a pathway or biological process by examining its effect. Simply put, it is possible to observe the relative importance of specific genes by studying the effects of their variations in large population-based studies.

By no means, should this paradigm be restricted to monogenic disorders. The study of host genetic factors has been extremely informative in dissecting the molecular events associated with both susceptibility and progression of infection with HIV-1. Interestingly, the same variants that increase the likelihood of acquiring infection, also appear to accelerate the course of disease overall. Much of this work has focused on chemokines, namely, CCR2, CCR5, SDF1 and RANTES which are critical co-factors for infection with HTV [36-39]. Interestingly, disease progression has also been associated with specific polymorphisms of the MHC complex. In particular, heterogeneity in class one MHC, namely HLA-A, B and C loci, was shown to provide a selective advantage, thus slowing progression of HIV infection [40]. Furthermore, genetic factors that alter the risk for developing one of several life-threatening complications of HIV-1 infection have been identified by genetic association studies. For example, a common variant of the low affinity Fc gamma receptor, FCGR3A increases the risk for developing Kaposi's Sarcoma (KS) in HIV infected men, also co-infected with the Kaposi's Sarcoma herpes virus, KSHV [41, 42]. KS afflicted roughly 20% of infected men in the USA prior to the development of highly active antiretroviral therapy.

7. Pharmacogenomics The term, pharmacogenomics refers to field of study seeking an association between pharmacological phenotypes and common genetic variations, namely SNPs. The underlying principle is that genetic variations can account for differences in response to drugs. This is not to ignore other reasons for adverse or inadequate response to a drug. Certainly, the failure to respond can also be attributed to allergies, drug-drug interactions, and erroneous prescriptions (either inappropriate dosings or medications). Still, the SNP revolution offers an opportunity to generate genetic profiles that will be useful in the choice of medications. This pertains to not only choosing the best drug, based upon a genetic pattern, but also avoiding selected drugs in individuals in whom the risk for a serious adverse event is high. From a practical point of view, the applicability of this approach has been demonstrated in principle, with several key examples. Dissecting the role of different steps at the genetic level has been guided by deconstruction of the fate of medicinal drugs. These include the drug target, transport, and metabolism. So far, pilot studies

S. Chanock / Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease

have not evaluated the drug as a foreign antigen, or a possible allergen. For instance, it has been shown that patients homozygous for the G17R variant in the B2 adrenergic receptor are at risk for exacerbation of asthmatic attacks, when using albuterol treatment [43, 44]. An example of a variant with a frequency of less than 1% but with important clinical consequences is the TPMT variant (roughly, 0.3%). Rare individuals who are homozygous for the TPMT variant develop severe toxicity when treated with azothioprine therapy for either leukemia or auto-immune diseases [45,46]. Thus, clinical decisions can be based upon knowledge of the genetic profile, and in the future, allowing physicians to tailor therapies on an individual basis. It is conceivable that SNP studies will also define suitable targets for therapeutic intervention. Though early in its genesis, it is notable that studies in monogenie disorders could generate important insights that give rise to specific therapies. For example, in cystic fibrosis, several groups have confirmed that the presence of a common, non-synonymous SNP in a gene located on a separate chromosome from the CFTR gene, the mannose binding lectin, MBL2, is an adverse risk factor for pulmonary outcomes. Informative MBL2 variants (clustered in exon 1) have an overall frequency of approximately 35% and alter both function and circulating levels of the protein, a C-type collectin critical for recognition of pathogens. In the setting of CF, a phenotypically insignificant variant in the normal host impacts pulmonary defenses against pathogenic bacteria [47]. Based on the available data, recombinant mannose binding lectin could be given to CF patients who co-inherit one of the common MBL2 variants. Directing therapy, even in a rare, monogenic disorder holds tremendous potential for treatment in the future.

8. The labor of SNP detection The technical platforms for detecting SNPs are rapidly changing. Currently, half a dozen different assay systems are commercially available to discriminate single base changes following amplification of a unique amplicon. The flanking region and the informative SNP are amplified by PCR technology, usually from genomic DNA. Unlike cDNA array studies, which capture the full complement of messenger RNA using a common oligo dT primer, a unique set of oligonuceotide primers is required for amplifying each individual SNP. In this regard, each assay has to be optimized, even before multiplexing, making large scale

95

analysis more cumbersome in development and execution. A number of promising platforms (see Table 2) have been developed that can increase the throughput of SNP detection and, in some cases, analyze multiple SNPs in one reaction (but not at a scale comparable to the 50,000 messages analyzed in cDNA array studies). Price and effort have both been streamlined, but still are not sufficiently economical to support large scale genome wide studies, which require 100,000 or more unique SNPs. Consequently, to analyze this number of SNPs in a population studies, a million or more genotypes would be required. Therefore, it is not surprising that until now and for the foreseeable future, the candidate gene(s) approach will be favored. In response to the technical challenges of large scale SNP analysis, many groups have turned to an approach called DNA pooling. Instead of analyzing individual SNPs in each sample, pooled DNA (e.g., equal aliquots of DNA from a large number of subjects) is analyzed for allelic frequency [48-50]. Naturally, labor and cost are substantially reduced, which is particularly advantageous in a pilot study, seeking to identify excellent candidate SNPs for confirmation in validation studies. At the same time, large scale screening can provide sufficient power to minimize the likelihood of identifying false positives. Pooling studies have the disadvantage of analyzing only the allelic frequency of SNPs, and in pools defined by the initial design. Thus, it is difficult to analyze more complex questions of outcomes or subgroups, unless chosen a priori in the design of the pooling studies. Still, this approach is extremely promising for the study of complex, multi-genie common disorders [49,51]. In this regard, pooling is an effective approach for screening candidate genes that could influence susceptibility to an exposure or a disease (e.g., in large scale cancer cohort studies). The strategy of interrogating pools of DNA has been successfully reported in identifying SNPs contributing to complex diseases and the identification of rare, Mendelian mutations [21,52]. Since we are still in the early phase of developing this approach, there is intense interest in optimization of the number of subjects pooled in conjunction with the platform used to analyze SNPs. In addition, aliquoting equal amounts of genomic DNA is a major technical challenge, which is further magnified by amplification-based technologies for SNP detection. Therefore, the margin for error is small in the execution of DNA pooling studies, especially when differences in the distribution of allelic frequencies might be relatively small, a common finding in genetic association studies.

96

S. Chanock / Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease Table 2 Current technical platforms for SNP detection Generic method Direct sequence analysisa Single base extension Single strand sequence

High-throughput platform Generally not Promising if multiplexed Limited to single sites

Type of analysis Qualitative only

Hybridization methodsa Target amplificationb Signal amplificationc

Variable Moderately efficient Highly efficient

Quantitative + Qualitative

Microarraya

Highly efficient

Quantitative + Qualitative

Restriction enzyme analysis Conformational analysis*

Not efficient

Qualitative only

Moderately efficient

Qualitative only These platforms are also applicable to use for detection of rare, highly penetrant mutations. a Relies upon amplification technology (i.e., polymerase chain reaction, PCR) to generate amplicon for analysis. b Example of platform is real-time polymerase chain reaction. c Example of platform is the chip-based matrix associated laser desorption/ionization timeof-flight (MADLI-TOF).

9. The challenges in SNP analysis that lie ahead The study of SNPs in human disease is a rich resource for dissecting the genetic contribution to complex-trait diseases and modifiers of monogenic disorders. The extraordinary spectrum of variation is also its weakest link, because each study has to be interpreted in the context of the population examined. The literature is filled with examples of informative SNPs that do not reproduce in different settings. This conundrum underscores the importance of combinations of SNPs, which for the purpose of any population-based study of unrelated individuals assumes no contribution from the background of SNPs. This problem raises difficult questions for identifying and validating geneenvironment interactions as well as gene-gene interactions. Despite the fact that the effect of a SNP is measured over an extended period of time, it is difficult to dissect the temporal relationship between SNPs without a sound biological model. One of the major hurdles of the future is to develop suitable systems to analyze the complex interactions of SNPs and create a suitable (and reproducible model) that will be useful for clinical implementation of genetic risk factors. The field is moving towards examining collections of SNPs, which are derived from biological pathways or families of genes. An extension of this approach, known as 'neighboring' SNPs, expands the genes under study to include those that interact up and downstream from the core set. By saturating a pathway, one can evaluate changes in a pathway, one that might dampen or amplify a cascade (i.e. complement cascade). We are early in the study of SNPs, which , for the purpose of initial studies, are viewed as individ-

ual units. However, SNPs form haplotypes and it is the investigation of haplotypes that will probably be most informative- both as genetic markers and as tools to correlate genetic variation with functional outcomes (i.e. clinical states). How many haplotypes exist for a given cluster of SNPs, either in one gene or in a group of closely positioned genes remains to be determined, but it is critical to pursue this approach [27,53]. Since haplotypes are defined by the blocks of genes which maintain variants already in place, such as SNPs, the cataolog of SNPs will be an invaluable tool in defining haplotypes, especially ones that include variants that functionally alter the gene or gene product. If indeed, large scale sequencing is possible in the near future, then the field will have to analyze (and re-analyze current studies when possible) haplotypes. The technical platforms will need to be more flexible and extend further distances between variants to capture the informative components of haplotypes. Lastly, the most difficult task will be to consider the implementation of SNPs in clinical decision making, particularly as it relates to providing recommendations for interventional or preventional measures, based upon the concept of "risk". Together with ethicists, a dialogue must begin to address how and in what manner to use genetic information, especially when the consequences of the information have pleotrpoic implications for personal security, insurance and health.

10. Conclusion We are at the beginning of an era when we can investigate the functional implications of single nucleotide

S. Chanock / Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease

polymorphisms, SNPs, and other rarer variants. The potential usefulness in medicine is unprecedented. To define risk factors for disease and pharmacological outcomes based upon genetic profiles of SNPs could revolutionize medical care. It also comes at a dangerous cost of potential political and philosophical challenges, which must be addressed in parallel, or actually in advance, if we are to protect the rights and will of the individual. Still, these small differences cumulatively have a staggering effect, creating the individuality that we recognize in each person, while at the same time, reflect the changes that have taken place over generations, many in response to environmental and pathogenic challenges. The opportunity to annotate the differences between individuals has provided an extraordinarily rich resource for investigating complex genetic events, particularly as they relate to disease susceptibility and population genetics. References [1] E.S. Lander, L.M. Linton and B. Birren et al., Initial sequencing and analysis of the human genome, Nature 409 (2001), 860-921. [2] J.C. Venter, M.D. Adams and E.W. Myers et al., The sequence of the human genome, Science 291 (2001), 1304-1351. [3] L. Kruglyak and D.A. Nickerson, Variation is the spice of life, Nat Genet 21 (2001), 234-236. [4] R. Sachidanandam, D. Weissman and S.C. Schmidt et al., A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms, Nature 409 (2001), 928-933. [5] M. Cargill, D. Altshuler and J. Ireland et al., Characterization of single-nucleotide polymorphisms in coding regions of human genes, Nat Genet 22 (1999), 231-238. [6] M.K. Halushka, J.B. Fan and K. Bentley et al., Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis, Nat Genet 22 (1999), 239-247. [7] P. Taillon-Miller, I. Bauer-Sardina and N.L. Saccone et al., Juxtaposed regions of extensive and minimal linkage disequilibrium in human Xq25 and Xq28, Nat Genet 25 (2000), 324-328. [8] I.A. Eaves, T.R. Merriman and R.A. Barber et al., The genetically isolated populations of Finland and Sardinia may not be a panacea for linkage disequilibrium mapping of common disease genes, Nat Genet 25 (2000), 320-323. [9] J.G. Hacia, J.B. Fan and O. Ryder et al., Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays, Nat Genet 22 (1999), 164-167. [10] B.K. Duncan and J.H. Miller, Mutagenic deamination of cytosine residues in DNA, Nature 287 (1980), 560-561. [11] R. Chakraborty, M. Kimmel, D.N. Stivers, L.J. Davison and R. Deka, Relative mutation rates at di-, tri-, and tetranucleotide microsatellite loci, Proc Natl Acad Sci USA 94 (1997), 10411046. [12] E.S. Lander and N.J. Schork, Genetic dissection of complex traits, Science 265 (1994), 2037-2048.

97

[13] E.M. Smigielski, K. Sirotkin, M. Ward and S.T. Sherry, dbSNP: a database of single nucleotide polymorphisms, Nucleic Acids Res 28 (2000), 352-355. [14] K.H. Buetow, M.N. Edmonson and A.B. Cassidy, Reliable identification of large numbers of candidate SNPs from public EST data, Nat Genet 21 (1999), 323-325. [15] K. Irizarry, V. Kustanovich and C. Li et al., Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences, Nat Genet 26 (2000), 233-236. [16] G. Marth, R. Yeh and M. Minton et al., Single-nucleotide polymorphisms in the public domain: how useful are they? Nat Genet 27 (2001), 371-372. [17] D. Cox, C. Boillot and F. Canzian, Data mining: efficiency of using sequence databases for polymorphism discovery, Human Mutation 17(2001), 141-150. [18] C.B. Foster and S.J. Chanock, Mining variations in genes of innate and phagocytic immunity: current status and future prospects, Curr Opin Hematol 7 (2000), 9-15. [19] M. Ashburner, C.A. Ball and J.A. Blake et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium, Nat Genet 25 (2000), 25-29. [20] L. Kruglyak, Prospects for whole-genome linkage disequilibrium mapping of common disease genes, Nat Genet 22 (1999), 139-144. [21] A. Collins, C. Lonjou and N.E. Morton, Genetic epidemiology of single-nucleotide polymorphisms, Proc Natl Acad Sci USA 96(1999), 15173-15177. [22] J. Ott, Predicting the range of linkage disequilibrium, Proc Natl Acad Sci USA 97 (2000), 2-3. [23] M.A. Pericak-Vance, J.L. Bebout and P.C. Jr. Gaskell et al., Linkage studies in familial Alzheimer disease: evidence for chromosome 19 linkage, Am J Hum Genet 48 (1991), 10341050. [24] W.J. Strittmatter, A.M. Saunders and D. Schmechel et al., Apolipoprotein E: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial Alzheimer disease, Proc Natl Acad Sci USA 90 (1993), 19771981. [25] F.S. Collins, Shattuck lecture - medical and societal consequences of the Human Genome Project, N Engl J Med 341 (1999), 28-37. [26] S. Wacholder, N. Rothman and N. Caporaso, Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias, J Natl Cancer Inxt 92 (2000), 1151-1158. [27] N.J. Risch, Searching for genetic determinants in the new millennium, Nature 405 (2000), 847-856. [28] S.J. London, T.A. Lehman and J.A. Taylor, Myeloperoxidase genetic polymorphism and lung cancer risk, Cancer Rex 57 (1997), 5001-5003. [29] P.M. Marcus, R.B. Hayes and P. Vineis et al., Cigarette smoking, N-acetyltransferase 2 acetylation status, and bladder cancer risk: a case-series meta-analysis of a gene-environment interaction, Cancer Epidemiol Biomarkerx Prev 9 (2000), 461467. [30] D. Altshuler, J.N. Hirschhorn and M. Klannemark et al., The common PPARgamma Prol2Ala polymorphism is associated with decreased risk of type 2 diabetes, Nat Genet 26 (2000), 76-80. [31] J.C. Knight, I. Udalova and A.V. Hill et al., A polymorphism that affects OCT-1 binding to the TNF promoter region is associated with severe malaria, Nat Genet 22 (1999), 145-150. [32] L.E. Johns and R.S. Houlston, Glutathione S-transferase mu 1 (GSTM1) status and bladder cancer risk: a meta-analysis,

98

[33]

[34]

[35] [36]

[37]

[38]

[39]

[40] [41]

[42]

S. Chanock /Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease Mutuxenesis 15 (2000), 399-404. P.M. Marcus, P. Vineis and N. Rothman, NAT2 slow acetylation and bladder cancer risk: a meta-analysis of 22 casecontrol studies conducted in the general population, Pharmacogeneticx 10(2000), 115-122. C.B. Foster. T. Lehrnbecher and F. Mol et al., Host defense molecule polymorphisms influence the risk for immunemediated complications in chronic granulomatous disease, J din Invest 102 (1998), 2146-2155. S.J. Chanock and C.B. Foster, SNPing away at innate immunity, J Clin Invest 104 (1999), 369-370. M. Dean, M. Carrington and C. Winkler et al., Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Hemophilia Growth and Development Study, Multicenter AIDS Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort. ALIVE Study, Science 273 (1996), 1856-1862. M.W. Smith, M. Dean and M. Carrington et al., Contrasting genetic influence of CCR2 and CCR5 variants on HIV-1 infection and disease progression. Hemophilia Growth and Development Study (HGDS), Multicenter AIDS Cohort Study (MACS), Multicenter Hemophilia Cohort Study (MHCS), San Francisco City Cohort (SFCC), ALIVE Study, Science 277 (1997), 959-965. D.H. McDermott, M.J. Beecroft and C.A. Kleeberger et al., Chemokine RANTES promoter polymorphism affects risk of both HIV infection and disease progression in the Multicenter AIDS Cohort Study, Aids 14 (2000), 2671-2678. P.A. Zimmerman. A. Buckler-White and G. Alkhatib et al., Inherited resistance to HIV-1 conferred by an inactivating mutation in CC chemokine receptor 5: studies in populations with contrasting clinical phenotypes, defined racial background, and quantified risk, Mol Med 3 (1997), 23-36. M. Carrington. G.W. Nelson and M.P. Martin et al., HLA and HIV-1: heterozygote advantage and B*35-Cw*04 disadvantage. Science 283 (1999), 1748-1752. C.B. Foster, T. Lehrnbecher and S. Samuels et al., An IL6 promoter polymorphism is associated with a lifetime risk of development of Kaposi sarcoma in men infected with human immunodeficiency virus. Blood 96 (2000), 2562-2567. T.L. Lehrnbecher. C.B. Foster and S. Zhu et al., Variant genotypes of FcgammaRHIA influence the development of Kaposi's sarcoma in HIV-infected men, Blood 95 (2000), 23862390.

[43]

E. Israel, J.M. Drazen and S.B. Liggett et A3., The effect of polymorphisms of the beta(2)-adrenergic receptor on the response to regular use of albuterol in asthma. Am J Rexpir Crit Care Med 162 (2000). 75-80. [44] E. Israel, J.M. Drazen and S.B. Liggett et A3., Effect of polymorphism of the beta(2)-adrenergic receptor on response to regular use of albuterol in asthma, Int Arch Allergy Immunol 124(2001), 183-186. [45] L. Lennard, J.A. Van Loon and R.M. Weinshilboum. Pharmacogenetics of acute azathioprine toxicity: relationship to thiopurine methyltransferase genetic polymorphism. ClinPharma-co [46]

[47]

[48]

[49]

[50] [51] [52]

[53]

E.Y. Krynetski, H.L. Tai and C.R. Yates et A3., Genetic polymorphism of thiopurine S-methyltransferase: clinical importance and molecular mechanisms. PharmacHxenetics 6 (1996). 279-290. P. Garred, T. Pressler and H.O. Madsen et A3., Association of mannose-binding lectin gene heterogeneity with severity of lung disease and survival in cystic fibrosis. J Clin Invest 104 (1999), 431-437. N. Arnheim, C. Strange and H. Erlich, Use of pooled DNA samples to detect linkage disequilibrium of polymorphic restriction fragments and human disease: studies of the HLA class II loci, Proc Natl Acad Sci USA 82 (1985), 6970-6974. S.H. Shaw, M.M. Carrasquillo, C. Kashuk, E.G. Purfenberger and A. Chakravarti, Allele frequency distributions in pooled DNA samples: applications to mapping complex disease genes. Genome Rex 8 (1998), 111-123. G. Breen. D. Harold, S. Ralston. D. Shaw and D. St Clair. Determining SNP allele frequencies in DNA pools, Biotechmciues 28 (2000), 464-466, 468,470. S. Germer. M.J. Holland and R. Higuchi, High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR, Genome Rex 10 (2000), 258-266. L.F. Barcellos, W. Klitz and L.L. Field et A3., Association mapping of disease loci, by use of a pooled DNA genomic screen. Am J Hum Genet 61 (1997), 734-747. PH. Joosten, M. Toepoel, E.C. Mariman and E.J. Van Zoelen. Promoter haplotype combinations of the platelet-derived growth factor alpha-receptor gene predispose to human neural tube defects. Nat Genet 27 (2001), 215-217.

99

A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines Douglas T. Rossa and Charles M. Peroub,* a Applied Genomics, Inc., Sunnyvale, CA, USA b Dept. of Genetics and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, NC, USA Cell lines derived from human tumors have historically served as the primary experimental model system for exploration of tumor cell biology and pharmacology. Cell line studies, however, must be interpreted in the context of artifacts introduced by selection and establishment of cell lines in vitro. This complication has led to difficulty in the extrapolation of biology observed in cell lines to tumor biology in vivo. Modern genomic analysis tool like DNA microarrays and gene expression profiling now provide a platform for the systematic characterization and classification of both cell lines and tumor samples. Studies using clinical samples have begun to identify classes of tumors that appear both biologically and clinically unique as inferred from their distinctive patterns of expressed genes. In this review, we explore the relationships between patterns of gene expression in breast tumor derived cell lines to those from clinical tumor specimens. This analysis demonstrates that cell lines and tumor samples have distinctive gene expression patterns in common and underscores the need for careful assessment of the appropriateness of any given cell line as a model for a given tumor subtype.

1. Introduction Oncologists rely upon clinical information, a morphologic assessment, and to a limited degree, immunohistochemical and molecular markers to classify malignancies into groups that have distinct clinical behavior. It is clear, however, that additional markers * Address for correspondence: Charles M. Perou, Lineberger Comprehensive Cancer Center, CB# 7295, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. E-mail: [email protected]. Disease Markers 17 (2001) 99-109 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

and/or technologies are needed for classifying tumors as current methods sometimes fail to accurately predict patient clinical course. In breast cancer for example, tens to hundreds of different genes/proteins have been shown to be of prognostic value, however, many of these markers co-vary, and hence, are not of independent prognostic value. In addition, progress in adopting these markers into clinical practice has been limited both by technical constraints in the number of markers that can be examined efficiently and by the difficulty in comparing and validating studies that use different reagents and clinical sample sets. In breast cancer, only three markers are typically scored for in the clinical setting which include the estrogen receptor (ER), the tyrosine kinase receptor ERBB2/HER2, and an assessment of tumor proliferation index (e.g. Ki-67 labeling fraction) [1]. The advent of modern genomic analysis tools, in particular DNA microarrays, has essentially created a new tool that is capable of collecting thousands of objective observations on clinical samples that can and are being used to characterize tumors and cell lines at a level of definition that was not possible even five years ago [2]. Many groups have begun to use microarrays to measure gene expression in hundreds of tumor samples with the expectation that the genomic scale measurement of gene expression will reveal a novel molecular based classification of malignant cells. In this review, we will first focus on the characterization of breast tissue and tumor derived cell lines using data obtained from cDNA microarrays. These data can be used to 1) identify which cell lines are the best models for different breast tumor subtypes, 2) define molecular signatures that distinguish the biology of different cell lines and tumor types, 3) identify new candidate markers for tumor diagnosis and classification, and 4) identify subtype specific targets for therapeutic intervention.

100

D.T. Ross and CM. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines Table 1 Cell line name 184A1 184Aa 184B5 BT-474-ATCC BT-474-Stanford BT-549 Fibroblast-UTSW

HB2 HCC-1937 HME31 HMEC+IFNa HMEC-C HMEC-C CON HMS32 Hs578T-ATCC Hs578T-NCI MCF-10A MCF-12A MCF7-NCI MCF7-UCLA MDA-MB-231-NCI MDA-MB-231-UTSW SK-BR-3 T47D

Array # svcc38 svcc 1 7 svcc40 svcc 128 svjl07 svcc69 shav146 svcc37 shaj046 shat023 svcc500 svcc94 svcc47 shaj058 shac095 svccl10 svn008 sham 103 svcc 1299 shat022 svcc73 shaj054 svcc 1 5 svcc 71

Previous description of cell line and (source) immortal derivative of !84Aa (M. Stampfer) primary HMEC strain Aa (M. Stampfer) immortal derivative of !84Aa (M. Stampfer) ERBB2 and ER positive line (ATCC) ERBB2 and ER positive line (Stanford) papillary/ductal carcinoma derived (NCI) hTERT immortalized stromal cell line (J. Shay/UTSW) SV40 immortalized breast epithelial line (H.S.Wiley) BRCA1 mutant 'carcinoma derived line (J. Shay/UTSW) primary HMEC strain 31 (J. Shay) HMEC-C strain plus IFNa (Clonetics) primary HMEC strain C (Clonetics) HMEC-C strain 2 days at 100% confluence (Clonetics) primary breast stromal/fibroblast cell strain (J. Shay) breast carcinosarcoma derived line (ATCC) breast carcinosarcoma derived line (NCI) non-tumongenic breast epithelial cell line (ATCC) non-tumorigenic breast epithelial cell line (ATCC) ER positive line isolated from a pleural effusion (NCI) ER positive line from a pleural effusion (UCLA) ER negative line from a pleural effusion (NCI) ER negative line from a pleural effusion (J. Shay) ERBB2 positive line from a pleural effusion (ATCC) ER positive line from a pleural effusion (ATCC)

2. Classification of breast cell lines In general, cell lines established from breast tissues have been mainly characterized with respect to their expression of cytokeratins, the estrogen receptor (ER), and ERBB2/HER2 protein [3-5]. For some lines, the histology of xenografts has been compared to the pathology of their tumor of origin in order to confirm that the cell line has conserved features of its parental tumor. As part of our efforts to characterize the phenotypic diversity of human breast tumors, we have measured gene expression using cDNA microarrays in a number of breast tissue derived cell lines [68]. In this review, we have re-analyzed previously published data from thirteen cell lines and report new data from three additional independent lines. Furthermore, we explored cell line stability by measuring gene expression in the same cell line obtained from different sources and therefore propagated independently. Lastly, we included some instructive data derived from a normal mammary epithelial cell line treated with interferon, data derived from a confluent normal cell line culture, and a variant of a normal cell line immortalized in vitro. Gene expression in these cell lines was measured using spotted microarrays in comparison to a common reference sample in a manner that has been previously described and that allows all samples to be compared to one another [6,7]. One of the most striking findings from genomic studies of gene expression to date has been that tumors

and/or cell lines have common characteristics of biological or clinical importance that can be identified in their patterns of expressed genes. Hierarchical clustering analysis has been used to identify systematic features in the patterns of variation of expression of genes across sample sets [9-12]. The functions of the known genes that are either relatively over-expressed or underexpressed in comparison between samples can give clues as to the differences in biology that are reflected in gene expression patterns [6-8,13-16]. In this breast cell line data set, we selected for analysis, the subset of genes that showed 1) a signal intensity of >70 arbitrary units in both the Cy3 and Cy5 channels, and 2) expression variation of at least 3-fold or more from average for that gene across the sample set in two or more of the 24 total experiments. This criterion selected 1287 genes out of the 8102 total original genes that were both well measured and changed in expression significantly between cell lines. All primary microarray data for the experiments presented here can be obtained from the Stanford Microarray Database at http://genomewww4.stanford.edu/MicroArray/SMD/, and all figures can be seen at our Supplementary Information website at http://genorne-www.stanford.edu/breast x'ancer/ cellJine_review2001/. As can be seen in Fig. 1A, hierarchical clustering analysis divided the cell lines into three main dendrogram branches. The first branch on the far left (Red) contained all three of the primary Human

D.T. Ross and C.M. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines

Mammary Epithelial Cell (HMEC) lines, all HMEC immortal derivatives and the non-tumorigenic MCF12A [17] cell line ostensibly derived from breast epithelium. The center dendrogram branch (Orange) contained both normal fibroblast derived cell lines, a carcinosarcoma derived line (Hs578T [18]), a line ostensibly derived from breast epithelium (MCF-10A [19]), and two lines derived from breast carcinoma specimens (BT-549/Coutinho and Lasfargues 1978, and MDAMB-231 [20]). The far right dendrogram branch (Blue) contained all cell lines that were thought to be derived from luminal epithelial cells including two estrogen receptor expressing cell lines (MCF-7 [21] and T47D [22]), two ERBB2 over-expressing lines (SKBR-3/Trempe and Old 1970, and BT-474 [23]), the S V-40 transformed epithelial derived cell line HB2 [5], and the BRCA1 mutant cell line HCC-1937 that was originally isolated by A. Gazdar and colleagues [24]. The biological functions of the sets of genes that were differentially expressed across different "branches" of the cell line dendrogram (Fig. 1B-E) suggested that the gene expression patterns identified cell lines with features that could be related to different types of normal breast cells. The cell lines sorted into those that either expressed HMEC/basal-cell characteristics, those that expressed stromal/mesenchyrnal-celMike characteristics or those that expressed luminal-cell characteristics. It should be emphasized mat this is an interpretation of the gene expression patterns and that alternative interpretations of these gene expression patterns are possible.

3. Breast basal epithelial cell signature The group containing all HMEC lines (Red dendrogram branch) was distinguished by the very high expression of a set of genes that contained many markers of normal breast basal-epithelial cells including keratins 5 and 17 (Fig. 1C) [3,4,25,26]. This set also included many genes whose roles in cell physiology distinguish basal from luminal epithelial cells including the production of basal lamina components and interactions with the extracellular matrix (e.g. gamma and alpha-laminin, collagen type-XVII, integrins alpha-3, alpha-6 and beta-4). The cultured basallike cell lines expressed variable amounts of smoothmuscle-actin but much less relative to the other lines that expressed the "stromal cell" gene expression signature (Fig. ID). Therefore, these cultured cells appeared to express some, but not all, of the features

101

of so-called "myo-epithelial cells" which are mature smooth-muscle-actin expressing cells that have a functional contractile apparatus [3,5,27]. This "basal" pattern of gene expression was not restricted to HMEC in that this set of genes were moderately expressed in three other lines (MCF-10A, BT-549 and HB2) that also expressed strong stromal-like gene expression signatures and therefore, did not fall into this class by clustering analysis (see below). It should also be noted that even immortal HMEC derivatives, like 184B5 and 184A1, showed the dominant "basal" cell gene expression pattern and not other signature patterns, and hence, this pattern was not dramatically influenced by immortalization or transformation (see Supplementary Information Fig. 3 - http://genomewww.stanford.edu/breastcell_line_review2001/). 4. Luminal epithelial cell signature Approximately 60-70% of sporadic breast tumors are estrogen receptor positive and are believed to be derived from breast luminal epithelial cells [1], which can be distinguished by their expression of cytokeratins 8 and 18 and by their location and function in lining breast secretory-ducts. The in vitro study of this cell type has been complicated by the difficulty in maintaining primary cultures of normal estrogen-receptorpositive luminal cells for longer than a few population doublings [28]. Therefore, most in vitro studies on breast luminal epithelial cells have been performed on cell lines derived from primary breast tumors or pleural effusions from breast cancer patients. The pattern of gene expression that distinguished the luminal-like signature was comprised of a set of genes that were nearly exclusively expressed in all of the luminal like lines while very low to absent levels were seen in all of the other tested lines (Fig. IE). Contained within this set of genes were genes/proteins that have been previously used to distinguish luminal breast epithelial cells including the estrogen receptor and keratins 8 and 18 [3]. However, two distinct subtypes of cell lines that expressed luminal characteristics could be distinguished, 1) those that expressed high levels of the estrogen receptor and essentially lacked either stromal or basal gene expression signatures (e.g. MCF7, BT-474 and T47D), and 2) those that expressed little or no estrogen receptor but also expressed genes characteristic of the basal signature (HCC-1937 and HB2). SK-BR-3 was unique in this set of cell lines in that it expressed low levels of the estrogen receptor but a strong luminal gene expression signature without the expression of basal cell characteristics.

102

D.T. Ross and C.M. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines

D.T. Ross and C.M. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines

103

Fig. 1. Cluster diagram depicting relative gene expression differences between cell lines. The red-green pseudocolor chart depicts gene expression data in comparison between different cell lines. Red blocks depict genes relatively over-expressed in comparison between the measured samples whereas green blocks depict genes relatively under-expressed. The data table has been organized by hierarchical clustering which groups the genes on the basis of their similarity in expression patterns across a set of experimental samples (e.g. cell lines), and groups the experimental samples together based upon their similarity in gene expression patterns across the set of chosen genes. The result of the analysis is a re-ordering of the data table such that genes with relatively similar patterns of expression across the sample set are adjacent to one another in the rows, and samples with similar patterns of expression in the set of chosen genes are adjacent to one another in the columns. The dendrogram above the color chart depicts the relative similarities of the cell lines to one another; terminal branches contain cell lines that express relatively similar patterns of gene expression across whereas those separated by longer branches express relatively less similar gene expression patterns [9]. A) Complete cluster diagram that depicts all 1287 transcripts across 18 independent cell lines including 24 different hybridizations. B) "Common epithelial" cell gene set that was expressed in both basal and luminal cells but was not expressed in the cells that have strong fibroblast-like characteristics. C) Breast basal epithelial cell gene set that was strongly expressed in all HMEC derived cell lines. D) Stromal-like/fibroblast gene set that was expressed in some fibroblasts as well as some breast cancer derived cell lines that were ostensibly mis-classified as carcinoma derived. E) Luminal epithelial gene set that was expressed in estrogen-receptor-positive cell lines as well as a few other lines. The color scale at the bottom left depicts the gene expression measured in each cell line relative to the average expression for each gene as determined in the 24 different cell line samples.

5. Mesenchymal/stromal cell signature

6. Common epithelial cell signature

We have previously shown that a small, but significant, number of cell lines ostensibly of epithelial origins showed patterns of gene expression that were more consistent with characteristics expected of stromal cells (see [6] and http://genome-www.stanford.edu/nci60/). In order to further investigate the gene expression properties of these cell lines we compared them to two cell lines explicitly derived from breast stroma (HMS32 and Fibroblast-UTSW, both obtained from Jerry Shay/UTSW). The distinguishing gene expression signature for these strains/lines was comprised of the high expression of a number of genes with roles in remodeling of extracellular matrix including high expression of the genes encoding smooth muscle actin, vimentin, fibrillin, byglycan and collagen types I, III, V and VI (Fig. ID), combined with the low expression of genes characteristic of epithelial cells (Fig. IB). The cell lines contained within this branch of the dendrogram (Orange) were further subdivided into a branch that contained three similar lines that expressed the highest levels of this "stromal" gene expression signature and others that showed incrementally less expression of this set of genes. Consistent with the interpretation that this signature reflected expression of stromal cell physiology, this signature was the most strongly expressed in the carcinosarcoma derived line Hs578T and the two cell lines/strains established from fibroblasts, HMS32 (primary fibroblast strain with a finite lifespan) and Fibroblast-UTSW (telomerase immortalized breast derived fibroblast line). The remaining lines that clustered with these fibroblast-like lines, showed decreased overall expression of this set of genes with incrementally less expression in BT-549, less in MCF10A, and finally, only a few stromal-signature genes expressed in the two independently propagated lines of MDA-MB-231(Fig. ID).

In addition to the gene expression signatures that distinguished the three major branches of the cell line dendrogram, both the basal-epithelial-like cell lines and the luminal-epithelial-like cell lines expressed a set of genes that were absent in those lines expressing the stromal gene expression signature (Fig. IB). This set was comprised of many genes that play roles in cell-tocell contacts that seal the lumen or extracellular space in epithelial tissues (e.g. E-cadherin, plakoglobin and junctional-adhesion protein). This cluster likely distinguished genes involved in functions common between subtypes of epithelial cells, and therefore, comprised a molecular signature of epithelial cells (Fig. IB).

7. Cell lines of ambiguous origins Of particular interest were the breast derived cell lines that lacked expression of the common epithelial genes and expressed some characteristics of the "stromal" expression signature including BT-549, MCF10A and MDA-MB-231. BT-549 showed strong expression of most of the genes in the stromal cluster (Fig. ID) and lacked expression of the other signature expression patterns including the common epithelial cell pattern. This suggests that BT-549 may represents a myofibroblast-like line transformed in vitro or a line like Hs578T that was originally derived from a stromallike tumor in vivo [18]. The MCF-10A [19] line is a well-studied breast model system that expressed characteristics of both the basal epithelial signature including keratin expression, as well as genes that comprised the stromal signature, but significantly, lacked expression of the common epithelial cell pattern. Perhaps the most enigmatic cell line was MDA-MB-231 that did not

104

D.T. Ross and C.M. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines

show strong characteristics of any of the three signature patterns of expression (Fig. 1C-E) except for a fraction of genes that comprised the stromal cluster. This cell line has been previously shown to be similar to renal carcinoma derived cell lines in a separate study that compared cell lines derived from a diverse set of tumor types, and therefore, may represent a de-differentiated cell type that has lost expression of the signature of its tissue of origin [6]. Further gene expression studies on these cell lines, including studies of their responses to extracellular matrix stimuli, might distinguish their potential for differentiation into cells with a more clear relationship to their in vivo counterparts. In a previous study, we shown that another cell line that has been used as a model of aggressive-metastaticbreast tumors, MDA-MB-435, showed a pattern of gene expression that was very similar to the pattern seen in seven independent melanoma derived cell lines (see http://genome-www.stanford.edu/nci60/images/figure2c.html and [6]). This distinctive pattern included strong expression of many genes characteristic of melanoc i n c l u d i n g , dopachrome tautomerase and S100-ß and therefore suggested that the tested cell line was derived from a Melanoma and not from a breast carcinoma. A number of different samples of MDA-MB-435 derived from different sources showed a similar pattern suggesting that most, if not all, examples of this cell line are similarly misclassified (data not shown). These finding suggest that this cell line is not an appropriate model system for the study of breast carcinoma.

8. Breast tumor gene expression patterns One of the most useful aspects of microarray technology is its utility in the study of gene expression patterns in clinical tumor specimens [7,13,15,29-32]. We have previously published a study of gene expression profiles of forty breast cancer patients that included twenty samples from patient's tumors before and after a sixteen week course of doxorubicin chemotherapy [7]. In order to identify the best set of genes to use for tumor classification, we utilized a statistical approach to identify the subset of genes that showed significant variation in expression across different patients/tumors, but which varied little in expression within paired samples from the same tumor [7]; this set of genes, termed the "intrinsic" gene set. was enriched for those genes whose expression patterns were characteristic of each tumor as opposed to those that varied as a function of

sampling error. An example of a gene that showed this "intrinsic" property was ERBB2/HER2, which was expressed at high levels in some tumors and not others (forty-fold difference within this sample set), but which was consistent in expression in comparison between multiple samples taken from the same tumor. Hierarchical clustering analysis using this "intrinsic" gene set of 476 cDNA clones (including 426 different genes) resulted in a novel molecular classification of the tumor samples on the basis of gene expression patterns (see Fig. 2 of [7] at http://genomewww.stanford.edu/molecularportraits/images/figure2. html). Importantly, the tumor samples cluster dendrogram from this analysis showed that the repeat biopsies taken from the same patient (i.e. the 20 "before" and "after" doxorubicin sample pairs) were found to almost always be more similar to each other than either was to any of the other tumors tested (17/20 "before" and "after" pairs were paired and 2/2 tumor/lymph node metastasis pairs were paired). This implied that every tumor is unique and has a distinctive gene expression "signature" or "portrait". These gene expression patterns distinguished four discrete tumor subtypes that, similar to the cell line studies, could be related to features of normal breast cell type in vivo. The classes of tumor subtypes identified were 1) a "luminal epithelial/ER+" subtype that was distinguished by high expression of a set of approximately twenty genes that included the ER gene and other genes known to be regulated by estrogen, 2) a normal breast-like group of samples that contained the three normal breast samples, a single fibroadenoma and 5 tumor samples, 3) a group of tumors most of which expressed high levels of the ERBB2/HER2 gene, and 4) a group of tumors that had gene expression patterns reminiscent of breast basal epithelial cells. Building upon these studies, we recently reported gene expression patterns of an expanded set of 78 different breast tumors [33]. Cluster analysis of data derived from this larger tumor set re-identified the same four tumor subtypes in addition to one additional subclass of ER+ tumors, such that five subtypes were now distinguished. To explore whether this classification of breast tumors was clinically significant, we assigned each of the 51 tumors that comprised the cohort of "before" and "after" doxorubicin patients [34], to a class based upon its location within the tumor sample associated dendrogram and did Kaplan-Meier survival analysis (see http://genomewww.stanford.edu/mopo/clinical/and [33]). We found that these subtypes, as defined by gene expression pat-

D.T. Ross and C.M. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines

terns, had statistically significant different overall patient survival and relapse free survival characteristics. The luminal/ER+ positive patients were sub-divided into two classes of which a novel subtype was identified (Luminal B/C) that had a significantly worse outcome when compared to the rest of the ER+ tumors (Luminal A), which showed the most favorable outcomes. Furthermore, the set of patients classified as "basal-like" had outcomes as poor as those that overexpressed ERBB2/HER2. The patterns of gene expression that were distinguished in these studies represent novel molecular signatures of breast tumors that can be used to 1) develop new clinical tests based upon gene expression patterns to score for these subtypes, 2) identify candidate markers for diagnosis, 3) identify genes important for understanding the biology that distinguishes basal and luminal epithelial cells, and 4) identify subtype specific targets for developing therapeutic interventions.

9. An integrated cell line and breast tumor analysis The set of genes that defined epithelial cell characteristics within the cell line panel was remarkably similar to the set of genes that distinguished tumor subtypes amongst the breast carcinomas. This suggested that certain cell lines may be very good models for specific subtypes of tumors. In order to more directly compare and contrast primary breast tumors and cell lines, we created a single hierarchical clustering diagram using the aforementioned "intrinsic" gene list and data from 16 cell lines discussed above and our previously published study on 40 breast tumors and three normal breast samples (Fig. 2, and see Supplementary Information Fig. 4 for the complete cluster diagram - http://genomewww.stanford.edu/breastxancer/cellJinejwiew2G01/)As expected, the results showed a similar grouping of the tumors samples into at least four subtypes including luminal/ER+ (dark blue), ERBB2/HER2+ (pink), normal breast-like (green), and a basal-like classes (dark red). The cell lines, regardless of their presumed celltype of origin, clustered together on a single large dendrogram branch separate from all of the tumors, however, they were also similarly subdivided into the basal, luminal, and stromal-like groups described above (Fig. 2). The luminal/ER-r- signature was most strongly expressed in a large group of ER expressing tumors and the subset of cell lines that expressed the luminal gene expression signature (blue dendrogram branch

105

and Fig. 2B). The pattern of expression of this set of genes showed both quantitative and qualitative differences between the luminal/ER+ tumors in comparison to the luminal cell lines. These tumor samples were comprised of at most 60-70% tumor cells, but expressed stronger and more consistent levels of this gene set when compared to the "pure" population of cell derived from a single cell line. Given the difficulty in establishing cultures from cells expressing luminal characteristics, the loss of expression of these genes may be related to the process of establishment or maintenance of cell lines in vitro. It was interesting that the ERBB2+ tumor subtype expressed fewer of these genes than the luminal-like/ER+ tumors, and in most cases, expression levels lower than the luminal cell lines. These patterns were consistent with the notion that the relative level of expression of this set of genes may reflected the degree of luminal differentiation of the tumor samples. Taken together, these results suggested that the best models for ER+ luminal epithelial cell derived tumors, choosing from the cell lines tested here, are MCF7, T47D and BT-474, with SK-BR-3 serving as a model of luminal-cell derived ER-negative/ERBB2 positive tumors. We have previously shown that most of the genes highly expressed in a pattern similar to ERBB2 across large sets of breast tumors are all part of the coamplified chromosomal region that contains ERBB2 [7, 35-38]. In this analysis, most of the tumors that expressed this gene expression signature as their primary distinguishing characteristic were clustered together and formed a discreet subtype (pink dendrogram branch, and Fig. 2D). A few other tumors that expressed the ERBB2 signature were present, however, they also expressed either the luminal or normal breast signatures and were clustered based upon those distinguishing characteristics. Both BT-474 and SK-BR-3 showed high level expression of the ERBB2 signature and therefore are likely appropriate cell line models for ER-positive and ER-negative, ERBB2 over-expressing tumors, respectively. The basal-like tumors were distinguished by strong expression of keratin 5 and 17 relative to the luminal cell-like tumors (Fig. 2 and [7]). The basal epithelial gene expression signature expressed by HMEC lines was in part expressed by both the basal-like tumors and normal breast tissue, which appeared to be comprised predominantly of basal epithelial cells (dark red dendrogram branch and Fig. 2E). These results suggested that the basal like cell lines/HMEC cultures and their derivatives, are in general, appropriate model systems

106

D.T. Ross and C.M. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines

D.T. Ross and CM. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines

107

Fig. 2. Integrated analysis of breast tumors and cell lines using the "intrinsic" gene set. Cluster diagram (see legend to figure 1) depicting the relationships between gene expression patterns in breast cancer derived cell lines and tumor specimens. A) Experimental sample associated dendrogram showing distinct tumor and cell line subtypes. B) Luminal/ER+ gene expression cluster. C) Common epithelial cell cluster containing E-cadherin. D) ERBB2+ amplicon cluster. E) Basal epithelial cell cluster. The color scale at the top right depicts the gene expression measured in each sample relative to the average expression for each gene as determined in the 87 different samples. The full microarray data files for all 87 experiments can be obtained from the Stanford Microarray Database at http://genome-www4.stanford.edu/MicroArray/SMD/, and the full cluster diagrams for Figs 1 and 2 can be seen at http://genome-www.stanford.edu/breast cancer/cell line review2001/.

for the breast basal-like tumors. The absence of expression of any genes characteristic of the luminal signature suggests that HMEC cultures and their derivatives are NOT appropriate models of hormone responsive breast tumors, which comprise the majority of sporadic breast tumors [1]. HB2 and HCC-1937, as noted above, expressed genes characteristic of both the basal and luminal signatures, and therefore may be similar to basallike tumors that often co-express cytokeratin 17 and 18 in vivo (M. van de Rijn, personal communication). The cell lines that expressed the stromal cell signature (orange dendrogram branch), with the exception of MCF-10A, showed very few gene expression characteristics in common with any of the breast tumors, even those that were highly metastatic when sampled for microarray analysis. The cellular origins of these cell lines are still enigmatic and it is not clear which, if any, of these cell lines are appropriate models for breast carcinoma. It is interesting to note, however, that some of the cell lines contained within this cluster, in particular MDA-MB-231, are some of the most tumorigenic and aggressive in nude mouse xenograft models [39]. Most interesting amongst these cell lines was MCF-10A, which expressed some genes characteristic of basal like tumors including keratin expression, but lacked expression of the common epithelial signature. Interestingly, this cell line is not tumorigenic in xenograft models.

10. Summary The advent of the DNA microarray technology has enabled researchers to measure genomic scale gene expression in human cancers and cell lines. The exploration of these gene expression patterns is challenging oncologists and pathologists to re-assess traditional classifications of cancer and incorporate molecular features into treatment regimens and drug development strategies. The great strength of cDNA microarray studies coupled to hierarchical clustering analysis is the ability to objectively identify sets of coordinately expressed genes and display the data in a format that a biologist can utilize to form hypotheses.

The data presented here, and in our previous studies, have shown that many different gene selection criteria, across different sets of breast tumors and cell lines, consistently identified similar sets of genes that can serve as markers for probing the biology of breast cancer [6, 7,33]. The comparison of gene expression patterns between cell lines and tumors is dominated by differences related mostly to the proliferative index of the samples, with most cell lines growing at a much faster rate than in vivo tumor cells [6-8,13,15]). However, in the case of breast tumors, cell lines and tumors share many aspects of their gene expression patterns that can be related to the normal and pathological physiology that distinguishes breast cell types in vivo. These gene sets include 1) the basal epithelial cluster, 2) the luminal epithelial/ER+ cluster, 3) the ERBB2+ amplicon cluster, 4) the proliferation cluster, and 5) the interferon cluster [7,8]. Remarkably, the classes of tumors as defined by gene expression, in part, are consistent with current markers that are used for breast cancer stratification and prognostication (e.g. ER status, ERBB2 status, proliferative index) [33]. In addition to re-identifying and elucidating traditional classes of tumors, gene expression patterns are revealing novel subtypes of tumors that appear both biologically and clinically distinct. In the study by S0rlie et A3., the class of tumors distinguished by expression of the basal epithelial signature, including expression of cytokeratins 5 and 17, showed an outcome as poor as ERBB2 over-expressing tumors and were as numerous. Similarly, this study also identified a sub-group of patients with ER-positive tumors, traditionally classified as having a good prognosis, that had very poor outcomes [33]. The similarities and differences between cell lines and tumors should allow a much more informed choice to be made about the appropriateness of any given cell line model for a particular aspect of tumor biology to be studied in vitro. In addition to the dominant patterns of gene expression described in this review, there is tremendous additional variation in gene expression patterns in comparison between tumor subtypes (see Supplementary Information Figures at http://genomewww.stanford.edu/breastcancer/cellJinejreview2001/).

108

D.T. Ross and CM. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines

Gene expression studies of fifty to one hundred tumor specimens likely does not have the power to identify all of the inherent biologic diversity of tumors. It remains to be determined whether one, a few, or tens to hundreds of markers will be necessary to identify distinguishing characteristics of tumors in the clinical setting. It is likely that larger gene expression profiling studies, and/or large in situ or immunohistochemistry studies using candidate markers identified in microarray studies, will be necessary to distinguish all of the clinically relevant variation in biology that can be exploited to develop better patient management algorithms and targeted drug strategies.

Acknowledgements We are grateful to David Botstein and Patrick O. Brown for guidance and for providing the resources that were used in this study. We also thank John C. Matese for his efforts in creating and maintaining the website that supports this paper, and the following individuals for their cell lines or mRNA samples that were used in this study (Jerry Shay, H.S. Wiley, Fuyuhiko Tamanoi, Martha Stampfer and Paul Yaswen) and Robert Strausberg for critical reading of this manuscript.

References [1] [2] [3]

[4]

[5]

[6] [7] [8]

F.A. Tavassoli and S.J. Schnitt, Pathology of the breast. New York: Elsevier. xiii, 1992, pp. 669. P.O. Brown and D. Botstein, Exploring the new world of the genome with DNA microarrays, Nat Genet 21(1 Suppl) (1999). 33-37. L. Ronnov-Jessen, O.W. Petersen and M.J. Bissell, Cellular changes involved in conversion of normal to malignant breast: importance of the stromal reaction, Physiol Rev 76(1) (1996). 69-125. M.R. Stampfer and P. Yaswen. Culture systems for study of human mammary epithelial cell proliferation, differentiation and transformation [see comments]. Cancer Surv 18 (1993). 7-34. J. Taylor-Papadimitriou et A3.. Keratin expression in human mammary epithelial cells cultured from normal and malignant tissue: relation to in vivo phenotypes and influence of medium. J Cell Sci 94(Pt 3) (1989). 403-413. D.T. Ross et A3., Systematic variation in gene expression patterns in human cancer cell lines [see comments], Nat Genet 24(3) (2000), 227-235. C.M. Perou et A3., Molecular Portraits of Human Breast Tumors, Nature 406 (2000), 747-752. C.M. Perou et A3., Distinctive gene expression patterns in human mammary epithelial cells and breast cancers, Proc Natl Acad Sci USA 96(16) (1999). 9212-9217.

[9] M.B. Eisen etA3.,Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95(25) (1998). 14863-14868. [10] V.R. Iyer et A3., The transcriptional program in the response of human fibroblasts to serum [see comments]. Science 283(5398) (1999), 83-87. [11] P.T. Spellman et A3., Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol Biol Cell 9(12) (1998). 3273-3297. [ 12] J.N. Weinstein etA3.,An information-intensive approach to the molecular pharmacology of cancer. Science 275(5298) (1999), 343-349. [13] A.A. Alizadeh etA3.,Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [see comments]. Nature 403(6769) (2000), 503-511. [14] A.P. Gasch et A3., Genomic expression programs in the response of yeast cells to environmental changes [In Process Citation], Mol Biol Cell 11(12) (2000). 4241-4257. [15] C.M. Perou, P.O. Brown and D. Botstein. Tumor classification using gene expression patterns from DNA microarrays. New Technologies for life sciences: A Trends Guide. 2000. pp. 6776. [16] T.R. Hughes etA3.,Functional discovery via a compendium of expression profiles, Cell 102(1) (2000). 109-126. [17] T.M. Paine et A3., Characterization of epithelial phenotypes in mortal and immortal human breast cells, Int J Cancer 50(3) (1992), 463-473. [18] A.J. Hackett etA3.,Two syngeneic cell lines from human breast tissue: the aneuploid mammary epithelial (Hs578T) and the diploid myoepithelial (Hs578Bst) cell lines. J Natl Cancer Inst 58(6) (1977), 1795-1806. [19] H.D. Soule et A3., Isolation and characterization of a spontaneously immortalized human breast epithelial cell line, MCF10, Cancer Rex 50(18) (1990), 6075-6086. [20] R. Cailleau, M. Olive and Q.V. Cruciger, Long-term human breast carcinoma cell lines of metastatic origin: preliminary characterization. In Vitro 14(11i) (1978), 911 -915. [21] H.D. Soule et A3., A human cell line from a pleural effusion derived from a breast carcinoma, J Natl Cancer Inst 51(5) (1973). 1409-1416. [22] I. Keydar et A3., Establishment and characterization of a cell line of human breast carcinoma origin. Eur J Cancer 15(5) (1979), 659-670. [23] E.Y. Lasfargues, W.G. Coutinho and E.S. Redfield. Isolation of two human tumor epithelial cell lines from solid breast carcinomas. J Natl Cancer Inst 61(4) (1978). 967-978. [24] G.E. Tomlinson etA3.,Characterization of a breast cancer cell line derived from a germ-line BRCA1 mutation carrier. Cancer Rex 58( 15) (1998), 3237-3242. [25] M.R. Stampfer et A3., Gradual Phenotypic Conversion Associated with Immortalization of Cultured Human Mammary Epithelial Cells. Mol Biol Cell 8(12) (1997). 2391-2405. [26] L. Ronnov-Jessen et A3., The origin of the myofibroblasts in breast cancer. Recapitulation of tumor environment in culture unravels diversity and implicates converted fibroblasts and recruited smooth muscle cells. J Clin Invest 95(2) (1995). 859-873. [27] M Stampfer, The HMEC Homepage. [28] C. Pechoux et A3., Human mammary luminal epithelial cells contain progenitors to myoepithelial cells, Dev Biol 206(1) (1999), 88-99. [29] U. Alon et A3., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed

D.T. Ross and CM. Perou / A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines by oligonucleotide arrays, Proc Natl Acad Sci USA 96(12) (1999), 6745-6750. [30] T.R. Golub et A3., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286(5439) (1999), 531-537. [31] H. Okabe et A3., Genome-wide analysis of gene expression in human hepatocellular carcinomas using cDNA microarray: identification of genes involved in viral carcinogenesis and tumor progression, Cancer Res 61(5) (2001), 2129-2137. [32] J.B. Welsh et A3., Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer, Proc Natl Acad Sci USA 98(3) (2001), 1176-1181. [33] T. S0rlie etA3.,Gene expression patterns of breast carcinomas distinguish tumor subclasses with potential clinical implications, Submitted, 2001. [34] S. GeisleretA3.,Influence of TP53 gene alterations and c-erbB2 expression on the response to treatment with doxorubicin in locally advanced breast cancer, Cancer Rex 61(6) (2001), 2505-2512.

[35]

[36] [37] [38]

[39]

C. Moog-Lutz et A3., MLN64 exhibits homology with the steroidogenic acute regulatory protein (STAR) and is overexpressed in human breast carcinomas, Int J Cancer 71(2) (1997), 183-191. J.R. Pollack et A3., Genome-wide analysis of DNA copynumber changes using cDNA microarrays, Nat Genet 23(1) (1999), 41-46. J.S. Ross and J.A. Fletcher, The HER-2/neu oncogene in breast cancer: prognostic factor, predictive factor, and target for therapy, Stem Cells 16(6) (1998), 413-428. D. Stein etA3.,The SH2 domain protein GRB-7 is co-amplified, overexpressed and in a tight complex with HER2 in breast cancer, Embo J 13(6) (1994), 1331-1340. H. Pulyaeva etA3.,MT1-MMP correlates with MMP-2 activation potential seen after epithelial to mesenchymal transition in human breast carcinoma cells, Clin Exp Metastasis 15(2) (1997), 111-120.

This page intentionally left blank

Ill

C282Y and H63D mutation frequencies in a population from central Spain S. Alvareza, M.S. Mesab, F. Bandresa and E. Arroyoa,* a

Dep. Toxicologiay Legislation Sanitaria, Facultad de Medicina, Universidad Complutense de Madrid, 28040-Madrid, Spain b Seccion de Antropologia, Dep. Biologia Animal I, Facultad de Ciencias Biologicas, Universidad Complutense de Madrid, 28040-Madrid, Spain Objectives: To determine the frequency of hereditary hemochromatosis gene mutations, C282Y and H63D, from 125 autochthonous blood donors originating from a Central region of Spain, to provide epidemiological data about HFE gene in the Iberian Peninsula. Methods: DNA extracted from blood samples was analyzed by PCR-RFLP. Restriction enzimes were Snab I and Bel I for C282Y and H63D, respectively. Results were visualized with Ethidium Bromide staining after gel electrophoresis. Results and discussion: C282Y frequency was 0.02 and that of H63D was 0.16. Result for C282Y mutation falls within the range of variation of the Mediterranean populations. H63D frequency agrees with those reported for other European populations. In both cases frequencies obtained are the lowest of compared Spanish data. Conclusions: This study is useful to compare expected versus presented C282Y and H63D frequencies in Spanish populations and to contribute to the knowledge of Spanish variability, rarely analyzed until now for HFE gene mutations.

1. Introduction Hereditary hemochromatosis is an autosomal recessive disease affecting approximately 1/300 individuals of European origin [1]. This disorder causes overabsorption of iron from the intestine, increasing body iron overload. Progressive accumulation of iron appears on 'Correspondence address: Eduardo Arroyo, Dep. Toxicologia y Legislacion Sanitaria, Facultad de Medicina, Universidad Complutense de Madrid, 28040-Madrid, Spain. Tel.: +34 91 3941576; Fax: +3491 3941606; E-mail: [email protected]. Disease Markers 17 (2001) 111-114 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

different tissues and liver, joints, skin, heart and other organs may be affected which determine endocrine abnormalities, cirrhosis, arthropaty, cardiomyopathy, etc, and may be fatal if treatment is not applied. In 1976, Simon etA3.reported an association of hemochromatosis with HLA loci [2]. Later, the hemochromatosis gene, named HFE, was identified on the short arm of chromosome 6 about 4, 5 megabases telomeric to HLAA [3]. HLA-linked hemochromatosis gene has heretofore been described exclusively in people of European ancestry. It is important to consider that there are other iron metabolism disorders different from this hereditary hemochromatosis, such as juvenile hemochromatosis, African iron overload or hereditary hyperferritinaemia, all of which are determined by other loci different from HFE [4]. The characterization of HFE gene has shown that it presents a mutation, named C282Y, found in most patients suffering hemochromatosis. This mutation is a G —»A transition at nucleotide 845 of the open reading frame that results in a 282 Cys Tyr substitution in the protein product. Homozygosity for the C282Y variant is closely associated with hemochromatosis, and its frequency is: 64-94.7% of European hemochromatosis patients [3,5], although it has been reported in Australia (from individuals of European origin) that 100% of hemochromatosis patients are homocigotic for the C282Y mutation [6]. A few other mutations related to hemochromatosis have been described in the HFE gene [7,8], the most important, considering its frequency, is H63D which corresponds to the transition C —> G at nucleotide 187 resulting in a 63 His —> Asp substitution in the protein chain. The role of this second variant in causing genetic hemochromatosis is less clear in comparison with C282Y. However, hemochromatosis disease is shown in cases in which H63D and C282Y are present together in the same individual [3,9]. The allelic association between HLAA*03 and hemochromatosis is well recognized [2,10]. Edwards et A3. [11] demonstrated that approximately 70% of hemochromatosis patients possess at least one HLA-A*03. A possible origin of the mutation on a chromosome bearing an HLA-A*03 allele has been

112

S. Alvarez etA3./ HFE polymorphism in Spain

suggested [12]. Recent findings support the hypothesis that the mutation appeared in a relatively recent period (60-70 generations ago) on a *A3 *B7 haplogroup [10]. In addition, the finding of HLA alleles other than *A3 associated with hemochromatosis may be explained by recombination systems, which occurred as infrequent events, between these loci over many generations; another explanation could be the appearance of infrequent hemochromatosis causing mutations other than the two mentioned above. The aim of this study is to establish population frequencies of C282Y and H63D mutations from a sample of Central Spain to provide epidemiological data which may permit us to relate both the incidence of hereditary hemochromatosis and presence of these two major mutations. This is the first report of mutation frequencies for the Central Spanish population. This contribution adds to our knowledge concerning European population and HFE markers.

2. Material and methods 2.1. Sample DNA samples were obtained from 125 unrelated individuals. These individuals were blood donors and autochthonous (4 grandparents born in the studied region) from Central Spain (Vera-Jerte Valley, Caceres province). This population was chosen because of the genetic resemblance to the majority of Spanish populations as shown in previous studies of STR variability and other markers [13,14].

2.2. Amplification protocol A 25 ml reaction mix containing 20 pmol of each primer, 75 mM Tris HC1 (pH 9.0), 1.5 mM MgCl2, 50 mM KC1, 20 mM (NH4)SO4, 200 mM each dNTP and 1 U Taq polymerase (Biotools). PCR program was carried out in a Omn-E Thermal Cycler (Hybaid, UK) for 1 cycle (95 °C 120 s) of initial denaturation, 35 cycles of amplification (95 °C 60 s, 55°C 30 s, 72°C 60 s) and 1 cycle of final extension (72°C 60 s). PCR primers for detection of H63D and C282Y mutations were used according to Beutler etA3.[15].

2.3. RFLP assay 10 ml of each final product were digested with Bcl I for H63D and Snab I for C282Y (New England Biolabs) at 50°C and 37°C, respectively, overnight, according to manufacturer's protocol. Fragment polymorphism was separated by electrophoresis in 2% Agarose gels and visualized with Ethidium Bromide.

3. Results and conclusion 125 blood samples were analyzed and the C282 Y and H63D variants were identified. In relation to C282Y mutation, five of the donors were C282Y heterozygous and the rest carried the normal allele. No C282Y homozygous was detected. These results correspond to a C282Y gene frequency of 0.02 ± 0.009. The search for H63D mutation revealed that two individuals were H63D homozygous and 36 heterozygous while 87 exhibited normal alleles. H63D gene frequency was 0.16 ± 0.023. For comparative purposes, Table 1 summarizes results from different European populations studied for HFE mutations. The C282Y mutation has been found in these populations, except Udmurts (USSR), Finland, Askenazi Jews, Cypriots and Turkey [4], but in these cases the number of characterized individuals is small and this fact may influence the absence of the mutation because of its low frequency. The relatively most elevated values were found in countries located in the Northern Europe: Denmark (9.5% and 6.8%) [4,16], United Kingdom (8%-5.9%) [4,17,18], Iceland (6.7%) or Norway (6.4%) [4]. The Mediterranean area presents lower values, between 0.5% in Italy [4] and 4.4% obtained in a region corresponding to Northern Spain (Cantabrians) [19]. Considering the Spanish population, to the present, only Northern Spain (Basques, Catalans and Cantabrians) [4,19] is represented in the HFE gene studies; the rest of the Iberian Peninsula is unknown. In addition, two of the samples are composed of a reduced size of the phenotyped individuals. Our result for Central Spain, with a C282Y gene frequency of 2%, falls within the range of variation of the Mediterranean populations and the lowest of compared Spanish data. H63D mutation is more widely distributed and its relationship with hemochromatosis is less clear. It is present in African, Asian and native American populations, although at a frequency lower than 2% in most of them. It is not found in most Australasians [4]. European populations show H63D fre-

S. Alvarez etA3./ HE polymorphism in Spain

113

Table 1 Frequencies of H63D and C282Y mutations in European populations Population Iceland Norway USSR (Mekhelta) USSR (Udmurts) Finland The Netherland Denmark Denmark Germany Germany Germany (Bavarians) Austria United Kingdom United Kingdom (Wales) United Kingdom (NE Scotland) Ireland France ( Finistere) France (Rennes) France (Toulouse) Hungary Askenazi Jews Italy Italy Greece Greece (Cypriots) Turkey Turkey (Turks-Cypriots) Spain (Basques) Spain (Catalans) Spain (Cantabrians) Spain (Central region)

quencies greater than the other continents and is present in all studied samples with a variation range between 6.7%, in a sample of Mekhelta from USSR, and 30.4%, corresponding to the Spanish Basque sample [4]. The same authors summarized all studied individuals (1440) residing in European countries and obtained a general H63D frequency of 13.6%. In contrast to C282Y mutation, H63D frequencies do not show any gradation from North to South, and Mediterranean people have a similar variation range in comparison with the north of Europe. The most elevated values are found in the Netherlands (29.5%) and in the two analyzed series from Northern Spain: Basques (30.4%) and Catalans (24.0%) [4]. Our findings are consistent with this European variation for H63D; and the frequency for Central Spain (16.0%) is lower than Northern Spanish samples but very similar to a great part of the other populations, such as France (15.8%-16.9%) [20-22], United Kingdom (12.1%-15.8%) [4,17,18] and Ireland (18.9%) [4]. In conclusion, the results of this study present data concerning frequencies of HE mutations related to hemochromatosis disease are useful to compare expected versus presented frequencies of individuals with

N 90 94 45 46 38 39 200 37 53 153 62 271 368 101 188 45 254 139 95 277 35 50 91 139 57 31 39 28 50 213 125

H63D 10.6 11.2 6.7 15.2 11.8 29.5 12.8 12.2 18.9 22.0 11.3 13.0 12.1 15.8 15.7 18.9 16.9 16.5 15.8 12.3 8.6 10.0 12.6 11.9 17.5 17.7 10.3 30.4 24.0 16.0

C282Y 6.7 6.4 1.1 0 0 2.6 6.8 9.5 1.9 5.2 5.6 3.7 6.0 5.9 8.0 10.0 9.4 2.9 4.2 5.6 0 1.0 0.5 1.4 0 0 0 3.6 3.0 4.4 2.0

Reference [4] [4] [4] [4] [4] [4] [16] [4] [4] [23] [4] [24] [4] [17] [18] [4] [22] [20] [21] [25] [4] [26] [4] [4] [4] [4] [4] [4] [4] [19] Present study

hereditary hemochromatosis in the Spanish population. HE mutations also show a high level of variability among populations, and this study provides an amplification of data about their distribution in European countries.

Acknowledgements This work has been partially funded by multidisciplinar project from Complutense University PR182/96 no 6745 and DIGCYT project PB92-0224.

References [ 1 ] C.Q. Edwards, L.M. Griffen, D. Goldgar, C. Drummond, M.H. Skolnick and J.P. Kushner, Prevalence of hemochromatosis among 11,065 presumably healthy blood donors, New Engl. J. Med. 318 (1988), 1355-1362. [2] M. Simon, M. Bourel, M. Fauchet and B. Genetet, Association of HLA-A3 and HLA-B14 antigens with idiopathic hemochromatosis, Gut 17 (1976), 332-334.

114

S. Alvarez etA3./ HE polymorphism in Spain

[3] J.N. Feder, A. Gnirke, W. Thomas, Z. Tsuchihashi, D.A. Ruddy, A. Basava, F. Dormishian et A3., A novel MAC class 1-like gene is mutated in patients with hereditary haemochromatosis, Nat. Genet. 13 (1996), 399-108. [4] A.T. Merryweather-Clarke, J.J. Pointon, J.D. Shearman and K.J. Robson, Global prevalence of putative haemochromatosis mutations, J. Med. Genet. 34 (1997), 275-278. [5] B.R. Bacon. L.W. Powell, PC. Adams. T.F. Kresina and J.H. Hoofnagle, Molecular medicine and hemochromatosis: at the crossroads, Gastroenterology 116(1999), 193-207. [6] B.C. Jazwinska. W.R. Pyper, M.J. Burl, J.L. Francis, S. Goldwurm, S.I Webb, S.C. Lee, J.W. Halliday and L.W. Powell, Haplotype analysis in Australian hemochromatosis patients: Evidence for a predominant ancestral haplotype exclusively associated with hemochromatosis. Am. .1. Hum. Genet. 56 (1995), 428-433. [7] D.F. Wallace, J.S. Dooley and A.P. Walker, A novel mutation of HE explains the classical phenotype of genetic hemochromatosis in aC282Y heterozygote, Gaxtroentemlogy 116(1999), 1409-1412. [8] V. Douabin, R. Moirand, A.M. Jouanolie, P. Brissot, J.Y. Le Gall, Y. Deugnier and V. David, Polymorphisms in the HE Gene, Hum. Hered. 49 (1999), 21-26. [9] G. Mercier, A. Burckel, C. Bathelier, E. Boillat and G. Lucotte, Mutation analysis of the HLA-H gene in French hemochromatosis patients, and genetic counseling in families. Genetic Counseling 9 (1998). 181 -186. [10] R.S. Ajioka, L.B. Jorde, J.R. Gruen, G. Yu, D. Dimitrova. J. Barrow, E. Radisky, C.Q. Edwards, L.M. Griffen and J.P. Kushner, Haplotype analysis of hemochromatosis: Evaluation of Different linkage-diseqilibrium approaches and evolution of disease chromosomes. Am. J. Hum. Genet. 60 (1997), 14391447. [1 i] C.Q. Edwards, M.M. Dadone, M.H. Skolnick and J.P. Kushner. 1982, Hereditary hemochromatosis, Clin. Haematnl. 11 (1982), 411-435. [12] M. Simon, L. LeMignon, M. Fauchet, J. Yaouang, V. David, G. Edan and M. Bourel. A study of 609 haplotypes marking for the hemochromatosis gene: (1) mapping of the gene near the HLA-A locus and characters required to define a heterozygous population and (2) hypothesis concerning the underlying cause of hemochromatosis-HLA association. Am. J. Hum. Genet. 41 (1987). 89-105. [13] M.A. Ocana, E. Arroyo-Pardo, C. Martfnez-Labarga. M. Arroyo, V. Fuster and M.S. Mesa. Analisis de polimorfismos de STR en una region del Centro de Espana(Vera-Jerte, Caceres), in: Investigaciortes en Biodiversiilad Humana, T.A. Varela, ed., Santiago de Compostela University, Santiago de Compostela, 2000, pp. 788-797. [14] M.S. Mesa, J. Martin, V. Fuster. R. Fisac, Blood polymorphisms and Geography in the Sierra de Gredos, Spain, Hum. Biol. 66(1994), 1005-1019. [15] J.G. Beutler. T. Gelbart, C. West, P. Lee, M. Adams, R. Blackstone. P. Pockros, M. Kosty. C.P. Venditti, P.D. Phatak, N.K.

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

Seese, K.A. Chorney. A.E.T. Elshof, G.S. Gerhard and M Chorney, Mutation analysis in hereditary hemochromatosis, Blood Cells. Mol. and Dis. 22 (19%). 187-194. R. Steffensen, K. Vanning and C. Jersild. Determination of gene frequencies for two common haemochromatosis mutations in the Danish population by a novel polymerase chain reaction with sequence-specific primers. Tissue Antigens 52 (1998), 230-235. A.G. Roberts, S.D. Whatley, R.R. Morgan. M. Worwood and G.H. Elder. Increased frequency of hemochromatosis Cys282Tyr mutation in sporadic porphirya cutanea tarda. Lancet 349 (1997). 321-323. Z. Miedzybrodzka, S. Loughlin, D. Baty, A. Terron, K. Kelly, J. Dean. M. Greaves, M. Pippard and N. Haites. Haemochromatosis mutations in North-East Scotland, Br. J. Haematol. 106(1999). 385-387. E. Fabrega. B. Castro, L. Sanchez-Castro, A. Benito. J.L Fernandez-Luna, F. Pons-Romero. Prevalence de la mutacbn Cys282Tyr del gen de la hemocromatosis en Cantabria y en los pacientes diagnosticados de hemocromatosis hereditana. Med. Clin. (Barc.) 112 (1999). 451-453. A.M. Jouanolie, G. Gandon, P. Jezequel. M. Blayau, M.L. Campion, J. Mosser, P. Fergelot, B. Chauvel, P. Bouric, G. Cam, N. Andrieux, I. Gicquel. J.Y. Le Gall and V. David. Haemochromatosis and HLA-H. Nat. Genet 14 (1996), 251 252. N. Borot, M.P. Roth, L. Malfroy, C. Demangel. J.P. Vine, J.P. Pascal and H. Coppin, Mutations in the MCH class Ilike candidate gene for hemochromatosis in Freeh patients. Inmunogenetics 45 (1997), 320-324. P. Jezequel. M. Bargain, F. Lellouche and F. Geffroy, Allele frequencies of hereditary hemochromatosis gene mutations in a local population of west Brittany, Hum. Genet. 102 (1998), 332-333. R. Gottschalk, C. Seidl. T. Loffler. E. Seifried, D. Hoelzer and J.P. Kaltwasser, HE codon 63/282 (H63D/C282Y) dimorphism in German patients with genetic hemochromatosis. Tissue Antigens 51 (1998), 270-275. C. Datz. M.R.A. Lalloz, W. Vogel, I. Graziadei, F. Hackl. G. Vautier, D.M. Layton, T. Maier-Dobersberger, P. Ferenci, E. Penner, F. Sandhofer. A. Bomford and B. Paulweber, Predominance of the HLA-H Cys282Tyr mutation in Austrian patients with genetic haemachromatosis. J Hepatology 27 (1997), 773-779. A Tordai, H. Andrikovics, L. Kalmar, M. Rajczy, B. Sarkadi, I. Klein and A. Varadi, High frequency of the haemochromatosis C282Y mutation in Hungary could argue against a Celtic origin of the mutation, J. Med. Genet. 35 (1998), 878-879. M. Carella, L. D'Ambrosio, A. Totaro, A. Grifa, M.A. Valentino, A. Piperno, D. Girelli, A. Roetto. B. Franco, P. Gasparini and C. Camaschella, Mutation analysis of the HLA-H gene in Italian hemochromatosis patients. Am. J. Hum. Genet. 60(1997), 828-832.

115

ANX7 as a bio-marker in prostate and breast cancer progression Meera Srivastavaa,*, Lukas Bubendorfb,e, Lisa Nolana, Mirta Glasmana, Ximena Leightona, Georgina Millerc, Wilfred Fehrled, Mark Raffeldd, Ofer Eidelmana, Olli P. Kallioniemib, Shiv Srivastavaf and Harvey B. Pollarda a Department of Anatomy, Physiology and Genetics, and Institute for Molecular Medicine, Uniformed Services University School of Medicine (USUHS), Bethesda, MD 20814, USA b Section on Molecular Genetics, Cancer Genetics Branch, NHGRI, National Institutes of Health, Bethesda, MD 20892, USA c Veterinary Resources Program, NCRR, National Institutes of Health, Bethesda, MD 20892, USA d Laboratory of Pathology, Hematopathology section, NCI, N1H, Bethesda, MD 20892, USA e Institute for Pathology, University of Basel, Switzerland f Center for Prostate Disease Research, and Department of Surgery, Uniformed Services University School of Medicine (USUHS), Bethesda, MD 20814, USA The ANX7 gene codes for a Ca2+-activated GTPase, which has been implicated in both exocytotic secretion in cells and control of growth. In this review, we summarize information regarding increased tumor frequency in the Anx7 knockout mice, ANX7 growth suppression of human cancer cell lines, and ANX7 expression in human tumor tissue micro-arrays. The loss of ANX7 is significant in metastatic and hormone refractory prostate cancer compared to benign prostatic hyperplasia. In addition, ANX7 expression has prognostic value for predicting,survival of breast cancer patients.

'Address for correspondence: Meera Srivastava, Ph.D, Department of Anatomy and Cell Biology, USU School of Medicine, 4301 Jones Bridge Road, Bethesda, MD 20814, USA. Tel.: +1 301 295 3204; Fax: +1 301 295 2822; E-mail: [email protected]. Disease Markers 17(2001) 115-120 ISSN 0278-0240 / $8.00 © 2001, IOS Press. All rights reserved

1. Introduction Long-term survival in cancer currently rests on detection and appropriate therapy at the earliest possible stage. Molecular markers for cancer are therefore being strongly sought after, in hopes of achieving the very earliest detection. Such markers include new tumor suppressor genes (TSG), whose identities are presently only hypothesized on the basis of allelic loss. For example, multiple potential tumor suppressor genes have been hypothesized to exist around the 10q21 locus of chromosome 10. Interestingly, the human ANX7 gene is located on chromosome 10q21, and gradually, we became alert to the possibility that ANX7 might have tumor suppressor gene activity. We found that ANX7 codes for a membrane-associated, Ca2+-activated GTPase, and is involved in exocytotic secretion [8,12,13, 34,38]. ANX7 GTPase activity is sensitive to such critical modulators of conventional G-proteins as Al2F6 and mastoparan [8,9]. In studies with cultured cells, ANX7 can be shown to bind and hydrolyze GTP [8]. ANX7 protein also forms Ca2+ channels in membranes [33], which can be stabilized in long open states by GTP (Pollard and Arispe, unpublished data). The subcellular distribution of ANX7 protein is predominantly in membranes and to a lesser extent in the nucleus [11, 27]. 2. Methods and results 2.1. ANX7 plays a role in growth throughout phylogeny Early work on the annexin VII gene (anx7 isynexin) has shown that it is expressed in small amounts in nearly every cell, and is found throughout phylogeny as a single copy gene in organisms as diverse as man [36], mouse [44,45], Xenopus [38], and Dictyostelium [15, 19,21]. The first molecular hints as to the possible involvement of the anx7 gene towards growth have come from studies on Dictyostelium. The first anx7

116

M. Srivastava etA3./ ANX7 as a bio-marker in prostate and breast cancer progression

gene disruption mutants were noted to have growth defects [15], and more recent studies have shown that these anx7-knockout mutants lose many properties related to growth, differentiation, motility and chemotaxis, especially in Ca2+ limiting conditions [3,16]. It has been reported that the anx7 level is increased during the transition of D. discoideum Ax-2 cells from growth to differentiation. Bonfils et A3. [3] have also shown that compared with the differentiated form of Dictyostelium, the proliferating form possesses only l/5th the amount of anx7-mRNA and only l/60th the amount of ANX7 protein [3]. Okafugi et A3. [31 ] have discovered that the mechanism involves genesis of a naturally occurring anx7 antisense mRNA which activates growth and proliferation in wildtype Dictyostelium. These latter data have been interpreted as indicating a possible role for anx7 in a signal transduction pathway for growth. In summary, the anx7 gene could seem to control the Dictyostelium cell cycle such that a relative decrease in the anx7 gene activity would appear to enhance growth and proliferation at the expense of Ca2+-dependent differentiated functions. 2.2. ANX7 is phosphorylated by protein kinases and mitogen-stimulated protein kinases Protein kinase C phosphorylates ANX7 with a 2:1 Pi/Protein molar ratio, both in vitro and in vivo [10]. This result is of possible relevance to ANX7 function in the cell cycle, since many isoforms of PKC have been directly implicated in activating intracellular signalling [30], and in specifically activating mitosis [2,7,26,29] and tumorigenicity [22,28,32]. Quantitative phospho-ANX7 adducts have also been prepared in vitro with EGF (epidermal growth factor) receptor and pp60src. In vivo, cells treated with tyrosine kinase activators such as epidermal growth factor (EGF) and platelet derived growth factor (PGDF) also support phosphorylation of endogenous ANX7. These reactions are of as yet unknown biological significance. However, the potential relevance of such reactivity to tumor suppressor gene activity is manifest by reports that splice variants of the prostate and ovarian cancer susceptibility gene BRCA1 contain phosphotyrosine and play a role in cell cycle regulation [14,43,46]. 2.3. Anx7 knockout mice have growth anomalies and increased incidence of tumors We used targeted homologous recombination technology to prepare a knockout mouse for the Anx7 gene.

Fig. 1. Adenocarcinoma of salivary gland. An image is shown, of the gross appearance of the tumor in the anx? (+/-) mouse.

The null Anx7 (-/-) mouse is lethal, while the heterozygous Anx7 (+/-) mouse has been shown to display defects in growth control, Ca2+ signal transduction, endocrine functions and tumor suppression [37, 39]. The fact that the anx7 (-/-) mice does not survive after 10 days of gestation indicates an essential role in the development of the embryo when maternal message is exhausted. The male Anx7 (+/-) mouse begins an extraordinary growth spurt relative to normal littermate controls after postpartum week 4. Growth continues uninterrupted for at least 12 months, leading to 40-60 gram mice. Many internal organs increase in weight, some out of proportion to the weight increment by the mouse itself. Insulin secretion from beta cells in the islets of Langerhans is inefficient as a function of external Ca2+, and IP3 mediated calcium transients are attenuated in cultured beta cells [37]. Enlargement of the prostate has also been systematically noted in male mutants. In more recent work, we have become aware of a profoundly increased frequency of tumors, including prostate carcinoma in Anx7 (+/-) animals compared to Anx7 (+/+) normal littermate controls. In general, tumor frequency is in the range of 20-50% of animals, becoming more accentuated with advancing age. An instance of a salivary gland adenocarcinoma was observed clinically as a steadily increasing mass on the right side of the neck region. This is shown in Fig. 1. An example of a Hepatocellular Carcinoma is shown in Fig. 2(B). For comparison, a sample of normal liver from an anx7(+/+) mouse shown in Fig. 2(A).

117

M. Srivastava etA3./ ANX7 as a bio-marker in prostate and breast cancer progression

Table 1 Levels of ANX7 protein expression in a prostate cancer tissue microrray. BPH: Benign prostatic hyperplesia, PIN: high-grade prostatic intraepithelial neoplesia, stage T2: clinically localized primary cancer, stage T3/T4: locally advanced primary cancer, Hr loc re: Hormone refractory local recurrence Tumor

BPH PIN Stage 2 Stage 3/4 Dist. metastasis Hr loc rec

Fig. 2. Hepatocellular Carcinoma: Liver was taken from the anx7 (+/-) mouse and fixed in buffered formalin. Sections were stained with hematoxylin and eosin (H and E). A section from a hepatocellular carcinoma, taken at 100-X magnification, is shown in Fig. 2(B). For comparison a sample of normal liver from an anx7(+/+) mouse is shown in Fig. 2(A).

2.4. ANX7 and other tumor suppressor genes suppress the growth of tumor cell lines A parallel and highly quantitative method of testing for tumor suppressor gene activity has been to transfect the candidate gene into a tumor cell line, and to determine whether the gene in question suppresses proliferation (e.g. [20]). The basic strategy behind this experiment is to transduce the wildtype tumor suppressor gene and to show that the growth characteristics of the tumor cell are lost in transfected cells. In the case of the ANX7 gene, our data, complete with p53 positive controls, show that the human ANX7 gene also suppresses growth of prostate, breast and osteosarcoma cell lines by a mechanism involving Gl arrest. This is an important result for candidate tumor suppressor gene characterization because certain human prostate tumor cell lines can be suppressed when a mutated Rb gene is supplanted by a wildtype Rb gene [4,23]. Equivalent results have been reported for a human bladder carcinoma cell line [42]. Similar reports have also been made for the p53 gene (e.g. [17,18,24]). Specific examples include suppression of growth of human colorectal cancer cells [1] and human prostate cancer cells lines such as LNCaP and DU145 [41]. 2.5. ANX7 expression is completely lost in a high proportion of and hormone-refractory prostate cancers Human tumor tissue microarray technology allows one to query hundreds of tumors at a time for under- or

ANX7 positive negative

20 16 106 16 21 70

2 1 2 2 14 38

Total samples

Percent positive

22 17 108 18 35 108

91% 94% 98% 89% 60% 65%

over-expression of the candidate gene product using immunohistochemical or other techniques [6,25,35]. To efficiently analyze the clinical significance of ANX7, we therefore used a prostate tissue microarray in which we were able to query ANX7 protein expression in hundreds of tumors. The tissue microarray we used was specifically constructed with 376 specimens [5]. These specimens were from across all stages of progression including 25 normal controls, 25 PIN lesions, 150 untreated localized tumors, 135 hormone-refractory local recurrences, and 41 distant metastases. The levels of ANX7 were evaluated by immunohistochemistry using a monoclonal anti-ANX7 antibody. As shown in Table 1, we found that ANX7 expression is completely lost in a high proportion of metatases (57%) and in local recurrences of hormone refractory prostate cancer (63%). These data, highly significant, strongly suggest that the ANX7 gene has clinical relevance for prostate cancer in man. By contrast, ANX7 occurs at close to normal levels in benign prostate glands, high grade prostatic intraepithelial neoplasms (PIN), and stage T2 and T3/4 primary tumors (all in the range of 89-96%). Using Ki67 immuno-staining as an index of tumor cell proliferation, we find that a high Ki67 labeling index is positively correlated with lack of ANX7 expression [39]. Thus ANX7 expression is most profoundly reduced in the most prognostically challenging forms of prostate cancer. 2.6. Alterations in ANX7 expression as a function of breast cancer progression We determined the frequency of ANX7 protein expression in a breast tissue microarray containing 525 tumor specimens from all stages of human breast tumor progression. The levels of ANX7 were evaluated by immunohistochemistry using a monoclonal anti-ANX7 antibody. We find that a significant reduction in ANX7

18

M. Srivastava et al. / ANX7 as a bio-marker in prostate and breast cancer progression

Fig. 3. ANX7 expression in a breast cancer tissue microarray as a function of cumulative survival at the five year time point. A. Survival is measured as a function of clinical stages of breast cancer (BRE). B. Survival is measured as a function of pathological stages of breast cancer (pT). The levels of ANX7 expression in the tissue array are color coded, as shown in the inset.

expression occurs in primary breast cancers. On the other hand the percent of ANX7 positives increase progressively as the tumor progresses (data not shown; and [40]). In a second set of experiments, we used a prognostic breast cancer array containing 303 tumor specimens to detect ANX7 by immunohistochemistry using a mono-clonal anti-ANX7 antibody. The ANX7 levels are classified as 0 (no staining), 1 (low staining), 2 (moderate staining), and 3 (highest staining intensity). Kaplan-Meier curves of disease-free survival in

patients with low (0) versus high (3) ANX7 staining shows a significant separation within 5 years of followup (p — 0.0017). As depicted in the histograms, the high cytoplasmic ANX7 expression predicts that the patient survival is 50% at the clinical BRE grade 2 level (Fig. 3(A)) and 30% at pathological stage, pT4 (Fig. 3(B)). Significantly, there is no change observed in nuclear ANX7 staining in any of the cases. Thus these data strongly support the clinical relevance of the ANX7 gene as a prognostic marker for aggressive

M. Srivastava et al. / ANX7 as a bio-marker in prostate and breast cancer progression

treatment of breast cancer in women. 3. Conclusion Prostate cancer is the most common cancer detected in American men, for whom metastatic, and hormone refractory prostate cancer are the end-stage, lethal forms of this disease. Of profound relevance to prostate cancer in humans is our finding that both metastatic and hormone refractory prostate cancers in human are associated with a significant loss of ANX7 gene expression. The key role of ANX7 in regulating cell proliferation is reflected in the properties of the Anx7 knockout mouse which we generated in our lab. The phenotype of this mouse includes gigantism and a high incidence of spontaneous tumors, including prostate carcinoma. In addition, overexpression of wild type ANX7 shows potent inhibitory effects on the growth of two metastatic prostate human cancer cell lines. These results thus further lend support to our proposed role for ANX7 in cancer and cell proliferation, not only in the mouse knockout model created by us, but also in breast and prostate cancer specimens that we tested. This is an important insight, because until now the ANX7 gene has not been thought to play such a role [34]. It is of particular importance that ANX7 is located on human chromosome 10q21, where hitherto unidentified potential tumor suppressor genes have been predicted to exist. It is possible that ANX7 may be at least one of the tumor suppressor genes predicted to occur at this locus. The fact that ANX7 protein expression is significantly reduced in androgen-insensitive metastatic and locally recurrent hormone insensitive prostate cancers suggests that the study of ANX7 gene action will have great potential importance for understanding human prostate cancer progression. Most importantly, in studies of human breast tumors, high cytoplasmic expression of ANX7 was found to be a strong predictor of reduced disease-free survival. As these studies show, the relationship between levels of ANX7 and progression of these different cancers is at a nascent stage of understanding. Nonetheless, we anticipate that further work to elucidate the tumor-specific action of ANX7 will provide a new and useful tools for diagnosis, prognosis and therapy for these different types of cancers.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15] [16]

References [1]

S.J. Baker, S. Markowitz, E.R. Fearon, J.K.V. Willson and B. Vogelstein, Suppression of human colorectal cancer cell growth by wildtype p53. Science 249 (1990), 912-915.

119

E. Berra, M.T. Diaz-Meco, I. Dominguez, M.M. Municio, L. Sanz, J. Lozano, R.S. Chapkin and J. Moscat, Protein kinase C ztea isoform is critical for mitogenic signal transduction, Cell 74(1993), 555-563. C. Bonfils, M. Greenwood and A. Tsang, Expression and characterization of a Dictyostelium discoideum annexin, Mol Cell Biochem 139 (1994), 159-166. R. Bookstein, J.-Y. Shew, P.-L. Chen, P. Scully and W.-H. Lee, Suppression of tumorigenicity of human prostate carcinoma cells by replacing a mutated RB gene, Science 247 (1990), 712-715. C. Bowen, L. Bubendorf, J.H. Voeller, R. Slack, N. Willi, G. Sauter, T.C. Gasser, P. Koivisto, E.E. Lack, J. Kononen, O.P. Kallioniemi and E.P. Gelman, Loss of NKX3.1 Expression in human Prostate Cancer Correlates with tumor Progression, Cancer research 60 (2000), 6111-6115. L. Bubendorf, M. Kolmer, J. Kononen, P. Koivisto, S. Mousses, Y. Chen, E. Mahlamaki, P. Schraml, H. Moch, N. Willi, A.G. Elkahlhoun, T.G. Pretlow, T.C. Gasser, M.J. Mihatsch, G. Sauter and O.P. Kallioniemi, Molecular mechanisms of hormone therapy failure in human prostate cancer analyzed by a combination of cDNA and tissue microarrays, J. Natl. Cancer Inst. 91 (1999), 1758-1764. A. Cacace, S.N. Guadagno, R.S. Krauss, D. Fabbro and I.B. Weinstein, The epsilon isoform of protein kinase C is an oncogene when overexpressed in rat fibroblasts, Oncogene 8 (1993), 2094-2104. H. Caohuy, M. Srivastava and H.B. Pollard, GTP-activation of membrane fusion protein anx7 (Annexin VII) and detection of Ca2+ -activated GTPase activity in vitro and in vivo, Proc. Nat. Acad. Sci. (USA) 93 (1996), 10797-10802. H. Caohuy, M. Srivastava and H.B. Pollard, Membrane fusion protein annexin VII: a Ca2+ -activated GTPase target for mastoparan in secreting chromafftn cells, in: Secretory Systems and Toxins, (Vol. 2), M. Linial, A. Grasso and P. Lazarovici, eds, 1998, pp. 439-449. H. Caohuy and H.B. Pollard, Activation of annexin 7 by protein kinase C in vitro and in vivo, J .Biol. Chem. (2000), in review. A.M. Cardenas, A.J. Kuijpers and H.B. Pollard, Effect of protein synthesis inhibitors on synexin levels and secretory response in bovine adrenal medullary chromaffin cells, Biochim. Biophys. Acta 1234 (1995), 255-260. C.E. Creutz, CJ. Pazoles and H.B. Pollard, Identification and Purification of an Adrenal Medullary Protein (synexin) That Causes Calcium Dependent Aggregation of Isolated Chromaffin Granules, J. Biol. Chem. 253 (1978), 2858-2866. E.G. Creutz, C.J. Pazoles and H.B. Pollard, Self-Association of synexin in the Presence of Calcium: Correlation with synexinInduced Membrane Fusion and Examination of the Structure of Anx7 Aggregates, J Biol. Chem. 254 (1979), 553-558. J.Q. Cui, H. Wang, E.S. Reddy and V.N. Rao, Differential transcriptional activation by the N-terminal region of BRCA1 splice variants BRCAla and BRCAlb, Oncol. Rep. 5 (1998), 585-589. V. Doring, M. Schleicher and A.A. Noegel, Dictyostelium annexin VII (anx7), J. Biol. Chem. 266 (1991), 17509-17515. V. Doring, F. Veretout, R. Albrecht, B. Muhlbauer, C. Schlatterer, M. Schleicher and A.A. Noegel, The in vivo role of annexin VII (synexin): characterization of an annexin VIIdencient Dictyostelium mutant indicates an involvement in Ca(2+)-regulated processes, J. Cell Sci. 108 (1995), 20652076.

120

[17]

[18] [19]

[20]

[21]

[22]

[23]

[24] [25]

[26]

[27]

[28]

[29]

[30] [31]

[32]

[33]

M. Srivastava et al. / ANX7 as a bio-marker in prostate and breast cancer progression D. Eliyahu, D. Michalovitz, S. Eliyahu, O. Pinhasi-Kimhi and M. Oren, Wildtype p53 can inhibit oncogene-mediated focus formation, Proc. Nat. Acad. Sci. (USA) 86 (1989), 8763-8767. C.A. Finlay, P.W. Hinds and A.J. Levine. The p53 protooncogene can act as a suppressor of transformation. Cell 57 (1989), 1083-1093. V. Gierke, Identification of a Homologue for Annexin VII (Anx7) in Dictyostelium discoideum, J. Biol. Chem. 226 (1991), 1697-1700. M.S. Greenblatt, W.P Bennett, M. Hollstein and C.C. Harris, Mutations in the p53 tumor suppressor gene: Clues to cancer etiology and molecular pathogenesis. Cancer Res. 54 (1994), 4855-4878. M. Greenwood and A. Tsang, Sequence and expression of annexin VII of Dictyostelium discoideum, Biochim Birtphvx Acta 1088(3) (1991), 429-432. G.M. Housey, M.D. Johnson, W.L.-W. Hsiao, C.A. O'Brian, J.P. Murphy, P. Kirschmieir and I.E. Weinstein, Overproduction of protein kinase C causes disordered growth control in rat fibroblasts. Cell 52 (1988), 343-354. H.J.-S. Huang, J.-K. Yee, J.Y. Shew, R. Bookstein, T. Friedmann, E.Y.-H.P. Lee and W.-H. Lee, Suppression of the neoplastic phenotype by replacement of the RB gene in human cancer cells. Science 242 (1988), 1563-1566. W.B. Isaacs, B.S. Carter and C.M. Ewing, Wild-type p53 suppresses growth of human prostate cancer cells containing mutant p53 alleles. Cancer Rex. 51 (1991), 4716-4720. J. Kononen, L. Bubendorf, A. Kallionemi, M. Barlund, S. Leighton, J. Torhorst, M.J. Mihatsch, G. Sauter and O.P. Kallioniemi, Tissue microarrays for high-throughput molecular profiling of hundreds of tumor specimens. Nature Medicine 4(1998), 844-847. W. Kolch. G. Heidecker, G. Kochs, R. Hummel. H. Vahidi, H. Mischak. G. Finkenzeller, D. Marme and U.R. Rapp, PKCinactivates raf-1 by direct phosphorylation, Nature 364 (1993), 426-428. G.A.J. Kuijpers, G. Lee and H.B. Pollard, Immunolocalization of Anx7 (Annexin VII) in Adrenal Chromaffin Granules and Chromaffin Cells: Evidence for a Dynamic Role in the Secretory Process, Cell Tissue Res. 269 (1992), 323-330. H. Mischak, J. Goodnight, W. Kolch, G. Martiny-Baron. C. Schaechtle, M.G. Kazanietz, P.M. Blumberg, J.H. Pierce and J.F. Mushniski, Overexpression of protein kinase C-delta and epsilon in NIH 3T3 cells induces opposite effects on growth, morphology, anchorage dependence and tumorigenicity. J Biol. Chem. 268 (1993), 6090-6096. D.K. Morrisson, D.R. Kaplan, U. Rapp and T.M. Roberts. Signal transduction from membrane to cytoplasm: growth factors and membrane-bound oncogene products increase Raf1 phosphorylation and associated protein kinase activity, Proc. Nat. Acad. Sci. (USA) 85 (1988), 8855-8859. Y. Nishizuka, Intracellular signalling by hydrolysis of phospholipids and activation of protein kinase C, Science 258 (1992), 607-614. T. Okafuji, F. Abe and Y. Maeda, Antisense-mediated regulation of Annexin VII gene expression during the transition from growth to differentiation in Dictyostelium discoideum, Gene 189 (1997), 49-56. D.A. Persons, W.O. Wilkison, R.M. Bell and O.J. Finn, Altered growth regulation and enhanced tumorigenicity of NIH 3T3 fibroblasts transfected with protein kinase C-l-cDNA, Cell 52 (1988), 447-458. H.B. Pollard and E. Rojas, Calcium Acivated annexin Forms Highly Selective. Voltage-gated Channels in Phosphatidylser-

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44] [45]

[46]

ine Bilayer Membranes, Proc. Natl. Acad. Sci. (USA) 85 (1988), 2974-2978. P. Raynal and H.B. Pollard. Annexins: the problem of assessing the biological role for a gene family of multifunctional calcium-and phospholipid-binding proteins. BBA Biomembranes 1197 (1994). 63-93. P. Schraml. J. Kononen, L. Bubendorf, H. Moch, H. Bissig, A. Nocito. M.J. Mihatsch. O.P. Kallioniemi and G. Sauter, Tissue microarrays for gene amplification surveys in many different tumor types, Clin. Cancer Res. 5 (1999), 1966-1975. A. Shirvan. M. Srivastava, M.A. Wang, C. Cultraro, K. Magendzo, O.W. McBride, H.B. Pollard and A.L. Bums, Structure of the human synexin (Annexin VII) gene and assignment to chromosome 10, Biochemistry 33 (1994), 6888-6901. M. Srivastava, I. Atwater, M. Classman, X. Leighton, G. Goping, G. Miller, D. Mears. E. Rojas and H.B. Pollard, Defects in IP3 Receptor Expression, Ca2+-Signaling and Insulin Secretion in the Anx7 (+/-) Knockout Mouse. Proc. Natl. Acad. Sci. 96(1999), 13783-13788. M. Srivastava, G. Goping, H. Cauhuy. P. McPhie and H.B. Pollard, Novel isoforms of anx7 in Xenopus laevis: Multiple tandem PGQM repeats distinguish mRNA's in specific adult tissues and embryonic stages, Biochemical J. 316 (1996), 729736. M. Srivastava, L. Bubendorf, V. Srikantan, L. Fossom, L. Nolan. M. Glasman, X. Leighton, G. Miller, H. Caohuy, Y. Sei, W. Fehrle, S. Pittaluga, M. Raffeld, P. Koivisto, N. Willi, T. Gasser, J. Kononen, G. Sauter, O.P. Kallioniemi, S. Srivastava and H.B. Pollard, ANX7: A Novel Candiadte TumorSuppressor Gene for prostate Cancer, Proc. Natl. Acad. Sci. (USA) 98 (2001), 4575-4580. M. Srivastava, L. Bubendorf, L. Nolan, M. Glasman, X. Leighton. Y. Sei, W. Fehrle, M. Raffeld, P. Koivisto, N. Willi, T. Gasser, J. Kononen, G. Sauter, O. Eidelman, O.P. Kallioniemi, S. Srivastava and H.B. Pollard, ANX7 as a prognostic biomarker in breast cancer progression, in review (2001). S. Srivastava, D. Katayose, Y.A. Tong, C.R. Craig, D.G. McLeod, J.W. Moul, K.H. Cowan and P. Seth, Recombinant adeno virus vector expressing wildtype p53 is a potent inhibitor of prostate cancer cell proliferation, Umlttgy 46 (1998), 843848. R. Takahashi, T. Hashimoto, H.-J. Xu, S.-X. Hu, T. Matsui. T. Miki, H. Bigo-Marshall, S.A. Aaronson and W.F. Benedict, The retinoblastoma gene functions as a growth and tumor suppressor in human bladder carcinoma cells, Proc. Nat. Acad. Sci. (USA) 88 (1991), 5257-5261. H Wang, N. Shao, Q.M. Ding, J. Cui, E.S. Reddy and V.N. Rao, BRCA1 proteins are transported to the nucleus in the absence of serum and splice variants BRCAIa, BRCAIb are tyrosine phosphoproteins that associate with E2F, cyclins and cyclin dependent kinases, Oncogene 15(1997), 143-157. Z -Y. Zhang-Keck, A.L. Bums and H.B. Pollard, Mouse anx7 (Annexin VII) polymorphisms and phylogenetic comparison with other anx7s, Biochem. J. 289 (1993), 735-741. Z.-Y. Zhang-Keck, M. Srivastava, C.A. Kozak, H. Caohuy, A. Shirvan. A.L. Burns and H.B. Pollard, Genomic organization and chromosomal localization of the mouse anx7 (Annexin VII) gene. Biochemical J. 301 (1994), 835-845. H.T. Zhang, X. Zhang, H.Z. Zhao, Y. Kajino, B.L. Weber. J G. Davis, Q. Wang, D.M. O'Rourke, H.B. Zhang, K. Kajino and M.I. Greene, Relationship of p215BRCA1 to tyrosine kinase signaling pathways and the cell cycle in normal and transformed cells, Oncogene 14 (1997), 2863-2869.

Press IOS Press is a scientific publishing house of books and journals in the English language on the following topics: medical science, healthcare, telecommunication, artificial intelligence, information and computer science, parallel computing, physics and chemistry, environmental science, and other subjects. E-mail address: [email protected]

IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam The Netherlands Tel.: + 31 20 688 3355 Fax:+ 31 206203419 IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX3 7AD United Kingdom Fax.:+ 44 1865750079 /OS Press, Inc. 5795-G Burke Centre Parkway Burke, VA 22015, USA Tel.: + 1 703 323 5554 Fax.: + 1 703 323 3668 E-mail: [email protected] IOS Press/LSL.de Gerichtsweg 28 D-04103 Leipzig, Germany Tel.:+ 49 341995 4250 Fax.: + 49 341 995 4255 Ohmsha, Ltd. 3-1 Kanda Nishiki-cho Chiyoda-ku Tokyo 101, Japan Fax.: + 813 3233 2426

New: Journals Online IOS Press is pleased to be able to announce a new service, the IOS Press journals are now available electronically via the IOS Press website. As a subscriber to an iOS Press journal you can have free electronic access to the articles published in this journal. Electronic access is free with a print subscription. if you wish to access the system through IP access, please let us know and send us your institutes' IP ranges. We will set up access for these IP ranges and send you a confirmation e-mail. It is also possible to arrange access for your institution/university/library through Http Referrer, for this option, please send us your http details and we will set up access and send you a confirmation. The service on the IOS Press site is far more than only electronic access to journal articles. It is possible to search and browse journals and tables of contents, you can specify your favorite publications in order to facilitate quicker access to them, select publications for which you wish to receive notifications of new issues, instruct the system to automatically resubmit specified searches on newly published content and deliver notifications when relevant articles are found. Many of the above mentioned services are free to use, even if you are not a subscriber to the journal. Furthermore, individual articles can be purchased without being a subscriber to a journal. For an overview of all options and possibilities of this new system, please visit our website and use this service at: Http://iospress.metapress.com

Breast Disease

Aims and Scope The recent expansion of work in the field of breast cancer inevitably will hasten discoveries that will have impact on patient outcome. The breadth of this research that spans basic science, clinical medicine, epidemiology, and public policy poses difficulties for investigators. Not only is it necessary to be facile in comprehending ideas from many disciplines, but also important to understand the public implications of these discoveries. Thus, there is a need for an information source mat can summarize the field, and synthesize conclusions from experts representing the diverse disciplines that make up breast cancer research. Because of this need, the Editor-in-Chief and the publishers at IOS Press have reconfigured the journal Breast Disease. Each volume of Breast Disease will be devoted to an in-depth analysis of the scientific and public implications of recent research on a specific problem in breast cancer. Thus, the reviews will not only discuss recent discoveries but will also reflect on their impact in breast cancer research or clinical management. The topics discussed will cover basic cellular and molecular biology, epidemiology, genetics, clinical research, imaging, social and legal issues, in a series of concise reviews and commentaries. We believe that Breast Disease will be a timely and important addition to the information needs of scientists, clinicians, and policy makers. Editor-in-Chief Edison T. Liu, M.D. Division of Clinical Sciences National Cancer Institute National Institutes of Health

IOS Press

31/3A11 MSC 2440 Center Drive Bethesda, MD 20892, USA Tel: +13014963251 Fax: +13014800313 E-mail: [email protected] Editorial Board Patricia Barr, Esq. Adrian Harris, M.D., Ph.D. Andrew Huang, M.D. Marc E. Lippman, M.D. David Livingston, M.D. Etta Pisano, M.D. Bruce A.J. Ponder, Ph.D., F.R.C.P. Barbara Rimer, Dr.RH. Walter Willett, M.D., Dr.RH. Future Issues Advances in Adjuvent Therapy for Breast Cancer: Guest Editor: Larry Norton; HER-2: Guest Editor; Josef Yarden. Subscription Information Breast Disease (ISSN 0888-6008) will be published in one volume of 1 issue in 2001 (Volume 14). Regular subscription price: EUR 116/US$ 106 (including postage and handling). Abstracted/Indexed in Cancer Prevention and Control Database, Chemical Abstracts, Elsevier BIOBASE/Current Awareness in Biological Sciences, EMBASE/Excerpta Medica, Epilepsy Education and Prevention Database, Health Education and Promotion Database, Prenatal Smoking Cessation Database.

IOS Press is a STMP publishing house, publishing for an international audience both books and journals in a wide range of fields, such as Medical Science and Technology, Artificial Intelligence, Computer Science, Telecommunication, Information and Administrative Science, Physics, Biology and Design and Manufacturing. For a free sample or more information, please contact the Promotion Department of: IOS Press, Nieuwe Hemweg 6B, 1013 BG Amsterdam, The Netherlands Fax: +31 20 620 3419 Email: [email protected] URL: http://www.iospress.nl

Advertisement

A Century of Science Publishing Edited by: EH Fredriksson, 2001,320 pp., softcover, ISBN: 1 58603 148 1 Price: US$38/EUR40/£26 Leading publishers and observers of the science publishing scene comment in essay form on key developments over the past century. The scale of the glofcjal research effort and its industrial organisation have resulted in substantial increases in the published volume, as well as new techniques for its handling. The former languages of science communication, like Latin and German, have given way to English. The domination of European science before WWII has been followed by large efforts in North America and the Far East.The roots of the National Library of Medicine lie in the US Army medical library, the US War effort gave rise to hypertext, and the US defense reaction to the Soviet Sputnik resulted in the Internet. The European invention of the Web has also changed the science publishing scene in the past five years. Some characteristic publishing enterprises, commercial and society owned, are described in a series of articles. These are followed by analysis of recent developments and possible changes to come. Functions of publishers, librarians and agents are brought into context. The future of publishing is currently being debated on open channels, while the historical dimension and professional input are sometimes lacking. For more information, please use e-mail: [email protected], or visit our homepage at URL http://www.iospress.nl. Orders can be placed at [email protected] or at the addresses below IOS Press NieuweHemweg6B 1013 BG Amsterdam The Netherlands Tel:-1-3120 688 3355 Fax:+3120 620 3419

IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX37AD.UK Fax:+44 18 657 50079

IOS Press, Inc. 5795-G Burke Centre Parkway Burke VA 22015 USA Fax:+1703323 3668

IOS Press /LSL Gerichtsweg 28 D-04103 Leipzig Germany Fax:+49 341 995 4255

Ohmsha,Ltd. 3-1 Kanda Nisihiki-cho Chiyoda-ku Tokyo 101-8460 Japan. Fax:+8132936224

Advertisement

Molecular Pathology of Early Cancer Edited by: S. Srivastava 1999,482 pp., hardcover ISBN: 90 5199 373 0 Price: NLG200/EUR90,76/£65/DM178/US$99 The field of molecular pathology as it applies to cancer is emerging as the most accessible and utilitarian application of the great conceptual breakthroughs in our understanding about the underlying mechanisms of cancer. Research covering the last 100 years into both the nature of cancer, as well as its causes has converged over the last 10-20 years on a unified and unifying understanding of the nature of the beast. Three critical aspects of our current view about cancer underlies the emerging molecular pathology described in this useful volume. First, cancer is likely to be in most, if not in all cases, a monoclonal disease wherein the clinically apparent tumor represents the descendant of one cell Second, a manifestation of Darwinian evolution based upon variation and selection, all cancers evolve over time by the accumulation of genetic changes and the selection of cells capable of preferential proliferation and/or accumulation. Third, the nature of each cancer, that is, its clinical behavior, is a reflection of the pattern of gene expression characteristic of that cancer. The very definition of cancer is therefore a molecular one. That cancer is a product of cellular evolution means that we can draw no strict boundary that defines the passage of a cell from normal to cancerous. Rather, the process of carcinogenesis involves the accumulation of genetic changes that may or may not have well-defined correlates in classic cellular pathology. For more information, please use e-mail: [email protected], or visit our homepage at URL http://www.iospress.nl. Orders can be placed at [email protected] or at the addresses below % IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam The Netherlands Tel:+3120 688 3355 Fax:+3120 620 3419

IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX37AD.UK Fax+44 18 657 50079

IOS Press, Inc. 5795-G Burke Centre Parkway Burke VA 22015 USA Fax:+1703 323 3668

IOS Press C/o Universitats Buchhandhing Spandauerstrasse 2 D-10178 Berlin, Germany Fax:+49302423113

Ohmsha,Ltd 3-1 Kanda Nisihiki-cho Chiyoda-ku Tokyo 101 Japan Fax:+81 3 3233 2426

Disease Markers Subscription information Disease Markers (ISSN 0278-0240) is published in one volume of four issues a year. The subscription price for 2001 (Volume 17) is EUR 365 + EUR 23 p.h. = EUR 388 (US$ 372). The Euro price is definitive. The US dollar price is subject to exchange-rate fluctuations and is given only as a guide. 6% VAT is applicable for certain customers in the EC Countries. Subscriptions are accepted on a prepaid basis only, unless different terms have been previously agreed upon. Personal subscription rates and conditions, if applicable, are available upon request from the Publisher. Subscription orders can be entered only by calendar year (Jan.-Dec.) and should be sent to the Subscription Department of IOS Press, or to your usual subscription agent. Postage and handling charges include printed airmail delivery to countries outside Europe. Claims for missing issues must be made within six months of our publication (mailing) date, otherwise such claims cannot be honoured free of charge.

Publisher IOS Press Nieuwe Hemweg 66 1013 BG Amsterdam The Netherlands Tel.: Fax:

+31 20 688 33 55 +31 20 620 34 19

E-mail: Subscription Department: [email protected] Advertising Department: [email protected] Desk Editorial Department: [email protected] Internet: www.iospress.nl

© 2001 IOS Press. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher, IOS Press, Nieuwe Hemweg 6B, 1013 BG Amsterdam, The Netherlands. No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, instructions or ideas contained in the material herein. Although all advertising material is expected to conform to ethical standards, inclusion in this publication does not constitute a guarantee or endorsement of the quality or value of such product or of the claims made of it by its manufacturer. Special regulations for readers in the USA. This journal has been registered with the Copyright Clearance Center, Inc. Consent is given for copying of articles for personal or internal use, or for the personal use of specific clients. This consent is given on the condition that the copier pays through the Center the per-copy fee stated in the code on the first page of each article for copying beyond that permitted by Sections 107 or 108 of the U.S. Copyright Law. The appropriate fee should be forwarded with a copy of the first page of the article to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. If no code appears in an article, the author has not given broad consent to copy and permission to copy must be obtained directly from the author. This consent does not extend to other kinds of copying, such as for general distribution, resale, advertising and promotion purposes, or for creating new collective works. Special written permission must be obtained from the publisher for such copying.