This page intentionally left blank
FA April 1, 2006
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Contents
A Personal Introduction
v
vii
Chapter 1
Small Molecules for Chemogenomics-based 1 Drug Discovery Edgar Jacoby, Ansgar Schuffenhauer, Kamal Azzaoui, Maxim Popov, Sigmar Dressler, Meir Glick, Jeremy Jenkins, John Davies and Silvio Roggo
Chapter 2
Mapping the Chemogenomic Space Jordi Mestres
39
Chapter 3
Natural Product Scaffolds and Protein Structure Similarity Clustering (PSSC) as Inspiration Sources for Compound Library Design in Chemogenomics and Drug Development Frank J. Dekker, Stefan Wetzel and Herbert Waldmann
59
Chapter 4
A Reductionist Approach to Chemogenomics in the Design of Drug Molecules and Focused Libraries Roger Crossley and Martin Slater
85
Chapter 5
In silico Screening of the Protein Structure Repertoire and of Protein Families Didier Rognan
109
Chapter 6
New Methods for Similarity-based Virtual Screening Jérôme Hert, Peter Willett and David J. Wilton
133
Chapter 7
Structural Informatics: Chemogenomics In silico Derek A. Debe, Kevin P. Hambly and Joseph F. Danzer
157
fm
FA April 1, 2006
vi
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Contents
Chapter 8
Index
Construction of a Homogeneous and Informative In vitro Profiling Database for Anticipating the Clinical Effects of Drugs Nicolas Froloff, Valérie Hamon, Philippe Dupuis, Annie Otto-Bruc, Boryeu Mao, Sandra Merrick and Jacques Migeon
175
207
fm
FA April 1, 2006
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Personal Introduction
Following the sequencing of the human genome, a recognized key scientific challenge for the 21st century consists in the systematic identification of small molecules which interact in a specific manner with the products of the genome and modulate their biological function. Progress along side this challenge will strongly contribute to the further fundamental understanding of the biological function of the individual gene products and ultimately provide a basis for the discovery of new and better therapies for human diseases. Chemogenomics addresses this scientific challenge and integrates advanced disciplines like chemistry, genetics, chemo- and bioinformatics, structural biology, and biological screening in phenotypic and target-based assays. Complementary to previous publications on chemogenomics focusing on the individual component disciplines, this review book provides a general knowledge-centric overview of the different chemical, biological and informatics components. This new book is unique in that it provides an integrated review of the recent works of leaders in the various different disciplines and sheds light on strategies how these disciplines interact efficiently for the rapid discovery of new targets and their effector molecules simultaneously, leading toward the study of the biological pathways and circuits wherein these targets are involved. On purpose, all contributing chapter authors focus on knowledge-based approaches and show how previously generated knowledge on molecular recognition modes can efficiently be applied in systematic manners for new molecular discoveries. In the perspective of drug discovery it should be well acknowledged upfront that the primary role of chemogenomics resides in my opinion to provide starting points for future drug optimization projects which continue to rely on the classical medicinal chemistry and in vivo pharmacology-based design and selection principles. Examples of chemogenomics approaches pursued in the academia, as well as start-up biotech and pharmaceutical setups are herein provided. vii
fm
FA April 1, 2006
viii
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Personal Introduction
Chapter 1 contributed from the Novartis Molecular and Library Informatics group focuses on small molecules for chemogenomics based drug discovery and summarizes the main compound categories and selection methods of relevance for the compilation of a comprehensive discovery screening collection. Prof. Jordi Mestres from the Chemogenomics Laboratory at the University Pompeu Fabra in Barcelona summarizes in Chapter 2 mapping methodologies of the chemogenomics space and provides herewith an essential prerequisite for extracting knowledge from biochemical data. In Chapter 3, the group of Prof. Herbert Waldmann at the Max Plank Institute for Molecular Physiology provides a rationale on how natural products which play historically a predominant role in drug discovery are efficiently used in combination with protein structure similarity clustering to inspire directed compound library design and target identification. Dr. Roger Crossley and Dr. Martin Slater from the for library design pioneering BioTech company Galapagos-Biofocus, Inc. outline in Chapter 4 a reductionist approach to the design of drug molecules and focused libraries centering on the ion channel and GPCR target families. In Chapter 5, Prof. Didier Rognan from the University Louis Pasteur of Strasbourg elaborates on the basis of the GPCR target family, a concept for the in silico screening of the protein structure repertoire and of protein families in general. The group of Prof. Peter Willett from the University of Sheffield summarizes in Chapter 6 new chemoinformatics methods for similarity-based virtual screening which based on known active compounds are useful for the identification of new ligands for targets related by conserved molecular recognition. In Chapter 7, Dr. Derek D. Debe from the chemogenomics knowledgebased company Eidogen-Sertanty, Inc. demonstrates the role of 3D structural informatics for in silico chemogenomics enabling by systematic comparison of ligand binding sites the identification of new biological targets and potential side activities and directed compound design strategies. Finally in Chapter 8, Dr. Nicholas Froloff and his colleagues from Cerep SA discuss how profiling data and their integration in a homogenous and informative in vitro profiling database are becoming important for lead prioritization and design of safety pharmacology studies anticipating the clinical effects of drugs and enabling opportunistic drug discovery approaches. All chapter authors are very much acknowledged for their excellent scientific contributions and their willingness to share their insights and strategic view
fm
FA April 1, 2006
ix
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Personal Introduction
points on chemogenomics which make this book especially interesting to read. I also thank Mrs. Sook Cheng Lim and the staff of World Scientific Publishing, Co. for the invitation to edit this review book and for their commitment for completely dealing with all aspects of the production work. I’m delighted with this book and hope that you, the reader, will find it both informative and enjoyable. Edgar Jacoby Basel, January 2006
fm
This page intentionally left blank
FA April 1, 2006
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
1
Small Molecules for Chemogenomicsbased Drug Discovery Edgar Jacoby,*,a Ansgar Schuffenhauer,* Kamal Azzaoui,* Maxim Popov,* Sigmar Dressler,* Meir Glick,† Jeremy Jenkins,† John Davies† and Silvio Roggo*
1. Introduction The compound collections used within chemogenomics target familyoriented screening and chemical biology screening of whole cell systems to detect particular phenotypes include a diversity of sources.1 Typically the major pharmaceutical companies have large sets of handcrafted medicinal chemistry compounds that were generated in large quantities as crystalline samples with good water solubility in lead-optimization projects and which include design input to address the ca. 500 molecular targets investigated to date in drug discovery.2 Since the beginning of the 1990s, these sources have been enriched by compound acquisition projects where the worldwide academic organic and medicinal chemistry pools (in particular in the former Soviet Union) have become an invaluable compound source; this resulted in a successful business activity for suppliers of screening compounds. Since the mid 1990s, when combinatorial chemistry became a dominant technology driven approach, compounds from combinatorial and parallel synthesis projects have been included. The first libraries were purely chemistry and number driven. Initial reports claimed to include easily 1 000 000 or more compounds based on a few chemotypes, usually peptide-based. This approach, however, did not deliver on its promise and resulted in typically higher molecular weight and lipophilicity compounds.
1
∗ Novartis Institutes for BioMedical Research, Novartis Pharma AG, Lichtstrasse 35, Basel, CH-4056, Switzerland. † Novartis Institutes for BioMedical Research, Inc. 250 Massachusetts Avenue, Cambridge, MA 02139, USA. a Corresponding author. E-mail:
[email protected]. Tel.: +41 61 32 46186; Fax: +41 61 32 46261.
ch01
FA April 1, 2006
2
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
Subsequently, a quantity-to-quality paradigm shift took place.3−7 Today most combinatorial and parallel synthesis approaches are used to generate three types of libraries, viz., 1) Diverse/random libraries based on diverse scaffolds of lead and drug-like molecules for HTS (high-throughput screening); 2) Targeted libraries aiming at specific target families, or molecular interaction modes for HTS or MTS (medium throughput screening); and 3) Focused libraries for hit-to-lead and lead optimization projects where for a given chemical lead candidate, subtle modifications in the substituents are probed to optimize the pharmacodynamic and pharmacokinetic properties.1,8 The advent of combinatorial chemistry immediately triggered molecular design approaches such as chemoinformatics and computational library design in order to cope with the large number of compounds and to extract SAR (structure-activity relationship) information based on HTS data. The approaches are able to efficiently address the concepts of molecular diversity and, more recently, molecular complexity.9−13 This chapter summarizes the typical compound categories used in chemogenomics and chemical biology research together with current knowledge-based design and selection criteria aimed at systematically discovering small molecule ligands for interaction at binding sites on the target proteins of the genome.14−21 Emphasis is given to the systematic principles which allow many different types of targets of interest to be addressed in an efficient manner. The role of molecular information systems in integrating the chemical and biological knowledge spaces will be emphasized.
2. Compound Categories 2.1. Natural products and derivatives For obvious reasons, natural principles play a predominant role in the history of drug discovery.22 Diverse classes of natural products, including carbohydrates, steroids, fatty acids, polyketides, peptides, terpenoids, flavonoids, alkaloids, and many other products were isolated initially from herbs and later from various micro- and higher organisms for structure and activity characterization.23,24 All compounds produced by living organisms are generally defined as natural products. In contrast to primary metabolites, which are responsible for homeostasis and energy balance of living organisms, secondary metabolites are not required per se by their producers for these basic life functions, but they confer evolutionary advantages to
ch01
FA April 1, 2006
3
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
ensure their survival and dominance over other species. Natural products are a major source of innovative tool compounds for the elucidation of signaling pathways and new medicines for most indications, such as lipid disorders, cancer, infectious diseases, and immunomodulation; the latter three applications are in many cases explored in parallel and are targeting conserved anti-cellular proliferation pathways with distinct cell type specificities. Between 1981 and 2002, 5% of the around 870 new chemical entities approved by the FDA (US Food and Drug Administration) were natural products, and another 23% were molecules derived from natural products.25 Many natural products inspired the development of important drugs classes, as for instance illustrated by the low-molecular weight biogenic amines. These molecules derived from enzymatic decarboxylation and subsequent oxidation of aromatic proteinogenic amino acids provided the guiding principles for the development of around 50% of all GPCR drugs. Immunosuppressive natural products, like Cyclosporin A, FK-506 (Tacrolimus), Rapamycin (Sirolimus) and its innovative derivative RAD001 (Everolimus), 15-Desoxy-spergualin, Mizoribine, or mycophenolic acid (see Fig. 1) revolutionized transplantation medicine.26 A key challenge for natural product drug discovery is the elucidation of the targets and the molecular mode of action of phenotypically and physiologically active molecules. As the description of biodiversity is by no means complete, the chemical and structural knowledge of natural products space is only emerging and many exciting discoveries from new sources can be expected. The majority of marine organisms probably have not yet been described, and most of those already described have not been fully examined chemically. For flowering plants, about 250 000 species have been described, of which perhaps 10% were analyzed for their chemical content. The number of insect species described is about 1 000 000, and many more have never been described. Only a very small percentage of soil bacteria have ever been cultured, to get DNA directly out of the soil and put it into host organisms.27 Natural products offer a wealth of new structures far beyond the classical repertoire of synthetic compounds. The current most comprehensive summary on the chemical and biological information of around 200 000 isolated natural products is provided in a couple of literature databases, viz., the Chapman & Hall DNP (Dictionary of Natural Products), the AntiBase database, and the CNPD (China Natural Products Database) — for detail see Table 1.
ch01
FA April 1, 2006
4
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al. OH HO HO
O
O O
O O
O
OH
N
O
N
O
O
O
O
O
OH
O
N
OH
O OH
O
O
O
O
O
O
O OH
O
O
O
OH
O
O
O
O
O O
Rapamycin
FK506
RAD-001
O
O
O
HO
NH2 N
N
N O
O
N O
O
N H
HO
N
O
O
O
O
N
OH
HO N O
H N
O
H N
O
O N
H N
O HO
N
OH
O OH
Cyclosporin A
Mycophenolic Acid
OH
O
Mizoribine
NH
H N H2 N
N H
N H
N H
NH2
O
15-Desoxy-spergualin Figure 1. Natural products which were breakthrough discoveries for transplantation medicine.
A number of studies have investigated the structural characteristics of natural products as compared to synthetic organic compounds. Natural products often contain a greater proportion of oxygen than nitrogen heteroatoms. Typically the natural products have a higher number of stereocenters; a higher density of functionalization and pharmacophore sites; a higher number of rings; and more skeletal diversity. Natural products exemplify macro- and polycyclic scaffolds beyond the imagination of the classical synthetic medicinal chemist.28−30 Conversely, examples also exist of very simple natural product structures with biological activity. The structural repertoire can be extended by genomics approaches to natural products. For instance, genetic pathway engineering of the epothilone biosynthesis gene cluster of S. cellulosum allowed access to new epothilones: The
ch01
FA April 1, 2006
5
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery Table 1 Commercially and Publicly Available Compound Databases and Molecular Informatics Resources (see Section 5) for Chemogenomics Research. The list is not exhaustive, but rather constitutes a representative compilation of selected examples in this field.
Internet Resource URL/Source
Specification of Data and Information Available
http://www.chemnavigator.com/
The iResearch Library is ChemNavigator’s compilation of commercially accessible screening compounds. The database tracks over 21.7 million samples from around 150 vendors based on 14 million unique structures, including both physically available and virtual compounds.
http://www.mdli.com/products/ experiment/screening_compounds/ index.jsp
MDL Screening Compounds Directory (formerly ACD-SC) contains over 3.5 million structures (including 3D models), comprising nearly 6 million products from 46 chemical suppliers of compounds for HTS.
http://www.warr.com/ links.html#chemlib
List of screening compound vendors updated by Wendy Warr & Associates.
http://www.chemnetbase.com/
The DNP (Chapman & Hall/CRC Dictionary of Natural Products) is a comprehensive literature database of around 170 000 isolated natural products from various sources and provides names, chemical structures, CAS registry numbers, extensive source data, uses and applications.
http://www.neotrident.com/ neotrident_def4_2.htm
CNPD (China Natural Products Database) provides for around 10 000 natural products isolated in China, 2D and 3D chemical structures, CAS registry numbers, integrated with related therapeutic uses in TMC (Traditional Chinese Medicine).
http://www.wiley.com/
AntiBase 2005 is a comprehensive database of 31 022 natural compounds from micro-organisms and higher fungi based on curated literature reports. In addition to descriptive chemical data, biological data (e.g. pharmacological activity, toxicity) and information on origin and isolation are included.
http://www.genome.ad.jp/kegg/
The KEGG (Kyoto Encyclopedia of Genes and Genomes) LIGAND database provides chemical structures for around 12 000 chemical compounds and drugs with biological information; around 2000 compounds are annotated to enzymatic pathways.
(Continued )
ch01
FA April 1, 2006
6
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al. Table 1 (Continued)
Internet Resource URL/Source
Specification of Data and Information Available
http://www.brenda.uni-koeln.de/
The BRENDA database provides a comprehensive information system for enzymes. Beyond chemical information on substrates, co-substrates and co-factors, cross-links to structural, biological and disease information are annotated.
http://www.ebi.ac.uk/chebi/
ChEBI (Chemical Entities of Biological Interest) is a freely available dictionary of chemical compounds, with IUPAC and NC-IUBMB endorsed terminology. Currently three data sources have been incorporated into ChEBI, namely, KEGG Ligand, IntEnz, and Chemical Ontology.
http://chembank.med.harvard.edu/ http://pubchem.ncbi.nlm.nih.gov/
ChemBank at Harvard University and Pubchem at the NCBI are chemoinformatics databases for small molecules and their biological activities. Both systems are supported by the NCI’s initiative for chemical genetics.
http://www.sunsetmolecular.com/
The WOMBAT database contains 117 007 entries (104 230 unique SMILES), totaling over 230 000 biological activities on 1021 unique targets based on literature data.
http://www.mdl.com/products/ knowledge/medicinal_chem/ index.jsp
The MDL CMC (Comprehensive Medicinal Chemistry) database provides 3D models and important biochemical properties, including drug class, LogP, and pKa values for over 8400 pharmaceutical compounds. The MDDR (MDL Drug Data Report) is a database covering the patent literature, journals, meetings and congresses. The database contains over 132 000 biologically relevant compounds and well-defined derivatives.
http://scientific.thomson.com/ products/wdi/
The WDI (World Drug Index) contains chemical, biomedical and synonym data for over 58 000 marketed and development drugs worldwide.
ch01
FA April 1, 2006
7
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery Table 1 (Continued)
Internet Resource URL/Source
Specification of Data and Information Available
http://www.aureus-pharma.com/ http://www.eidogen-sertanty.com/ http://www.evolvus.com/ http://www.gvkbio.com/ http://www.inpharmatica.com/ http://www.jubilantbiosys.com/
A growing number of chemogenomics knowledge-based companies, like Aureus-Pharma, Eidogen-Sertanty, Evolvus, GVKBio, Inpharmatica and Jubilant Biosys are developing molecular information systems which integrate in a comprehensive manner for specific target families data from patents and selected literature together with chemical and biological search engines.
http://www.geneontology.org/
The GO (Gene Ontology) project provides a controlled vocabulary to describe gene and gene product attributes in any organism. Annotated are the biological process, the molecular function, and the cellular component of gene products.
http://www.genego.com/
MetaBase is a curated database of human protein-protein and protein-DNA interactions, transcriptional factors, signaling, metabolism and bioactive molecules. MetaCore provides intuitive tools for data visualization, mapping and exchange, multiple networking algorithms and data mining.
gene was cloned and sequenced and after destruction of the EpoF gene, epothilone C and D producing strains were generated.31 Natural products were excluded from Lipinski’s Rule-of-5 observation, which was established based on the analysis that synthetic drugs display characteristic distributions and limits for molecular weight, lipophilicity, and polarity as essential properties that enable biological membrane permeability and water solubility.32,33 Despite the fact that the distribution profiles of natural products are indeed broader compared to synthetic compounds, their fraction with two or more Rule-of-5 violations is equal to that of synthetic drugs. One interpretation of this finding might be that evolutionary optimization has coded, in addition to these essential properties, other biocharacteristics which still need to be deciphered. A number of marketed natural products based drugs are not orally available, but uniquely address a number of therapeutic applications. Compared to synthetic drug candidates, natural products — although they were not generated by nature in the perspective of therapeutic medicines — have an a priori biological advantage in that they have
ch01
FA April 1, 2006
8
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
been optimized in close co-evolution with interacting cellular structures and protein binding sites that determine their mode of action. Given the formidable progress organic synthetic chemistry made during the last fifty years, especially regarding asymmetric synthesis and catalysis, almost any of the natural products can by now be made by organic chemists either through total synthesis, or using semi-synthesis approaches at an industrial scale.24 Despite this, one key dilemma for natural products drug discovery is that although the primary HTS hit rates in the micromolar affinity range are 5–10 times higher than the hit rates for synthetic compounds, the take-up rate of the compounds by chemists for follow-up lead optimization is significantly lower. This finding is most probably due to the higher structural complexity and challenges related to the chemical structure elucidation and synthesis. A promising trend to broaden the scope of natural products is given by making small combinatorial libraries from natural products and natural product like scaffolds.34,35 A systematic extension of such libraries based on protein structure similarity clustering (PSSC) was proposed by the Waldmann group and is described in further detail in Chapter 2.36,37 The principles of this approach consider the domain organization and conservation of proteins and the corresponding needs for conservatisms of the architectures and interaction modes of their ligands. In view of these practically unlimited opportunities, natural products deserve a dedicated place in the drug discovery process. This position was challenged in the last decade by the progress of synthetic compounds which at first glance have the apparent advantage of being HTS friendly, and rapid and cheap for hit-to-lead identification and chemical development. However, given the not fully met expectations of the high throughput technology driven paradigms for many discovery programs and drug targets, including especially protein-protein interactions, natural products are expected to continue to be a significant source of drugs. Virtual screening approaches focusing on natural products were recently recognized as an additional route to explore these precious and costly compounds.38 2.2. Primary metabolites, co-substrates, co-factors, and marketed drugs Primary metabolites, co-factors, and marketed drugs form additional sets of biologically relevant and validated compounds that constitute an essential component of a comprehensive screening collection.
ch01
FA April 1, 2006
9
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
Primary metabolites which are key intermediates of cellular metabolisms and which interact with key enzymes and cellular regulatory receptor systems, are systematically included in deorphanization libraries of orphan targets. Databases like KEGG, BRENDA, or CheBI (see Table 1) have organized the relevant chemical and biological information. Hits from such libraries allow the elucidation of the functional relevance of a new potential target protein. The recent discovery via HTS that key intermediates of the Krebs cycle like succinate and α-ketogluterate activate extracellular GPCR systems with key regulatory roles in energy homeostasis provides a relevant example.39 More classically, the structural elucidation of enzyme substrate, or enzyme co-substrate/co-factor complexes provide an invaluable insight into the understanding of the essential molecular interactions of a given binding site, as successfully demonstrated by the design of many enzyme inhibiting drugs like sialidase inhibitors, protease inhibitors, and others.40 Combinatorial libraries around these principles provide molecular tool boxes for the systematic exploration of the roles of the individual members of a target family conserving a given interaction mode, such as the ATP cosubstrate binding site of protein kinases.41−43 A pioneering example here is the discovery of selective protein kinase inhibitors developed on the basis of trisubstituted purines to target the ATP-binding site of the human CDK2 (cyclin-dependent kinase 2).41 By iterating chemical library synthesis and biological screening, potent inhibitors of the human CDK2-cyclin A kinase complex and of the Saccharomyces cerevisiae Cdc28p kinase were identified. Given the large number of purine-dependent cellular processes, purine libraries may serve as a rich source of inhibitors for many different protein targets. The detailed comparative structural analysis of co-substrate and co-factor binding sites show that co-substrate and co-factor analogues open a very wide target window. For instance, the common structural framework for adenine and AMP (adenosine mono-phosphate) binding is conserved in 12 unrelated protein families, including different folds, which demonstrates that ligand recognition principles have a stronger conservation than protein fold conservation, thereby providing the basis for efficient systems-based inhibitor design strategies.44 The work of Sem et al. on the oxidoreductase gene family provides a first detailed analysis which divides the global family into structural subfamilies termed pharmacofamilies, which share pharmacophore features in their cofactor binding sites.45 The presence of the conserved NAD(P) (nicotinamide adenine dinucleotide-(phosphate)) cofactor binding site (approximately 15% of all known enzyme functions
ch01
FA April 1, 2006
10
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
utilize NAD(P) for catalytic function) coupled with the modular nature of this gene family, has led to the development of a highly parallel approach to inhibitor design. In this chemical proteomic strategy, focused chemical libraries are tailored to subfamilies of large gene families to produce high affinity inhibitors for multiple members of the subfamily (Fig. 2). Last but not least, marketed drugs and derivative libraries are an important and invaluable compound source and provide the basis for the SOSA (Selective Optimization of Side Activities) approach.46,47 The SOSA approach consists of testing old drugs on new pharmacological targets. The aim is to subject to pharmacological screening, a limited number of drug molecules that are structurally and therapeutically very diverse and that
Figure 2. Pharmacofamilies of the NADH cofactor (structure shown in panel A) binding to oxidoreductases. Panel B shows an overlay of a subset of NAD(P)(H) geometries obtained from 288 crystal structures of oxidoreductases. The two largest pharmacofamilies are shown, corresponding to the two-domain Rossmann fold enzymes in pharmacofamilies 1 (anti) and 2 (syn). Panel C shows the corresponding pharmacophores with all protein heteroatoms indicated that are within hydrogen bonding distance of atoms in the cofactor. (Figure adapted with permission from original work of Sem et al.45).
ch01
FA April 1, 2006
11
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
have known safety and bioavailability in humans, thereby potentially shortening the time and cost needed for hit optimization. The SOSA approach proceeds in two steps. First, a limited set of a few thousand carefully chosen, structurally diverse drug molecules are screened. Since bioavailability and toxicity studies have already been performed for those drugs and since they have proven their usefulness in human therapy, all hits are per definition drug-like. In the second stage, the hits are optimized by means of traditional, parallel, or combinatorial chemistry in order to increase the affinity for the new target and decrease the affinity for the other targets. The objective is to prepare analogues of the hit molecule in order to transform the observed “side activity” into the main effect and to strongly reduce or abolish the initial pharmacological activity. Successful examples of the application of the SOSA principle include, for instance, optimization projects started from the antidepressant Minaprine. Minaprine has low affinity for the muscarinic M1 GPCR (Ki = 17 µM); the optimization of this side effect yielded high affinity nanomolar M1 partial agonists. Minaprine is also a very weak acetylcholinesterase (AChE) inhibitor (Ki = 600 µM); separate optimization of this side activity resulted in nanomolar AChE inhibitors.46 The SOSA approach can be enhanced by virtual screening methods which use reference compound sets and molecular descriptors together with advanced chemoinformatics methods to compare and rank the similarity of considered candidate molecules. 2.3. Peptides and peptido-mimetics Peptide-protein molecular interactions constitute the most ubiquitous mode for controlling and modulating cellular function, intercellular communication, and signal transduction pathways. Hormones, neurotransmitters, antigens, cytokines and growth factors represent key classes of such peptide ligands.48 Physiologic peptides, such as insulin, oxytocin and calcitonin are used directly as drugs. In many cases, antagonists of the native ligands are searched for and here the endogenous peptide is per se not suitable. Peptides are key components of chemogenomics discovery libraries and are especially useful for the characterization of orphan targets. A number of successful deorphanizations, especially in the GPCR field, are based on peptides, resulting in new drug discovery projects. New peptides for such libraries are discovered using HPLC fractionations of tissue extracts together with random or designed peptide libraries based
ch01
FA April 1, 2006
12
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
on the bioinformatics analysis of putatively secreted peptides and protein hormones defined in the genome.49 Peptide synthesis was revolutionized after the major breakthrough in peptide chemistry in 1963 when Merrifield published a historic paper describing the principles and the applications of solid-phase peptide synthesis (SPPS).50 Limiting factors of peptide-based drugs are directed by the number of amide bonds which determine properties like a high PSA (polar surface area); a low membrane permeability; and a potentially high proteolytic degradation; resulting in quite poor ADME (absorption, distribution, metabolism, and excretion) properties. Mainly because of these reasons, robust strategies for the design of peptide mimetics have been successfully developed and peptide-based drug discovery approaches offer a truly systematic route for chemogenomics with peptide secondary structure mimetics.51−53 The use of β-amino acid building blocks was recently discovered by Seebach to overcome limitations of α-peptidic structures such as low bioavailability and easy proteolysis and to leverage the tendency of small β-peptides to form stable secondary structures.54 Robust peptide-derived approaches aim to identify a small drug-like molecule to mimic the peptide interactions. The primary peptide molecule is considered in these approaches as a tool compound to demonstrate that small molecules can compete with a given interaction. A variety of chemical, 3D structural and molecular modeling approaches are used to validate the essential 3D pharmacophore model which in turn is the basis for the design of the mimics. The chemical approaches include in addition to N- and C-terminal truncations a variety of positional scanning methods. Using alanine scans one can identify the key pharmacophore points; D-aminoacid or proline scans allow stabilization of β-turn structures; cyclic scans bias the peptide or portions of the peptide in a particular conformation (α-helix, β-turn and so on); other scans, like N-methyl-amino-acid scans and amide-bond-replacement (depsi-peptides) scans aim to improve the ADME properties.48 Peptide and protein mimetics libraries including β-turn/α-helix mimetics are recognized to be of central importance in chemogenomics.52,53 A number of important hormones, like angiotensin, bradykinin, CCK (cholecystokinin), MSF (melanocyte stimulating factor) and SST (somatostatin) make their key recognition via specific β-turn motifs. Others, like CRF (corticotrophin releasing factor), PTH/PTHrP (parathyroid hormone/parathyroid hormone related protein), NPY (neuropeptide Y), VIP (vasoactive intestinal peptide), or GHRF (growth hormone releasing factor)
ch01
FA April 1, 2006
13
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
interact via α-helix motifs.52 Whereas the design of organic drug-like αhelix mimetics is still in its infancy, the design of drug-like active β-turn mimetics based on organic drug-like scaffolds, or based on cyclic α-peptides or β/γ-peptides has advanced to a quite routine methodology.55 The work of Garland and Dean,56,57 defining a set of triangular distance constraints that the substitution points of a scaffold have to satisfy in order to mimic the specific Cα atoms of the peptide template, provided a generalized frame for the design of novel β-turn mimetic scaffolds and was in combination with database searches successfully applied for the design of CCK and SST antagonists.58 Targeted combinatorial libraries around such scaffolds are an essential component of a chemogenomics discovery library. Recent examples of successful peptide-ligand based discoveries of drug-like peptidomimetics include the discovery of SST antagonists,59 or the discovery of non-peptidic antagonists of the recently deorphanized urotensin II receptor at Sanofi-Aventis.60 As illustrated in Fig. 3, Flohr et al. used 3D models of the NMR solution structure of cyclic peptide derivatives of Urotensin II as a template for virtual 3D pharmacophore searches which resulted into non-peptidic candidates for lead optimization.
Figure 3. Discovery of Urotensin II GPCR antagonist by peptide mimetic approaches. 3D models of the NMR solution structure of cyclic peptide derivatives were used as templates for virtual 3D pharmacophore searches and resulted into non-peptidic hits.60 (Figure reproduced with permission from review of Klabunde and Hessler.61)
ch01
FA April 1, 2006
14
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
2.4. Diversity oriented synthesis molecules DOS (Diversity Oriented Synthesis), as opposed to the traditional TOS (Target Oriented Synthesis) chemistry approach, was introduced by Schreiber for forward chemical genetic screening in order to mimic the structural complexity and the skeletal and stereochemical diversity of natural products.62,63 In contrast to a convergent synthesis strategy resulting from the logic of retrosynthetic analysis of the target molecules, DOS, in the ideal state, allows the application of a diverse set of reagents and structural transformations on each synthesis intermediate; this results in diverging synthesis pathways that create a broad diversity of target molecules with different scaffolds. The situation mimics in sort the early history of structural chemistry of natural products where, for instance, the exploration of terpenoids under different conditions, including pH, resulted into chemical rearrangements and the generation of new structures from the same starting materials. DOS compounds clearly share a number of characteristics with natural products, including most notably the scaffold diversity and stereochemical complexity. The questions remains, however, whether these products of pure chemist imagination capture the evolutionary advantages of natural products and natural product-based compounds. The DOS planning strategy allows, by enumeration over a larger number of steps, the genesis of truly novel structures which itself is an innovative concept — see Panel A of Fig. 4. In practice, DOS combinatorial libraries focus on leveraging information about existing biologically active molecules in order to address the biologically relevant regions of chemical space. Three types of DOS libraries are currently distinguished: 1) Libraries based on the core scaffold of an individual natural product (see above); 2) Libraries based on specific structural motifs that are found across a class of natural products; and 3) Libraries that emulate the characteristics of natural products in a more general sense — which are most directly related to the theoretical definition of DOS.64,65 DOS libraries are not directed towards a single biological target and aim to provide diverse discovery libraries. DOS has increased the need for exceptionally efficient, stereoselective and chemoselective reactions, including multicomponent reactions, that can be applied to a broad range on substrates. A number of recent success stories prove that DOS compounds provide invaluable tools for target validation — see Panel B of Fig. 4. The validation of the ADMET (absorption, distribution, metabolism, excretion, and toxicology) and in vivo properties of these compounds and their value as
ch01
FA April 1, 2006
15:40
15
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery R5
R4
N
R1 O
R1 CN
EtO2C
1
CO2H
O
2
R2
N
NH2
R1 O
N
O H
O
R1HNOC
N
R4 N
O
H
3 R1 O
N
(R) O
O
H
(S)
O
σ
EtO2C
R2
N
O
σ
O
H
N
O
O
R2
O
R2
A
O
O
R5
O N
R2 O
σ
O R5
N
O
O
R4
O
B
H N
NH2 HN
OH
H N
H N
O
O
O O
O
O S
S
N OH
N
O N
O N OH
OH
Uretupamine A
O
O
O
NH2
Tubacin
Histacin
Figure 4. DOS. Panel A: Genesis of novel chemotypes following the DOS planning strategy using multi-component and complexity generating reactions. The reactions are: 1) Ugi-4CC-IMDA complexity generating reaction; 2) Allylation, hydrolysis, and acylation; 3) Ruthenium catalyzed metathesis complexity generating reaction in the context of a skeletal diversity generating folding process. Depending on the stereochemistry of the σ-subsitution point, different products are generated in the metathesis reaction. Panel B: Recent success stories of DOS compounds. Uretupamines,66 Tubacin,67 and Histacin68 were discovered by HTS of DOS libraries sharing characteristics of natural products. Uretupamines are function selective suppressors of the yeast signaling protein Ure2p. Tubacin are selective tubulin deacetylators. Histacins are selective HDAC (histone deacetylase) inhibitors.
therapeutics remain, however, to be proven. Comparable to natural products, as a result of the structural complexity, a key challenge is expected in the lead optimization phase and for the industrial chemical development of the final compounds.
3. Designing Comprehensive Chemogenomics Screening Compound Collections The industrial and emerging academic screening centers have put significant investments into screening compound collection enrichment
ch01
FA April 1, 2006
16
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
and enhancement projects.69−71 The main purpose of large screening collections, ranging typically from 100 000–2 000 000 compounds, is to supply the discovery pipeline with hit-to-lead compounds for today’s and future portfolio of drug discovery programs and to provide tool compounds for the chemogenomic investigation of novel biological pathways and circuits.72−74 A screening collection is a dynamic entity which aims to integrate continuously novel chemical structures. As such, it integrates design-focused and diversity-based compound sets from the synthetic and natural paradigms generated via corporate medicinal chemistry and combinatorial compound synthesis and external compound acquisition projects. The different compound categories mentioned earlier are included. The assessment of the likelihood of a molecule to bind to a molecular target is important. Both protein structure-based approaches like HTD (highthroughput docking), and ligand-based similarity and diversity approaches are applied in a rapid manner to the physically existing and virtually designed compound sets typically available for selection campaigns.5,10,75,76 The structural diversity is of particular importance to the general screening collection enhancement projects. Not only exact duplication needs to be avoided, but a general diversity in terms of chemical classes, lead-likeness, and drug-likeness needs to be achieved. Screening collection design processes are typically informatics, chemistry, and biology driven. The currently applied process within our group in outlined in Fig. 5 and consists of two steps.77 In the first step, the candidate compounds are filtered and grouped into three priority classes on the basis of their individual structural and computed physicochemical properties. Substructure and computed physicochemical filters, similar to those published by others, are used both to eliminate and to penalize compound classes.78−80 The similarity of the remaining structures to selected reference ligands of proven druggable target families of interest is then computed, and the compounds similar to drugs and known actives are prioritized for the following diversity analysis. In the second step, the compounds are compared with the archive compounds and a diversity analysis is performed. This is done separately for the compounds prioritized as similar to known drugs and actives, the drug like regular compounds, and the penalized compounds with increasingly stringent dissimilarity criteria. The automated analysis is followed by manual review of the compounds to assess more complex structural properties like the chemical derivatization potential. One major role of chemoinformatics is thus recognized in the need to reduce the number of potential candidates to a humanly
ch01
FA April 1, 2006
17
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
Figure 5. Schematic illustration of the NIBR compound selection process. The candidate compounds are first filtered using substructure and computed physicochemical filters. The remaining compounds are divided iteratively into three orthogonal categories. Penalized compounds are defined based on additional substructure filters identifying particularly abundant chemotypes. Using both Unity 2D fingerprints and Similog keys, the remaining compounds are compared with selected known reference drugs and actives of the main target families of interest to NIBR, and are separated into the category of similar to known drugs and the category of drug-like diverse compounds. The candidate compound sets are then compared in an incremental manner to the existing collections and the previously made selections. This diversity selection process starts with the candidate set of similar to known drugs and ends with the penalized compound set. The incremental diversity selections are done with decreasing similarity thresholds and compound densities. For further detail see Ref. 77.
tractable number.79 The approach is thus comparable to the ones described by Oprea and others, emphasizing serendipity and the inclusion of compounds beyond strict drug-like/lead-like filters.72−74 Regarding the assessment of chemical diversity, recent advances in clustering techniques are noteworthy.81,82 They make possible the co-clustering of very large commercial compound collections and reference sets with the entire corporate collection and allow the application of constraints for the number of compounds to be selected per cluster. The ideal library size is currently a subject of scientific debate.83,84 Whereas theoretical rationales are emerging, pragmatic considerations are prevailing and focus on the diversity of chemotypes rather than on larger and larger numbers of individual compounds per scaffold85 ; the latter should, however, be such as to enable the detection of SAR from the screening data.6 As an increasing number of
ch01
FA April 1, 2006
18
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
commercially available screening compounds are prepared by combinatorial or parallel synthesis, the evaluation and selection based on the scaffolds is a reasonable alternative. This is especially valid if the compounds have not yet been prepared and one is given the opportunity to prioritize the synthesis proposals. Scaffold novelty within the corporate collection and compared to the patented chemistry space can be ensured by substructure searching. As the number of attractive scaffolds is limited, the selection of the most promising ones can be done manually, although computational methods for the evaluation of scaffold diversity are emerging.86−88 Reference repertoires of privileged structures are a pragmatic guide in this process.89−92 The critical in silico and in vitro evaluation of scaffold-based prototypes is also recommended to create best added value.6 The implementation of efficient and updated 2D and 3D structure databases is one major challenge in molecular diversity management. In addition to the compound design and selection criteria used for the compilation, the quality of the compound storage, manipulation, and logistics systems used for the management of the collection are key factors to yield reproducible results.93
4. Essential Properties and Selection Processes along the Discovery Pipeline The pre-clinical drug discovery process is typically a sequential selection and optimization process focusing, as summarized in Table 2, on different essential properties at each step.94−99 Tool compounds required for early in vitro or in vivo target validation typically do not need to satisfy the same stringent in vivo efficacy and safety criteria as clinical candidates and drugs, which again depend on the targeted therapeutic indication.100 In order to reduce later attrition,101,102 the transitions between the stages include continuous iterations returning to the previous stage when the criteria of the following stage are not reached for a given candidate compound and when appropriate redesign of the compounds is required. No further redesign of the compound is possible after transition into clinical phases. As in the target-based drug discovery paradigm, the objective is to find compounds selective for one particular target or for a spectra of targets in a specific disease relevant signaling pathway, the ligand potency and efficacy are clearly among the most important properties. Because primary assays, especially cellular assays, can result in hit rates as high as 1%, a primary HTS
ch01
— Clear SAR — Compounds with — IP protected confirmed chemical structure and purity — Essential SAR established by substructure and similarity searching — Potential for compound IP generation — Amenable for parallel optimization — Assessment of aggregation and chemical cross-reactivity — Dose dependent activity in assays relevant for optimizations — Adequate potency in biochemical and cell-based assay — Adequate selectivity on key anti targets — Assessment of binding kinetics on target
— Nanomolar potency on isolated target — Submicromolar activity in functional assays — Demonstrated activity on paralogue targets in species for animal testing — Desired selectivity profile on key anti targets and safety pharmacology targets
Drug Candidate — Chemical synthesis or natural products isolation process tractable for large scale industrial manufacturing according to GMP — Chemically stable
— Knowledge of possible cross targets and possible adverse reactions based on receptor pharmacology
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Lead Compound
15:40
Receptor pharmacology
— Compounds from synthetic and natural paradigms including targeted and diversitybased design principles — Chemically pure or defined mixtures — Absence of undesirable functionalities impairing stability and chemical cross-reactivity — Fast back supply possible
Hit-to-lead Compound
April 1, 2006
Chemical properties
Screening Compound
Small Molecules for Chemogenomics-based Drug Discovery
Essential Properties
19
Table 2 Essential Properties of Small Molecules at Different Stages of Pre-clinical Drug Discovery from Screening Compound to Investigational Drug Candidate. Adapted from Refs. 94–97 Information regarding the principles of the clinical selection and approval process can be found at the FDA Center for Drug Evaluation and Research (http://www.fda.gov/cder/ ). CYP450 (cytochrome P450), DMPK (drug metabolism and pharmacokinetics), DMSO (dimethylsulfoxid), GMP (good manufacturing practice), IP (intellectual property), PAMPA (partial artificial membrane permeability assay).
FA
ch01
— Physicochemical characterization: LogP, LogD, pKa, solubility, aggregation — Assessment of membrane permeability using: CACO-2, PAMPA — Assessment of metabolic characteristics: CYP inhibition in major isoforms to assess drug-drug interaction liabilities and intrinsic clearance in rat and human liver microsomes
Lead Compound — Understanding of key membrane transport mechanisms — Desired metabolic characteristics — Appropriate clearance, volume of distribution and half life in rat — Evaluation of genotox: AMES bacterial mutagenicty — Evaluation of HERG interference
Drug Candidate — Identification of appropriate gallenic form for testing in animals — Metabolite profiling for each compound and assessment for reactive metabolite formation — Mammalian cell mutagenicity data — Understanding of in vivo ADMET properties, including tissue distribution an elimination properties — Dose escalation experiments and maximum tolerable dose in appropriate species — Decision for safe testing in human without impairing vital functions
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
ADMET/DMPK — Good water and DMSO solubility — Adequate permeability to reach site of action
Hit-to-lead Compound
15:40
Screening Compound
E. Jacoby et al.
Essential Properties
April 1, 2006
20
Table 2 (Continued)
FA
ch01
FA April 1, 2006
21
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
hit list triaging process is typically applied in order to select compounds for dose-response dependent validation. The objective is to select and enrich from a given primary HTS data set, those compounds which have the best potential to become hit-to-lead compounds, and to explore at maximum the chemical diversity represented in a primary hit list and other sources of information relevant for the specific biology project. As compound selection and filtering is a subject of intense scientific debate,103,104 the computational analysis process applied in our group uses in a first step data pipelining tools to annotate the different possible decision criteria to the compound.105 Because of the legacy of the screening collection, compounds violating the standard substructure filters used in the design of the newer screening sets need to be applied. In addition, project specific substructure, scaffold, and physicochemical filters are applied to the primary hit list in order to maximize the chemical attractiveness of the resulting hit list. Based on the chemist-dependent information of chemical attractiveness, Naïve Bayesian classifiers, or other machine learning techniques can be applied to translate this information into predictive computational models. In a similar manner, empiric information about the promiscuity, or cell-toxicity of the hits can be integrated using reference lists compiled over comparable assays of the same format or same target family. Input from maximum common substructure clustering methods is used to track quickly chemical scaffolds that are over represented in a hit list and to reduce large hit lists by cherry picking a representative set from each cluster, preserving the most active compounds.106 The summary of the different annotation criteria can then be used to qualify the chemical and biological hit attractiveness using simple additive point-based scoring schemes. The annotated and scored primary hit lists are then discussed within the project team for a final decision. Water solubility and compound self-aggregation properties turn out to be key properties essential at the hit-finding stage. Poorly soluble and self-aggregating compounds are a major cause for drop-out at further validation stages and resulting potentially in flat SAR of the derived compound series.107,108 Also compounds with the potential for cross-reaction with nucleophilic amino acid side chains of proteins are not desired. Both in silico prediction tools and high-throughput experimental methods for testing of these properties are currently used to either exclude or penalize compounds for the compound selection processes.109−111 A number of criteria are essential for the decision to take up a hit compound singleton or compound series to the hit-to-lead stage. The criteria,
ch01
FA April 1, 2006
22
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
which need to be handled flexibly, are chemical and biological in nature, including: The compound structure and purity are confirmed; true singletons are supported by small sets of analogues; and, absence of undesired functionalities and attractiveness for parallel optimization. The latter two objectives are rather subjective and depend on the expertise and experience of the medicinal chemist to be able to cope with latent hits.112 Ideally, at the hit-to-lead phase, structural biology is in place to include guidance by molecular modeling. The biological criteria include dose-dependent response validation in a secondary assay relevant to compound optimization together with the characterization of the binding kinetics,113 and some further potency criteria. The potency criteria, however, need to be balanced again with chemical tractability of a series, since interesting low potency hits can be transformed into high potent compounds.114,115 The early hitto-lead criteria also include early in vitro ADME characterization of the compounds. Key physicochemical properties guiding the optimization are determined, including solubility, LogP, LogD, pKa and passive membrane permeability. Plasma protein binding, CYP450 inhibition in the major isoforms, and intrinsic clearance in hepatocytes are included at this stage, to address limitations in the pharmacokinetic properties and liabilities for drug-drug interactions. The in vivo pharmacodynamic and pharmacokinetic aspects become in the later stages of the discovery chain more and more important and form the essence of the art of medicinal chemistry which is characterized by detail and dynamic complexity.96,116,117 Given the limitations in extending predictions to later stages (including cross species translations), it is clear that a successful project will investigate at each stage a couple of different chemical series, including a couple of representative multi-objective optimized compounds.96,117 Noteworthy, in the context of designing screening collections, are the concepts of drug- and lead-likeness.118 Drug-likeness is a general description of the potential of a small molecule to become a drug. As we have attempted to summarize in Table 2, many chemical and biological characteristics of a compound have to be met to make a compound a drug. In a provocative statement, Lipinski estimated that currently only about 10 000 drug-like compounds exist which are sparsely, rather than uniformly, distributed through chemistry space.32,33 True — immediately useful — diversity does not, in this sense, exist in experimental chemistry screening libraries! Because of the evolutionary pressures on ADMET to deal with endobiotics and exobiotics, the ADMET property space is of low dimensionality, whereas biological receptor activity is higher dimensional in chemistry space.
ch01
FA April 1, 2006
23
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
Compounds highly potent against a drug target may not be efficacious because of pharmacokinetic problems; they may be toxic, or may unfavorably interact with other drugs. Various studies have analyzed the statistical distribution of molecular properties of drugs and especially orally available drugs in order to be able to derive predictive models of drug-likeness. Experimental and, as shown in Fig. 6, computed essential properties show distinct distribution profiles for the different compound sets of a screening collection. However, as ADMET is hard to predict for large data sets,115 because it is typically multi-mechanism related and the predictions get worse as more data accumulates, these methods are mostly descriptive and focus mainly on the prediction of absorption or bioavailability. The level of permeability or solubility needed for oral absorption is related to needed potency. Based on the analysis of 2245 compounds from the WDI which were investigated in Phase II and later clinical trials, Lipinski’s Rule-of-5 predicts
Figure 6. Distributions of essential computed molecular properties defining drug-likeness for selected compound sets. Shown are the fraction of compounds vs. the properties. Orange: NIBR historical medicinal chemistry collection. Brown: Compilation of combinatorial chemistry libraries. Dark Green: Drugs (launched or Phase III listed in MDDR or CMC). Brown: Compilation from combinatorial libraries. Pink: Natural products of DNP. Light Green: HTS hits of NIBR 2004 screens. All properties were calculated with Pipeline Pilot software (www.scitegic.com).
ch01
FA April 1, 2006
24
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
that poor absorption or permeation is more likely when the number of Hbond donors >5; the number of H-bond acceptors >10; the MW (molecular weight) > 500; and the CLogP (calculated Log P) > 5.32,33 Veber et al. have used SmithKlineBeecham data of oral rat bioavailability of 1100 drug candidates to analyze the importance of molecular properties related to drug-likeness.119 They found that the commonly applied MW cutoff at 500 does not itself significantly separate compounds with poor oral bioavailability from those with acceptable values in their data set. Their observations suggest that compounds which meet a number of rotatable bonds ≤10 and a PSA ≤ 140 Å2 (or 12 or fewer H-bond donors and acceptors) will have a high probability of good oral bioavailability in the rat. These findings were, however, challenged by an analysis of 434 Pharmacia compounds, from which it was concluded that generalizations on complicated endpoints are difficult and dangerous for prospective selections.120 Vieth et al. analyzed the characteristic physical properties and structural fragments of oral drugs vs. drugs of other routes of administration and found that oral drugs tend to be lighter and have fewer H-bond donors, acceptors, and rotatable bonds.121 Martin, in responding to a demonstrated need to forecast in silico the permeability and bioavailability properties of compounds, has developed a score that assigns the probability that a compound will have a bioavailability >10% in the rat.122 Neither the Rule-of-5, LogP, LogD, nor the combination of the number of rotatable bonds and PSA, has successfully categorized compounds. Instead, different properties govern the bioavailability of compounds, depending on their predominant charge at biological pH. The fraction of anionic compounds with a bioavailability >10% falls from 85% if the PSA is ≤75 Å2 , to 56% if 75 < PSA < 150 Å2 , to 11% if PSA is ≥150 Å2 . Conversely, whereas 55% of the neutral, zwitterionic, or cationic compounds that pass the Rule-of-5 have a bioavailability >10%, only 17% of those that fail have a bioavailability >10%. This same categorization distinguishes compounds that are poorly permeable from those that are permeable in Caco-2 cells. These sometimes controversial reports, which are based on different datasets, are evidence that accurate predictions of drug-likeness are quite difficult123 ; nevertheless, these data in combination with different statistical modeling and machine learning techniques, can be used with caution for the evaluation of vendor database or virtual compound libraries, or for categorizing compounds sets and conditions for experimental testing.124−126 In regards to filtering out potentially toxic compounds, structure-based methods are often employed that primarily draw from mutagenicity,
ch01
FA April 1, 2006
25
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
carcinogenicity, and acute toxicity databases.127 Expert systems such as DEREK, TOPKAT, MCASE, or corporate internal developments are used to evaluate virtual compounds for multiple toxicity endpoints.127−129 Again, it is a strategic decision at what stage of the discovery process these methods are used as hard filters, or more simply they should be used as awareness indicators directing further experimental investigation and optimization. As lead optimization is quite an artistic activity, it was pointed out that screening library design should focus on lead-likeness rather than druglikeness. Teague et al. pointed out that there are differences between drugs and leads.4,130 Leads may be classified into three categories: 1) Low-affinity compounds with low MW and ClogP (e.g. endogenous compounds such as histamine); 2) High affinity and high MW compounds (e.g. peptides and natural products) that need improved pharmacokinetic profiles; and 3) Low affinity with drug-like MW (300–500) and CLogP (3–5). Most of the HTS hits belong to the third category and optimization often adds hydrophobic groups to increase the potency of the compounds. The conclusion was then drawn that chemical screening libraries should focus on low MW and lipophilic compounds. Low complexity compounds seem to have, in addition to the higher attractiveness to the chemist, a higher probability for successful detection in screening.131,132 Follow-up studies by Oprea which analyzed more systematically lead-drug pairs,133,134 recommended that lead-likeness libraries should have as characteristics: MW <460; −4 < LogP < 4.2; water solubility LogS > −5; number of rotatable bonds <10; number of rings <4; number of H-Bond donors <5; and, number of H-Bond acceptors <9. These differences compared to drugs are thus subtle, and as concluded by Proudfoot,135 successful and timely drug discovery campaigns require high quality lead structures and these lead structures may need to be much more drug-like than is commonly accepted.
5. Molecular Information Systems and Annotated Compound Libraries The fast growing amount of molecular information related to small molecule interactions with biological systems, generated in both academic and industrial screening centers, requires the design of molecular information systems which integrate bio- and chemoinformatics systems.136−139 During the past decade, annotation and classification efforts
ch01
FA April 1, 2006
26
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
in bioinformatics have focused mainly on genomic sequences and protein structures. Comprehensive gene ontologies like GO (gene ontology — see Table 1) that annotate the biological process — the molecular function and the cellular component of gene products — are the ultimate goal of this research. Compared to this, only limited effort has been put into annotation schemes for ligands. Ligand molecular information systems have evolved mainly from the need to track literature, patent and clinical status information. Databases, like MDDR, WDI and CMC (see Table 1), are such systems providing structural information of ligands together with molecular target or therapeutic class information. The systematic and comprehensive extensions of these annotations with molecular target, signaling pathway, and disease and therapeutic information, in combination with synonym and classification schemes, result in true cross-linking of the chemical and biological knowledge spaces. As the mining and interpretation of the available information is an intellectually demanding activity, a growing number of chemogenomics knowledge-based companies, like Aureus-Pharma, Eidogen-Sertanty, Evolvus, GVKBio, Inpharmatica or Jubilant Biosys (see Table 1) have specialized in developing molecular information systems which integrate data from patents and selected literature, including 2D structures of the ligands, target sequence and classification, mechanism of action, structure-activity data, assay and bibliographic information, together with chemical and biological search engines. Noteworthy is the development of automated information extraction systems from textual sources and images.140 As was shown recently by the Stockwell group, the automatic build-up of compound annotations based on Medline literature reports form a useful knowledge basis for annotated compound libraries that guide experiments for pathway elucidation.141 Other molecular information systems like the Cerep Bioprint Matrix142 or Iconix DrugMatrix,143 summarize validated IC50 profiling data of drug and development compounds on a panel of therapeutic targets together with ADMET data (the latter also contains gene expression data). As was discussed above and further discussed in this book by Froloff et al., profiling data are becoming important for lead prioritization and design of safety pharmacology studies and enable opportunistic drug discovery approaches. Development of integrated chemical and biological ontologies turns out to be quite complex. A key challenge resides in the genome-wide extension of molecular annotations and classification schemes. Given the growing detail complexity of our knowledge around biological systems, the design of data models will need to evolve further to enable integration and
ch01
FA April 1, 2006
27
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
mining of knowledge within broader dynamic systems biology and chemical genetic/protein-protein interaction network concept spaces.144,145 The GeneGo MetaBase and MetaCore expert systems (see Table 1) are such a pioneering systems biology knowledge platform which allow one to network over 90% of the known human proteins via physical protein-protein, protein-DNA and protein-compound interactions. In combination with advanced data analysis techniques, such systems allow the compilation of relevant reference sets for chemoinformaticsbased similarity searches and for library design of target class or signaling pathway-focused collections.146−148 Vice versa, the ligand similarity principle can be used to infer putative molecular targets of compounds of interest. Integrated molecular informatics systems are thus central to chemogenomics knowledge-based discovery strategies, which are based on the fact that similar ligands bind to similar targets or binding sites.149 As an example, homology-based similarity searching was developed at NIBR as a chemoinformatics similarity searching method able to identify not only ligands binding to the same target as the reference ligand(s), but also potential ligands of other homologous targets for which no ligands are yet known.150 More classically, similarity searching methods are used to identify candidate screening compounds for a target where reference compounds are already known, allowing competitors to find catch-up lead molecules. Related machine learning methods, like artificial neural networks, Kohonen self-organizing maps and SVMs (support vector machines), try to align the chemical and biological spaces based on mapping procedures.148 The goal here is to identify which parts (islands) of the chemical-property space correspond to specific target family or therapeutic activities, and vice-versa. Especially innovative are these applications in combination with scaffold hopping and hybridization methods.151−153 Hit rates of up to 1–10% covering multiple chemical chemotypes can be expected with library sizes of 500–2500 compounds, when the libraries are designed towards new members with expected conserved molecular recognition.6 Other direct applications include the prediction of the entire biological activity spectra of compounds of interest. For example, the PASS (prediction of activity spectra for substances) approach allows substructure analysis based probabilistic modeling of known biological effects (e.g. pharmacological main and side effects, mechanisms of action, mutagenicity, carcinogenicity, teratogenicity and embryotoxicity) based on a training dataset from literature.154 Similarly at NIBR, based on the WOMBAT
ch01
FA April 1, 2006
28
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al. OH
O O
O
O
O
O
O
O
O OH
O
OH
O
Top Targets Predicted 1. Protein Kinase C 2. Tubulin 3. Beta-Hexosaminidase
Figure 7. Multiclass Na¨ıve Bayes modeling within Pipeline Pilot software (www.scitegic.com) based on the WOMBAT chemogenomics dataset. Probabilistic target predictions are possible for compounds given only their chemical structure. In the example shown, the WOMBAT targets were predicted for Calphostin C, a known protein kinase C inhibiting natural product. Tubulin and beta-hexosaminidase are predicted as additional possible targets.
reference data, multiple class Naïve Bayes models are used for targeted compound selections and the in silico annotation of HTS hit lists (see Fig. 7). A further source of knowledge which is currently being integrated in the molecular information systems at NIBR is tracking and annotating of the chemical and biological hitlist triage decisions made on primary and validated HTS data. The derived data is recognized as being of capital interest for the transparency of the HTS hit identification process and an invaluable source for further improving the quality of the screening collection. The analysis of such data allows, for instance, the automatic identification and extraction of privileged structures, or the build-up of predictive models for compound classification.155,156
6. Conclusion In this chapter we have outlined the main compound categories and compound design and selection principles which permit systematization of discovery of small molecules interacting with the broad diversity of protein targets and molecular binding sites. Natural products, peptides and mimetics, co-factor/drug-based, and DOS libraries, used in combination, are expected to contribute to achieving the central chemogenomics challenge of
ch01
FA April 1, 2006
29
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
systematically identifying small molecules that interact with and modulate the function of each gene product. Systematization seems possible whenever there are underlying conserved principles of molecular interaction.157 Mainly for this reason, target familydirected and knowledge-based approaches illustrated in this book are especially promising.42,43,52,146 Finally, we wish to emphasize that the primary role of small molecules in chemogenomics is to provide the foundation for future drug discovery and optimization projects which continue to rely on the classical medicinal chemistry and in vivo pharmacology-based selection and design principles; also, it should be well acknowledged that drugs are after all discovered in the clinic.
Acknowledgements Drs. P. Acklin, R. Amstutz, D. Bojanic, B. Faller, P. Fuerst, R. Lewis, A. Marzinzik, C. McCarthy, F. Petersen, J. Priestle, L. Urban, H.-J. Roth, U. Schopfer, J. Tallarico, R. Tommasi, J. Van Drie, J. Zimmermann (all NIBR associates) are acknowledged for various support and discussions.
References 1. Webb TR. (2005) Current directions in the evolution of compound libraries. Curr. Opin. Drug Discov. Devel. 8:303–308. 2. Hopkins AL, Groom CR. (2002) The druggable genome. Nat. Rev. Drug Discov. 1:727–730. 3. Deprez-Poulain R, Deprez B. (2004) Facts, figures and trends in lead generation. Curr. Top. Med. Chem. 4:569–580. 4. Teague SJ, Davis AM, Leeson PD, Oprea T. (1999) The design of leadlike combinatorial libraries. Angew. Chem. Int. Ed. Engl. 38:3743–3748. 5. Goodnow RA, Jr., Guba W, Haapl W. (2003) Library design practices for success in lead generation with small molecule libraries. Comb. Chem. High Throughput. Screen. 6:649–660. 6. Jacoby E, Schuffenhauer A, Popov M, et al. (2005) Key aspects of the Novartis compound collection enhancement project for the compilation of a comprehensive chemogenomics drug discovery screening collection. Curr. Top. Med. Chem. 5:397–411. 7. Spencer RW. (1998) High-throughput screening of historic collections: Observations on file size, biological targets and file diversity. Biotechnol. Bioeng. 61:61–67. 8. Cavallaro CL, Schnur DM, Tebben AJ. (2005) Molecular diversity in lead discovery: From quantity to quality. In: T Oprea (ed), Chemoinformatics in Drug Discovery, pp. 175–198. Wiley-VCH, Weinheim.
ch01
FA April 1, 2006
30
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
9. Brown RD, Hassan M, Waldman M. (2000) Combinatorial library design for diversity, cost efficiency, and drug-like character. J. Mol. Graph. Model. 18:427–437. 10. Bajorath J. (2002) Integration of virtual and high-throughput screening. Nat. Rev. Drug Discov. 1:882–894. 11. Bleicher KH, Böhm HJ, Müller K, Alanine AI. (2003) Hit and lead generation: Beyond high-throughput screening. Nat. Rev. Drug Discov. 5:369–378. 12. Blake JF. (2004) Integrating chemoinformatics analysis in combinatorial chemistry. Curr. Opin. Chem. Biol. 8:407–411. 13. Rose S, Stevens A. (2003) Computational design strategies for combinatorial libraries. Curr. Opin. Chem. Biol. 7:331–339. 14. Schreiber SL. (1998) Chemical genetics resulting from a passion for synthetic organic chemistry. Bioorg. Med. Chem. 6:1127–1152. 15. Stockwell BR. (2000) Chemical genetics: Ligand-based discovery of gene function. Nat. Rev. Genet. 1:116–125. 16. Caron PR, Mullican MD, Mashal RD, et al. (2001) Chemogenomic approaches to drug discovery. Curr. Opin. Chem. Biol. 5:464–470. 17. Murcko M, Caron P. (2002) Transforming the genome to drug discovery. Drug Discov. Today 2:583–584. 18. Waldmann H. (2003) At the crossroads of chemistry and biology. Bioorg. Med. Chem. 17:3045–3051. 19. Bredel M, Jacoby E. (2004) Chemogenomics: An emerging strategy for rapid target and drug discovery. Nat. Rev. Genet. 5:262–275. 20. Schreiber SL. (2005) Small molecules: The missing link in the central dogma. Nat. Chem. Biol. 1:64–66. 21. Dobson CM. (2004) Chemical space and biology. Nature 432:824–828. 22. Sneader W. (2005) Drug Discovery: A History. John Wiley & Sons, Chichester, UK. 23. Clardy J, Walsh C. (2004) Lessons from natural molecules. Nature 432: 829–837. 24. Koehn FE, Carter GT. (2005) The evolving role of natural products in drug discovery. Nat. Rev. Drug Discov. 4:206–220. 25. Newman DJ, Cragg GM, Snader KM. (2003) Natural products as sources of new drugs over the period 1981–2002. J. Nat. Prod. 66:1022–1037. 26. Petersen F. (2005) Natural products research at Novartis Pharmaceuticals — A historical overview. In: Drug Discovery from Natural Products — 85 Years of Natural Products Research at Novartis, pp. 1–13. NIBR Global Communications, Cambridge, USA. 27. Rouhi AM. (2003) The case for natural products research. C&EN 81:80–81. 28. Henkel T, Brunne R, Müller H, Reichel F. (1999) Statistical investigation of structural complementarity of natural products and synthetic compounds. Angew. Chem. Int. Ed. Engl. 38:643–647.
ch01
FA April 1, 2006
31
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
29. Feher M, Scmidt JM. (2003) Property distributions: Differences between drugs, natural products, and molecules from combinatorial chemistry. J. Chem. Inf. Comput. Sci. 43:218–227. 30. Lee ML, Schneider G. (2001) Scaffold architecture and pharmacophoric properties of natural products and trade drugs: Applications in the design of natural products based combinatorial libraries. J. Comb. Chem. 3: 284–289. 31. Molnar I, Schupp T, Ono M, et al. (2000) The biosynthetic gene cluster for the microtubule-stabilizing agents epothilones A and B from Sorangium cellulosum So ce90. Chem. Biol. 7:97–109. 32. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 23:3–25. 33. Lipinski CA. (2000) Drug-like properties and the cause of poor solubility and poor permeability. J. Pharmacol. Toxicol. Methods 44:235–249. 34. Boldi AM. (2004) Libraries from natural products-like scaffolds. Curr. Opin. Chem. Biol. 8:281–286. 35. Abel U, Koch C, Speitling M, Hansske FG. (2002) Modern methods to produce natural-product libraries. Curr. Opin. Chem. Biol. 6:453–458. 36. Breinhauer R, Vetter IR, Waldmann H. (2002) From protein domains to drug candidates — Natural products as guiding principles in the design and synthesis of compound libraries. Angew. Chem. Int. Ed. Engl. 41: 2879–2890. 37. Koch MA, Breinbauer R, Waldmann H. (2003) Protein structure similarity as guiding principle for combinatorial library design. Biol. Chem. 384: 1265–1272. 38. Chen X, Ung CY, Chen Y. (2003) Can an in silico drug-target search method be used to probe potential mechanisms of medicinal plant ingredients? Nat. Prod. Rep. 20:432–444. 39. He W, Miao FJ, Lin DC, et al. (2004) Citric acid cycle intermediates as ligands for orphan G-protein-coupled receptors. Nature 429:188–193. 40. Babine RE, Bender SL. (1997) Molecular recognition of protein-ligand complexes: Applications to drug design. Chem. Rev. 97:1359–1472. 41. Gray NS, Wodicka L, Thunnissen AM, et al. (1998) Exploiting chemical libraries, structure, and genomics in the search for kinase inhibitors. Science 281:533–538. 42. Vieth M, Higgs RE, Robertson DH, et al. (2004) Kinomics-structural biology and chemogenomics of kinase inhibitors and targets. Biochim. Biophys. Acta. 1697:243–257. 43. ter Haar E, Walters WP, Pazhanisamy S, et al. (2004) Kinase chemogenomics: Targeting the human kinome for target validation and drug discovery. Mini. Rev. Med. Chem. 4:235–253.
ch01
FA April 1, 2006
32
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
44. Denessiouk KA, Johnson MS. (2000) When fold is not important: A common structural framework for adenine and AMP binding in 12 unrelated protein families. Proteins 38:310–326. 45. Sem DS, Bertolaet B, Baker B, et al. (2004) Systems-based design of bi-ligand inhibitors of oxidoreductases: Filling the chemical proteomic toolbox. Chem. Biol. 11:185–194. 46. Wermuth CG. (2004) Selective optimization of side activities: Another way for drug discovery. J. Med. Chem. 47:1303–1314. 47. Kubinyi H. (2004) Drug discovery from side effects. In: H Kubini, G Müller (eds), Chemogenomics in Drug Discovery — A Medicinal Chemistry Perspective, pp. 43–68. Wiley-VCH, Weinheim. 48. Hruby VJ. (2002) Designing peptide receptor agonists and antagonists. Nat. Rev. Drug Discov. 1:847–858. 49. Wise A, Jupe SC, Rees S. (2004) The identification of ligands at orphan G-protein coupled receptors. Annu. Rev. Pharmacol. Toxicol. 44:43–66. 50. Merrifield RB. (1963) Solid phase peptide synthesis. I. The synthesis of a tetrapeptide. J. Am. Chem. Soc. 85:2149–2154. 51. Eguchi M, McMillan M, Nguyen C, et al. (2003) Chemogenomics with peptide secondary structure mimetics. Comb. Chem. High Throughput Screen. 6:611–621. 52. Tyndall JDA, Pfeiffer B, Abbenante G, Fairlie DP. (2005) Over one hundred peptide-activated G protein-coupled receptors recognize ligands with turn structure. Chem. Rev. 105:793–826. 53. Tyndall JDA, Nall T, Fairlie DP. (2005) Proteases universally recognize beta strands in their active sites. Chem. Rev. 105:973–999. 54. Lelais G, Seebach D. (2004) Beta2-amino acids — Syntheses, occurrence in natural products, and components of beta-peptides1,2. Biopolymers 76: 206–243. 55. Yin H, Hamilton AD. (2005) Strategies for targeting protein-protein interactions with synthetic agents. Angew. Chem. Int. Ed. 44:4130–4163. 56. Garland SL, Dean PM. (1999) Design criteria for molecular mimics of fragments of the beta-turn. 1. C alpha atom analysis. J. Comput.-Aided Mol. Des. 13:469–483. 57. Garland SL, Dean PM. (1999) Design criteria for molecular mimics of fragments of the beta-turn. 2. C alpha-C beta bond vector analysis. J. Comput.Aided Mol. Des. 13:485–498. 58. Webb TR. (2004) Some principles related to chemogenomics in compound library and template design for GPCRs. In: H Kubini, G Müller (eds), Chemogenomics in Drug Discovery — A Medicinal Chemistry Perspective, pp. 313–324. Wiley-VCH, Weinheim. 59. Weckbecker G, Lewis I, Albert R, et al. (2003) Opportunities in somatostatin research: Biological, chemical and therapeutic aspects. Nat. Rev. Drug Discov. 2:999–1017.
ch01
FA April 1, 2006
33
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
60. Flohr S, Kurz M, Kostenis E, et al. (2002) Identification of nonpeptidic urotensin II receptor antagonists by virtual screening based on a pharmacophore model derived from structure-activity relationships and nuclear magnetic resonance studies on urotensin II. J. Med. Chem. 45: 1799–1805. 61. Klabunde T, Hessler G. (2002) Drug design strategies for targeting G-protein coupled receptors. Chem. Bio. Chem. 3:928–944. 62. Schreiber SL. (2000) Target-oriented and diversity-oriented organic synthesis in drug discovery. Science 287:1964–1969. 63. Burke MD, Schreiber SL. (2004) A planning strategy for diversity-oriented synthesis. Angew. Chem. Int. Ed. Engl. 43:46–58. 64. Tan DS. (2005) Diversity-oriented synthesis: Exploring the intersections between chemistry and biology. Nat. Chem. Biol. 1:74–84. 65. Reayi A, Arya P. (2005) Natural product-like chemical space: Search for chemical dissectors of macromolecular interactions. Curr. Opin. Chem. Biol. 9:240–247. 66. Kuruvilla FG, Shamji AF, Sternson SM, et al. (2002) Dissecting glucose signalling with diversity-oriented synthesis and small-molecule microarrays. Nature 416:653–657. 67. Haggarty SJ, Koeller KM, Wong JC, et al. (2003) Domain-selective smallmolecule inhibitor of histone deacetylase 6 (HDAC6)-mediated tubulin deacetylation. Proc. Natl. Acad. Sci. USA. 100:4389–4394. 68. Wong JC, Hong R, Schreiber SL. (2003) Structural biasing elements for in-cell histone deacetylase paralog selectivity. J. Am. Chem. Soc. 125:5586–5587. 69. Milne GM, Jr. (2003) Pharmaceutical productivity — The imperatives for new paradigms. Ann. Rep. Med. Chem. 38:383–396. 70. Austin CP, Brady LS, Insel TR, Collins FS. (2004) NIH molecular libraries initiative. Science 306:1138–1139. 71. National Chemical Library, http://chimiotheque.ujf-grenoble.fr/induk. html. 72. Olah MM, Bologa CG, Oprea TI. (2004) Strategies for compound selection. Curr. Drug Discov. Technol. 1:211–220. 73. MacFadyen H,Walker G,Alvarez J. (2005) Enhancing hit quality and diversity within assay throughput constraints. In: T Oprea (ed), Chemoinformatics in Drug Discovery, pp. 143–174. Wiley-VCH, Weinheim. 74. Maggiora GM, Shanmugasundaram V, Lajiness MS, et al. (2005) A practical strategy for directed compound acquisition. In: T Oprea (ed), Chemoinformatics in Drug Discovery, pp. 317–332. Wiley-VCH, Weinheim. 75. Oprea TI, Matter H. (2004) Integrating virtual screening in lead discovery. Curr. Opin. Chem. Biol. 8:349–358. 76. Hert J, Willett P, Wilton D, et al. (2004) Topological descriptors for similaritybased virtual screening using multiple bioactive reference structures. Org. Biomol. Chem. 2:3256–3266.
ch01
FA April 1, 2006
34
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
77. Schuffenhauer A, Popov M, Schopfer U, et al. (2004) Molecular diversity management strategies for building diverse and focused lead discovery compound screening collections. Comb. Chem. High Throughput Screen. 7: 771–781. 78. Rishton GM. (1997) Reactive compounds and in vitro false positives in HTS. Drug Discov. Today 2:382–384. 79. Charifson PS, Walters WP. (2002) Filtering databases and chemical libraries. J. Comput.-Aid. Mol. Design 16:311–323. 80. Rishton GM. (2002) Non-leadlikeness and leadlikeness in biochemical screening. Drug Discov. Today 8:86–96. 81. Martin YC. (2001) Diverse viewpoints on computational aspects of molecular diversity. J. Comb. Chem. 3:231–250. 82. Downs GM, Barnard JM. (2002) Clustering methods and their use in computational chemistry. Rev. Comput. Chem. 18:1–40. 83. Harper G, Pickett SD, Green DV. (2004) Design of a compound screening collection for use in high-throughput screening. Comb. Chem. High Throughput Screen. 7:63–71. 84. Wright T, Gillet VJ, Green DVS, Pickett SD. (2003) Optimizing the size and configuration of combinatorial libraries. J. Chem. Inf. Comput. Sci. 43: 381–390. 85. Dole RE. (2003) Comprehensive survey of combinatorial library synthesis 2002. J. Comb. Chem. 5:693–753. 86. Watson P, Willett P, Gillet V, Verdonk ML. (2001) Calculating the knowledgebased similarity of functional groups using crystallographic data. J. Comput.Aided. Mol. Design 15:835–857. 87. Xu J. (2002) A new approach to finding natural chemical structure classes. J. Med. Chem. 45:5311–5320. 88. Sauer WHB, Schwarz MK. (2003) Molecular shape diversity of combinatorial libraries: A prerequisite for broad bioactivity. J. Chem. Inf. Comput. Sci. 43:987–1003. 89. Evans BE, Rittle KE, Bock MG, et al. (1988) Methods for drug discovery: Development of potent, selective, orally effective cholecystokinin antagonists. J. Med. Chem. 31:2235–2246. 90. Bemis GW, Murcko MA. (1996) The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39:2887–2893. 91. Patchett AA, Nargund RP. (2000) Privileged structures — An update. In: GL Trainor (ed), Annual Reports in Medicinal Chemistry Vol. 35, pp. 289– 298, Academic Press, San Diego. 92. Müller G. (2004) Target family-directed masterkeys in chemogenomics. In: H Kubini, G Müller (eds), Chemogenomics in Drug Discovery — A Medicinal Chemistry Perspective, pp. 7–42. Wiley-VCH, Weinheim.
ch01
FA April 1, 2006
35
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
93. Schopfer U, Engeloch C, Stanek J, et al. (2005) The Novartis compound archive — From concept to reality. Comb. Chem. High Throughput Screen. 8:513–519. 94. Alanine A, Nettekoven M, Roberts E, Thomas AW. (2003) Lead generationenhancing the success of drug discovery by investing in the hit to lead process. Comb. Chem. High Throughput Screen. 6:51–66. 95. Walters WP, Namchuk M. (2003) Designing screens: How to make your hits a hit. Nat. Rev. Durg Discov. 2:259–265. 96. Pritchard JF, Jurima-Romet M, Reimer ML, et al. (2003) Making better drugs: Decision gates in non-clinical drug development. Nat. Rev. Drug Discov. 2:542–553. 97. Davis AM, Keeling DJ, Steele J, et al. (2005) Components of successful lead generation. Curr. Top. Med. Chem. 5:421–439. 98. Fischer HP, Heyse S. (2005) From targets to leads: The importance of advanced data analysis for decision support in drug discovery. Curr. Opin. Drug Discov. Devel. 8:334–346. 99. Baringhaus KH, Matter H. (2005) Efficient strategies for lead optimization by simultaneously addressing affinity, selectivity and pharmacokinetic parameters. In: T Oprea (ed), Chemoinformatics in Drug Discovery, pp. 333–380. Wiley-VCH, Weinheim. 100. Lipinski C, Hopkins A. (2004) Navigating chemical space for biology and medicine. Nature 432:855–861. 101. Wess G. (2002) How to escape the bottleneck of medicinal chemistry. Drug Discov. Today 4:533–535. 102. Kubinyi H. (2003) Drug research: Myths, hype and reality. Nat. Rev. Drug Discov. 8:665–668. 103. Lajiness MS, Maggiora GM, Shanmugasundaram V. (2004) Assessment of the consistency of medicinal chemists in reviewing sets of compounds. J. Med. Chem. 47:4891–4896. 104. Guba W, Roche O. (2004) Computational filters in lead generation: Targeting drug-like chemotypes. In: H Kubini, G Müller (eds), Chemogenomics in Drug Discovery — A Medicinal Chemistry Perspective, pp. 325–339. Wiley-VCH, Weinheim. 105. Jacoby E, Schuffenhauer A, Popov M, et al. (2004) Molecular informatics as an enabling in silico technology platform for drug discovery. Chimia 58: 577–584. 106. Wilkens SJ, Janes J, Su AL. (2005) HierS: Hierarchical scaffold clustering using topological chemical graphs. J. Med. Chem. 48:3182–3193. 107. Gribbon P, Sewing A. (2005) High-throughput drug discovery: What can we expect from HTS? Drug Discov. Today 10:17–22.
ch01
FA April 1, 2006
36
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
108. Seidler J, McGovern SL, Doman TN, Shoichet KK. (2003) Identification and prediction of promiscuous aggregating inhibitors among known drugs. J. Med. Chem. 46:4477–4486. 109. Roche O, Schneider P, Zuegge J, et al. (2002) Development of a virtual screening method for identification of “frequent hitters” in compound libraries. J. Med. Chem. 45:137–142. 110. Huth JR, Mendoza R, Olejniczak ET, et al. (2005) ALARM NMR: A rapid and robust experimental method to detect reactive false positives in biochemical screens. J. Am. Chem. Soc. 127:217–224. 111. Feng BY, Shelat A, Doman TN, et al. (2005) High-throughput assays of promiscuous inhibitors. Nat. Chem. Biol. 1:146–148. 112. Mestres J, Veeneman GH. (2003) Identification of “latent hits” in compound screening collections. J. Med. Chem. 46:3441–3444. 113. Swinney DC. (2004) Biochemical mechanisms of drug action: What does it takes for success? Nat. Rev. Drug Discov. 3:801–808. 114. Di L, Kerns E. (2003) Profiling drug-like properties in discovery research. Curr. Opin. Chem. Biol. 7:402–408. 115. Van de Waterbeemd H, Gifford E. (2003) ADMET in silico modelling: Towards prediction paradise? Nat. Rev. Drug Discov. 2:192–541. 116. Roberts SA. (2001) High-throughput screening approaches for investigating drug metabolism and pharmacokinetics. Xenobiotica. 31:557–589. 117. Kenakin T. (2003) Predicting therapeutic value in the lead optimization phase of drug discovery. Nat. Rev. Drug Discov. 2:429–438. 118. Muegge I. (2003) Selection criteria for drug-like compounds. Med. Res. Rev. 23:302–321. 119. Veber DF, Johnson SR, Cheng HY, et al. (2002) Molecular properties that influence the oral bioavailability of drug candidates. J. Med. Chem. 45: 2615–2623. 120. Lu JJ, Crimin K, Goodwin JT, et al. (2004) Influence of molecular flexibility and polar surface area metrics on oral bioavailability in the rat. J. Med. Chem. 47:6104–6107. 121. Vieth M, Siegel MG, Higgs RE, et al. (2005) Characteristic physical properties and structural fragments of marketed oral drugs. J. Med. Chem. 47:224–232. 122. Martin YC, Kofron JL, Traphagen LM. (2002) Do structurally similar molecules have similar biological activity? J. Med. Chem. 45:4350–4358. 123. Lajiness MS, Vieth M, Erickson J. (2004) Molecular properties that influence oral drug-like behavior. Curr. Opin. Drug Discov. Devel. 7:470–477. 124. Baurin N, Baker R, Richardson C, et al. (2004) Drug-like annotation and duplicate analysis of a 23-supplier chemical database totalling 2.7 million compounds. J. Chem. Inf. Comput. Sci. 44:643–651. 125. Ghose AK, Viswanadhan VN, Wendoloski JJ. (1999) A knowledge-based approach in designing combinatorial or medicinal chemistry libraries for
ch01
FA April 1, 2006
37
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Small Molecules for Chemogenomics-based Drug Discovery
126. 127.
128.
129. 130.
131.
132. 133. 134. 135. 136.
137. 138. 139.
140.
141.
drug discovery. 1. A qualitative and quantitative characterization of known drug databases. J. Comb. Chem. 1:55–68. Egan WJ, Merz KM, Jr., Baldwin JJ. (2000) Prediction of drug absorption using multivariate statistics. J. Med. Chem. 43:3867–3877. Kalgutkar AS, Gardner I, Obach RS, et al. (2005) A comprehensive listing of bioactivation pathways of organic functional groups. Curr. Drug Metab. 6:161–225. Snyder RD, Pearl GS, Mandakas G, et al. (2004) Assessment of the sensitivity of the computational programs DEREK, TOPKAT, and MCASE in the prediction of the genotoxicity of pharmaceutical molecules. Environ. Mol. Mutagen. 43:143–158. Mühlbacher J, Ertl P, Selzer P, et al. (2004) Toxizitätsvohersage im Intranet. Nachr. Chem. 52:162–164. Wenlock MC, Austin RP, Barton P, et al. (2003) A comparison of physiochemical property profiles of development and marketed oral drugs. J. Med. Chem. 46:1250–1256. Hann MN, Leach AR, Green DVS. (2005) Computational chemistry, molecular complexity and screening set design. In: T Oprea (ed), Chemoinformatics in Drug Discovery, pp. 43–58. Wiley-VCH, Weinheim. Selzer P, Roth HJ, Ertl P, Schuffenhauer A. (2005) Complex molecules: Do they add value? Curr. Opin. Chem. Biol. 9:310–316. Hann MH, Oprea TI. (2004) Pursuing the leadlikeness concept in pharmaceutical research. Curr. Opin. Chem. Biol. 8:255–263. Opera TI. (2005) Chemoinformatics in lead discovery. In: T Oprea (ed), Chemoinformatics in Drug Discovery, pp. 25–42. Wiley-VCH, Weinheim. Proudfoot JR. (2002) Drugs, leads, and drug-likeness: An analysis of some recently launched drugs. Bioorg. Med. Chem. Lett. 12:1647–1650. Schuffenhauer A, Zimmermann J, Stoop R, et al. (2002) An ontology for pharmaceutical ligands and its application for library design and in silico screening. J. Chem. Inf. Comp. Sci. 42:947–955. Schuffenhauer A, Jacoby E. (2004) Annotating and mining the ligand-target chemogenomics knowledge space. BioSilico 2:190–200. Searls DB. (2005) Data integration: Challenges for drug discovery. Nat. Rev. Drug Discov. 4:45–58. Strausberg RL, Schreiber SL. (2003) From knowing to controlling: A path from genomics to drugs using small molecule probes. Science 300: 294–295. Zimmermann M, Fluck J, Thi le TB, et al. (2005) Information extraction in the life sciences: Perspectives for medicinal chemistry, pharmacology and toxicology. Curr. Top. Med. Chem. 5:785–796. Root DE, Flaherty SP, Kelley BP, Stockwell BR. (2003) Biological mechanism profiling using an annotated compound library. Chem. Biol. 10:881–892.
ch01
FA April 1, 2006
38
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
E. Jacoby et al.
142. Krejsa CM, Horvath D, Rogalski SL, et al. (2003) Predicting ADME properties and side effects: The BioPrint approach. Curr. Opin. Drug Discov. Devel. 6:470–480. 143. Roter AH. (2005) Large-scale integrated databases supporting drug discovery. Curr. Opin. Drug Discov. Devel. 8:309–315. 144. Sharom JR, Bellows DS, Tyers M. (2004) From large networks to small molecules. Curr. Opin. Chem. Biol. 8:81–90. 145. Nikolsky Y, Nikolskaya T, Bugrim A. (2005) Biological networks and analysis of experimental data in drug discovery. Drug Discov. Today 10:653–662. 146. Jacoby E, Schuffenhauer A, Acklin P. (2004) The contribution of molecular informatics to chemogenomics. Knowledge-based discovery of biological targets and chemical lead compounds. In: H Kubini, G Müller (eds), Chemogenomics in Drug Discovery — A Medicinal Chemistry Perspective, pp. 139–166. Wiley-VCH, Weinheim. 147. Schneider P, Schneider G. (2003) Collection of bioactive reference compounds for focused library design. Quant. Struct.-Act. Relat. 22:713–718. 148. Savchuck NP, Balakin KV, Tkachenko SE. (2004) Exploring the chemogenomic knowledge space with annotated chemical libraries. Curr. Opin. Chem. Biol. 8:412–417. 149. Mestres J. (2004) Computational chemogenomics approaches to systematic knowledge-based drug discovery. Curr. Opin. Drug Discov. Dev. 7:304–314. 150. Schuffenhauer A, Floersheim P, Acklin P, Jacoby E. (2003) Similarity metrics for ligands reflecting the similarity of the target proteins. J. Chem. Inf. Comput. Sci. 43:391–405. 151. Schneider G, Neidhart W, Giller T, Schmid G. (1999) “Scaffold-hopping” by topological pharmacophore search: A contribution to virtual screening. Angew. Chem. Int. Ed. Engl. 38:2894–2896. 152. Böhm HJ, Flohr A, Stahl M. (2004) Scaffold hopping. Drug Discov. Today: Technol. 1:217–224. 153. Pierce AC, Rao G, Bemis GW. (2004) BREED: Generating novel inhibitors through hybridization of known ligands. Application to CDK2, p38, and HIV protease. J. Med. Chem. 47:2768–2775. 154. Stepanchikova AV, Lagunin AA, Filimonov DA, Poroikov VV. (2003) Prediction of biological activity spectra for substances: Evaluation on the diverse sets of drug-like structures. Curr. Med. Chem. 10:225–233. 155. Sheridan RP. (2003) Finding multiactivity substructures by mining databases of drug-like compounds. J. Chem. Inf. Comput. Sci. 43:1037–1050. 156. Cases M, Garcia-Serna R, Hettne K, et al. (2005) Chemical and biological profiling of an annotated compound library directed to the nuclear receptor family. Curr. Top. Med. Chem. 5:763–72. 157. Frye SV. (1999) Structure-activity relationship homology (SARAH): A conceptual framework for drug discovery in the genomic era. Chem. Biol. 6:R3–R7.
ch01
FA April 1, 2006
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
2
Mapping the Chemogenomic Space Jordi Mestres∗
An essential prerequisite for extracting knowledge from biochemical data is the establishment and adoption of annotation and classification schemes for all chemical and biological entities. An overview of the main classification schemes currently in use and their application to the mapping of the chemogenomic space is presented.
1. The Chemogenomic Space Chemogenomics, the annotation of all molecules to all targets, has recently emerged as a new paradigm in drug discovery in which efficiency in the compound design and optimization process is achieved through gaining and reusing of targeted knowledge.1 Since targeted knowledge resides at the interface between chemistry and biology, computational tools aimed at integrating the different dimensions of the chemogenomic space play a central role in chemogenomics.2 The chemogenomic space is a complex space defined by multiple interconnected dimensions. This is schematically represented in Fig. 1, where molecules, proteins, genes, pathways, and diseases are some of the different dimensions represented within the chemogenomic space. Different types of experimental data allow for connecting pairs of dimensions in this representation. For example, pharmacological data on the binding affinity of chemicals to a particular assay allow the mapping of the molecule-protein space (in gray in Fig. 1), one of the two-dimensional subspaces defining this multidimensional chemogenomic space. Recent trends towards implementing chemogenomic strategies in drug discovery involve organizing research around target families in order to
39
∗ Chemogenomics Laboratory, Research Unit on Biomedical Informatics, Municipal Institute of Medical Research and University Pompeu Fabra, Dr. Aiguadev 88, 08003 Barcelona, Catalonia, Spain. Email:
[email protected].
ch02
FA
40
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres
Diseases
April 1, 2006
Ge ne s
s
y wa ath
P
ns
tei
o Pr
M ole cu les
Figure 1. The multidimensional chemogenomic space.
maximize the efficiency of chemistry and biology resources by gaining, accumulating, and reusing family-directed knowledge. Key to this systematization of drug discovery is the adoption of annotation and classification schemes for all biological and chemical entities. On the one hand, annotations assign unique identifiers to diseases,3 pathways,4 gene sequences,5 protein sequences6 and structures,7 as well as chemical structures8 generated and stored in databases. They represent an efficient means of providing fast, directed access to data and allow for establishing direct links between the entities of the different dimensions within the chemogenomic space. On the other hand, classification schemes allow for organizing entities within each dimension in a rational, systematic manner. They are essential for inferring cross-dimensional relationships and thus ultimately for extracting family-directed knowledge. In the remainder of this chapter, focus will be given to review the main annotation and classification schemes currently in use for proteins and molecules and their application to mapping this portion of the chemogenomic space.
2. Annotation and Classification Schemes for Proteins The biological space of the human genome containing proteins capable of interacting with a molecule having similar properties to existing drugs can be referred to as the druggable genome.9 The size of the druggable genome can be estimated by considering that sequence similarities within gene families are often indicative of a more general conservation of their active site architecture and that, if one member of the gene family has
ch02
FA April 1, 2006
41
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
already been proven to interact with a known drug, other members may potentially be able to interact with molecules of similar characteristics. Using this reasoning, and on the basis of current knowledge, only 3,051 proteins from the approximately 30,000 human genes could be assigned to the druggable space of the genome. However, not all proteins delineating the druggable genome are associable to particular diseases. The subset of therapeutically relevant targets within the druggable genome is estimated to be between 600 and 1,500 proteins. At present, the number of proteins being utilized as targets for currently marketed drugs is around 120 and, thus, the unexploited biological space with target potential may still be significant. Of particular relevance is the fact that 88% of these 120 targets are proteins belonging to the two main biochemical classes, viz., enzymes and receptors. Good knowledge of the composition and classification of these families is essential when attempts are made to maximally exploit their general characteristics for the design and screening of targeted chemical libraries. One of the main problems for the annotation and classification of the biological space is the lack of a standard scheme for all protein families. Even within families, different classification schemes coexist and are being used by different research communities. This aspect hampers enormously any chemogenomic initiative aimed at integrating chemical and biological spaces with novel computational techniques.2 The following provides an overview of the classification schemes currently in use for the main therapeutically relevant protein families. 2.1. Enzymes Enzymes constitute a large superfamily of proteins acting as biological catalysts capable of accelerating over a million-fold the rate of chemical reactions within cells.10 On the basis of the type of reaction catalyzed, enzymes are organized using a classification scheme that has prevailed for decades.11–13 Each enzyme is classified according to a 4-digit identifier usually referred to as the Enzyme Commission (EC) number.14 The first digit specifies the class of enzyme. There are six different classes, viz., oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases, which are assigned to EC numbers from 1 to 6 in that particular order. The second digit specifies the enzyme subclass according to a compound or group involved in the reaction being catalyzed. For example, the subclass in oxidoreductases indicates the
ch02
FA April 1, 2006
42
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres
type of group being oxidized in the donor (e.g. 1.1 and 1.5 acting, respectively, on the CH-OH and CH-NH group of donors), whereas the subclass in transferases specifies the type of transfer being produced (e.g. 2.1 and 2.3 transferring one-carbon and acyl groups, respectively). The third digit specifies the enzyme sub-subclass defining the type of reaction in a more concrete manner. For example, the sub-subclass in oxidoreductases defines the acceptor (e.g. 1.-.1 reflects that NAD+ or NADP+ is the acceptor and 1.-.2 assigns a cytochrome as the acceptor), whereas the sub-subclass in transferases provides more information on the group being transferred (e.g. 2.1.1 and 2.1.4 corresponding to methyltransferases and amidinotransferases, respectively). Finally, the fourth digit is a number specifying the individual enzyme within a sub-subclass (e.g. 1.1.2.3 identifies L-lactate dehydrogenase and 2.1.1.45 corresponds to thymidylate synthase). The ENZYME nomenclature database,12 a repository of information based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB), currently contains 4435 enzyme entries (Release 38.0, September 2005). Of these, 545 entries are superseded, resulting in a final list of 3,890 enzymes, 224 sub-subclasses, and 65 subclasses. 2.2. Receptors A receptor is defined as a molecular structure of polypeptidic nature that interacts specifically with a messenger, hormone, mediator, cytokine, or a particular intracellular contact.15 For the nomenclature and identification of receptors, the Nomenclature Committee of the International Union of Pharmacology (NC-IUPHAR) has proposed a general system for defining, characterizing, and classifying all known pharmacological receptors.16 Under the NC-IUPHAR classification scheme, each receptor receives a unique identifier referred to as the Receptor Code (RC). A complete RC is a 7-level alphanumeric identifier. The first two levels are separated by a dot and refer to the structural class and subclass. Exceptionally, an additional third level is included at this stage to further characterize the subclass. There are four different structural classes, viz., ion channel, G proteincoupled, enzyme-associated, and transcription factor receptors, which are assigned to RC numbers 1 to 4 in that order. Separated by a double-dot, the third level identifies the family of the receptor; the fourth level is a digit characterizing the receptor number; the fifth level is a code that specifies the type of receptor; the sixth level defines the organism; and the seventh level is a two-digit number related to the receptor isoform.
ch02
FA April 1, 2006
43
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
Unfortunately, the continuous changes introduced in the form of the RC identifier have limited the use of this classification scheme and facilitated the establishment of alternative classifications in all receptor classes.
2.2.1. Channel receptors Receptors of this type are activated by an ionic flux that modulates the opening of a channel and selectively regulates the entry of ions into the cell. At least two classification schemes for ion channels coexist today, viz., the one established by the NC-IUPHAR and the alternative scheme approved by the NC-IUBMB based on the Transport Classification Database.17 Under the Transport Classification (TC) system, each transport protein is identified with a 5-digit code. The first digit is a number specifying the transporter class; the second digit is a letter corresponding to the transporter subclass; the third digit is a number denoting the transporter family; the fourth digit is a number indicating the subfamily in which a transporter is found; and the fifth digit relates to the substrate or range of substrates transported. Within the TC system, ion channels are defined as transport class 1. As an example, using the NC-IUPHAR classification scheme, the human 5-HT3 serotonin receptor would be identified with an RC of 1.1:5HT:1:5HT3:HUMAN:00 and the human GluR3 receptor would have an RC of 1.2:GLU:3:GLUR3:HUMAN:00. In comparison, the respective codes under the TC system would be 1.A.9.2.1 and 1.A.10.1.4.
2.2.2. G Protein-coupled receptors The superfamily of G protein-coupled receptors (GPCRs) is composed of proteins that play crucial roles in a variety of biological functions, including light, odor, hormone, and neurotransmitter detection, and thus they are highly regarded as therapeutically relevant targets for the pharmaceutical industry. Different classification schemes for GPCRs coexist nowadays. On the one hand, there is the standard proposal by the NC-IUPHAR described above and, on the other hand, there are at least two alternative schemes based on the phylogenetic relationships within the superfamily.18,19 On the basis of sequence identities, the vast majority of GPCRs can be classified into six main groups. The first four groups correspond to families 1 to 418 or classes A to D,19 depending on the classification scheme used. Family 1, or class A, is the most populated group and contains
ch02
FA April 1, 2006
44
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres
the rhodopsin-like receptors; family 2, or class B, collects all secretinlike receptors; family 3, or class C, corresponds to the metabotropic glutamate/pheromone receptors; and family 4, or class D, is assigned to the fungal pheromone receptors. The other two main groups are annotated differently depending on the scheme. Under the numerical system,18 the group of frizzled/smoothened receptors are assigned to family 5, and the cAMP receptors are directly annotated as family cAMP. In contrast, under the alphanumeric system,19 cAMP receptors form class E and the frizzled/smoothened receptors are directly referred to as the frizzled/smoothened family (not class). Beyond the differences found in the annotation of this first level of classification, neither of the two phylogeny-based classification systems provides a concrete annotation scheme to follow for the subsequent levels within the GPCR phylogenetic tree. In this respect, taking the alphanumeric system as the basic framework,19 a simple and straightforward strategy could be devised to assign a numeric identifier following the order arbitrarily established in the GPCR classification. Using this scheme, the rhodopsin-like (A), amine (A.1), serotonin (A.1.5) 5-HT1A receptor can be annotated as A.1.5.1. For comparison, following the NC-IUPHAR classification scheme, the RC for the corresponding rat receptor would be 2.1:5HT:1:5HT1A:RAT:00. 2.2.3. Nuclear receptors Nuclear receptors form a family of ligand-activated transcription factors that regulate a variety of biological processes, including lipid and glucose homeostasis, detoxification, cellular differentiation, embryonic development and orphan physiology. In addition to that, many nuclear receptors also play an important role in mediating the induction of hepatic cytochrome P450s, a class of enzymes involved in drug metabolism and in the toxification and detoxification of xenochemicals prevalent in the environment. The combination of these two aspects makes nuclear receptors a family of utmost biological relevance. In line with the situation found previously for the other receptor families, several classification schemes coexist for nuclear receptors. In particular, beyond the NC-IUPHAR system described above, an alternative nomenclature system has been proposed and is currently widely accepted and used by the research community working in this family.20 This annotation scheme consists of a 3-character code. The first character is a number that designates
ch02
FA April 1, 2006
45
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
the subfamily. There are six main subfamilies, assigned to identifiers 1 to 6. All nuclear receptors in these subfamilies contain a highly conserved DNA-binding domain (DBD) and a less conserved ligand-binding domain (LBD). However, some unusual receptors contain only one of the two conserved domains and thus an additional subfamily, assigned to identifier 0, has been included to account for them. The second character is a capital letter specifying the group within the subfamily and the third character is a number identifying the individual nuclear receptor within a group. For the sake of comparison, the 3-character code for the human estrogen receptor subtype α is 3.A.1, whereas the NC-IUPHAR RC would be 4.1.3:EST:1:ERA:HUMAN:00.
3. Structural Representativity of Protein Families An important part of the knowledge generated within protein families comes from the availability of experimentally determined protein structures. Recent advances in high-throughput methods for protein expression and production, NMR spectroscopy, and X-ray crystallography have led to a significant rise in the number of protein structures solved.21 Many of these structures are ultimately deposited and made publicly accessible in the Protein Data Bank (PDB), currently containing over 30,000 entries and its size continuing to increase annually at an almost exponential rate.7 Unfortunately, primarily because of technical difficulties, not all the therapeutically relevant protein superfamilies reviewed above, viz., enzymes, channel receptors, G protein-coupled receptors, and nuclear receptors, are at present equally represented in the PDB. In fact, enzymes currently cover almost 50% of the entire contents of the PDB, compared to the few hundred structures available for nuclear receptors and the handful resolved for G protein-coupled receptors. The general adoption of classification schemes for proteins is an essential aspect for assessing, quantitatively the structural representativity of target families in the PDB,22 an aspect that has important implications for the applicability of family-directed structure-based approaches to drug discovery. An analysis of the 15,151 enzyme entries found in the PDB as of July 2005 reveals that, despite the almost exponential growth of entries deposited in the PDB, progress in achieving complete occupancy at all levels of the enzyme nomenclature system appears to be relatively slow (Fig. 2). In particular, the number of new enzymes for which representative structures are deposited in the PDB has only increased linearly in the last five years
ch02
FA April 1, 2006
46
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres 16000
a)
No. Enzyme Entries
14000 12000 10000 8000 6000 4000 2000
1996
1997
1998
1999
2000
2001
1998
1999
2000
2001
2002
2004
1995
1997
2003
1994
1996
2002
1993
1995
1992
1994
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973
1972
0
Year
No. Enzymes Represented
4000
b)
3500 3000 2500 2000 1500 1000 500
EC
2004
2003
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973
1972
0
Year Figure 2. Comparison between (a) growth of enzyme entries and (b) growth of enzymes structurally represented in the PDB.
with, on average, 97 new enzymes per year. With over 70% of the enzymes still remaining devoid of experimentally determined structures in the PDB, achieving full occupancy of all the enzymes with an EC number appears to be 20 years off.
4. Annotation and Classification Schemes for Molecules Contrary to the situation found for the annotation and classification of proteins, where multiple schemes coexist for the different families, it is remarkable to realize that to date, little attention has been paid to the
ch02
FA April 1, 2006
47
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
construction of hierarchical schemes for the structural annotation and classification of large compound collections. One of the earliest efforts in developing a hierarchical classification scheme for large sets of chemical compounds is exemplified by LeadScope.23 LeadScope classifies compounds on the basis of a structural feature hierarchy. Originally, a set of predefined structural features are stored in a template library containing approximately 27,000 features. The structural features chosen for analysis are motivated by substructures typically found in small molecule drug candidates, viz., aromatics, heterocycles, spacer groups, and simple substituents. Each structural feature is assigned a chemical name generally based on the systematic nomenclature developed by Chemical Abstracts, which facilitates the navigation through the hierarchy to the medicinal chemists. In the end, features are arranged in a hierarchy of 14 main structural classes comprising amino acids, bases/nucleosides, benzenes, carbocycles, carbohydrates, elements, functional groups, heterocycles, naphtalenes, natural products, peptidomimetics, pharmacophores, protective groups, and spacer groups. Each of these main structural classes is then further divided into different subcategories. More recently, researchers from the Novartis Research Foundation reported on their HierS algorithm.24 Initially, HierS identifies all possible ring-delimited substructures within a set of compounds. Molecules are then grouped by shared ring substructures so that common scaffolds obtain higher membership. Once all the scaffolds for a set of compounds are identified, the hierarchical structural relationships between the scaffold structures are established and utilized to navigate compounds in a structurally directed fashion. Using this approach, a structural classification of natural products can be obtained, from which compound collections rich in natural product scaffold features can be designed.25 A hierarchical classification scheme for chemical structures was recently reported and used to profile a family-directed annotated compound library.26 Purely based on topological features,8 each molecule is uniquely identified with a 5-level Chemical Structure (CS) code. The first level is an integer specifying the number of rings in the core ring system (CRS) of the molecule; the second level is another integer reflecting the total number of ring systems in the molecule; the third level is a unique 7-character Chemical Graph Identifier (CGI) of the molecular framework; the fourth level is a unique CGI of the molecular scaffold; and the fifth level is a unique CGI of the complete molecular structure. As an illustrative example, Fig. 3 shows the CS code for tioconazole.
ch02
FA April 1, 2006
48
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres
Level 1
Level 2
Level 3
Level 4
Level 5
No. Rings in CRS No. Ring Systems Framework CGI Scaffold CGI
Molecule CGI
S
S Cl
O N
O
N
N Cl
1
3
BD41UTG
BDMKRWF
N
Cl
L0Q0GFB
1.3.BD41UTG.BDMKRWF.L0Q0GFB Figure 3. Chemical structure code for tioconazole.
Figure 4. Structural classification of drugs.
The establishment of hierarchical classification schemes for chemical structures allows for organizing systematically large chemical libraries in structurally related groups. All molecular scaffolds can now be associated into groups of structurally related molecular frameworks and,
ch02
FA April 1, 2006
49
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
correspondingly, entire series of compounds are naturally grouped in structurally related scaffolds. Using the CS codes described above, the structural classification of a set of 2,762 drugs is presented in Fig. 4. Browsing through this hierarchical classification reveals that of the 701 drugs containing 2 rings as the largest ring system, 161 of them have only this 2-ring system. These 161 drugs can be described with 11 frameworks, one of them (3NNLGVQ) containing 3 different scaffolds.
5. Mapping the Molecule-Protein Space For over a century, pharmacologists have been producing quantitative data on the response of biological systems to the presence of chemical compounds. The ability to measure the affinity of compounds for a particular enzyme or receptor provides a direct link between molecules and proteins and has long been exploited by the pharmaceutical industry to identify active compounds within chemical libraries. These active compounds can then be used as chemical probes for target validation or as initial hits for lead generation and are thus of key relevance to both target and drug discovery research.1 The technological advances produced during the last decade in combinatorial chemistry and high-throughput screening have dramatically increased the number of compounds and the capacity for biochemical testing, respectively, leading to an explosion in both the collection and availability of pharmacological data.27 Unfortunately, only a small portion of the vast number of pharmacological data generated internally within pharmaceutical companies is made accessible to the public domain. Consequently, capturing pharmacological information in databases of chemical structures has been historically limited by the amount of data available from public sources (e.g. drug reports, patents, scientific journals and conferences). As a result, some of the early initiatives focused primarily on known drug molecules for which pharmacological data was available. Among those, the Comprehensive Medicinal Chemistry (CMC) database currently offers biochemical information for over 8,400 pharmaceutical compounds and the Derwent World Drug Index (WDI) contains data on activity and mechanism of action for over 58,000 marketed and development drugs worldwide. In recent years, both the electronic access to public documentation sources and the increase in the number of data generated and reported have facilitated the construction of pharmacologically annotated chemical databases. For example, the MDL Drug Data Report (MDDR)
ch02
FA April 1, 2006
50
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres
includes information on therapeutic action and biological activity for over 132,000 compounds gathered from patent literature, journals, and congresses and the WOMBAT database offers biological information for 104,230 molecules reported in medicinal chemistry journals over the last 30 years.28 The construction of annotated chemical libraries directed to protein families has recently emerged as a knowledge-based means for integrating chemical and biological data and thus for exploring the chemogenomic space.26 Ultimately, the establishment of direct biochemical connections through annotated chemical libraries may contain clues to the existence of different proteins having affinity for similar ligands or the presence of some privileged structures responsible for the activity of compounds in the entire target families. The ability to extract knowledge from annotated chemical libraries will be highly determined by the way chemical and biological data are stored. In this respect, the use of the classification schemes for both chemical and biological entities described above is envisaged as an essential aspect in the construction of annotated chemical libraries. In fact, the majority of the annotated chemical libraries currently available contain only information on the protein names to which molecules have been found active. Although this storage system might be sufficient for retrieving and browsing data, it offers limited possibilities for extracting knowledge. Figure 5 provides an illustrative example of the added value gained by using hierarchical classification schemes instead of storing only protein names. Let us assume that a corporate chemical library contains molecules with pharmacological data for enzymes such as trypsin, thrombin, factor Xa, papain, cathepsin L, and cathepsin S. Unless the user possesses additional knowledge, it would Trypsin Thrombin Factor Xa
EC.3.4.21.4 EC.3.4.21.5 EC.3.4.21.6
EC.3.4.21 Serine proteases
EC.3.4 Peptide hydrolases
Papain Cathepsin L Cathepsin S
EC.3.4.22.2 EC.3.4.22.15 EC.3.4.22.27
EC.3.4.22 Cysteine proteases
Figure 5. Protein names versus hierarchical identifiers in data storage.
ch02
FA April 1, 2006
51
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
be difficult to be aware of any phylogenetic relationship between these enzymes directly from their names. Instead, if these enzymes are stored in the database using also their respective EC numbers, phylogenetic relationships can then be clearly established, allowing for the identification of structural commonalities in compounds not only at the enzyme level, but also at the sub-subclass and subclass levels. It is thus of utmost importance to implement the protein-family classification schemes described above into current storage systems of pharmacological data and ensure not only their proper dissemination among users, but also the users understanding of these schemes. An example of the potential of using protein-family classification schemes when attempting to extract knowledge from annotated chemical libraries was recently reported.29 Based on the MDL Drug Data Report (MDDR) of pharmacologically active compounds, all ligands were annotated to proteins according to the schemes to derive a ligand classification scheme reflecting the phylogenetic relationships of target families. The ligand classification was then exploited to design target class focused compound libraries. The next step would be to incorporate the ability of deriving a protein classification scheme reflecting the phylogenetic relationships of chemical families. To this end, one would need to incorporate a hierarchical classification scheme for chemical structures such as the one described above. Under this framework, Fig. 6 shows the basic pieces for constructing annotated chemical libraries. On the one hand, proteins should be stored using the appropriate annotation under their respective protein-family classification schemes (in this case, nuclear receptors). On the other hand, molecules should be stored using a unique hierarchical identifier. The link between the two entities (molecules and proteins) would be defined by pharmacological data (activity). The use of a certain criteria would then allow to construct a binary annotation matrix, from which the mapping of the chemogenomic space is established.
6. Exploiting the Chemogenomic Space The adoption of hierarchical classification schemes for both chemical and biological entities makes the storage and analysis of annotated chemical libraries computationally tractable and reduced to managing binary annotation matrices. These binary matrices are often visually illustrated as heatmaps (Fig. 7), where red indicates that the molecule is annotated to a target (i.e. shows biological activity under a certain criteria), while green
ch02
FA April 1, 2006
52
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres
SubFamily
ligand
target 1000000000000001 1001001000010100 0000001010000000 1011001000100000 0100000000101001
Group
Nuc Rec NR_id NR-name
Framework Scaffold
ACTIVITY Annotation Act_value
....
Figure 6. Construction of an annotated chemical library.
Figure 7. Heatmap of a binary annotation matrix.
Molecule Mol_id Mol_name ....
ch02
FA April 1, 2006
53
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
means that the molecule is not annotated to the target (i.e. either lacks any biological activity or has not been tested). The use of a binary scheme to map molecules and proteins emphasizes one of the main limitations in annotated compound libraries, i.e. the existence of information gaps. Since the majority of data in this type of libraries is extracted directly from public sources of information, the problem resides basically in the way data is collected and reported in these sources. Due to limited time and resources, molecules are usually not tested systematically through the entire panel of protein targets for the sake of obtaining the maximum amount of information possible but solely to the target of interest. But even if they were screened through multiple targets, habitually only a limited amount of data is made available. These important, yet often overlooked, aspects lead to a lack of data completeness, an issue that may have strong implications for the ultimate validity of the conclusions derived from analyzing the discontinuous data present in most annotated compound libraries. A strategy to identify potential annotation gaps in scaffolds on the basis of the annotations assigned to structurally related scaffolds has recently been proposed.26 Assuming the limitations mentioned above, the use of binary annotation matrices with entities identified under hierarchical classification schemes provides the basic framework for extracting relevant family-directed knowledge directly from data present in annotated chemical libraries. For example, clustering methods can be applied to grouping molecules having similar protein annotation profiles irrespective of their structural characteristics as well as to grouping proteins having similar chemical annotation profiles irrespective of their phylogenetic relationships. But more importantly, these molecule-protein annotations can then be transferred to the upper levels of the respective classification schemes. Therefore, the information given by each molecule-protein annotation is directly inherited, on the one hand, along the chemical classification scheme by the corresponding scaffold and framework and, on the other hand, along the protein classification scheme by the corresponding subfamily and family. Accordingly, all annotations to proteins present in molecules containing exactly the same scaffold will be collapsed in the corresponding scaffold-protein annotations as will subsequently be all annotations to scaffolds present in proteins of the same subfamily into scaffold-subfamily annotations, and so on. The property of annotation inheritance along chemical and protein classification schemes provides a computationally efficient means of
ch02
FA April 1, 2006
54
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres
extracting knowledge from annotated chemical libraries. Analysis of the scaffold-protein binary annotation matrix, gives information on the scaffold promiscuity and complexity for different proteins, whereas analysis of the scaffold-family binary annotation matrix may provide clues to the existence of potential privileged scaffolds within families. As an illustrative example, Fig. 8 shows the scaffold profiling of three nuclear receptor groups covered in an annotated chemical library directed to the nuclear receptor family.26 The values along the x-axis indicate the number of rings in the core ring system of each scaffold, corresponding to the first level of its Chemical Structure code (see Fig. 3). The tick marks on this axis define the space covered by the scaffolds within each value. For every nuclear receptor group, values on the y-axis refer to the number of annotations from molecules containing the corresponding scaffold on the x-axis. As can be observed, the group of thyroid hormone receptors (1A) is represented by molecular scaffolds of low structural complexity, with at most a single ring in the core ring system. In this respect, the scaffold simplicity of the molecules annotated to this group contrasts with the wide diversity of scaffolds found in the other two nuclear receptor groups. For example, 25
1A
20 15 10 5 250
1B
20 15 10 5 25 0
1C
20 15 10 5 0
0
1
2
Figure 8. Scaffold profiling for nuclear receptor groups 1A, 1B, and 1C.
3
4
5
ch02
FA April 1, 2006
55
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
molecules annotated to the group of retinoic acid receptors (1B) show an ample range of structural diversity, with scaffolds containing between one and five rings in the core ring system. Molecules annotated to the group of peroxisome proliferator activated receptors (1C) also show a wide scaffold diversity, although in this case 68% of the scaffolds have less than two rings in their core ring system. The results of this analysis conclude that, in terms of the structural characteristics found in the scaffolds of active molecules, molecules annotated to 1A appear to be less complex than those annotated to 1C, and the latter less complex in general than molecules annotated to 1B.
7. Conclusions The explosion of experimental data in the chemical, biological, and medical fields may allow, in the near future, linking of entities of the different twodimensional subspaces defining the chemogenomic space. In particular, in this post-genomic era, the increasing need for integrating chemical and biological data is starting to put pressure on traditionally built chemical databases. In this respect, the use of hierarchical classification schemes for both chemical and biological entities emerges as a key relational means for storing data in annotated chemical libraries. Since annotations can now be inherited by all levels of both classification schemes, this takes the ability to extract knowledge from annotated compound libraries to another dimension. The path to designing novel biochemoinformatics tools for in silico pharmacology is now set.
References 1. Bredel M, Jacoby E. (2004) Chemogenomics: An emerging strategy for rapid target and drug discovery. Nat. Rev. Genet. 5: 262–275. 2. Mestres J. (2004) Computational chemogenomic approaches to systematic knowledge-based drug discovery. Curr. Top. Drug Discov. Dev. 7: 304–313. 3. National Center for Health Statistics (2005) The International classification of diseases, 9th revision, clinical modification. ICD-9-CM, 6th ed. http://www.cdc.gov/nchs/icd9.htm. 4. Kanehisa M, Goto S, Okuno Y, Hattori M. (2004) The KEGG resource for deciphering the genome. Nucl. Acids Res. 32: D277–D280. 5. Gene Ontology Consortium (2000) Gene Ontology: Tool for the unification of biology. Nat. Genet. 25: 25–29. 6. Bairoch A, Apweiler R, Wu C, et al. (2005) The universal protein resource (UniProt). Nucl. Acids Res. 33: D154–D159.
ch02
FA April 1, 2006
56
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Mestres
7. Berman HM, Westbrook J, Feng Z, et al. (2000) The protein data bank. Nucl. Acids Res. 28: 235–242. 8. Xu Y-J, Johnson M. (2002) Using molecular equivalence numbers to visually explore structural features that distinguish chemical libraries. J. Chem. Inf. Comput. Sci. 42: 912–926. 9. Hopkins AL, Groom CR. (2002) The druggable genome. Nat. Rev. Drug Discov. 1: 727–730. 10. Garcia-Viloca M, Gao J, Karplus M, Truhlar DG. (2004) How enzymes work: Analysis by modern rate theory and computer simulations. Science 303: 186–195. 11. Tipton K, Boyce S. (2000) History of the enzyme nomenclature system. Bioinformatics 16: 34–40. 12. Bairoch A. (2000) The ENZYME database in 2000. Nucl. Acids Res. 28: 304–305. http://www.expasy.org/enzyme/. 13. Schomburg I, Chang A, Ebeling C, et al. (2004) BRENDA, the enzyme database: Updates and major new developments. Nucl. Acids Res. 32: D431–D433. http://www.brenda.uni-koeln.de/. 14. Webb EC. (ed.) (1992) Enzyme Nomenclature, Academic Press, San Diego. http://www.chem.qmw.ac.uk/iubmb/enzyme/. 15. Kenakin TP, Bond RA, Bonner TI. (1992) Definition of pharmacological receptors. Pharmacol. Rev. 44: 351–362. 16. Humphrey PPA, Barnard EA, Bonner TI, et al. (2000) The IUPHAR Receptor Code. In: The IUPHAR Compendium of Receptor Characterization and Classification, 2nd ed. IUPHAR Media, London, pp. 9–23. 17. Busch W, Saier MH Jr. (2002) The transporter classification (TC) system, 2002. Crit. Rev. Biochem. Mol. Biol. 37: 287–337. http://www.tcdb.org/. 18. Bockaert J, Pin JP. (1999) Molecular tinkering of G protein-coupled receptors. EMBO J. 18: 1723–1729. 19. Horn F, Vriend G, Cohen FE. (2001) Collecting and harvesting biological data: The GPCRDB and NucleaRDB information systems. Nucl. Acids Res. 29: 346–349. http://www.gpcr.org/7m/ for GPCRDB and http://www. receptors.org/NR/ for NucleaRDB. 20. Nuclear Receptors Nomenclature Committee (1999) A unified nomenclature system for the nuclear receptor superfamily. Cell 97: 161–163. 21. Sali A, Glaeser R, Earnest T, Baumeister W. (2003) From words to literature in structural proteomics. Nature 422: 216–225. 22. Mestres J. (2005) Representativity of target families in the Protein Data Bank: Impact for family-directed structure-based drug discovery. Drug Discov. Today 10: 1629–1637. http://cgl.imim.es/pdbrff/. 23. Roberts G, Myatt GJ, Johnson WP, et al. (2000) LeadScope: Software for exploring large sets of screening data. J. Chem. Inf. Comput. Sci. 40: 1302–1314.
ch02
FA April 1, 2006
57
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Mapping the Chemogenomic Space
24. Wilkens SJ, Janes J, Su AI. (2005) HierS: Hierarchical scaffold clustering using topological chemical graphs. J. Med. Chem. 48: 3182–3193. 25. Koch MA, Schuffenhauer A, Scheck M, et al. (2005) Charting biologically relevant chemical space: A structural classification of natural products (SCONP). Proc. Natl. Acad. Sci. 102: 17272–17277. 26. Cases M, García-Serna R, Hettne K, et al. (2005) Chemical and biological profiling of an annotated compound library directed to the nuclear receptor family. Curr. Top. Med. Chem. 5: 763–772. 27. Walters WP, Namchuk M. (2003) Designing screens: How to make your hits a hit. Nat. Rev. Drug Discov. 2: 259–266. 28. Olah M, Mracec M, Ostopovici L, et al. (2004) In: Opreat TI (ed.) WOMBAT: World of Molecular Bioactivity. Chemoinformatics in Drug Discovery, WileyVCH, New York, pp. 223–239. 29. Schuffenhauer A, Zimmermann J, Stoop R, et al. (2002) An ontology for pharmaceutical ligands and its application for in silico screening and library design. J. Chem. Inf. Comput. Sci. 42: 947–955.
ch02
FA April 1, 2006
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
This page intentionally left blank
ch03
FA April 1, 2006
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
3
Natural Product Scaffolds and Protein Structure Similarity Clustering (PSSC) as Inspiration Sources for Compound Library Design in Chemogenomics and Drug Development Frank J. Dekker,*,† Stefan Wetzel*,† and Herbert Waldmann*,†,a
1. Introduction Knowledge of the protein networks that regulate biological processes increases rapidly due to modern approaches addressing DNA sequence (genomics), protein structure (structural biology) as well as protein expression and protein interactions (proteomics). As a result many new proteins are being discovered, whereas the functions of these new proteins often remain to be elucidated. Study of protein function can be performed by genetic studies in which, for example, one or more specific gene products, i.e. proteins, are eliminated by knocking out genes. An alternative approach is the use of small molecules to alter the function of genes or proteins.1,2 This approach has been termed “chemical genetics” or “chemogenomics.” Chemogenomics can be considered as part of the drug discovery field, because small molecules that modulate gene or protein function might ultimately result in drug candidates. In chemogenomics and drug discovery, the most challenging task is to find small molecule modulators of protein function that specifically modulate the protein function of interest. Combinatorial chemistry has emerged as a powerful tool to address this problem by generation of large compound ∗ Department
of Chemical Biology, Max-Planck Institute of Molecular Physiology, Otto-Hahn Str. 11, D-44227 Dortmund, Germany. † Fachbereich 3, Organic Chemistry, University of Dortmund, Otto-Hahn Str. 6, D-44227 Dortmund, Germany. a Corresponding author. E-mail:
[email protected]; Tel.: +49-231-133-2400; Fax: +49-231-133-2499.
59
ch03
FA April 1, 2006
60
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
libraries that can be evaluated for activity on specific proteins. However, initial expectations that generation of large compound libraries would result in discovery of many new hit and lead structures were not met. Design of biologically relevant compound libraries proved to be one of the most crucial factors to obtain high hit rates. As the universe of conceivable compounds is almost infinite, the decisive question to be answered remains: Where in structural space can biologically validated starting points be found from which to build compound libraries? Answers to this question might be provided by nature itself. Natural products interact with proteins in their biosynthesis and during their mode of action (poisons, neurotransmitters, etc.). Therefore, natural products can be considered as biologically validated starting points for compound library design,3 and the core structures of natural products can be considered as “privileged structures”. Privileged structures are structural motifs that are often found in bioactive compounds and confer to the compounds containing these motifs the ability to bind to multiple proteins.4,5 Screening of focussed libraries, whose design is based on natural products or other privileged structures, for activity on a cluster of protein targets provides an efficient strategy to establish enhanced hit rates. Target clustering can be performed on the basis of similarities in amino-acid sequence, structure or function of proteins. We developed a new concept to cluster proteins according to structural similarity in their ligand sensing cores in a Protein Structure Similarity Cluster, and to use known ligands for one protein member of such a cluster as an inspiration source to design focussed compound libraries that are screened for activity on the other members of the cluster.3,6–11
2. Biological Relevance in Compound Library Design 2.1. Compound libraries as sources for small molecule modulators of protein function Finding small molecules as tools for chemogenomics or as lead compounds in drug development remains one of the major challenges for research at the interface of chemistry and biology. The chances to find bioactive small molecules in random combinatorial libraries proved to be small. Therefore, several approaches have been developed to increase the chances to discover biologically active compounds in compound libraries. The design of a compound library proved to be one of the most decisive factors determining
ch03
FA April 1, 2006
61
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
the hit rates. Diversity,12,13 drug-likeness,14−16 and biological relevance3,6−11 have been recognized as important guidelines for compound library design. In the following sections, several methods will be discussed to increase the biological relevance of compound libraries. 2.2. Annotated libraries An interesting strategy to elucidate biological mechanisms using compound libraries has been described by Root et al.17 This group applied a so-called “annotated” compound library, which is a library of compounds with a diverse set of known biological mechanisms and activities. A compound library of 2036 compounds was screened for activity on cancer cell lines. Subsequent analysis of the screen results allowed the determination of previously unknown biological mechanisms and targets in these cancer cell lines. This example shows how known biologically active compounds can be used to study biological targets in new contexts. 2.3. Natural products as inspiration sources for library design An efficient approach to design biologically relevant compound libraries is to take natural products as inspiration sources.3,6−11,18,19 Natural products can be considered as biologically validated starting points in structural space, since they interact with different proteins in the course of their biosynthesis and when they exert their biological function, e.g. in chemical defence (snake poisons) or communication (hormones, neurotransmitters). An example in which natural products were used effectively as biologically validated starting point for library design has been provided by Pelish et al.20 This group used the core structure of the natural product galanthamine as an inspiration source to synthesize a compound library, as presented in Scheme 1. A scaffold that is similar to galanthamine was synthesized on the solid phase and equipped with different functional groups. The resulting compound library was screened for activity on a pathway that shuttles proteins from the endoplasmatic reticulum (ER) to the plasma membrane via the Golgi apparatus (GA). This method enabled the identification of an active compound that can be used as a small molecular probe to study this pathway. The authors termed this active compound secramine. This result is remarkable because the natural product galanthamine itself showed no effect on this pathway. This finding supports the view that natural product core structures are biologically validated starting points in
ch03
FA April 1, 2006
62
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann O
OH
O
N
O
MeO
S
Br
OMe OH
N
NH
O
galanthamine
secramine
O N O iPr
Br O Si HO
NH
iPr
Diversity generating reactions: 1) Mitsunobu 2) conjugate addition Br 3) acylation, alkylation 4) imine formation 5) resin cleavage R1 O
R4
R2
O
S OH N R3
Scheme 1. Natural product (galanthamine) inspired compound library that led to the identification of a small molecular probe (denoted as secramine) to study a pathway that shuttles proteins from the endorplasmatic reticulum (ER) to the plasma membrane via the Golgi apparatus (GA).
structural space, which confer biological relevance to compound libraries that are designed around their core structures. 2.4. Library design based on privileged structures Another strategy to afford biologically relevant libraries is to take privileged structures as guiding principles. Structural motifs found in natural products can in many cases be considered as privileged structures. Furthermore, frequently occurring structural motifs in existing drugs, like for example, the benzodiazepine scaffold4 or the indole scaffold,21,22 can be considered as privileged structures. The concept of privileged structures enables the medicinal chemist to synthesize a focussed compound library based on one scaffold and screen this library against a variety of targets to yield more active compounds from one library. The utility of the privileged structure concept is illustrated by compound library synthesis based on the indole scaffold and the biologically active compounds resulting from that. The indole scaffold is found in natural products, like for example, the neurotransmitter serotonin (5-hydoxytriptamine, 5-HT) and in well-know drugs, like for example, the
ch03
FA April 1, 2006
63
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources OH O
O N
Indomethacine
O
+
H2N
R1
H N
O
Cl
capture
selective acylation R4
N
N
O
N
R3
O
R1 H2N
R2
R2
TFA/DCE H2O traces
R4
TFA/DCE
N
N
O
R2
R1 R3
Cl
R2
R1
R3
R4
O
selective acylation
R1
O
R1
H N
N
O
N R2
Scheme 2. The indole core structure is present in natural products and drugs, like for example; Indomethacine and can therefore be considered as a privileged structure. A resin-capturerelease strategy to synthesize an indole based compound library has been developed by Waldmann and co-workers.
nonsteroidal anti-inflammatory drug (NSAID) Indomethacin (Scheme 2). It has been demonstrated that Indomethacin influences multiple biological processes such as the cell cycle, the Wnt signalling pathway, the transcriptional activity of the peroxisome proliferation-activated receptor δ (PPARδ) and angiogenesis, i.e. the formation of new blood vessels from pre-existing ones.23 However, the precise molecular targets in these processes have not been identified unambiguously. A solid phase library synthesis method based on the Fischer indole synthesis was developed (Scheme 2).21,24 A “resin-capture-release” method was developed to provide indole based compounds without linker traces. The synthetic method is compatible with a variety of functional groups in
ch03
FA April 1, 2006
64
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
each building block, i.e. ketones, acid chlorides, and hydrazines. The overall yields ranged from 4% to quantitative and the yields were the highest when activating electron-donating substituents were present in the hydrazines. Biological investigations have been performed to evaluate the ability of the indole based compounds to inhibit angiogenesis-related tyrosine kinase receptors. Angiogenesis depends on endothelium-specific receptor tyrosine kinases, in particular vascular endothelial growth factor receptor 1-3 (VEGFR 1-3) and the Tie-2 receptor. The indole based library of 134 compounds provided 6 inhibitors of angiogenesis related kinases with IC50 values in the low-micromolar range, although Indomethacin itself shows no activity for these kinases.21 Another interesting feature of these Indomethacin analogs is their activity on multidrug resistant protein 1 (MDR-1) mediated multidrug resistance in cancer cell lines. Members of the indole based library enhanced the toxicity of the chemotherapeutic agent Doxorubicine in a model system of human glioblastoma cell lines (T98G).22 This example illustrates that privileged structures are biologically validated starting points for compound library design.
3. Natural Product Inspired Compound Library Synthesis Synthesis of natural product inspired compound libraries using solution and solid-phase chemistry has received increasing attention in recent years as described in some reviews.25−30 Combinatorial synthesis in solution can be applied effectively as described in a recent review27 ; however, it poses the difficulty of isolating and purifying the library members. This problem can be circumvented by solid-phase synthesis, which allows easy removal of the excess reagent. Versatile and high yielding reactions are most desirable to establish a multi-step synthesis of a natural product inspired library on the solid-phase. Two main strategies can be distinguished. Either the natural product analogs can be completely assembled on the solid phase from diverse simple building blocks or natural product scaffold molecules can be assembled in solution and diversified on the solid phase. The recently described solid phase synthesis of 6,6 -spiroketals provides an example in which natural product analogs were completely assembled on the solid phase.31 The authors described solid-phase synthesis procedures to synthesize 6,6 -spiroketals, which are structural motifs found in many
ch03
FA April 1, 2006
65
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
natural products. The solid-phase synthesis essentially involved 12 linear steps, including two asymmetric boron-mediated aldol reactions which proceed with high enantio- and diastereoselectivity (Scheme 3). Polymerbound and soluble chiral boron enolates are used for asymmetric induction. Resin cleavage results in spontaneous formation of the spiroketal in an overall yield comparable to that of the corresponding synthesis in solution. Derivatization of functional groups in a natural-product scaffold can also be effectively performed on the solid-phase. An example of this is the synthesis of a small compound collection (27-compounds) based on the tetrahydroquinoline scaffold.32 A chiral tetrahydroquinoline scaffold was synthesized in solution from 5-hydroxy-2-nitrobenzaldehyde (Scheme 4). The synthesis involved a key asymmetric aminohydroxylation step. This building block was anchored to the solid support with a Wang linker and diversity was introduced by selective deprotection and derivatization of the protected hydroxyl and amino substituents. O OH
O
H
Wang resin O
B(Ipc)2
Aldol reaction
OH
O
B(c-C6H11)2
OH
O
O
O (c-C6H11)2BCl
O Aldol reaction
OPMB
H
TBSO OH O
O
TBS protection and resin cleavage
OH
O O
OPMB TBSO
Scheme 3. Asymmetric solid-phase synthesis of compounds with the 6,6 -spiroketal skeleton.
ch03
FA April 1, 2006
66
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann H N
H N
O
HO
OAlloc
Resin loading
O
NHAlloc
O
OAlloc NHAlloc
O-Alloc removal and esterification
H N
O
O
O O
FmocHN
R1
O
NH
HO NH
N H
R2
O
O R1
NH-Fmoc removal and amide formation and resin cleavage
O
O O
O
NHAlloc
R2
H N
O
H N
NH-Alloc removal and Fmoc-AAx-OH coupling
O
O R1
27-Membered Library R3
Scheme 4. Synthesis of a compound library based on the tetrahydroquinoline scaffold.
These examples show that synthetic methods to synthesize complicated natural product analogs on the solid phase are available.
4. Target Clustering as Strategy in Drug Discovery 4.1. Target clustering Pharmaceutically relevant protein targets can be grouped into target clusters (also called target families) based on the amino acid sequence, structure and/or function of proteins. Knowledge on ligands that bind to one protein member of the target cluster can be taken as an inspiration source to design focussed compound libraries that are screened for binding to all the members of the target cluster. The underlying principle for such a strategy is the presumption that similar ligands bind to similar targets.33,34 It is obvious that such a presumption is true for highly similar ligands and
ch03
FA April 1, 2006
67
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
targets; however, more challenging results can be expected if the similarity is lower. This raises the question which level of similarity should be present and how to define similarity for a meaningful clustering of protein targets. Proteins can be clustered according to similarities in amino acid sequence, structure and/or function, which can be considered as evolutionary relationships between proteins and a certain interdependence between these properties can be expected. In the following sections the applicability of these properties to cluster proteins in order to find enhanced hit rates in focussed compound libraries will be discussed. 4.2. Target clustering based on structural and functional similarity The family of tyrosine kinases provides an example of proteins that have been classified based on their functional behavior, i.e. their ability to catalyze the phosphoryl transfer reaction from a phosphate donor (ATP) to a tyrosine. For instance, Waldmann and co-workers investigated a cluster of tyrosine kinases and screened a compound library that was designed around a natural product that binds to one of the clustered kinases.35−38 According to the established grouping of kinases the FGF1R, IGF1R, VEGFR-2 and Tie-2 and Her-2/Neu receptor tyrosine kinases were clustered based on structural and functional similarity (Fig. 1). Nakijiquinones are naturally occurring inhibitors of the Her-2/Neu receptor tyrosine kinases and thus provide a biologically validated starting point for compound library design. A compound library was synthesized based on the nakijiquinone scaffold as presented in Fig. 2. This focussed compound library with 74 library members was screened and provided 7 hits that showed activity to one or more kinases.35 This result shows that natural products that bind to one member of a cluster of functional and structural related proteins (in this case tyrosine kinases) can serve as biologically validated starting points to design compound libraries that address other members of the cluster. Structural similarity in protein domains appears to be a useful guideline for target clustering. Moreover, recent results in bioinformatics and structural biology indicate that protein domain fold is much more conserved than amino acid sequence and functional behavior.39−43 We proposed a concept in which ligand sensing cores with similar three dimensional structures can be clustered into a protein structure similarity cluster. Knowledge about known natural product ligands for members of such a cluster can be employed to guide the design of compound libraries that address the other proteins of the cluster. This provides an alternative approach
ch03
FA April 1, 2006
68
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
Figure 1. Superimposition of the structures of tyrosine kinases, Tie-2 (yellow), VEGFR-3 (red) and IGF1R (blue).
for target clustering as presented in Fig. 3. In the following section, we explain why target clustering primarily based on structural similarity provides chances to improve library design and what the scope and limitations of this concept are.
5. PSSC as Guiding Principle for Compound Library Design 5.1. Protein structure similarity clustering (PSSC) Many proteins consist of different domains, which are distinct modules of proteins that are formed by folding of regions of a polypeptide chain into distinct, stable and compact secondary structures. Although the number of possible protein domains is huge, nature has restricted the structural complexity of protein domains to a limited number of folds.39−48
ch03
FA April 1, 2006
69
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
R HN
COOH
O Nakijiquinone A: R = H O H
Nakijiquinone B: R = iPr Nakijiquinone C: R = CH2OH
OH
Nakijiquinone D: R = CH(CH3)OH
HO OMe
OMe
O
HN
NH
HO VEGFR-2: IC50 = 8 µM
O HOOC
COOH
O
O HOOC
COOH
O
O
Tie-2: IC50 = 18 µM
HN
O
O
NH
OH
OH Tie-2: IC50 = 18 µM
Tie-2: IC50 = 9 µM
VEGFR-3: IC50 = 3 µM
IGF1R: IC50 = 0.5 µM
Figure 2. Hits for different tyrosine kinases found in the nakijiquinone library.
Figure 3. Approaches to cluster protein targets for screening of small molecule libraries. Currently, protein targets are clustered based on functional or amino acid sequence similarity (left). We propose to cluster proteins purely based on structural similarity of protein cores and to apply this clustering to compound library design (right).
ch03
FA April 1, 2006
70
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
Many protein domains show similar folds either due to functional or biophysical constraints on the secondary structure or as a result of divergent evolution to a stage in which sequence similarity cannot be recognized anymore. Possibly many proteins have descended from relatively little ancestral domains, whose coding sequences were duplicated, diverged or recombined during evolution. The SCOP (Structural Classification of Proteins) database predicts 800 folds corresponding to 20,619 entries in the protein data bank (PDB).49,50 Different fold comparison methods disagree on the total number of folds. Depending on the models and approximations used, the number of folds ranges from 1,000 to 10,000.40−43 Another feature of the conservatism in protein domain folding is that the distribution of folds is highly non-homogeneous with some folds occurring abundantly and some rarely.40,41 It seems certain that a great majority of protein domains can be attributed to about 1,000 most commonly observed folds. The ligand binding or catalytic sites are the most relevant parts of a protein domain for the development of small molecules as modulators of protein function. There is evidence that proteins with conserved folds often also have their functional sites on the same topological location.51,52 In some cases a remarkable conservatism in functional sites can be observed. This is true for the example described later in this review on similarity of Cdc25A phosphatase, acetylcholinesterase (AChE) and 11β-hydroxysteroid dehydrogenases (11βHSD) (Fig. 9). Nevertheless, it should be stressed that the correlation patterns of amino acid sequence, protein fold and protein function remain a matter of debate. Moreover, a vast number of specific functions can be carried out by the limited number of protein domains due to the high amino acid diversity of proteins with similar folds.51−54 These findings led to the development of a novel strategy to exploits nature’s structural conservatism in protein architecture for the identification of small molecule modulators of protein function (Fig. 4). We introduced a concept that uses Protein Structure Similarity Clustering (PSSC) as a guiding principle for the selection of biologically validated starting points for compound library synthesis.6−11 The structures of natural products or other bioactive molecules that bind to a member of a cluster, are taken as guiding structures for compound library design. These libraries are screened for all the members of one cluster. Proteins with a high structural similarity and a low sequence similarity are the most interesting cases for PSSC, because they represent more distant relationships. These cases are outside the scope of classical approaches. In a given cluster that contains proteins with diverse amino acid sequences, a significant diversity in
ch03
FA April 1, 2006
71
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
Figure 4. Nature’s conservatism and diversity found in a protein structure similarity cluster. The conservatism found in the cluster can be used for the selection of pre-validated guiding structures for compound library design. The chemical diversity of the compound library is needed to address the biological diversity found in the binding sites of the cluster members.
functional sites can be expected. Therefore compound libraries addressing a cluster should display sufficient chemical diversity in order to match the biological diversity of the protein cluster (Fig. 4). Potential member proteins of a cluster based on structural similarity can be found using the online available fold comparison servers DALI/FSSP55 and the Combinatorial Extension (CE) method.56,57 Recently, an interesting review on the evaluation of protein fold comparison servers has been published, which ranked DALI/FSSP and CE among the best fold comparison servers.58 These servers require a PDB code or 3D structural data in the PDB format as input and yield a list of proteins arranged with decreasing similarity. The pharmaceutically relevant proteins resulting from these searches that display low sequence identity (<20%), yet a certain structural similarity (rmsd<5 Å in the ligand sensing cores) are selected and analyzed in detail. Further analysis is necessary because similarity calculated by DALI/FSSP or CE is not weighed for the spatial location of similarity in relation to the spatial location of the ligand binding site. Therefore, a found similarity could in some cases be attributed to a domain fairly remote
ch03
FA April 1, 2006
72
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
from the binding site and thus will not influence the binding properties at all. Therefore, the ligand sensing cores from the chosen protein structures are extracted to ensure that the initially discovered structural similarity is indeed located close to the binding site. These ligand sensing cores denote spherical cut-outs from the protein structures each 20 to 30 Å in diameter and centered on the binding site. The core structures are then aligned against all other cores structures of the cluster using the CE algorithm. The cores that show sufficient similarity in the 3D structures as well as in the topological location of their functional sites are assigned to a PSSC.7 5.2. PSSC based reanalysis of the development of leukotriene A4 hydrolase inhibitors Analysis of literature examples provided support for the viability of the PSSC concept and has been described previously.8−11 One example is the development of Leukotriene A4 hydrolase (LTA4 H) inhibitors. LTA4 H is a bifunctional zinc metalloenzyme that is responsible for the vinylogous hydrolysis of the leukotriene epoxide LTA4 into LTB4 . LTB4 is a potent chemoattractant and immune modulator involved in inflammation, immune responses, host defence against infection and platelet activating factor (PAF)-induced shock. The critical role of LTA4 H in LTB4 generation makes it an attractive drug target. Inhibitors of LTA4 H may be used to alleviate inflammatory diseases, like for example, rheumatoid arthritis, asthma, and even in cancer prevention and therapy.59 The presence of the zinc-binding motif (HEXXH-X18 -E) in LTA4 H prompted investigations on its relationship to zinc-binding metalloproteases like Thermolysin. Indeed, the naturally occurring aminopeptidase inhibitor Bestatin inhibits LTB4 biosynthesis.60 Moreover, the angiotensin-converting enzyme (ACE) inhibitor Captopril also inhibits LTA4 H. These findings inspired combinatorial variation of these lead structures, which led to the synthesis of potent and selective LTA4 H inhibitors (Fig. 5).61,62 Re-evaluation of these results applying the PSSC concept shows that LTA4 H, ACE and thermolysin exhibit significant structural resemblance (Fig. 6) although the sequence identity is low (about 7%). Moreover, the catalytic sites occupy a similar topological location. Thus, these proteins would have been clustered into a protein structure similarity cluster (PSSC). A natural product like Bestatin or a drug like Captopril, both targeting members of the cluster, would have been considered as guiding structures in compound library design.
ch03
FA April 1, 2006
73
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources HS NH2
O
CO2H
N
O
O
OH
O
A Bestatin Ki (peptidase activity) = 0.2 µM IC50 (epoxide hydrolase activity = 4 µM
O
NH2
OH
B Captopril Ki (peptidase activity) = 0.1 µM IC50 (epoxide hydrolase activity = 14 µM
O
O
NH2
O
SH
O C Ki (peptidase activity) = 0.046 µM
O
D Ki (peptidase activity) = 0.018 µM IC50 (epoxide hydrolase activity) = 0.2 µM
NH2
OH N
CO2H
O E Ki (peptidase activity) = 0.002 µM IC50 (epoxide hydrolase activity) = 0.15 µM
Figure 5. Bestatin and captopril derived inhibitors of LTA4 hydrolase.
5.3. PSSC based reanalysis of the development of nuclear hormone receptor ligands Another example in which literature results were reanalyzed in view of the PSSC concept concerns the development of ligands for the farnesoid X receptor. The farnesoid X receptor is a transcriptional sensor for bile acids, the primary products of cholesterol metabolism, and plays an important role in lipid homeostasis. The farnesoid X receptor was, until recently, an orphan receptor, which means that no specific ligands existed for this receptor. Selective ligands for this receptor have been found in natural product libraries described by Nicolaou et al.63 The group of Nicolaou developed solid phase synthesis methods to make combinatorial libraries based on a benzopyran core structure.64 A 10,000-membered combinatorial library based on the benzopyran core structure was synthesized65 and screened for activity on the farnesoid X receptor. The first specific ligands for the
ch03
FA April 1, 2006
74
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
Figure 6. Superimposition of the crystal structures of the catalytic domains of LTA4 hydrolase (blue), angiotensin-converting enzyme (red) and thermolysin (yellow), each with bound zinc ion (coloured accordingly).
farnesoid X receptor were found in this library. These ligands were used in a chemical genetic analysis to unravel the farnesoid X receptor functions in lipid metabolism.66 The farnesoid X receptor is a member of the class of nuclear hormone receptors, which have key roles in development and homeostasis, as well as in many diseases like obesity, diabetes and cancer.67,68 The farnesoid X receptor shows structural similarity to the estrogen receptor β (ERβ), which mediates a broad spectrum of physiological functions such as regulation of reproduction, modulation of bone density, cholesterol transport and breast cancer.69 The farnesoid X receptor also shows similarity with the peroxisome proliferation-activated receptor γ (PPARγ), which is involved in fat metabolism, inflammatory and immune responses.70 The estrogen receptor β (ERβ), the peroxisome proliferation-activated receptor γ (PPARγ) and the farnesoid X receptor (FXR) can be clustered in a
ch03
FA April 1, 2006
75
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
protein structure similarity cluster. These receptors show a similar fold as presented in Fig. 7; however, the sequence similarities are less than 20%. In the PSSC concept, we envisioned that there is a high chance that proteins in a PSSC recognize derivatives showing the same natural product core structure. The natural product genistein (Fig. 8) is active for both the ERβ and PPARγ receptors71 and the natural product troglitazone (Fig. 8) is active for the PPARγ receptor.72 The core structures of these natural products show remarkable similarity to the benzopyran core structure (Fig. 8). Application of PSSC to find ligands for the nuclear hormone receptors would have indicated the use of the benzopyran core structure as a guiding principle for library synthesis. This example also provides support for the applicability
Figure 7. Superimposition of the crystal structures of the catalytic domains of ERβ, PPARγ, and FXR, each with bound ligand. ERβ with genistein (blue), PPARγ with rosiglitazone (red), and FXR with ligand E (Figure 8) (yellow).
ch03
FA April 1, 2006
76
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
OH
OH
O
HO O O
O HO
S
NH
O A genistein Ligand for ERβ and PPARγ
B troglitazone Ligand for PPARγ
O
O Cl
O
Cl
N O
N
O C Farnesoid X receptor ligand EC50 = 5-10µM
O
O
D Farnesoid X receptor ligand EC50 = 0.188 µM
O N
(H3C)2N
O E Farnesoid X receptor ligand EC50 = 0.025 µM
O
Figure 8. Natural, non-natural and synthetic ligands for ERβ, PPARγ, and FXR receptors.
of the PSSC concept for de novo development of inhibitors for proteins of a similarity cluster. 5.4. Application of PSSC for de novo ligand development for the protein cluster Cdc25A phosphataseacetylcholinesterase-11β-hydroxysteroid dehydrogenase The PSSC concept was applied for the first time in de novo compound library design for a cluster of the enzymes Cdc25A, acetylcholinesterase (AChE) and 11β-hydroxysteroid dehydrogenase types 1 and 2 (11βHSD1 and 11βHSD2). These enzymes were assigned to a PSSC because their ligand sensing cores show remarkable structural resemblance despite their low sequence similarity (5–8%), as shown in Fig. 9. Moreover, the central
ch03
FA April 1, 2006
77
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
catalytic residues of Cdc25A (Cys430 ) and AChE (Ser200 ) occupy similar spatial locations. Also, the catalytic residues of both the 11βHSD isoenzymes occupy similar positions in space with respect to the catalytically important functional groups (sulfur in Cys430 Cdc25A and a phenolic hydroxyl group in Tyr183 11βHSD1 and 2). The clustered enzymes represent viable drug targets for the treatment of various diseases. Cdc25A regulates cell cycle progression at the G1→S checkpoint by dephosphorylating the Cdk2/cyclin complex. Thus, it may be an interesting target for antiproliferative drug design.73−75 AChE hydrolyzes the neurotransmitter acetylcholine and thereby terminates impulse transmission at cholinergic synapses. Inhibition of this enzyme improves the signal intensity in the synapse and therefore AChE is targeted in the treatment of myasthenia gravis, glaucoma and Alzheimer’s disease.76
Figure 9. Superimposition of the catalytic cores of Cdc25A (red), AChE (blue) and 11βHSD1 (green). The key catalytic amino acids Cys430 (Cdc25A), Ser200 (AChE) and Tyr183 (11βHSD1) occupy similar spatial locations.
ch03
FA April 1, 2006
78
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
The 11βHSD type 1 catalyzes the reduction of a keto function in cortisone to a hydroxyl function in cortisol. Cortisol can bind to the glucocorticoid receptor and thus initiate translocation of the ligand-receptor complex to the nucleus where it will stimulate transactivation. Consequently, 11βHSD type 1 represents a target to suppress glucocorticoid effects, which might provide therapeutic potential for the treatment of obesity, metabolic syndrome, diabetis type 2 and cognitive dysfunction.77−81 The 11βHSD type 2, however, catalyzes only the oxidation of cortisol and inhibition of this enzyme leads to sodium retention, which results in hypertension.82 Therefore, isoenzyme specificity is a major prerequisite for the clinical use of 11βHSD1 inhibitors. According to the proposed concept, a natural product that binds to one of the PSSC member proteins was selected as “leitmotiv” for the generation of a focused compound library. A naturally occurring inhibitor of Cdc25A is the sesterterpene dysidiolide (Fig. 10).83 This compound was selected as a starting point for library synthesis. Earlier investigations18,84 and literature reports on the phosphatase-inhibiting activity of related natural products,73 suggested that the γ-hydroxybutenolide moiety would be the active part for inhibiting phosphatases. A 147-membered compound collection of γ-hydroxybutenolides and closely related α,β-unsaturated five-membered lactones was synthesized and screened for inhibition of the enzymes Cdc25A, AChE, 11βHSD1 and 11βHSD2.7 Compounds that displayed IC50 values of ≤ 10 µM were considered as hits (Fig. 10). According to these guidelines, 42 out of 147 compounds were qualified as hits in the Cdc25A assay. The most potent compound had an IC50 value of 350 nM, which is significantly lower than the reported value for dysidiolide (9.4 µM84 ). Moreover, the compound library also contained three AChE inhibitors with IC50 values of 1.3–4.5 µM; three 11βHSD1 inhibitors with IC50 values of 7.8–10 µM; and four 11βHSD2 inhibitors with IC50 values of 2.4–6.7 µM. Remarkably, a pronounced degree of selectivity was observed for individual enzymes and also for the isoenzymes 11βHSD1 and 11βHSD2, as presented in Fig. 10. These examples show that a hit rate of approximately 2–3% can be obtained for enzymes that were identified as similar by PSSC. 5.5. Position of the PSSC concept in drug development and chemogenomics The PSSC concept provides a conceptually new principle to inspire compound library design for chemical biology and medicinal chemistry
ch03
FA April 1, 2006
15:40
79
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
OH O O
OH
A dysidiolide Cdc25A: IC50 = 9.4 µM
MeO
O O OH OH
HO
Library of analogs
O
O O D Cdc25A: IC50 = 1.8 µM AChE: IC50 = 20 µM 11βHSD1: IC50 = 19 µM 11βHSD2: IC50 = 6.7 µM
OH B Cdc25A: IC50 = 0.35 µM AChE: IC50 > 20 µM 11βHSD1: IC50 14 µM 11βHSD2: IC50 2.4 µM
O O OH O C Cdc25A: IC50 = 45 µM AChE: IC50 > 20 µM 11βHSD1: IC50 = 10 µM 11βHSD2: IC50 = 95 µM
O
Figure 10. Analogs of the naturally occurring Cdc25A inhibitor dysidiolide screened for binding to the PSSC member enzymes Cdc25A, AChE and 11βHSD½ (IC50 values are given).
research. Currently, protein targets are predominantly clustered based on amino acid sequence or functional similarities. Target clustering based on structural similarity enables one to explore more distant relationships between proteins compared to the classical clustering methods. The examples from literature and the first application of PSSC clustering in the discovery of inhibitors for proteins from the Cdc25A, AChE and 11βHDS½ cluster convincingly demonstrates that application of target clustering based on protein structure similarity in conjunction with natural
ch03
FA April 1, 2006
80
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
product inspired compound library synthesis provides increased hit rates at comparably small library size.7 Alternatively, existing drugs or drug-like compounds containing “privileged structures” can also be used as a starting point for library design using the PSSC concept. Application of PSSC is not limited to existing crystal structures of proteins. It could be applied to structures derived from homology models as well. A homology model, as applied for the 11 βHSD types 1 and 2 enzymes performed well in the first application of PSSC for compound library design.7 Furthermore, it should be noted that the concept has not been developed to make predictions on structural complementarity or the appropriate orientation of functional groups in the binding site. The PSSC concept will be useful in the early stage of drug development as a first abstracting rationale to select natural products as biologically validated starting points for library design. After initial natural product selection, other library design methods, for example ligand docking in the binding site, may further improve the quality of the library. Further refinement of the identified hit structures in medicinal chemistry programs to optimize selectivity and potency remains necessary. The PSSC concept may open up new opportunities for research in the developing field of “chemical genomics.” Central to this field is the identification of small molecule lead-like compounds that bind to a gene family product (i.e. a protein). Such small molecules can subsequently be used to elucidate the function of other gene products/proteins of the same gene family. The PSSC concept may broaden this approach by considering more distantly related genes and proteins.
6. Conclusions The discovery of many new proteins presents the challenge to find small molecules that specifically modulate their functions. Compound libraries can provide sources to find such small molecule modulators. Design of biologically relevant libraries proves to be crucial in the search for such compounds. One of the methods to design biologically relevant compound libraries is to take natural products as a design inspiration source. Natural products are biologically pre-validated and evolutionarily selected starting points in chemical space, because they interact with different proteins during their life cycle. Many natural product cores can be considered as privileged structures, which can also be found in existing drugs or other
ch03
FA April 1, 2006
81
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
bioactive compounds. Privileged structures can also provide a validated basis to design biologically relevant compound libraries. Recently, many reviews have appeared concerning natural product inspired compound libraries, denoting a renaissance of natural products as a source of protein modulators. An increasing number of methods to synthesize natural product inspired libraries on the solid-phase are being developed. Two main library synthesis strategies can be distinguished. Either natural product analogues can be completely synthesized on the solid phase from diverse simple building blocks or natural product scaffolds can be synthesized in solution and afterwards diversified on the solid phase. Clustering of protein targets provides a strategy to reach enhanced hit rates for related targets. Classical guidelines to cluster protein targets are amino acid and/or functional similarities. We proposed and used a novel concept to cluster protein targets based on structural similarity to overcome limitations of classical target clustering approaches. The structures of natural products or non-natural products that bind to a member of a cluster can be used as guiding structures for compound library design. Thus, we developed a guiding principle to select natural products as biologically validated starting points to design compound libraries that address clusters of targets. This concept integrates the use of natural products as pre-validated starting points and target oriented synthesis supported by protein structure similarity clustering (PSSC). Application of the PSSC concept for de novo ligand development performed well in its exploratory stage and has proved to be a powerful tool in the design of compound libraries yielding selective protein modulators.
Acknowledgments This work was financially supported by the Max-Planck Gesellschaft, the Deutsche Forschungsgemeinschaft and the Netherlands Organization for Scientific Research (NWO) (TALENT-stipendium for FJD).
References 1. Stockwell BR. (2000) Trends Biotechnol. 18:449–455. 2. Stockwell BR. (2004) Nature 432:846–854. 3. Breinbauer R, Vetter IR, Waldmann H. (2002) Angew. Chem. Int. Ed. 41: 2878–2890.
ch03
FA April 1, 2006
82
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
4. 5. 6. 7. 8.
9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
Evans BE, Rittle KE, Bock MG, et al. (1988) J. Med. Chem. 31:2235–2246. Mueller G. (2003) In: Drug Disc. Today 681–691. Koch MA, Breinbauer R, Waldmann H. (2003) Biol. Chem. 384:1265–1272. Koch MA, Wittenberg L, Basu S, et al. (2004) Proc. Natl. Acad. Sci. 101: 16721–16726. Koch MA, Waldmann H. (2004) In: H Kubinyi, G Müller (eds), Chemogenomics in Drug Discovery: A Medicinal Chemistry Perspective, Wiley-VCH. 377–403. Koch MA, Waldmann H. (2005) Drug Disc. Today 10:471–483. Balamurugan R, Dekker FJ, Waldmann H. (2005) Mol. BioSyst. 1:1–11. Dekker FJ, Koch MA, Waldmann H. (2005) Curr. Opin. Chem. Biol. 9: 232–239. Burke MD, Schreiber SL. (2004) Angew. Chem. Int. Ed. 43:46–58. Schreiber SL. (2000) Science 287:1964–1969. Sadowski J, Kubinyi H. (1998) J. Med. Chem. 41:3325–3329. Ajay Walters WP, Murcko MA. (1998) J. Med. Chem. 41:3314–3324. Walters WP, Ajay, Murcko MA. (1999) Curr. Opin. Chem. Biol. 3:384–387. Root DE, Flaherty SP, Kelly BP, Stockwell BR. (2003) Chem. Biol. 10:881–892. Brohm D, Metzger S, Bhargava A, et al. (2002) Angew. Chem. Int. Ed. 41: 307–311. Newman DJ, Cragg GM, Snader KM. (2003) J. Nat. Prod. 66:1022–1037. Pelish HE, Westwood NJ, Feng Y, et al. (2001) J. Am. Chem. Soc. 123: 6740–6741. Rosenbaum C, Baumhof P, Matischek R, et al. (2004) Angew. Chem. Int. Ed. 43:224–228. Rosenbaum C, Roehrs S, Mueller O, Waldmann H. (2005) J. Med. Chem. 48:1179–1187. Jones MK, Wang H, Peskar BM, et al. (1999) Nature Med. 5:1418–1423. Rosenbaum C, Katzka C, Marzinzik A, Waldmann H. (2003) Chem. Commun. 1822–1823. Hall DG, Manku S, Wang F. (2001) J. Comb. Chem. 3:125–150. Nicolaou KC, Pfefferkorn JA. (2001) Pept. Sci. 60:171–193. Boger DL, Desharnais J, Capps K. (2003) Angew. Chem. Int. Ed. 42:4138–4176. Ganesan A. (2004) Curr. Opin. Biotechn. 15:584–590. Koehn FE, Carter GT. (2005) Nature 4:206–220. Boldi AM. (2004) Curr. Opin. Chem. Biol. 8:281–286. Barun O, Sommer S, Waldmann H. (2004) Angew. Chem. Int. Ed. 43: 3195–3199. Couve-Bonnaire S, Chou DTH, Gan Z, Arya P. (2004) J. Comb. Chem. 6:73–77. Freye SV. (1999) Chem. Biol. 6:R3–R7. Jacoby E, Schuffenhauer A, Floersheim P. (2003) Drug News Perspect. 16: 93–102. Kissau L, Stahl P, Mazitschek R, et al. (2003) J. Med. Chem. 46:2917–2931.
ch03
FA April 1, 2006
83
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Natural Product Scaffolds and PSSC as Inspiration Sources
36. Stahl P, Waldmann H. (1999) Angew. Chem. Int. Ed. 38:3710–3713. 37. Stahl P, Kissau L, Mazitschek R, et al. (2001) J. Am. Chem. Soc. 123: 11586–11593. 38. Stahl P, Kissau L, Mazitschek R, et al. (2002) Angew. Chem. Int. Ed. 41: 1174–1178. 39. Grishin NV. (2001) J. Struct. Biol. 134:167–185. 40. Grant A, Lee D, Orengo C. (2004) Genome Biol. 5:107. 41. Koonin EV, Wolf YI, Karev GP. (2002) Nature 420:218–223. 42. Leonov H, Mitchell JSB, Arkin IT. (2003) Proteins 51:352–359. 43. Coulson AFW, Moult J. (2002) Proteins 46:61–71. 44. Ponting CP, Schultz J, Copley RR, et al. (2000) Adv. Protein Chem. 54:185–244. 45. Apic G, Gough J, Teichmann SA. (2001) Bioinformatics 17:S83–S89. 46. Chothia C, Gough J, Vogel C, Teichmann SA. (2003) Science 300:1701–1703. 47. Liu J, Rost B. (2003) Curr. Opin. Chem. Biol. 7:5–11. 48. Lee D, Grant A, Buchan D, Orengo C. (2003) Curr. Opin. Chem. Biol. 13: 359–369. 49. Murzin AG, Brenner SE, Hubbard T, Chothia C. (1995) J. Mol. Biol. 247: 536–540. 50. Andreeva A, Howorth D, Brenner SE, et al. (2004) Nucl. Acids Res. 32: D226–D229. 51. Stark A, Shkumatov A, Russell RB. (2004) Structure 12:1405–1412. 52. Russell RB, Sasieni PD, Sternberg MJE. (1998) J. Mol. Biol. 282:903–918. 53. Jones S, Thornton JM. (2004) Curr. Opin. Chem. Biol. 8:3–7. 54. Anantharaman V, Aravind L, Koonin EV. (2003) Curr. Opin. Chem. Biol. 7: 12–20. 55. Holm L, Sander C. (1997) Nucl. Acids Res. 25:231–234. 56. Shindyalov IN, Bourne PE. (1998) Protein Eng. 11:739–747. 57. Shindyalov IN, Bourne PE. (2001) Nucl. Acids Res. 29:228–229. 58. Novotny M, Madsen D, Kleywegt GJ. (2004) Proteins: Struct. Funct. Bioinform. 54:260–270. 59. Chen X, Wang S, Wu N,Yang CS. (2004) Curr. Cancer. Drug. Targets 4:267–283. 60. Orning L, Krivi G, Fitzpatrick FA. (1991) J. Biol. Chem. 266:1375–1378. 61. Yuan W, Munoz B, Wong C, et al. (1993) J. Med. Chem. 36:211–220. 62. Hogg JH, Ollmann IR, Haeggström JZ, et al. (1995) Bioorg. Med. Chem. 3:1405–1415. 63. Nicolaou KC, Evans RM, Roecker AJ, et al. (2003) Org. Biomol. Chem. 1: 908–920. 64. Nicolaou KC, Pfefferkorn JA, Roecker AJ, et al. (2000) J. Am. Chem. Soc. 122:9939–9953. 65. Nicolaou KC, Pfefferkorn JA, Mitchell HJ, et al. (2000) J. Am. Chem. Soc. 122:9954–9967. 66. Downes M, Verdecia MA, Roecker AJ, et al. (2003) Mol. Cell 11:1079–1092.
ch03
FA April 1, 2006
84
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
F. J. Dekker, S. Wetzel and H. Waldmann
67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84.
Robinson-Rechavi M, Escriva H, Laudet V. (2003) J. Cell Science 116:585–586. Gronemeyer H, Gustafsson J, Laudet V. (2004) Nature 3:950–964. Jordan VC. (2004) Cancer Cell 5:207–213. la Lastra CA d. Sanchez-Fidalgo S, Villegas I, Motilva V. (2004) Curr. Pharm. Design 10:3505–3524. Dang Z, Audinot V, Papapoulos SE, et al. (2003) J. Biol. Chem. 278:962–967. Van Gaal L, Scheen AJ. (2002) Diabetes/metabolism research and reviews 18: S1–S4. Lyon MA, Ducruet AP, Wipf P, Lazo JS. (2002) Nat. Rev. Drug Discov. 1: 961–976. Fauman EB, Cogswell JP, Lovejoy B, et al. (1998) Cell 93:617–625. Bialy L, Waldmann H. (2005) Angew. Chem. Int. Ed. In press. Racchi M, Mazzucchelli M, Porrello E, et al. (2004) Pharmacol. Res. 50: 441–451. Walker BR, Seckl JR. (2003) Expert Opin. Ther. Targets 7:771–783. Chrousos GP. (2004) Proc. Natl. Acad. Sci. 101:6329–6330. Ross SA, Gulve EA, Wang M. (2004) Chem. Rev. 104:1255–1282. Sandeep TC, Yau JLW, MacLullich AMJ, et al. (2004) Proc. Natl. Acad. Sci. 101:6734–6739. Masuzaki H, Paterson J, Shinyama H, et al. (2001) Science 294:2166–2170. New MI, Wilson RC. (1999) Proc. Natl. Acad. Sci. 96:12790–12797. Gunasekera SP, McCarthy PJ, Kelly-Borges M. (1996) J. Am. Chem. Soc. 118:8759–8760. Brohm D, Philippe N, Metzger S, et al. (2002) J. Am. Chem. Soc. 124: 13171–13178.
ch03
FA April 1, 2006
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
4
A Reductionist Approach to Chemogenomics in the Design of Drug Molecules and Focused Libraries Roger Crossley and Martin Slater∗
1. Introduction In the post-genomics era there is a huge amount of information available on gene sequences and hence the amino acids that these sequences encode. There are many ways to use this information. In the approach used at BioFocus, the information contained in the sequence has been used to define binding sites, in the design of drug molecules and in the construction of focused screening libraries. In doing this, we have taken a fresh look at the processes of molecular recognition and combined some new and some old concepts in drug design.
2. Molecular Recognition and Vicinity Analysis™ Before the days of X-ray crystallography and molecular modeling, the processes by which receptors and ligands recognize each other were deduced from the effects on activity observed for different drug molecules. In this approach, binding sites are described in terms of the properties of microenvironments that reflect recognition processes, for example, lipophilic pockets, acid/base recognition features and so on. The presence of these is thus inferred from the structure activity relationships (SAR) of the ligands which are active at the particular enzyme or receptor. One example of this approach is illustrated in Fig. 1, which represents the interaction at the sulphonyl urea receptor associated with the KATP potassium channel and which is involved in the control of insulin release in the pancreas. This pharmacophore map was obtained by an analysis ∗ Corresponding author: BioFocus (a Galapagos Company), Chesterford Research Park, Saffron Walden,
Essex, CB10 1XL, UK.
85
ch04
FA April 1, 2006
86
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater aa1
aa5 aa4 aa aa
H
amide NHCO or CONH
lipophilic group aa3
aromatic ring aa
aa2
aa6
alkyl group aa
acidic SO2NH or CO 2H aa
aa9
aa7 aa8
aa
Figure 1. Pharmacophore model for the sulfonylurea binding site (amended from Ref. 1).
of the observed SAR from the interaction of drugs at this receptor and provides sufficient information to design drugs that are likely to interact at this receptor.1 The basic map makes no assumptions about the nature of the interactions which enable recognition of lipophilic groups, etc. However, it can be assumed that specific combinations of amino acids lining the binding site are responsible for producing the lipophilic environment which recognizes these lipophilic groups. Furthermore, without knowing exactly which specific amino acids are involved, it is likely that they (aa1 , aa2 and aa3 ) are to be selected from the lipophilic amino acids (Val, Ile, etc.) or the aromatic (Phe, Tyr, etc.) amino acids. More recent work has started to unravel the actual residues involved but such efforts are still at an early stage.2 Because of the three-dimensional character of receptors, it can also be predicted that there are at least three amino acids involved in this lipophilic recognition; and furthermore, that they are arranged in space around a binding pocket the size of which can be determined by examining the SAR of the ligands which bind to this receptor. The major contributions towards the binding energy in this lipophilic pocket then arise from the displacement of some ordered water molecules, together with a shape complementarity that is determined by the steric bulk of the relevant amino acids and their relative positions. In the terminology used at BioFocus the combinations of amino acids, which create a binding pocket, are called “Themes” and the various moieties arising from the SAR of the ligands (lipophilic in this case) are called “Motifs.” In a similar way, it is possible to examine in detail other parts of this SAR and propose other possible combinations of amino acids to recognize the Motifs observed in the SAR. For example, it is interesting to speculate how an acid recognition site (aa7 , aa8 , aa9 ) can be constructed. In this case,
ch04
FA April 1, 2006
87
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics
one obvious possibility would be formation of a salt with a strong base and then Lys and Arg are potential candidates to bind a carboxylic acid. However, salt formation may not be the whole story here and sulphonyl urea stabilization may be by other means. Sulfonylureas are also likely to be stabilized in an environment created by a weaker base, such as histidine in a hydrogen bonding environment provided by the amino acids, serine and threonine. Then again this latter environment is also likely to support the binding of amides, hence His, Ser and Thr are also candidates for aa4 , aa5 and aa6 , and so on. One way in which such binding Themes can be estimated is from X-ray data of ligands interacting with receptors and enzymes. One technique used at BioFocus to dissect binding pockets is that of Vicinity Analysis™. In this experimentally based approach, a Motif is used to search the Relibase website (http://relibase.ebi.ac.uk/) to discover all the crystal structures where this Motif is part of the ligand. The crystal structures can then be overlaid so that the Motifs overlap. Residues outside a given radius are excluded as are other parts of the ligand. The result is a collection of amino acids arranged in space around a molecular fragment and this collection is then analyzed to determine the various contributions to molecular recognition. One example is illustrated in Fig. 2, where the Motif used to explore the pocket is benzoic acid. Of the structures studied, a few had a strong base at a suitable distance to form a salt bridge by proton transfer. However, in the majority of cases, the carboxylic group was not directly proximate to a protein basic group but, in all cases, at least one basic residue was located
Figure 2. Observed aromatic stacking between benzoic acid groups (center of picture) and protein amino acids.
ch04
FA April 1, 2006
88
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater
Figure 3. Rules for binding a benzoic acid moiety.
in the vicinity of the acid group but at a distance that would not imply a salt bridge. Often a basic group appears to stabilize the benzoic acid moiety indirectly through the presence of intermediate hydrogen bonding groups or water. Important stabilization also arises from interaction with aromatic rings. Figure 3 shows an illustration of the binding rules as observed in the “cavity” set. Following these experimental observations, therefore, it is possible to construct a Rule which says that, if the arrangement of amino acids is as in Fig. 3, then the receptor will recognize a benzoic acid fragment (or a bioisostere of a benzoic acid). It is also possible to translate the description in Fig. 3 into a more formalized Boolean logic, which could be used to interrogate a database. Discursively, such logic would read: “A drug fragment comprising a benzoic acid will be recognized by a pocket in a receptor IF there is an amino acid at 3.7–5.2 Å distance at 6-o-clock AND a hydrogen bonding amino acid (group) at 2.2–3.1 Å at 3-o-clock AND (an aromatic amino acid at 3.7–5.2 Å at 9-o-clock OR an aromatic or lipophilic amino acid at 3.7–5.2 Å at noon) AND a basic amino acid at 6.5 Å between 1-o-clock and 5-o-clock.” Note that this is only one way in which to describe the arrangement in Fig. 3, using as it does an origin based on the binding moiety. Other descriptions using the relative positions and identities of the groups and amino acids are also possible and are used in Thematic Analysis™ when applied to GPCRs and ion channels vide infra. This information
ch04
FA April 1, 2006
89
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics
is then sufficient for use in drug design or database searching and indeed such observations actually throw light on interactions that are not easily modeled using conventional techniques. One very important form of recognition in drug design is the recognition of charged groups, e.g. amines or quaternary nitrogens. The nicotinic and muscarinic acetylcholine receptors are architecturally very different, one being an ion channel and the other a G-protein coupled receptor (GPCR), yet they recognize the same ligand in the same way, although the ligand is in a different conformation. In the former, the stabilization is attributed to Trp149 in combination with Asp180 and, analogously, in the latter, by both Trp377 and Asp104. In other words, cation-π stabilization is necessary for the recognition of positively charged nitrogen.a In some cases it is sufficient. In acetylcholinesterase, the quaternary nitrogen is recognized by an arrangement of aromatic amino acids and these are solely using pi-charge stabilization to provide the binding environment. The hERG channel, a third and equally distinct protein found in cardiac tissue, possesses an analogous pattern of residues where Tyr652 and Phe656 have been shown by site directed mutagenesis to be involved in the recognition of the amine bearing ligands such as quinidine.3 A stacked Tyr-Phe arrangement pattern is also shared by all biogenic amine GPCRs, which goes some way to explain the undesirable biological overlap of antagonists at these receptors with activity at the hERG channel. In a series of insightful papers,4−6 Dougherty has explored this effect further from a computational perspective which looks at the nature of the stabilization by different amino acids. This treatment provides a basis for understanding the differences in the recognition of electron rich and electron poor systems and indeed the relative electron richness of bioisosteric drug fragments. Thus, tryptophan, by virtue of its electron richness, is the residue most predominant in the stabilization of a positive charge and is of critical importance in GPCRs. At this point we have mainly considered recognition in terms of rules that apply across any target with defined geometry. Next, we consider how they can be derived within target families and how under these conditions it is possible to use the consistency within families to derive rules for the binding of fragments from experimental data even where structural definition is imprecise. a Our
own experience has shown that the pattern of recognition for positively charged amine moieties is relatively universal at least in helical protein environments and in most cases there is both a cation-π as well as an electrostatic component operating.
ch04
FA April 1, 2006
90
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater
3. Thematic Analysis™ As we have seen, some drug interactions, such as cation-π interactions and indirect stabilization using water and hydrogen bonding groups, are not modeled easily. It is also difficult to quantify the importance of such interactions. Modeling based on molecular mechanics gives significant prominence to hydrogen bond and salt formation, but ignores the contribution of water both to the enthalpy gained by stabilizing otherwise high energy interactions and the entropic gains by removing its ordering. It does not cope well with dynamic systems, where there is a large-scale protein movement or where hard structural data is not available. When it comes to GPCRs many of the weaknesses of traditional modeling are present. We have only the single example of bovine rhodopsin to base homology models on and this is in an inactive conformation. Many of the interactions between drug molecules and GPCRs are of the type where atom-centered modeling is the weakest. Alternative approaches based on fields7 or quantum mechanics cannot yet manage the size of these systems. Interactions in GPCRs are significantly dominated by the π-π and cation-π type and the importance of hydrogen bonding is diminished, a consequence of the absence of interactions with the protein backbones in all-helical environments. It was in order to cope with the vagaries of GPCRs that the observations inherent in Vicinity Analysis™ were coupled with traditional modeling by deduction and were extended into Thematic Analysis™.8 The limitation of Vicinity Analysis™ is that it requires knowledge of the precise geometrical arrangement of amino acids. This is overcome in Thematic Analysis™ by assuming that there is a consensus binding site, which is essentially the same in all GPCRs, and that amino acid positions are conserved. In other words, it is assumed that in all Family A GPCR receptors, there is consistency of arrangement of the helices so that the amino acids, which decorate the helices, occupy equivalent positions and that a consistent subset of these positions surrounds a binding pocket into which most small drug molecules bind.b Furthermore, in Thematic Analysis™ the volume modeled represents a binding site that averages the conformational itinerary of GPCRs during activation. As absolute distances are not required in Thematic Analysis™, this is not a particular handicap. b This is an assumption shared with homology modeling9,10
based on bovine rhodopsin, which uses as a starting point an inactive form of the receptor and subsequent manipulation is required to transform it into an active form.
ch04
FA April 1, 2006
91
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics
A second assumption is that, whenever equivalent amino acids occupy the same positions around this consensus binding site, they cooperate to recognize the same type of bioisosteric Motif in the interacting drug molecule. For example, if a particular combination of amino acids in certain positions contribute to recognize an electron-rich ring (say) in one GPCR, then the same combination in the same positions will recognize electron-rich rings in another GPCR. The precise orientation of the electron-rich ring is then dictated by the relative positions of the co-operating amino acids. The approach, therefore, consists of initially exploring the SAR of known drugs at specific and related receptors, defining the bioisosteric Motifs in these drugs according to certain classes (acid, ester, aromatic, lipophilic, electron rich, etc.), and examining the receptor sequences of the interacting GPCRs to determine the likely combinations of interacting amino acids. This was then verified by examining the effects of mutagenesis experiments reported in the literature and the whole arrangements were organized in excel spreadsheets and displayed in logical maps of the receptors. In this extensive survey of known interactions with GPCRs, it emerges that some 42 amino acid positions are primarily responsible for creating the consensus binding pocket for the vast majority of drug molecules which interact with Family A (rhodopsin-like) GPCRs. There are some further amino acid positions that may in particular cases also contribute to binding and sets of rules have been created which allow for this involvement. It became necessary to create a device where these could be visualized. The approach chosen was that of logical mapping whereby the three-dimensional arrangement is reduced, in the extreme case, to that of connectivity and distances are de-emphasized or ignored. GPCRs are dynamic systems and it is supposed that active and inactive conformations are produced by large movements of the helices promoted by rotation and translation. Drugs that interact at these receptors may be agonists, antagonists, partial agonists or inverse agonists according to how they stabilize the various parts of this conformational itinerary. This poses a significant problem in the modeling of these systems. Thematic Analysis™, in order to be general enough to cope with this variety of interactions, minimizes but does not ignore the influence of distance. It relies on the assumption, elaborated above, that the same arrangement of amino acids will provide the same microenvironment and hence recognize similar bioisosteres. This is the same problem faced by creators of subway and other transport maps. Here the importance of distance is minimized and
ch04
FA April 1, 2006
92
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater
distances are relative. In such systems, it is the connectivity between stations that is important and not the absolute distances. Using this type of logical map, the arrangement of amino acids in GPCRs can be shown as in Fig. 4. The consensus drug-binding site was determined by a systematic critical examination of established mutagenesis and modeling experiments for a wide range of Family A GPCRs. A parallel exercise11 to the definition of a consensus binding sites has been conducted over a number of years at the University of Copenhagen and at 7TM Pharma and the results are formally similar.c This putative binding site was then compared with the known SAR for these receptors. Interacting drug classes were dissected into a series of binding Motifs (acid, amide, ester, base, lipophilic, electron rich, electron poor, etc.) and those amino acids that could be determined to cooperate to explain the SAR were collected together into Themes. The Themes and the microenvironments they describe are indicated in Fig. 4. Some assumptions were made in the creation of this map. As the bottom half of the helices are not generally involved in drug binding, it has been removed, as have the extracellular loops. The latter do play a significant role in the interaction of some GPCRs with their natural ligands, such as peptides, but, apart from a few instances, they do not have a significant role in the creation of small drug molecule binding sites. A potential limitation of this approach is that the conformational information that would enable decisions to be taken about agonism and antagonism, etc. has been lost. It is, however, possible to establish rules which go a long way to overcome this limitation. For example, interaction with Themes in the bottom left quadrant is likely to produce antagonists, whereas interaction with the Themes in the top left quadrant of the logical map to produce agonists is preferred. Examples of Themes In Family A GPCRs we have defined 28 Themes so far. These represent most of the combinations found in drug molecules and these Themes populate the consensus binding site as illustrated in Fig. 4. The existence of a Theme, or not, is then defined by a search logic which surveys the amino acids at the particular positions. In the examples of the first 5 Themes below, c In
the BioFocus approach, the analysis was driven by the observed SAR of the systems and in the above case by some elegant experimental biology. More formally Thematic Analysis™ is a superset of the above approach as it describes the recognition processes in all-helical environments and hence is able to be extended to Family B and C GPCRs and also ion channels.
ch04
FA April 1, 2006
93
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics
NH2
H2OC
a
b TM 6
TM 5
TM 7
TM 4 TM 2
TM 3
r
r
c
d
e
f
Figure 4. The consensus binding site for GPCRs derives initially from identification of important binding residues in the amino acid sequence and are distributed as on the ribbon diagram. In the ribbon diagram (a, b) the green circles indicate the area of consensus binding. In (c) the predicted consensus binding area is shown overlaid on a model derived from the structure of bovine rhodopsin. The ribbon diagram is transformed into a logical map (d) of GPCRs by deletion of extracellular and intracellular loops and straightening of the helices. The Themes and the microenvironments they describe are superimposed onto the GPCR logical map and shown in two orientations (e, f ).
ch04
FA April 1, 2006
94
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater
the numbering relates to that in rhodopsin and the standard single letter abbreviationsd for the amino acids are used. Theme 1 recognizes a positively charged functionality, usually a basic primary, secondary, tertiary or quaternary amine, as the Motif. Key residues are Asp117 on Tm3, which stabilizes the charge on the protonated amine, and this charge is further stabilized by cation-π interaction with aromatic residues at positions 265 and 269 on Tm6. Usually Trp265 on Tm6 provides the majority of stabilization. Supporting residues are found on Tm6 at positions 268 and 269. Position 268 is predominantly Phe or Tyr and potentially provides a second cation-π aromatic interaction or stabilizes other participating residues via a π-π aromatic-aromatic interaction. On occasions, the system is inverted and Trp265 is replaced by Phe or Tyr and the residue at position 269 changes to Trp. One variant exists in Opioid receptors, where a small lipophilic group replaces the aromatic residue at 268 (Ile268). Three other variants exist in which secondary stabilization may be achieved via H-bonding, i.e. His269 in Opioid receptors; Gln269 or Asn269 in ORL1 and Muscarinic receptors, respectively; and Thr269 or Ser269 in some Histamine and Trace amine receptors. Thus, the search logic which can be applied for defining this Theme is as follows: D in position 117; W, Y, F in position 265; W, F, Y, I, V, L, C in position 268; and, F, Y, I, V, L, H, N, Q, S, T in position 269. Examples are shown in Table 1. Theme 2 recognizes “carboxylic acids” or bioisosteres of them. There is, however, a gradation of properties of this Motif that range from charged “anionic carboxylate-like” to uncharged “ester-like” with an intermediate Table 1 Examples of Theme 1
Tm3 Receptor B2AR ACM1 OPRM H3
Tm6
Subtype
117
265
268
269
1aa 1ac 1bb 1ad
D D D D
W W W W
F Y I Y
F N H S
d Tm1-7 refers to the transmembrane helices. The following are the abbreviations for the amino acids: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met, N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp and Y, Tyr.
ch04
FA April 1, 2006
95
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics Table 2 Theme 2 Subtypes
Subtype
Acid Moiety
Supporting Group Type
Position
2a
‘Acid like’
Basic H, K, R
Tm3 113–115, 117–118 Tm5 203–205 Tm6 268, 271–272
2b
‘Acid-like or amide-like’
Basic H, K, R and usually H-bonding N, Q, S, T
Tm3 113–115, 117–118 Tm5 203–205, 207–208 Tm6 268, 271–272
2c
‘Ester-like’
H bonding N, Q, S, T
Tm3 113–115, 117–118 Tm5 203–205, 207–208
H-bonding “amide-like” Motif.e In contrast to Theme 1, the positions of the supporting residues are spread over helices Tm6, Tm3 and Tm5. The Theme 2 binding domain is somewhat higher and to the right of that of Theme 1 between Tm5/Tm3 and Tm6. As the “carboxylic acid” microenvironments range through “amide” and “ester,” so the surrounding residues range appropriately from being somewhat basic to both H-bond donating and accepting to H-bond donating in nature. The key residue is His and is usually in position 269 on Tm6 (sometimes the adjacent positions 268 or 271 are involved). This either stabilizes the charge on the acid or the amide and ester bioisosteres by H-bonding. Residues at positions 265 and 268 have an indirect role in supporting Theme 2 by ensuring the correct orientation of the acidic moiety. Other supporting residues are in positions 113, 114, 115, 117, 118 on Tm3; 203, 204, 207 and 208 on Tm5; and 271 and 272 on Tm6. These residues can be basic (H, R, K), H-bond-donor/acceptor (Q, N) or H-bond donor (S, T) in nature, depending on the sub-type of Theme 2, as summarized in Table 2. The search logic which can be applied for defining this theme is as follows: His in any position 268, 269 or 271; W, Y, F in position 265. At least 1 or more H-bonding residue’s: N, Q, S, T or at least 1 or more basic residue’s: H, K, R in any position 113–115, 117–118, 203–205, 207–208, 268, 271–272. Examples are shown in Table 3. Theme 3 recognizes electron-rich aromatic rings and is located to the right of Theme 1 between Tm4 and Tm5. The prototypical Theme 3 Motif e Acid-like
includes carboxylic acid, tetrazole, sulphonamide (ionizable NH) etc. Amide-like includes (substituted) amide. Tetrazole, sulfonamide, etc. Ester-like includes tertiary amide (substituted, no NH) ureas, urethane, sulfonamide, etc.
ch04
FA April 1, 2006
96
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater Table 3 Examples of Theme 2 Subtypes
Tm3
Recepter ETIR P2Y6 AT1A GHSR MC4R
Tm5
Tm6
Subtype
1 1 3
1 1 4
1 1 5
1 1 7
1 1 8
2 0 3
2 0 4
2 0 5
2 0 7
2 0 8
2 6 5
2 6 8
2 6 9
2 7 1
2 7 2
2a 2a 2b 2b 2c
F V A F F
P R S Q D
F F A F S
Q F V S I
K Y S E C
K Y L V Y
D G G M V
W M L V I
L L K V C
F T N S L
W F W W W
L F H F F
H H Q H F
S T F G H
R K T R L
Table 4 Examples of Theme 3
Tm4 Receptor D2DR ETBR-LP-2 NTR2
Tm5
164
165
167
168
208
211
212
S S S
T M L
I L L
S A A
S F V
S Y S
F F F
would be the 5-hydroxyindole or catechol moieties such as are found in aminergic neurotransmitter receptors. A key feature of Theme 3 is a Serand/or Thr-rich microenvironment supported by aromatic residues Phe or Tyr. Ser or Thr residues are situated on Tm4 at positions 164, 168; and on Tm5 at positions 208 and 211; and putatively stabilize the electron-rich aromatic ring via H-π aromatic interactions or by classical H-bonding to the appropriate pendant substituents. A supporting aromatic residue at position 212 on Tm5 stabilizes the Motif via a π-π interaction. A variant exists where a Ser or Thr is replaced by Tyr at position 211, e.g. in the endothelin receptors. The search logic which can be applied for defining this theme is as follows: F at position 212. No less than two S ∼ T ∼ Y in any position 164, 165, 167, 168, 208, 211. Examples are shown in Table 4. Theme 4 is located in a similar position to Theme 3 to the right of Theme 1 between Tm4 and Tm5, but recognizes esters and heterocyclic ring systems. Theme 4 is comparatively rare and principally relates to muscarinic receptor ligands. Ser or Thr residues are situated on Tm4 at position 164 and on Tm5 at position 207, and these stabilize the ester moiety via classical H-bonding. Supporting aromatic residues Trp 168 on Tm4 and Phe212
ch04
FA April 1, 2006
97
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics Table 5 Example of Theme 4
Tm4 Receptor ACM1
Tm5
164
168
208
212
S
W
T
F
Table 6 Examples of Theme 5
Tm7 Receptor SLC1 GHSR B1AR V1BR TA1 P2YR
Sub type
288
289
292
293
295
296
299
5a 5a 5b 5b 5c 5c
Y N F F N Y
N L V T D Q
I F N M I R
S V W L W G
G F G G G A
Y Y Y N Y S
S A S S S S
on Tm5 stabilizes the ester or heterocycle presumably via a π-π P-orbitalaromatic or aromatic-aromatic interaction. The search logic, which can be applied for defining this theme, is: F at position 212. W at 168. No less than two S or T in positions 164 and 208. Examples are shown in Table 5. Theme 5 recognizes amphiphilic groups such as are found in the adrenoreceptor antagonists; the prototypical Motif is the naphthyl etherethylene glycol unit of propranolol. In this Theme, the key residue is Asn or Gln on Tm7 which stabilizes the motif by H-bonding and is situated above a lipophilic pocket which lies down Tm7 and into which the naphthyl moiety of propranolol projects. Subtypes exist which are apparently displaced up or down Tm7.f Here the position of the Asn or Gln is apparently further up towards the extracellular surface or further down towards the intracellular surface relative to position 292 in the beta-aminergic receptors. The search logic which can be applied for defining this theme is as follows: No less than 1 Asn or Gln at 288, 289, 292, 293 or 296, AND no more than 1 E, D, K, R at 288, 289, 292, 293 or 296. Examples are shown in Table 6. f Alternatively
it may be that the algorithm used to determine the start and end of Tm7 has caused the alignment shift. This has little real impact on the analysis and is allowed for in the search logic.
ch04
FA April 1, 2006
98
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater
4. Family B and C GPCRs So far most drugs to come from the GPCRs arise from Family A. Families B and C have generally proved to possess more difficult environments in which to find drugs. Family B, secretin-like, is characterized by having a large extracellular domain into which the natural peptide ligands bind and also by the effective lack of an extracellular loop in between transmembrane helices 1 and 2. As there are no crystal structures of this family available, approaches using traditional molecular modeling have proved difficult. With Family C GPCRs, the natural ligands, glutamate and GABA, bind to what is essentially a separate extracellular binding domain and, although more is known about this domain, the arrangement of the transmembrane helices remains a problem especially in view of the dimeric nature of the receptors. In addition, the scarcity of drugs that bind in the transmembrane regions means that there is a comparative lack of mutagenesis data. However, Thematic Analysis™ does not require an accurate description of the binding domain only that a consistent consensus binding domain exists. The assumption was made, therefore, that this does exist and is defined, as for Family A, by the upper portions of transmembrane helices.2−7 This is not unreasonable as the transduction mechanisms responsible for the interactions with the G-proteins appear to involve the bottom parts of the helices of all GPCRs. It is presumed that the shape and size of the consensus-binding domain is different between the families but is consistent within a family. It then remained to examine the SAR of the few drugs that interact with Family B GPCRs to see if Themes existed that correspond to the Motifs observed in the SAR.g By looking at the SAR and the amino acid side chains predicted to face inwards into a possible consensus binding pocket, it became possible to describe the Themes present in Family B receptors. The same process was applied to Family C receptors, where little g Thematic Analysis™ is essentially a description of the potential drug binding environments possible in
all-helical receptor space. This space is also determined by the need to satisfy the requirements of water molecules in the absence of other ligands. These restrictions limit the number of arrangements of amino acids possible around consensus binding sites. It was recognized in the 1980s by workers at Parke Davis12 in work on peptoids that amino acids fall into a few classes depending on whether they are aromatic, hydrogen bonding lipophilic, etc. Hence, there are only a few types of microenvironment which can be described by small collections of amino acid side chains and they are the commonly observed ones used in Thematic Analysis™. The converse also applies and a particular microenvironment, as deduced from SAR, is only describable by a limited number of amino acid arrangements.
ch04
FA April 1, 2006
99
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics
is known about binding in the transmembrane regions. Nevertheless, some well understood arrangements of amino acids could be found by assuming that the helices should be more or less arranged as in Family A, again, because of the requirements of the transduction mechanisms. From a comparison of the logical maps and Themes present across Families A, B and C, two major generalizations can be drawn. Whereas in Family A receptors, amine and electron-rich microenvironments are well represented, in Family B receptors, these are of minor importance and electron deficient recognition appears to be more important. Also, in Family A the SAR across the family is dominated by the presence of central Themes such as Theme 1; however, such Themes seem to be absent in Families B and C.
5. Classification of GPCRs The ability to use Themes to recognize binding microenvironments in GPCRs has a number of consequences. One of these is the ability to classify the receptors in a particular family in an entirely new way. There have previously been attempts to treat classes of GPCR according to how the particular receptors recognize the SAR present in the drugs that interact with them. This can be a powerful tool in the design of drugs for the classes identified in this way. One such approach which has been extensively used at the Neurocrine group, is the definition of a subset of GPCRs, the so-called GPCR-PA+ set, where the SAR of drugs at these receptors is dominated by the presence of a positively charged amine.13 In Thematic Analysis™ terms, many of these fall in the group of amino acids defined by the presence of Theme 1.h It is therefore possible to describe the SAR at a particular receptor by the presence or absence of Themes. For example, typical aminergic receptors contain Themes 1 and 3 and a subset of these are recognized by antagonists that also bind to Theme 5. It is possible then to create a table, which indicates the presence or absence of a Theme, and use this to classify GPCRs according to the types of SAR that the Themes describe. Such a table is shown in Table 7. Using such a table, it is then possible to sort a whole family of GPCRs according to the type of SAR that the receptor displays. This can, in turn, be used to design focused libraries with which to screen particular subsets of receptors (see below), or to define search criteria with which to trawl h There
are positively charged amine recognition sites other than at that defined by Theme 1.
ch04
FA April 1, 2006
100
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater Table 7 Binary Map of Selected GPCRs by Themes. A “1” indicates the presence of a Theme and a “0” the absence
Theme Receptor
1
2
3
4
5
A2AR D2DR ACM1 OPRD ETBR V1BR P2YR
1 1 1 1 0 0 0
0 0 0 1 1 0 1
1 1 0 0 1 1 0
0 0 1 0 0 0 0
1 0 0 0 1 1 1
large compound collections for subsets more likely to generate hits. It is now increasingly recognized that there are considerable savings of time and money to be made by only screening compounds likely to interact with a particular receptor. More fully describing the receptors in terms of a 28-Theme bit string has also enabled a graphical description of GPCR space as shown in Fig. 5, using the mathematical technique of multidimensional scaling.i Applying this technique to the thematic bit maps of Families A, B and C GPCRs produces a two-dimensional space where the receptors, as points, are separated according to how similar their bit strings are. This depiction then clusters together receptors sharing similar Themes and hence having SAR in common. It can be used to suggest which other receptors are most likely to present selectivity problems or, conversely, provide clues as to the type of molecules which are likely to provide starting points for drug design experiments by SAR hopping. The classification appears to be intuitively correct with known groups of receptors, e.g. most of the GPCR PA+ come together in space and receptors “known” to be difficult also tend to occupy similar parts of the space. Thus the classification clusters these “difficult” Family B and C receptors somewhat apart from the majority of Family A, whereas Family B receptors which have proved more tractable, e.g. GLP2, are to be found closer to the center of the constellation. i For
an excellent review of multidimensional scaling, see http://forrest.psych.unc.edu/teaching/p208a/ mds/mds.html.
ch04
FA April 1, 2006
15:40
101
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics
LEC2
MGR8
GLR MGR6
MGR2 PACAP1 GBR2 RAIG
MGR7 CASR
MGR4 VIPR
PTR2
GIP VIPS
MGR1 MGR5
GBR1 PTR
CGRR CALR
GLP2
Figure 5. Multidimensional scaling plot of GPCR SAR space as defined by the presence or absence of Themes. Family A GPCRs are indicated by filled squares; Family B by triangles; and Family C by diamonds.
6. Pharmacophore Maps Although logical maps combined with a set of rules are sufficient to use Thematic Analysis™ in compound design and optimization for particular GPCRs, a method was sought which would render them more intuitive when presented to medicinal chemists. The solution was to provide a tool that could describe the consensus binding site of a GPCR in terms of the microenvironments described by the Themes; in other words, to produce the kind of pharmacophore map which a medicinal chemist would produce to describe observed SAR such as that presented in Fig. 1. The result is illustrated in Fig. 6, where the pharmacophore map for the serotonin 5HT2a receptor is displayed, overlaid with a known 5HT2a antagonist. The map indicates the features of the receptor filled by appropriate groups and is at an approximate scale of 1 unit = 5 Å. Thus, the indole appears to map into the bottom right electron-rich pocket and the 2-phenyl group into the adjacent lipophilic pocket; the positively charged nitrogen into the amine pocket and the phenethyl group either into the left hand electron-rich pocket or the
ch04
FA April 1, 2006
102
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater 1.5 HT2A Amine Acid
0.5
H-bond -2
-1
N0
N NH 1
-0.5
2
Ester Hetero E-rich Lipo
-1.5 Figure 6. Predicted pharmacophore map of the serotonin 5HT2a receptor overlaid with a 5HT2a antagonist.
adjacent lipophilic pocket. The map indicates that there is room for further elaboration of this molecule with some hydrogen bonding accessible from the 2-phenyl group and the phenethyl could benefit from being somewhat more electron-rich.
7. Library Design Using Thematic Analysis™ There are a number of approaches to the design of libraries of compounds focused on GPCRs and such libraries fall into a number of classes depending on how they have been designed. In the technique of post hoc design, a set of descriptors are built up by examination of a set of compounds active at a particular receptor family or sub-class. Normally, the set of drugs would be from a commercial database such as MDDR or the Merck Index, etc. and the descriptors would usually be substructural fragment or key based. One example would be the GPCR-PA+ sub-class referred to above, where BCUT descriptors have been used to aid the design of a focused library of around 2000 compounds based on 8 scaffolds.13 Libraries have also been constructed based on peptidomimetic principles14 as well as on the concepts of privileged structures.15 The above approaches rely upon prior knowledge of the SAR of related receptors and the assumption that a receptor with unknown SAR is related to the training set. There are, however, receptors that are notoriously
ch04
FA April 1, 2006
15:40
103
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics 1.50 Amine Acid 0.50 R2
-2.00
-1.00
R1
N 0.00
H-bond 1.00
2.00
Ester
-0.50 E-rich ~5Å
Lipo
-1.50 Figure 7. Use of Thematic Analysis™ in library design.
difficult to find hits for. Is this because the receptor does not contain a suitable binding site or that the correct libraries have not been used in its screening? Experience shows the latter. Thematic Analysis™, because it uses sequence alone to determine binding microenvironments, is a form of chemogenomics and is part of a new approach to the creation of focused libraries. An earlier example of a chemogenomics approach by Jacoby16−18 produced focused libraries that were targeted by analysis of the physicochemical relationships of the binding site in aminergic GPCRs and has been further exemplified since.19 Thematic Analysis™ design of focused libraries takes a similar line but has been more generalized in that it has been applied across a much wider range of receptors. Figure 7 illustrates the main approach to the design of focused libraries based on Thematic Analysis™. A subset of GPCRs or a single GPCR is chosen which displays a particular pattern of Themes, A scaffold is selected, which is either known post hoc to interact with one of these Themes or can be designed to display the correct geometry to interact with a pair of Themes. Vectors are chosen for elaboration (R1 and R2) which explore other Themes and the sets of groups to decorate these vectors are chosen according to the properties which appear in the pharmacophore map. In the above example, a central piperidine is chosen to interact with Theme 1 (amine) and a 3-phenyl group is positioned so as to interact with Theme 3. This phenyl can then be decorated with groups to increase its electronegativity. R1 is able to explore either a bottom left electron-rich pocket or
ch04
FA April 1, 2006
104
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater
a slightly higher lipophilic pocket. R2 can be chosen from groups which explore either a central hydrogen bonding region or a lipophilic pocket end extended into a hydrogen bonding region. Having selected a range of potential side chains and confirmed that the available chemistry is suitable, a virtual library is constructed and the final library members synthesized which reflect good drug-like properties. Experience across a wide variety of receptors with the screening of libraries designed in this way demonstrates that these design principles are effective and hits have been found for receptors which otherwise were considered to be difficult. Hits at a particular receptor are usually found in clusters giving an early indication of the SAR, and the same design considerations, used to produce the library, can be used in the lead optimization process. The extension of Thematic Analysis™ into Families B and C opens up the possibility of obtaining tractable leads in these areas as well.
8. Radar Maps Radar maps are a method of visualizing multidimensional space and have been applied to receptors as described by Thematic Analysis™. In these maps, the importance of a descriptor is denoted by its distance along a radiating axis. When the points on these axes are joined together, the resulting shape is a description of the object being described. In the examples in Fig. 8, the axes represent the properties encoded by Thematic Analysis™ (amine, acid, hydrogen bonding, ester, heterocycle, electron rich, electron deficient, lipophilic) and for any particular receptor, the importance of such a microenvironment is reflected in the distance along the relevant axis. Joining the points together produces a shape which is a description of the overall properties of a binding site in a receptor and the shapes are
Neuropeptide Y2 (NY2R)
Metastatin (GPR54)
Lip
7
Lip
acid
-1
ER
7
amine
acid
Lip
HB
est
ED
7
-1
HB
ED
-1
ER
ER
acid
3
3
3 ED
Metabotropic glutamate receptor 5
amine
amine
HB
est
est Het
Het
Het
Figure 8. GPCRs described using radar plots. From left to right: metastatin receptor (GPR54), neuropeptide Y2 receptor and metabotropic glutamate receptor (mGluR5).
ch04
FA April 1, 2006
15:40
105
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics
similar in receptors with similar binding sites. In addition, as it is also possible to describe any particular focused library, produced as above, in terms of such a diagram; it is possible to choose relevant libraries for screening by visual inspection. In the examples given in Fig. 8, neuropeptide Y2 receptor has a similar binding site to the metastatin receptor, but this is significantly different from the metabotropic glutamate receptor 5. Accordingly, a library with an overall property profile shaped like a rightward facing fish would be expected to produce hits for the first two but not the latter.
9. Ion Channels Although Thematic Analysis™ was extensively developed for its application to GPCR classification, library design and lead optimization, it is in reality capable of exploring the arrangements of amino acids that provide binding microenvironments within the helical domains of receptors. Such helical recognition elements (HREs) would therefore be expected to exist at the binding sites of other predominantly helical proteins such as ion channels. This was in fact found to be the case when a similar approach to that employed for GPCRs was applied to ion channels. For example, the voltage gated cation channels belong to the largest group of structurally related ion channel proteins and were therefore chosen as the most amenable to analysis. However, a dichotomy exists, relative to the GPCRs, in terms of the
IIS5
IIIS6
IVS5
IVS6
IIS6
IS6 74 74
76
76
76
74 76
74 Ile
H
Ser
O O Asn
O
Phe
N Leu
O Tyr
Val Leu
O
Ile Asn
NH
Leu
Ion Channel Theme
a
b
Figure 9. (a) Batrachotoxin (BTX) docked into the KcSA X-ray structure. Residue positions superimposed in green correspond to those implicated in the sodium channel Nav1.4; and (b) its display as projected on the generic logical model of ion channels and predicted interaction with an ion channel Theme.
ch04
FA April 1, 2006
106
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater
physical evidence surrounding these targets. On one hand, much SAR and mutagenesis information is available for the sodium and calcium channels, whereas structural evidence such as X-ray crystallographic data is lacking. On the other hand, X-ray information, from bacterial channels, is available for potassium channels20−23 for which ligand and mutagenesis data are rather sparse. Careful analysis of the primary amino acid sequences, together with the detailed structural information from the X-ray data, revealed the commonalities present in the three cation channels. These include, for example, the architecture of the selectivity filter and the positions of hinge points in the S5 and S6 helices. The former controls which ions can pass through the channel; the latter are required for gating to occur. The similarity stems from the related function of the channels and the voltage dependent opening of an ion conducting pore, which is inherent in the structures and independent of the cation. These details enabled the construction of a multiple sequence alignment for all three channel types, which was critical to determining the relationship between SAR, site directed mutagenesis (SDM) and structural data. The residues indicated by the SDM data could then be superimposed onto the X-ray crystal structures to map the surface accessible to known ligands. Analysis of the predicted binding modes of ligands, within the context of the resulting generic ion channel logical models, ultimately provided the ion channel Themes. The current analysis has identified 37 of these Themes, which cover a similar range of Motif properties as those identified for GPCRs. The Theme property profiles for ion channels can similarly be depicted as radar plots and are directly comparable with those derived from GPCRs. The result is an extremely convenient and powerful tool for visualizing and comparing drug fragment recognition in the two protein classes. For example, it can be used to understand the SAR overlap24 of the 5-HT2a antagonist and antipsychotic drug sertindole at hERG, sodium and calcium channels as shown in Fig. 10. For ion channels, the Thematic Analysis™ approach is important for its potential to predict the interactions of ligands that block or open the channels. In common with GPCRs, ion channels are complex, intrinsically dynamic macromolecular machines. Since both opener and blocker ligands were used to derive the models, the generic ion channel logical map represents a time-averaged view of the protein. As such, it would be anticipated that careful analysis might reveal specific interactions relating to blockers or openers by analogy with the differential binding modes of agonists and antagonists in the GPCR logical maps.
ch04
FA April 1, 2006
15:40
107
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
A Reductionist Approach to Chemogenomics hERG .
Serotonin (5HT2a) amine Lipo
acid
ED
ER
ester
Acid
ED
HB
ER
ester Het
unknown
D4Cav1.2 .
ED
Acid
ester Het
Acid
ED
HB
ER
unknown
ester Het
unknown D1Nav1.5 . Amine
Amine
HB
ER
Lipo
D4Nav1.2 .
Amine Lipo
Amine
Lipo HB
Het
D3Cav3.3 .
Amine
unknown
Lipo
Acid
ED
HB
ER
ester Het
unknown
Lipo
Acid
ED
HB
ER
ester Het
unknown
Figure 10. Radar plots of the serotonin 5HT2a receptor and ion channel domains likely to be affected by the 5HT2a antagonist sertindole (D4Nav1.2 and D4Cav 1.2) and those of two dissimilar channel domains (D3Cav3.3 and D1Nav1.5).
BioFocus has designed and synthesized a number of ion channel libraries according to the Theme directed design principles described previously. Initial indications are that these libraries give at least an order of magnitude enhancement in hit rate compared with diverse libraries for channels where hit rates are particularly low; more importantly, good selectivity between similar channel subtypes has been regularly observed.
10. Summary The relationship between a gene sequence and drugs that interact at receptors encoded by that sequence is a complex one. Nevertheless, in families of related receptors, by using the information present in the sequence, it is possible, to an extent, to define binding interactions by considering the properties of the amino acids encoded in the gene sequence. To do this, it is important to work within the appropriate receptor space and to use the tools (e.g., structure or site-directed mutagenesis together with existing SAR) most appropriate for the receptor family in question. In space defined by all-helical receptors, Thematic Analysis™ has proven to be effective in reclassifying the systems in a ligand-centric manner, in the design of compound libraries and in lead optimization. Although these
ch04
FA April 1, 2006
108
15:40
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
R. Crossley and M. Slater
concepts were developed and constructed around GPCRs, they represent a link between traditional ideas of pharmacophores and bioisosteres and more recent biostructural concepts of molecular recognition.
References 1. Crossley R, Opalko A. (1996) The structure-activity relationships of potassium channel Blockers, in: JM Evans, et al. (eds), Potassium Channels and their Modulators, Taylor and Francis. 2. Bryan J, Vila-Carriles WH, Zhao G, et al. (2004) Diabetes S104. 3. Sanchez-Chapula JA, Ferrer T, Navarro-Polanco RA, Sanguinetti MC. (2003) Mol. Pharmacol. 1051. 4. Gallivan JP, Dougherty D. (1999) Proc. Natl. Acad. Sci. USA 9459. 5. Zhong W, Gallivan JP, Zhang Y, et al. (1998) Proc. Natl. Acad. Sci. USA 12088. 6. Mecozzi S, West Jr AP, Dougherty DA. (1996) Proc. Natl. Acad. Sci. USA 10566. 7. Cheeseright T, Mackey M,Vinter A. (2004) Drug Discov. Today: BIOSILICO 57. 8. Crossley R, Rose VS, Stevens AP. (2003) PCT Int. Appl. WO 03/004147. 9. Chambers JJ, Nichols DE. (2002) J. Comp.-Aided Mol. Design 511. 10. Gouldson PR, Kidley NJ, Bywater RP, et al. (2004) Proteins: Struct. Funct. Bioinform. 67. 11. Elling CE, Thirstrup K, Holst B, Schwartz TW. (1999) Proc. Natl. Acad. Sci. USA 12322. 12. Horwell DC, Birchmore B, Boden PR, et al. (1990) Eur. J. Med. Chem. 53. 13. Lavrador K, Murphy B, Saunders J, et al. (2004) J. Med. Chem. 6864. 14. Chianelli D, Kim YC, Lvovskiy D, Webb TR. (2003) Bioorg. Med. Chem. 5059. 15. Mason JS, Morize I, Menard PR, et al. (1999) J. Med. Chem. 3251. 16. Jacoby E, Fauchère JL, Raimbaud E, et al. (1999) Quant. Struct.-Act. Relat. 561. 17. Jacoby E. (2001) Quant. Struct.-Act. Relat. 115. 18. Milligan G, Kellett E, Dacquet C, et al. (2001) Neuropharmacol. 334. 19. Nordling E, Homan E. (2004) J. Chem. Inf. Comput. Sci. 2207. 20. Doyle DA, Morais CJ, Pfuetzner RA, et al. (1998) Science 280:69–77. 21. Jiang Y, Lee A, Chen J, et al. (2002) Nature 417:515. 22. Jiang Y, Lee A, Chen J, et al. (2003) Nature 423:33. 23. Kuo A, Gulbis JM, Antcliff JF, et al. (2003) Science 300:1922. 24. Haverkamp W, Eckardt L, Matz J, Frederiksen K. (2002) Internat. J. Psychiatry Clin. Pract. S11.
ch04
FA April 1, 2006
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
5
In silico Screening of the Protein Structure Repertoire and of Protein Families Didier Rognan∗
1. Introduction Virtual screening of compound libraries1 has recently gained considerable importance in early hit finding programs, notably when technological or economic hurdles disfavor experimental screening. Numerous successful applications of either ligand-based2 or structure-based3 in silico screening have been reported in the literature. Quite unexpectedly, the inverse paradigm still has not been deeply investigated. Given a set of ligands, is it possible to prioritize their most likely targets for experimental validation? Answering this question first requires the development of a relevant library covering the most reliable target space.4 By target library, we mean here a collection of macromolecules for which either the amino acid sequence and/or three-dimensional (3-D) coordinates are available and can be browsed using simple queries. Then, an appropriate screening method has to be set-up which is able to select a panel of targets fulfilling requirements imposed by either a ligand structure or a specific fingerprint5 or an evolutionary trace.6 Once a target library has been developed, several applications can be foreseen: 1) simply compare targets and whenever possible relevant ligand binding sites; 2) predict the most likely target(s) of a given ligand; 3) predict a selectivity profile for either a target or a ligand; 4) predict the druggability of a given target from a structural point of view. All these questions now require early answers in the evaluation of drug discovery programs. We will try to review each of these applications in the following sections.
∗ Bioinformatics
of the Drug, CNRS UMR 7175-LC1, 74 route du Rhin, F-67400 Illkirch, France. E-mail:
[email protected]
109
ch05
FA April 1, 2006
110
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
2. Setting-up Target Libraries When developing a target library, first, a compromise between available information (notably at the structural level) and the therapeutical relevance of selected targets has to be made. Many proteins for which fine structural details are known (e.g. toxins, antibodies) are not druggable. Conversely, some important protein families for the pharmaceutical industry (e.g. G protein-coupled receptors) are poorly understood at the 3-D level. Next, a scope has to be assigned to the library. Which target space has to be covered? Lastly, which kind of data (amino acid sequences, 3-D atomic coordinates) are browsed for defining a target list? 2.1. sc-PDB: a collection of active sites from the Protein Data Bank 2.1.1. Setting-up the database To establish the proof-of-concept that a protein library might be of screening interest, we have chosen the Protein Data Bank (PDB)7 as it is the major 3-D protein database for which experimentally-determined protein coordinates are available. Several protein-ligand databases derived from the PDB have already been recently described.8−13 Relibase8 allows easy retrieval of protein-ligand complexes from a user-defined query focusing on specific molecular interactions. MSDsite9 is a database search and retrieval system for listing PDB entries fulfilling user-defined queries based on ligand information. The LPDB10 stores 195 high-resolution protein-ligand complexes and related physicochemical descriptors as well as binding constants. Its main purpose, as well as related protein-ligand datasets,11,12 is to provide reliable 3D information for calibrating docking algorithms and scoring functions. The ProLINT database13 contains about 20 000 interaction data for two protein families (kinases, proteases) with attached information about the ligand, the protein, experimental binding constants and published literature. It has been used to derive structure-activity relationships and predict binding constants. LigBase14 is a database of ligand binding sites aligned with related protein structures and sequences containing 50 000 binding sites for heterogeneous ligands (ions, solvent, cofactors, inhibitors, etc.). However, none of the above-mentioned databases are directly usable to generate a collection of druggable protein active sites customized to accommodate small molecular-weight “drug-like” ligands. Generally, no
ch05
FA April 1, 2006
15:41
111
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
PDB
Organic Ligand
solvent, detergent, etc…
Cofactors/ Ions undesirable cofactors/ Ions
undesirable entries
Peptide Ligand
Target
Potential Ligands undesirable ligands Ligands
Active sites Topological screen
1 Ligand/ Site pair
Target Ligand Site
Figure 1. Flowchart for developing the sc-PDB databank (http://bioinfo-pharma.u-strasbg.fr/ scpdb/scpdb_form.html).
differences between solvent, detergent, co-factors and ligands (in the pharmaceutical sense) are made in the above-mentioned databases. To fill this gap, we have recently developed a relational database (sc-PDB)15 specifically customized for screening purposes (Fig. 1). Starting from 27 000 PDB entries, a series of logical filters has been applied to constitute the database as follows: - removal of undesirable entries: low resolution (>2.5 Å) X-ray structures, NMR structures; - on-the-fly detection of the molecule to which each referenced PDB atom belongs to (target, organic ligand, peptide ligand, cofactor, ion, solvent, detergent), thanks to knowledge-based rules and pre-existing lists of “HET” codes defined in the PDBsum database16 ;
ch05
FA April 1, 2006
112
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
- removal of undesirable ligands (molecular weight >1000 Da, solvent, detergents, ions and co-factors exhibiting atom types not recognized by classical docking algorithms); - definition of putative ligands (organic or peptidic molecules, co-factor if present alone); - detection of the binding site for each ligand (collection of amino acids for which any heavy atom is closer than 6.5 Å from any ligand atom); - selection of a single ligand/active site for each PDB entry by calculating the buried surface area of the ligand and of the site, and selecting the ligand/site pair for which the percentage of burial is the highest; - storage, for each selected PDB entry, of 3-D atomic coordinates in readable PDB format (target, active site) and SD/MOL2 formats (ligand, co-factors, ions). 2.1.2. Annotating the database The current database (http://bioinfo-pharma.u-strasbg.fr/scpdb/scpdb_ form.html) contains 5634 ligand-binding sites for 2464 small molecules. We assigned a unique UniProt17 accession number to each protein, thereby identifying 1407 different proteins in the database. Additional information collected from several sources (UniProt, Gene Ontology) was collected to obtain information on the source organism and the biological function of each protein. A functional classification of the database entries is shown in Fig. 2. Entries were separated into two superfamilies, viz. enzymatic (85% of all entries) and non-enzymatic proteins. The set of enzymatic proteins 800 600 400 200 Hydrolases (30 %)
Lyases (6.8 %) Isomerases (3.9 %) Ligases (2.4 %) Non-enzymes (14 %)
1,609 non redundant proteins
60
40 HIV-1 Protease
20 Transferases (22 %)
Oxidoreductases (20 %)
0 0
10
20
30
40
50 100 150
Copies Figure 2. sc-PDB content (release 3, March 2005). Left panel: distribution of enzymes and nonenzymes; bottom panel: observed redundancy.
ch05
FA April 1, 2006
15:41
113
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
was organized into six families, according to EC (Enzyme Commission) number.18 The distribution of enzyme families displayed in Fig. 2 reveals that the most populated family is that of hydrolases (43% of the enzymes). This is correlated to the high number of proteases in the sc-PDB database. The diverse functions of the non-enzymatic proteins were clustered into families corresponding to the different cellular processes (Fig. 2). Four major classes (including replication/transcription/translation, molecular transport, immunity and signal transduction/cell adhesion) make up 75% of the non-enzymatic proteins set. Figure 2 gives an overview of the redundancy of current database entries. In most cases, fewer than 10 copies of an active site corresponding to a given protein are available in the database. The uneven protein entries distribution, which reflects the intrinsic PDB redundancy, is of great interest for application like inverse virtual screening. Indeed, conformational differences between several copies of an active site reflect the local protein flexibility.
2.2. hGPCRDb: a collection of human non-olfactory GPCRs 2.2.1. Setting-up the database G protein-coupled receptors (GPCRs) constitute a superfamily of membrane receptors of utmost importance to pharmaceutical research.19 Hence, GPCRs are the macromolecular targets of ca. 30% of marketed drugs.20 The first draft of the human genome suggests that over 800 genes encode for a GPCR,21 out of which only a few (ca. 30) are currently addressed by marketed drugs. If one excludes the family of sensory receptors, about 400 receptors are potentially druggable with ca. 120 proteins being still considered as orphan targets.20 Traditionally, the first stage in the design of GPCR ligands has focused on the potency of the ligand for the selected receptor target. Selectivity towards the host receptor is usually considered once potency has already been reached. It would, however, be highly desirable to consider selectivity as soon as possible in the design process. Ideally, one would like to consider the universe of GPCRs for designing a ligand with the desired selectivity profile. As addressing this issue by highthroughput screening is currently impossible, “in silico” screening could provide a reasonable start. Indeed the recently-described 2.8 Å-resolution X-ray structure of bovine rhodopsin22 provides a possible 3-D template for modeling other GPCRs. Recent reports unambiguously demonstrated that rhodopsin-based GPCR homology models are accurate enough both
ch05
FA April 1, 2006
114
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
to propose reliable 3-D models of GPCRs very different from those of bovine rhodopsin23,24 and to identify new ligands by structure-based virtual screening.25−28 Using classical homology modeling to establish a 3-D target library including ca. 400 3-D models is not possible. We, therefore, designed a chemoinformatic tool (GPCRMod) specifically dedicated to high-throughput GPCR modeling.29 From the very beginning, several considerations were taken in the design of the code: 1) the target library cover all human nonolfactory GPCRs; 2) a reliable multiple alignment of all investigated GPCRs should be obtained at the seven-transmembrane (7-TM) domain only, acknowledging that high-throughput modeling of intra- and extracellular loops is currently not feasible; 3) the 7-TM binding cavity of every 3-D model should not be biased by the X-ray structure of bovine rhodopsin. In a first step, 372 human GPCR amino acid sequences were aligned at the 7-TM domain by browsing the target sequence for family-specific fingerprints and motifs29 (Fig. 3). Then, alignments were converted into a 3-D model using a comparative modeling tool that uses a set of ligand-biased GPCR models as main chain templates, and two rotamer libraries for side chain positioning (Fig. 4). A key point of the modeling procedure is that 7-TM cavities are derived from templates which (http://www.cbs.dtu.dk/services/TMHMM/)
Prediction of TMs (TMHMM)
rough TM locations
query sequence
search TM locations for familyspecific patterns and motifs
Family determined ?
no
Scoring Matrix
Motifs
(http://bioinf.man.ac.uk/dbbrowser/PRINTS/)
no alignment
yes Choose template of the same family For each of the 7 TMs:
pattern found ? yes align TM via pattern location
Figure 3. Multiple alignment flowchart in GPCRMod.
no
motif found ?
no
yes align TM via motif location
Full TM alignment
ch05
FA April 1, 2006
15:41
115
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
Bond lengths and angles (AMBER) Cartesian coordinates
8 3-D Models
Dihedral angles in internal coordinates Knowledgebased rotamer library
AA sequence
1
Backboneindependant rotamer library ?
?
Template selection
GPCR Target Lib.
GPCR-Align
GPCR-Gen
Amino acid sequence alignment
Side-chains Positioning
2
3
Model Optimization 4
Topology Check 5
Figure 4. 3-D model generation flowchart in GPCRMod.
prove useful for differentiating known ligands from decoys. Resulting 3-D models are qualitatively quite similar to those obtained by ligand-based comparative modeling26,28 but are obtained at a throughput allowing the fast generation of hundreds of targets. 2.2.2. Annotation of the hGPCRdb Assuming that similar targets recognize similar ligands, an accurate annotation of all entries should consider similarities/differences at their binding cavity. As most small molecular-weight ligands probably bind to the 7-TM core, all GPCR entries have been annotated using a chemogenomic procedure taking into consideration a fingerprint characterizing their 7-TM binding cavity. Thirty positions lining the retinal binding site in bovine rhodopsin, were extracted from all entries and concatenated into ungapped sequences out of which a phylogenetic tree could be derived using the standard UPGMA clustering method30 (Fig. 5). Twenty-two clusters could be unambiguously detected from the present analysis of 30 amino acid positions (Fig. 5). These clusters were defined in order to encompass the maximum number of related entries within a branch characterized by the highest possible statistical bootstrap value. Thirty-four out of 372 entries could not be assigned to one of the existing 22 clusters and are considered as singletons. The herein presented tree
ch05
FA April 1, 2006
116
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
7.35 7.39
1.35
6.51
6.52
6.48 6.44
6.55
7.45
1.42 1.39
5.43
7.43 2.65
3.36 1.46 2.61 2.58 2.57
3.32
3.40 5.39
5.46
3.33
5.42 5.38
3.28 3.29 4.56 4.60
Prostanoids (8) 906
799
Glycoproteins (8)
Adhesion (33)
894
SREBs (6)
Glutamate (23)
MAS (11)
Secretin (15) 1000
238
806
648 775
Opsins (10)
273 1000
449
Adenosine (6) 883 Amines (45) Melanocortin (5)
780
485
620
Vasopeptides (7) 726
Melatonin (7)
Frizzled (11)
Brain-gut peptides (10)
211
Lipids (14) 409
Peptides (26) 676 Opiates (13)
747 431
909
Purines (35)
Acids (5)
Chemokines (23)
Chemoattractants (17)
Figure 5. Protocol to generate a TM cavity-driven phylogenetic tree: 1) selection of 30 critical positions, 2) definition of ungapped sequences describing the 7-TM cavity, 3) TM cavityderived phylogenetic tree for 372 human GPCRs. The consensus tree was derived from 1000 replicas. Numbers in commas indicate the number of entries in each cluster. Numbers in italic represent bootstrap values to assess the statistical significance of the tree. Receptors classified as singletons (see text) are not displayed here for sake of clarity. Glutamate, Rhodospin, Adhesion, Frizzled and Secretin subfamilies are collared in green, cyan, yellow, pink and orange, respectively.
ch05
FA April 1, 2006
15:41
117
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
is very similar to the most complete phylogenetic tree (GRAFS classification) known to date,31 although the latter has been obtained from full TM sequences. In both classifications, GPCRs of the Frizzled, Glutamate, Secretin and Adhesion families cluster in well-separated groups, whereas the large Rhodopsin family can be classified into 18 different clusters. Remarkably, all known GPCR subfamilies (e.g. receptors for biogenic amines, purines, and chemokines) are reproduced with high bootstrap support. The five main families (Glutamate, Rhodopsin, Adhesion, Frizzled, Secretin) reported in the GRAFS classification are recovered with no overlaps between the corresponding clusters with the single exception of Q9GZN0 (GPR88), a rhodopsin-like GPCR clustered with class III GPCRs. Interestingly, receptors for which the orthosteric binding site is not located in the TM domain (Adhesion, Secretin and Glutamate families) are, nevertheless, grouped into homogeneous clusters. Relating cluster members to precise molecular features is here greatly facilitated by the analysis of a small subset of amino acids. For each of the 22 clusters, there is often a clear relationship between known ligand chemotypes (e.g. amines, carboxylic acids, phosphates, peptides, eicosanoids, and lipids) and the cognate TM cavities. For example, receptors for bulky ligands (e.g. phospholipids, prostanoids) have a TM cavity significantly larger than that for smaller compounds (e.g. biogenic amines, nucleotides). Receptors for charged ligands (cationic amines, phosphates, mono and di-carboxylic acids) always present among the 30 critical residues one or more conserved amino acid exhibiting the opposite charge (e.g. Asp3.32 for biogenic amines; Asp4.60 /Glu7.39 for chemokines; Arg3.29 /Lys6.55 /Arg7.35 for nucleotides). Our clustering approach implies two assumptions: 1) the overall fold of the 7-TM domain around the binding cavity has been conserved during evolution; 2) critical hotspots spread over the 7-TM domain repeatedly account for ligand binding. Although solid biostructural data for the three most important GPCR classes (class I, II, III) are missing, numerous experimental data do provide evidence in favor of strong similarities among many GPCRs: 1) residues known to affect small molecular-weight ligand binding to unrelated GPCRs are mostly spread among the herein selected 30 residues, suggesting a common architecture of the TM pocket; 2) many known ligands are promiscuous for even unrelated GPCRs and are usually anchored through so-called privileged structures to common subpockets of different GPCRs.32 Of course, we are aware that class II and class III GPCRs exhibit an additional orthosteric site located outside the 7-TM bundle. Therefore, conclusions drawn herein only apply to the 7-TM binding site.
ch05
FA April 1, 2006
118
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
3. Screening Target Libraries Provided that a target library has been set-up, two screening methods are possible (Fig. 6). In a 1-D screening, a query enclosing amino acid sequence information (e.g. fingerprint) is used to parse a database of family-specific alignments to retrieving interesting targets. In a 3-D screening, the 3-D structure of either a ligand or a known active site is used to sample 3-D structures or homology models. Both applications will be detailed in the following section. 3.1. 1-D screening Simple 1-D screening is less precise than 3-D screening but also less sensitive to errors. When applied to entire target families (e.g. GPCRs, kinases), its accuracy only depends on the quality of the sequence alignment which is generally much higher than that of 3-D structural models. Assuming that similar ligands should bind to similar cavities, browsing a database of sequence alignments can easily provide access to reliable information if specific fingerprints are already known. Three possible applications of 1-D screening of a GPCR target library are presented here. 3.1.1. Searching for orthologs/paralogs The amino acid sequence of GPCRs is extremely variable in length (from 290 to 6300 residues for human GPCRs) notably at extra- and intracellular loops. Relying on receptor comparisons on full sequence alignment may thus lead to unreliable results. Comparing the above-defined TM cavitylining residues is much more appropriate. For any GPCR target of interest,
Sequence Fingerprint Active site Ligand
Query Figure 6. Target library screening flowchart.
Alignments Structures
Target List Search
Output
ch05
FA April 1, 2006
15:41
119
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
these 30 residues can be identified quite unambiguously at least for the rhodopsin-like GPCRs, as several class specific TM fingerprints previously identified in this family of receptors can guide the sequence alignment.29 As an example, we have been looking for the human ortholog(s) of a gene product from C. elegans (Y22D7AR_13) in order to predict the functional role of this presumed GPCR. Blasting its full amino acid sequence against human GPCRs leads to ambiguous conclusions because the level of sequence identity with the closest human GPCRs is low (usually in the 15–30% range) and that several candidates are possible (Table 1). Looking at local sequence identity within a set of 30 TM cavity-lining residues provides an answer that is easier to interpret because the sequence identities with the input query are much higher (above 70% for the first three 5-TH receptors; Table 1). Since 7 out of the top 10 ranked candidates were 5-HT receptors, the C. elegans gene product was predicted to be a receptor for serotonin, which was further experimentally validated.33 The proposed approach has the merit of being extremely fast (a few ms) but requires the a priori identification of the 7-TMs and a good sequence alignment of the latter domain. Therefore, the presence of TM fingerprints (usually present in nearly all entries)29,30 in the input query is a prerequisite. Table 1 Searching for the 10 Closest Human Orthologs of the C. elegans Y22D7AR_13 Gene Product
Full Sequence Blast a Rank
Receptor
1 2 3 4 5 6 7 8 9 10
5-HT1B 5-HT1D 5-HT1A D2 5-HT1E α2A α2C D3 α2B M2
TM-Cavity Searchb
Sequence Identity, %
Rank
29.6 29.0 26.7 25.4 24.8 24.0 23.8 23.6 21.1 16.8
1 2 3 4 5 6 7 8 9 10
Receptor
5-HT1A 5-HT7 5-H5A α1A 5-HT1B 5-HT1D 5-HT4 α1B α1D 5-HT1E
Sequence Identity, % 72.7 72.7 72.7 69.7 69.7 69.7 66.7 63.6 60.6 60.6
a Sequence comparison achieved using standard settings of the BLASTP program (http://services.bioasp.nl/blast/cgi-bin/blast.cgi). b Sequence comparison achieved using our in-house GPCR find program (http://bioinfo-pharma.u-strasbg.fr/gpcrdb/jss.html).
ch05
FA April 1, 2006
120
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
3.1.2. Computer-guided target deorphanization A TM-cavity biased phylogenetic tree offers the opportunity to navigate in target space without the necessity to rely on questionable 3-D structures. Receptors close in target space can be expected to bind ligands close in chemical space. Known GPCR ligands are thus a good starting point to start deorphanizing receptors predicted to be close enough to liganded receptors (Table 2). For example, focusing our cavity-based tree on two related orphan receptors (GPR19, GPR83) predicts a significant relationship to three tachykinin receptors (NK1R, NK2R, NK3R; Fig. 7). Likewise, GPR54 is predicted to be close to three galanine receptor subtypes (GALR1, GALR2, GALR3). Therefore, a rational start to find ligands for these three orphans would be first to test known ligands for neurokinine and galanine receptors, respectively. An experimental validation of this approach has been recently reported by scientists at 7TM-Pharma who identified ligands for the CRTh2 (GPR44) receptor by evaluating angiotensin 2 receptor (AG2R, AG2S) ligands, the corresponding targets being close in terms of the 7-TM cavity.34 Table 2 Possible Ligand Source for some Orphan GPCRs
Orphan Receptor(s)a GPR88, GABABL GPRC6a, Q8TDU1 LRG4, LRG5, LRG6 GPR119 GPR19, GPR83 GPR54 GPR154, PKR1, PKR2 PNR GPR39 CCRL2, RDC1 GPR7, GPR8 GPR15, GPR25, GPR44 EBI2, GPR92, P2RY5 GPR171, GPR87 GPR17, GPR34, FK79 a Receptors b For
Clusterb Glutamate Glutamate Glycoproteins Lipids Peptides Peptides Vasopeptides Amines Brain-gut peptides Chemokines Opiates Chemoattractants Purines Purines Purines
Source GABA-B allosteric ligands CaSR allosteric ligands LH/FSH nonpeptide ligands Cannabinoid receptors ligands Tachykinin receptors ligands Galanine receptor ligands Oxytocin/vasopressin receptor ligands Biogenic amine receptors ligands Neuromedin U receptors ligands Chemokine receptor ligands Somatostatine receptor ligands Angiotensin II receptor ligands LPC/SPC receptor ligands Purinergic nucleotide receptor ligands Cysteinyl Leukotriene receptor ligands
are labeled according to their UniProt entry names. cluster definition, see Ref. 30.
ch05
FA April 1, 2006
15:41
121
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families NK1R NK2R NK3R
?
GPR19 GPR83 GALR3 GALR2 GALR1
?
GPR54
Figure 7. Close up to the peptide receptors cluster.
3.1.3. Matching target space with ligand space GPCR ligands sharing a common privileged structure and exhibiting promiscuous binding to unrelated GPCRs are a current important source for GPCR library design. Assuming that conserved moieties of the ligands are likely to bind to conserved subsites of the targets,32 matching privileged structures with TM hotspots can be achieved very easily without biasing the match by a manual or automated 3-D docking. As an example, biphenyltetrazoles and biphenylcarboxylic acids (Fig. 8) are known to bind to at least six GPCRs (AG22, AG2R, AG2S, GHSR, L4R1, L4R2).35−37 Fine details of 3-D recognition of this privileged substructure by GPCR hotspots have been recently proposed by a thorough mutagenesisguided manual docking of several GPCR ligands.32 We propose here a much simpler approach leading to the same outcome; looking at the 30 residues lining the TM cavity of the later six GPCRs allows us to clearly identify putative TM residues able to interact with this substructure (Fig. 8). Conserved aromatic residues are likely to interact with the biaryl moiety cluster between TMs 6 and 7 (Phe6.44 , Trp6.48 , Phe/Tyr/His6.51 , Phe/Tyr 7.43 ). A positively-charged residue that probably interacts with the bioisosteric tetrazole and carboxylate groups should be located nearby the aromatic cluster. Hence, three basic residues (Lys5.42 , Arg6.55 , and Arg7.35 ) fulfil this
ch05
FA April 1, 2006
122
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
N N N N H
1.Database search
AG2R AG2S AG22 GHSR L4R1 L4R2 6 known GPCR targets
2.Cavity alignment
O
3.Extracting hotspots OH
Privileged structures
APJ C5L2 FMLR GALR2 GPR1 Q9GZQ4
C 3AR C ML1 G 2A G PR15 M TLR S PR1
C 5AR F ML1 G ALR1 G PR44 N TR1
4. Cavity search
TM hotspots
17 putative new GPCR targets
Figure 8. Matching privileged structures of known GPCR ligands to TM hotspots. An in-house GPCR ligand database is searched to retrieve privileged structures common to multiple GPCRs and to find conserved residues within the 7-TM cavity of selected entries. Browsing the in-house GPCR cavity database allows to retrieve new GPCR entries satisfying the query and likely to accommodate the privileged structure.
requirement. Last, a polar side chain at position 6.52 (His/Gln) is conserved for the six investigated GPCRs and might H-bond to the acidic moiety of the privileged structure. We have then clearly identified the same important anchoring residues than Bondensgaard et al.32 by simply looking at sequence alignments of TM cavity-lining amino acids, without relying on uncertain docking data. Searching our TM cavity database for additional GPCRs fulfilling the above-described requirements (Phe6.44 , Trp6.48 , Phe/Tyr/His6.51 , Phe/Tyr 7.43 and Lys5.42 or Arg6.55 or Arg7.35 and His/Gln6.52 ) allows us to extract 17 new GPCRs that might accommodate biphenyltetrazoles and biphenyl-carboxylic acids (Fig. 8). Among putative targets are 10 chemoattractant receptors (APJ, C3AR, C5AR, C5L2, CML1, FML1, FMLR, GPR15, GPR44, and GPR1); 3 brain-gut peptide receptors (MTLR, NTR1, and Q9GZQ4); 2 cationic phospholipid receptors (G2A, SPR1); and 2 peptide receptors (GALR1, GALR2). This target list encompasses receptors recently identified by Bondensgaard et al. (e.g. APJ, NTR1).32 It also suggests totally new putative targets for the investigated privileged structure that might serve as a common scaffold for small-sized combinatorial libraries targeting the new receptors list.
ch05
FA April 1, 2006
15:41
123
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
3.2. 3-D screening High-throughput docking of large chemical libraries37 has been established as a promising tool for identifying new hits from protein 3-D structures coming mostly from X-ray diffraction data3 but also from homology modeling.39 Computational chemistry is slowly but increasingly be employed for finding out of a large library which ligands are likely to bind to a protein of interest.1 Surprisingly, the opposite question is still an issue. Given a known ligand, is it possible to recover its most likely target(s)? Answering this question using the above-mentioned docking approach implies first the development of a collection of protein active sites (see Sec. 2), and then the use of an inverse docking tool able to dock a single ligand to multiple macromolecules. Although inverse screening uses the same paradigm as ligand screening (predicting the most likely ligand-target interactions from molecular docking), docking a single ligand to a target library is more difficult to set up than classical docking of a ligand library to a single target. One should automate the generation of input files (3-D coordinates of the targets or/and of the cognate binding sites; docking configuration file) for a large array of hererogeneous targets, which is much more difficult than setting-up a reliable set of coordinates for a ligand library. Notably, protein and binding site 3-D coordinates should be prepared automatically and should be rendered suitable for docking by removal of any additional molecule (solvent, ion, cofactor) not essential for ligand binding. As an inverse screening tool, we have chosen the GOLD docking software40 for two main reasons: 1) it is one of the most robust and accurate docking tool in our hands41 ; 2) it only requires a single configuration file (gold.conf) whose distribution over a target library is easy to automate.
3.2.1. 3-D screening of the PDB: proof of concept The first validation of inverse screening was to recover, among 2150 entries of the sc-PDB (release 1, Feb. 2004), the toe target(s) of either selective (e.g. biotin, 6-hydroxyl-1,6-dihydropurine ribonucleoside) or promiscuous ligands (e.g. 4-hydroxytamoxifen, methotrexate). Screening the sc-PDB database clearly allowed unambiguous recovery of the true targets of all four ligands.15 When screening our database for potential targets of biotin, 7 out of the 10 streptavidin entries present in the sc-PDB were ranked at the top eight positions with very good averaged fitness scores (Fig. 9). Interestingly, the
ch05
FA April 1, 2006
124
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan 70
60 Streptavidine Others
Average fitness
Average fitness
50 45 40 35
estrogen receptor α 3α-hydroxysteroid dehydrogenase NADP[H] quinone oxidoreductase others
65
55
60 55 50 45 40 35
30
30 0
50
100
Rank
150
200
0
50
100
150
200
250
Rank
Figure 9. Inverse screening of the sc-PDB database for finding the target of two small molecular weight ligands: top panel, biotin; bottom panel: 4-hydroxy tamoxifen. Filled stars indicate the different sc-PDB copies of the true target (top: streptavidin, bottom: estrogen receptor α). Filled triangles and squares indicate known secondary targets of 4-hydroxy tamoxifen (3-α hydroxysteroid dehydrogenase and NADP[H] quinone oxidoreductase, respectively). Targets are ranked by decreasing GOLD fitness scores averaged over 10 independent docking runs.
three streptavidin copies with lower rankings (90th, 195th, 315th) correspond to either an active site for which a key amino acid (Asp128) has been mutated (1swt) or alternative binding sites (peptide binding sites for 1vwr and 1rsu). Altogether, the proposed inverse screening protocol is able to unambiguously rank streptavidin as the most likely target for biotin with a percentage of coverage of 70% (7 out of 10) among the top 10 (5%) positions. Likewise, the two sc-PDB entries of the estrogen receptor α were ranked at the top 2 positions when screening for the target of 4-hydroxy tamoxifen (Fig. 9). Interestingly, two other targets (NADP[H] quinone oxidoreductase, 3α-hydroxysteroid dehydrogenase), at least ranked twice among the top 25 scorers, are true minor targets of this ligand. Therefore, inverse screening of target databases could also be viewed as a computational filter to roughly predict the selectivity profile of a given ligand and thus to anticipate putative site effects. When compared to random screening, a significant enrichment in the true target is observed among the top scorers (Fig. 10). Analyzing both the enrichment factor and the percentage of coverage of known targets indicates that the best compromise can be reached by selecting a very small fraction (0.5%) of the sc-PDB database. Even for the rather difficult case of methotrexate, selecting the top 2.6% scorers would allow select 40% of
ch05
FA April 1, 2006
15:41
125
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
Figure 10. Percentage of recovery of known targets as a function of the top scoring fraction extracted found by inverse screening (green line) and random picking (red line). The percentage of coverage of known targets is the ratio in percentage between the number of true target entries recovered by inverse screening at a defined top scoring fraction and the total number of true target entries in the sc-PDB dataset.
all dihydrofolate reductase entries to be selected, with a 15-fold enrichment with respect to random screening. 3.2.2. 3-D screening of the PDB: test case Having validated the inverse screening approach for four unrelated ligands, a prospective screening was carried out to identify the putative targets for representative compounds of a scaffold-focused combinatorial library (Fig. 11). Release 1 of the sc-PDB (2148 entries) was screened to prioritize targets likely to accommodate five representative compounds from the library (Table 3). In the sc-PDB, a target is defined either as an enzyme from the PDB with a unique EC number, or a non-enzymatic protein with a unique name
ch05
FA April 1, 2006
126
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan O R1
R5
N
N
O
N R3
R2
R4
Figure 11. The 1,3,5-triazepane-2,6-dione scaffold with 5 diversity points.
Table 3 Predicted Targets for 5 Compounds from a Triazepanedione Combinatorial Library
Target
E.C. numbera
Nb
Target rate, %c Cpd1
Aconitase DAAOd ESTe GTf HPRTg MAh PLA2 i PNPj TKk
4.2.1.3 1.4.3.3 2.8.2.4 2.4.12.4.2.8 3.4.11.18 3.1.1.4 2.4.2.1 2.7.1.21
7 2 2 2 6 5 8 6 5
43 50 50
Cpd2
Cpd3
Cpd4
29 50
14
100 100 33 20 13
50 50 17 100 17
80
Cpd5
25 83
20
a Enzyme
commission number. of target copies in the sc-PDB (release 1, Feb. 2004). c Target rate: Percentage of targets ranked in the top 2% scoring entries. d D-amino-acid oxidase. e Estrogen sulfotransferase. f Lipopolysaccharide 3-alpha-galactosyltransferase. g Hypoxanthine-guanine phosphoribosyltransferase. h Methionine aminopeptidase. i phospholipase A2. j purine nucleoside phosphorylase. k thymidine kinase. b Number
according to our previous annotation of the database. Differences related to species, isoforms or mutations are thus not considered in our classification scheme. For each of the five investigated compounds, a target was selected if it fulfils any of the three following criteria: 1) 50% of target entries present in the sc-PDB were scored, according to the average GOLD fitness score, among the top 2% scoring entries; 2) the average fitness score for all entries
ch05
FA April 1, 2006
15:41
127
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
of the corresponding target was above 50; two entries of the same target were scored in the top 2% scoring entries. Out of the nine targets (Table 3) fulfilling this selection procedure, five were finally selected for biological evaluation (ES, MA, PLA2, PNP, TK; Table 3). About 24 compounds enclosing the five representatives used for inverse screening were tested for inhibition of the above-described 5 enzymes. Micromolar inhibitors from this small library could be found for three out of the five predicted entries (MA, PNP, PLA2). A detailed description of the structures and corresponding inhibition constants will be reported elsewhere.42 3.2.3. 3-D screening of the hGPCRDb library: proof-of-concept Screening the collection of human GPCRs to identify the receptors of known ligands is quite a demanding task given the current limited accuracy of GPCR models. However, we have tried to recover, from the GPCR target database, either the known receptor of a selective purinergic P2Y1 antagonist (MRS-2179) or the receptors of a promiscuous antagonist (NAN-190, Fig. 12) known to bind to several monoamine receptors with nanomolar affinities (α1A , D2 , D3 , 5-HT1A , 5-HT1D , 5-HT1F , 5-HT2A , 5-HT2C , 5-HT7 ). When screening the protein library for putative receptors of MRS-2179, the P2Y1 receptor is indeed ranked among the top scorers (7th, Fig. 12A).
70
50
Average fitness
Average fitness
60 50 40
HN N
30
O O P
20
O
O
N
O O
10
N
N
P O O O
MRS-2179
50
30 20 O
100
150
Rank
200
250
N
N
N
10
O
0 0
40
O
NAN-190
0 0
50
100
150
200
250
Rank
Figure 12. Ranking of the true receptor(s) of a selective ligand (top panel: MRS-2179, P2Y1 receptor antagonist) and of a promiscuous ligand (bottom panel: NAN-90, antagonist of the dopamine D2 and D3 receptors, serotonin 5-HT1A , 5-HT1D , 5-HT1F , 5-HT2A , 5-HT2C , 5-HT7 receptors, and adrenergic α1a receptor). Known receptor(s) are indicated by a dark ball. Targets are ranked by decreasing GOLD fitness scores averaged over 10 independent docking runs.
ch05
FA April 1, 2006
128
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
Five out of the nine known targets of NAN-190, the second ligand investigated herein, are ranked in the top 25 positions, and seven out of nine in the top 31 positions (Fig. 12B). The worst-ranked true receptor (5-HT1A ) is ranked 68th. The fine selectivity profile for the whole 5-HT receptor family is unfortunately not fully addressed as the 12 5-HT receptor subtypes currently present in our database are all clustered among the top 68 positions. For both ligands, ca. 80% of GPCRs closely related to the true target(s) (P2Y receptors for MRS-2179; 5-HT receptors for NAN-190) usually clustered in the top 20% scorers. Thus, the current inverse screening procedure is aimed more at identifying the likely receptor subfamily (dopamine, serotonin, adenosine, etc.) than precisely mapping the individual preference for highly-related GPCR subtypes. It could thus be used as a computational filter to study the most likely targets when addressing the selectivity profile of a given compound or when trying to identify the yet unknown receptor of a molecule showing promising in vivo biological effects. Although the hGPCR database enclosed ground-state models suitable for docking antagonists and inverse agonists,43 we checked whether the same protocol could be applied to identify the receptor of endogenous ligands. The hGPCR database was therefore screened to recover the receptor of succinic acid (Fig. 13), a recently-identified ligand for the previously orphan GPR91 receptor.44 Although ground-state 3-D models were screened, the native receptor was surprisingly ranked among the top-scoring receptors
Avergaed fitness (n=10)
45
40
GPR91
35
30
O
O
OH
OH
25 0
10
20
30
40
50
60
70
80
90
100
Rank Figure 13. Ranking of the true receptor (GPR91, filled star) of an endogenous ligand (succinic acid) by inverse screening of a GPCR 3-D library. Targets are ranked by decreasing GOLD fitness scores averaged over 10 independent docking runs.
ch05
FA April 1, 2006
15:41
129
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
(11th) in our inverse screening. Again, the true receptor was not ranked first but high enough in a shortlist that could be experimentally evaluated.
4. Conclusions Virtual screening of target libraries offers new opportunities to prioritize a few targets for experimental evaluation by applying simple ligand-based or target-based queries. There is no reason that single ligand docking to a wide array of targets might not be as useful as classical docking of ligand libraries to a single protein, assuming comparable accuracies of input data. The increasing coverage of target space by the Protein Data Bank as well as the development of accurate comparative models describing entire protein families is likely to favor target screening in the near future. Pharmacophore-based and protein-based computational filters are widely used virtual screening strategies.28−39 One could imagine very similar scenarios for target screening where interesting cavities are first filtered by similarity measurements to a binding site of interest,45,46 and then selected by ligand docking. Furthermore, orthogonal clustering of target families and of their ligands should soon easily provide precise chemogenomic information for selecting the most interesting compounds/scaffolds according to a predefined selectivity profile. Addressing simultaneously potency and selectivity in hit evaluation will undoubtedly affords added-value molecules in early drug discovery processes.
Acknowledgments I would like to thank several former and current collaborators of the Bioinformatics group (C. Bissantz, G. Bret, E. Kellenberger, A. Logean, P. Muller, N. Paul, and C. Schalon) for their invaluable work in the development of target libraries. Financial support of the French Ministry of Research and Technology, and of the Alsace-Lorraine Genopole is acknowledged as well as the allocation of computing resources at the Centre Informatique National de l’Enseignement supérieur (CINES, Montpellier, France).
References 1. Schoichet BK. (2004) Nature 862. 2. Bajorath J. (2002) Nat. Rev. Drug Discov. 882. 3. Kitchen DB, Decornez H, Furr JR, Bajorath J. (2004) Nat. Rev. Drug Discov. 935.
ch05
FA April 1, 2006
130
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. Rognan
4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41.
Lipinski C, Hopkins A. (2004) Nature 855. Attwood TK, Bradley P, Flower DR, et al. (2003) Nucl. Acids Res. 400. Lichtare O, Bourne H, Cohen F. (1996) J. Mol. Biol. 342. Berman HM, Westbrook J, Feng Z, et al. (2000) Nucl. Acids Res. 235. Hendlich M, Bergner A, Gunther J, Klebe G. (2003) J. Mol. Biol. 607. Golovin A, Dimitropoulos D, Oldfield T, et al. (2005) 190. Roche O, Kiyama R, Brooks CL. III. (2001) J. Med. Chem. 3592. Kramer B, Rarey M, Lengauer T. (1999) Proteins 228. Nissink WM, Murray C, Hartshorn M, et al. (2002) Proteins 457. Kitajim K, Ahmad S, Selvaraj S. (2002) Genome Informatics 498. Stuart AC, Ilyin VA, Sali A. (2002) Bioinformatics 18. Paul N, Bret G, Kellenberger E, et al. (2004) Proteins 671. Laskowski RA, Chistyakov VV, Thornton JM. (2005) Nucl. Acids Res. D26. Apweiler R, Bairoch A, Wu CH, et al. (2004) Nucl. Acids Res. D115. Bairoch A. (2000) Nucl. Acids Res. 304. Schwalbe H, Wess G. (2002) ChemBioChem 915. Wise A, Gearing K, Rees S. (2002) Drug Discov. Today 235. Venter JC, et al. (2001) Science 291. Palczewski K, Kumasaka T, Hori T. (2000) Science 289. Petrel C, Kessler A, Maslah F. (2003) J. Biol. Chem. 49487. Malherbe P, Kratochvwil N, Knoflach F. (2003) J. Biol. Chem. 8340. Varady J, Wu X, Fang X, et al. (2003) J. Med. Chem. 4377. Evers A, Klebe G. (2004) J. Med. Chem. 5381. Becker OM, Marantz Y, Shacham S, et al. (2004) Proc. Natl. Acad. Sci. USA 11304. Evers A, Klabunde T. (2005) J. Med. Chem. 1088. Bissantz C, Logean A, Rognan D. (2004) J. Chem. Info. Comput. Sci. 1162. Surgand JS, Rodrigo J, Kellenberger E, Rognan D. Proteins (in press). Fredriksson R, Lagerstrom MC, Lundin LG, Schioth HB. (2003) Mol. Pharmacol. 1256. Segalat L. (personal communication). Bondensgaard K, Ankersen M, Thogersen H, et al. (2004) J. Med. Chem. 888. http://www.ismc2004.dk/index.php/Session_3D__A_physicogenetic_m/ 121/0/ Ji H, Leung M, Zhang Y. (1994) J. Biol. Chem. 16533. Smith RG, Cheng K, Schoen WR, et al. (1993) Science 1640. Reiter LA, Koch K, Piscopio AD, et al. (1998) Bioorg. Med. Chem. Lett. 1781. Halperin I, Ma B, Wolfson H, Nussinov R. (2002) Proteins 409. Evers A, Klebe G. (2004) Angew. Chem. Intl. Ed. Engl. 248. Verdonk ML, Cole JC, Hartshorn MJ, et al. (2003) Proteins 609. Kellenberger E, Rodrigo J, Muller P, Rognan D. (2004) Proteins 225.
ch05
FA April 1, 2006
15:41
131
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
In silico Screening of the Protein Structure Repertoire and of Protein Families
42. 43. 44. 45. 46.
Muller P, Lena G, Bezzine S, et al. J. Med. Chem. (submitted). Bissantz C, Bernard P, Hibert M, Rognan D. (2003) Proteins 5. He W, Miao FJ, Lin DC. (2004) Nature 188. Jambon M, Imberty A, Deleage G, Geourjon C. (2003) Proteins 137. Weber A, Casini A, Heine A, et al. (2004) J. Med. Chem. 550.
ch05
FA April 1, 2006
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
This page intentionally left blank
ch06
FA April 1, 2006
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
6
New Methods for Similarity-based Virtual Screening Jérôme Hert, Peter Willett ∗ and David J. Wilton
1. Introduction Virtual screening involves the use of a computational scoring scheme to rank molecules in decreasing order of probability of activity in some bioassay of interest. The molecules are then assayed in the resulting rank order, thus ensuring that molecules with a high probability of activity are tested at as early a stage as possible in a lead-discovery programme. There are many ways in which virtual screening can be carried out1,2 : here, we focus on the use of similarity searching for this purpose.3 This involves taking a reference structure, i.e. a molecule with the required activity such as a hit from a high-throughput screening (HTS) programme, and then searching a database to find the molecules that are most similar to it. The rationale for similarity-based virtual screening is the similar property principle 4,5 (analogous ideas also underlie structural approaches to molecular diversity and chemogenomics.6,7 ) The principle states that molecules that have similar structures are likely to have similar properties, and the nearest neighbors (NNs) of a bioactive reference structure (i.e. the topranked molecules from the similarity search) are hence prime candidates for biological screening. There are, of course, many exceptions to the principle,8 with for example, even very minor structural variations sometimes having a drastic effect on the levels of activity in a set of analogues. However, if the principle was not of general applicability then it would be difficult to develop systematic approaches for the identification of novel bioactive molecules, and there is now a substantial body of evidence to support its use in lead-discovery programs.5,9−12
∗ Corresponding
author: Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, UK.
133
ch06
FA April 1, 2006
134
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
A measure of inter-molecular structural similarity has two principal components: the structural representation used to characterize the molecules; and the similarity coefficient used to compute the degree of resemblance between pairs of such representations. While very many types of approach, in both 2D and 3D, have been suggested for the quantification of molecular similarity,13 most current operational systems for similarity searching are based on the use of molecular fingerprints (binary vectors encoding the presence or absence of 2D substructural fragments within a molecule) and of the Tanimoto coefficient to measure the degree of similarity between two such fingerprints.3 Such systems are widely used, since they are both effective in operation and also extremely efficient, and in this chapter we discuss ways of further increasing the effectiveness of similaritybased virtual screening using this well-established approach. We begin by describing how similarity searching can be carried out when not one but multiple reference structures are available for searching a database. The results of these studies provide insights that we have then used to increase the effectiveness of conventional similarity searching based on just a single reference structure. In this chapter, we discuss the principal findings from our investigations: full details are provided by Hert et al.14−17
2. Virtual Screening Using Multiple Reference Structures The standard screening approach when several active molecules have been identified is pharmacophore mapping followed by 3D database searching.18 This approach assumes that the active molecules have a common mode of action and that features that are common to all of the molecules describe the pharmacophoric pattern responsible for the observed bioactivity. This is a powerful technique but one that may not be applicable to the structurally heterogeneous hits that characterize typical HTS experiments or sets of competitor compounds drawn from the public literature. In such cases, it is appropriate to consider approaches based on 2D similarity searching and we present here a comparison of approaches for combining the structural information that can be gleaned from a small set of reference structures. 2.1. Virtual screening methods 2.1.1. Single fingerprint method Our first, single fingerprint approach involves creating a single, combined fingerprint from the fingerprints of individual reference structures;
ch06
FA April 1, 2006
15:41
135
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening
specifically, we have used an approach first described by Shemetulskis et al. in their work on Stigmata.19 Given an input set of molecules, the method generates a modal fingerprint that seeks to capture the common chemical features present in the members of this training set. A bit j is set to “on” in the modal fingerprint if that bit is found in more than a user-defined threshold percentage of the training set molecules. The modal fingerprint is then used as a query and compared with the fingerprints for each of the compounds in the database using the Tanimoto coefficient. In what follows, we refer to searches carried out in this way as modal searches and found, in a set of initial experiments,14 that optimal performance was obtained with a threshold of 40% using Unity 2D fingerprints.20 We also carried out experiments using an alternative, weighted, approach that does not require the use of such a threshold value; however, this was found to be noticeably less effective in operation and will hence not be discussed further here. 2.1.2. Data fusion method The second, data fusion, approach involves searching each reference structure separately and then combining the results to give a single output ranking of a database. Previous studies in chemoinformatics applications (where this combination approach is sometimes referred to as consensus scoring ) have involved combining the results of several database searches using different descriptors or scoring functions.21−24 Our application of data fusion is different: rather than carrying out searches for a single reference structure with multiple similarity measures, we carry out searches for multiple reference structures with a single similarity measure (i.e. one based on 2D fingerprints and the Tanimoto coefficient). Extensive preliminary experiments14 have shown that better results were obtained from combining the similarity scores associated with the molecules in a database, than from the rankings associated with those similarity scores; this latter approach had been used in most previous applications of data fusion in chemoinformatics. The preliminary experiments also showed that the best rankings were obtained using the following fusion rule. Assume that some database molecule has a similarity score of si with the i-th reference structure (1 ≤ i ≤ n, the number of different reference structures), then the MAX fusion rule assigns the value MAX{s1 , s2 , . . . , si , . . . , sn−1 , sn }
(1)
to that database molecule, and the database is ranked in decreasing order of these maximum scores. We call this approach, first suggested by Schuffenhauer et al.,25 as group fusion (GF).26
ch06
FA April 1, 2006
136
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
2.1.3. Machine learning methods Machine learning involves a training-set of known active and known inactive molecules that is used to train a decision function that can then be used to predict the (in)activity of test-set molecules, i.e. those that have not as yet undergone biological testing. The classical machine-learning approach in chemoinformatics is substructural analysis (or SSA). This was first described by Cramer et al.27 and seeks to assign a weight to each bit (or substructure) in a fingerprint that describes that bit’s differential occurrence in the active and inactive molecules in the training-set. The resulting weights are then used to rank the test database, with molecules at the top of this ranking being candidates for testing. Many different substructural analysis weights have been described in the literature,28 but they are all based on some or all of the following pieces of information about the training-set: Aj and Ij are the number of active and inactive compounds with bit j set; Tj is the total number of compounds with bit j set (so Tj = Aj + Ij ); and NA and NI are the total number of active and inactive molecules, respectively, with NT being the total number of molecules (so NT = NA + NI ). In the present context, we do not have access to all of the necessary information as the training-set consists of just active molecules. However, if we restrict our attention to those weighting schemes that do not make explicit use of information about the inactives and also make the assumption that the overall characteristics of the training-set are mirrored by those of the entire database that is to be searched, then we can use the R1 weight.28 This has the form Aj /NA log . (2) Tj /NT In Eq. (2), Tj is the total number of molecules in the database with bit j set and NT is the total number of molecules in the database (rather than the total numbers in the training-set, as in conventional substructural analysis). A recent comparison of several virtual screening approaches highlighted the general effectiveness of the binary kernel discrimination (BKD) method.29 BKD was first applied to chemical applications by Harper et al.,30 and uses a kernel function that has been developed for handling binary data; specifically, given two molecules i and j represented by fingerprints containing N bits and differing in di,j of those bits, then Harper et al. define the following kernel function Kλ (i, j) Kλ (i, j) = λN−di,j (λ − 1)di,j .
(3)
ch06
FA April 1, 2006
15:41
137
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening
In Eq. (3), λ is a smoothing parameter, the value of which must be defined or optimized. The fingerprint representing a database molecule, j, is matched against the fingerprints for each of the active and inactive molecules in the training-set and its score is then computed as Kλ (i, j) i ∈ actives L(A|j) = . (4) Kλ (i, j) i ∈ in actives
The use of BKD requires calculating these scores using a range of values of the smoothing parameter, λ, and then choosing that value which gives the best results in terms of ranking the actives towards the top of the training-set. The success of the BKD approach29,30 led us to consider how it might be used given just a set of active reference structures, rather than a full training-set. The approach we have taken is to make the assumption that the characteristics of the inactives are approximated with a high degree of accuracy by the characteristics of the entire database that is to be searched (since the overwhelming majority of the database structures will be inactive). If this assumption is accepted, then a training-set can be generated by taking the set of reference structures and adding to it molecules randomly selected from the database (subject, in our experiments, to the qualification that none of the resulting pairs of molecules had a similarity coefficient greater than 0.80 using Unity fingerprints and the Tanimoto coefficient), with the expectation that most, if not all, of these added molecules are inactive. This expectation is not unreasonable given that actives are inherently very rare. In a set of initial experiments, we obtained good results by selecting 100 candidate molecules from the database and using them as inactives. 2.2. Comparison of search methods We have evaluated the various approaches described above by means of simulated virtual screening searches on the MDL Drug Data Report (MDDR) database.31 After removal of duplicates and molecules that could not be processed using local software, a total of 102 535 molecules were available for searching. These molecules were represented by 988-bit Tripos Unity 2D fingerprints,20 and searched using the eleven sets of active compounds detailed in Table 1.
ch06
FA April 1, 2006
138
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton Table 1 MDDR Activity Classes used in this Study. MPS is the mean pair-wise similarity, computed using the Tanimoto coefficient and Unity 2D fingerprints, averaged over all of the molecules in an activity class.
Activity Name
5HT3 antagonists 5HT1A agonists 5HT reuptake inhibitor D2 antagonists Renin inhibitors Angiotensin II AT1 antagonists Thrombin inhibitors Substance P antagonists HIV protease inhibitors Cyclooxygenase inhibitors Protein kinase C inhibitors
Number of Active Compounds
MPS
752 827 359 395 1130 943 803 1246 750 636 453
0.35 0.34 0.35 0.35 0.57 0.40 0.42 0.40 0.45 0.27 0.32
This set of biological classes was selected from MDDR such that: the mode of action is known; the activity is of current pharmaceutical interest; and there is a substantial number of MDDR molecules categorized as exhibiting that activity. The datasets chosen were quite disparate in nature, some of them being structurally homogeneous (e.g. renin and HIV-1 protease inhibitors) while others were structurally diverse (e.g. cyclooxygenase and protein kinase C inhibitors), where the diversity was estimated by the mean pair-wise similarity (hereafter MPS) across each set of active molecules. The MPS is calculated by matching each compound with every other in its activity class, calculating similarities using the Tanimoto coefficient and the Unity fingerprints, and computing the mean intra-set similarity. The resulting similarity scores are listed in Table 1. For each of the 11 activity classes, 10 active compounds were selected for use as a training-set. The selections were done at random, subject to the constraint that no pair-wise similarity in a group exceeded 0.80. Each searching method (modal, GF, R1, and BKD) was repeated 10 times using a different training-set each time, and a note made for each search of the percentage of active molecules (i.e. those in the same class as those in the training-set) that occurred in the top 5% of the ranking. The results of the different methods are shown in Table 2, where it will be seen that GF and BKD are the methods of choice, consistently outperforming the other approaches. Of the two, BKD is more effective but is also
ch06
FA April 1, 2006
15:41
139
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening Table 2 Comparison of the Average Recall at 5% achieved by the various Search Methods
Activity Class
SSA
Modal
GF
BKD
5HT3 antagonists 5HT1A agonists 5HT reuptake inhibitor D2 antagonists Renin inhibitor Angiotensin II AT1 antagonists Thrombin inhibitor Substance P antagonist HIV protease inhibitor Cyclooxygenase inhibitor Protein kinase C inhibitor
29.3 30.1 33.1 27.5 52.9 43.4 35.6 36.5 34.1 19.2 35.6
30.3 21.9 39.6 27.1 88.8 73.6 49.4 36.8 53.5 11.0 35.6
49.0 37.2 49.7 37.4 88.6 80.4 58.6 47.1 61.6 26.5 48.0
52.3 38.2 45.8 38.7 93.3 84.5 63.1 58.4 68.5 33.2 49.4
Average over all classes
34.3
42.5
53.1
56.8
noticeably less efficient, being at least an order of magnitude slower than GF.14 With some minor exceptions, the performance of all of the methods tends to increase as the self-similarity of the active molecules increases. The correlation with intra-class similarity is not unexpected; what is of importance here is that good virtual screening performance is obtained even with quite diverse activity classes (such as the protein kinase C inhibitors and the D2 agonists). The worst results are obtained with the most diverse set of actives, i.e. the cyclooxygenase inhibitors; even here, however, the GF runs, for example, represent a 5.3-fold enrichment over a random ranking of the dataset. We will return to the effect of diversity on search performance in Sec. 2.4 below. We sought to quantify the benefit that can be achieved using multiple reference structures, rather than single reference structures as in conventional similarity searching. This was done by using every single active molecule in each of the 11 chosen activity classes as the reference structure and recording the minimum, mean and maximum performance, as detailed in Table 3. The mean values correspond to the performance that might be expected using a single reference structure and are clearly much lower than the figures reported in Table 2 for the BKD and the GF methods (30.6% as against 56.8% and 53.1%, respectively). Thus the use of 10 actives, rather than just one, results in an increase of over two-thirds in the number of actives retrieved. Most interestingly, consider the figures in Table 3 for the best possible single similarity search, i.e. the figures listed under Maximum. These represent the best single similarity searches possible from the many
ch06
FA April 1, 2006
140
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton Table 3 Recall at 5% achieved by Conventional, Single-molecule Similarity Searching
Activity Class
Mean
Maximum
Minimum
5HT3 antagonists 5HT1A agonists 5HT reuptake inhibitor D2 antagonists Renin inhibitor Angiotensin II AT1 antagonists Thrombin inhibitor Substance P antagonist HIV protease inhibitor Cyclooxygenase inhibitor Protein kinase C inhibitor
21.2 18.4 24.0 17.4 80.5 48.0 33.5 26.9 37.6 9.4 19.4
41.0 39.3 42.7 35.6 93.2 81.7 63.6 57.7 63.7 21.1 46.1
1.9 2.5 1.4 0.3 3.0 3.6 0.6 0.6 1.9 0.3 0.7
Average over all classes
30.6
53.2
1.5
hundreds of individual bioactive molecules. If we consider the average overall activity classes, it will be seen that this upper-bound is only fractionally better than the GF result in Table 2 and is actually worse than the BKD figure. Thus, on average, picking any 10 active reference structures and combining them using GF or BKD will enable searches to be carried out that are comparable to even the best conventional similarity search using a single reference structure. This is a striking result, and one that strongly supports the use of multiple reference structures for virtual screening. 2.3. Comparison of fingerprints Having identified appropriate ways of combining multiple fingerprints, we then evaluated the many different types of fingerprints that are now available for similarity-based virtual screening using multiple reference structures. Specifically, we describe experiments using four classes of fingerprint descriptor: structural keys; hashed fingerprints; circular substructures; and pharmacophore descriptors. In all, we evaluated 10 different descriptors; of these, seven are commercially available, two are used in-house at Novartis, and one was implemented from the literature description.15 2.3.1. Fingerprint types Structural keys have been used in chemoinformatics for many years, and are usually encoded by a binary array, each element of which denotes the
ch06
FA April 1, 2006
15:41
141
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening
presence or absence of a specific 2D fragment. A predefined fragment dictionary lists the various fragment substructures that are encoded in the fingerprint. This study used the 1052-bit Barnard Chemical Information (BCI) fingerprints, which encode the following types of fragment substructure: augmented atoms, atom sequences, atom pairs, ring composition and ring fusion substructural fragments.32 Hashed fingerprints differ from structural keys in that they consider a predefined type of substructural pattern, rather than a predefined dictionary of fragment substructures, where a pattern describes, for example, a path of length n bonds, i.e. (atom-bond-atom)n with the natures of the atoms and bonds defined. Each such pattern present in a molecule is hashed to generate a position (or positions) within the available length of the bit-string. The study used three different hashed fingerprints: 988-bit Tripos Unity fingerprints20 ; 2048-bit Daylight fingerprints33 ; and 2048-bit Avalon fingerprints. Daylight fingerprints encode each atom’s type, all augmented atoms and all paths of length 2–7 atoms. Unity fingerprints encode paths of length 2–6 atoms, and also include 60 structural keys for common atoms and ring counts. Avalon fingerprints are used for similarity search in Novartis’ corporate data warehouse and encode atoms, augmented atoms, atom triplets and connection paths. A circular substructure is a fragment descriptor where each atom is represented by a string of extended connectivity values that are calculated using a modification of the Morgan algorithm.34 The study evaluated two different circular substructure descriptors from Scitegic’s Pipeline Pilot Software33 : extended connectivity fingerprints (ECFPs) and functional connectivity fingerprints (FCFPs). The initial code assigned to an atom is based on the number of connections, the element type, the charge, and the mass for ECFPs and on six generalized atom-types — viz. hydrogenbond donor, hydrogen-bond acceptor, positively ionizable, negatively ionizable, aromatic and halogen — for FCFPs. This code, in combination with the bond information and with the codes of its immediate neighbor atoms, is hashed to produce the next order code, and the process iterates until the required level of description has been achieved. The experiments here used the ECFP_2, ECFP_4, FCFP_2 and FCFP_4 fingerprints, where the numeric code denotes the diameter in bonds up to which features are generated. The Scitegic software represents a molecule by a list of integers, each describing a molecular feature and each in the range −231 to 231 . These integers were hashed here to a bit-string of length 1024 bits.
ch06
FA April 1, 2006
142
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
Pharmacophore points are features (such as a heteroatom or the center of an aromatic ring) that are thought to be required for a molecule to show bioactivity. Pharmacophore fingerprinting involves generating all of the patterns of three or four pharmacophore points in a molecule, together with the corresponding inter-point distances, and then using the resulting 3D structural codes as descriptors for similarity searching or diversity analysis.9,10,36,37 When used with 2D, rather than 3D, structural representations, the inter-atomic distances can be replaced by through-bond distances, and this approach forms the basis of the two pharmacophore fingerprints studied here: Similog keys25 and the chemically advanced template search (CATS) descriptor,38 both of which are based on generalized atom-types describing potential pharmacophores. The Similog keys use a “DABE” atom-typing scheme based on the following four properties: hydrogen-bond donor (D), hydrogen-bond acceptor (A), bulkiness (B) and electropositivity (E). The presence or absence of these properties for an atom is encoded in a 4-bit string, and each triplet of atoms is represented by the three DABE strings and by the associated topological distances: in all, 8031 different codes were identified in the MDDR database. The Similog keys store the occurrence of each distinct code, and not just their presence or absence as in a conventional bit-string. A binning scheme was hence used to bin the occurrence data into 8-bit strings to form a fingerprint of length 64 248 bits. The CATS descriptor is based on counts of atom-pair topological distances, with the following generalized types of atom being considered in the generation of the descriptor: lipophilic, positive, negative, hydrogen-bond donor and hydrogen-bond acceptor. The occurrences of the 15 possible pairs of pharmacophores are determined for distances up to 10 bonds to give a 150-element (i.e. 15 × 10) vector. The vectors were generated using the description in Fechner et al.39 and then converted to a binary fingerprint using the same binning scheme as for Similog to form a fingerprint of length 1200 bits.
2.3.2. Comparison of effectiveness Searches were carried out as described previously, and the results are shown in Table 4. Inspection of these results reveals the general superiority of the circular substructure descriptors (with the notable exception of the FCFP_2), with the EFCP_4 fingerprints being the best for virtual screening of the sort advocated here.
ch06
FA April 1, 2006
15:41
143
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening Table 4 Comparison of the Recalls at 5% averaged over the 11 Datasets in Table 1, using Different Types of Fingerprint
Descriptor
GF
BKD
ECFP_2 ECFP_4 FCFP_2 FCFP_4 BCI Daylight Unity Avalon Similog CATS
64.1 67.4 53.6 63.7 58.0 54.9 53.1 53.0 52.1 38.1
64.8 65.7 58.8 64.1 58.9 56.8 56.7 54.4 53.2 41.7
The dictionary-based descriptors, represented here by the BCI fingerprints, were ranked second overall, returning generally higher recalls than the hashed fingerprints, i.e. Unity, Daylight and Avalon. This finding is in agreement with the studies of cluster-based property prediction by Brown and Martin9,10 (although they used different types of dictionary and hashed fingerprints from those studied here). Finally, the CATS and Similog pharmacophore descriptors yield consistently poor recall values. Previous studies of these descriptors, for chemogenomics and scaffold-hopping applications,25,38 have demonstrated that they can be highly effective in operation, but this was certainly not the case for the present application. Circular substructures of various sorts have been widely used for applications such as structure and substructure searching, constitutional symmetry, structure elucidation and the probabilistic modeling of bioactivity inter alia.15 The work reported here demonstrates that this type of fragment is also very well suited to virtual screening using multiple reference structures. 2.4. Scaffold hopping applications We now discuss the suitability of BKD and GF for scaffold hopping applications.38,40−42 Scaffold hopping (other names that have been used include leap frogging, lead hopping and scaffold searching ) involves finding chemical structures that exhibit the same biological activity as the reference structure but that have significant topological differences.
ch06
FA April 1, 2006
144
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
The results in Sec. 2.2 show that search effectiveness is positively associated with the MPS of the sought actives; however, acceptable results were obtained even with the more heterogeneous activity classes (vide supra). Other studies on GF using datasets from both MDDR and the Dictionary of Natural Products 43 suggested that while the absolute level of performance might not be very high for such datasets, the increase in performance, relative to conventional similarity searching based on just a single reference structure, was greatest for the most diverse activity classes.26 These experiments suggest that such techniques may be of greatest value in those cases where conventional similarity searching (and conventional data fusion) is least effective, i.e. that it might prove to be a useful tool for scaffold hopping. We hence analyzed the MDDR to identify activity classes that could be used to test this hypothesis. Specifically, 261 classes were identified that were covered in a ligand ontology linking the activity classes to a hierarchical classification of protein targets,44 and that had more than two members. The MPS was then computed using the Tanimoto coefficient and all the 2D fingerprints discussed in Sec. 2.3. The final MPS for each class was then taken as the mean when averaged over the values for the individual fingerprint-types, and the 10 most diverse (lowest average MPS) classes were selected, subject to them containing at least 50 compounds. These activity classes are listed in Table 5(a); only one of them, the cyclooxygenase inhibitors, had figured in our previous experiments. For comparison with the MPS values in this table, the MPS for 10 000 compounds selected at random from MDDR was 0.200, demonstrating the highly disparate nature of the datasets listed in Table 5(a). In addition, 10 activity classes were similarly selected that had medium average MPS values and a further 10 that had the highest average MPS values (i.e. very homogeneous sets of compounds), as detailed in Tables 5(b) and 5(c), respectively. Simulated virtual screening searches were carried out on the MDDR database using the procedures described previously. The results obtained with the high-diversity (low MPS score) datasets were much lower than those for the medium-MPS and high-MPS searches; however, the recall values relative to those for conventional similarity searching are much higher, more than doubling the numbers of retrieved actives for most of the fingerprint-types. This differential behavior is illustrated in Fig. 1. The plots show the recall at 5% obtained with group fusion (GF) and similarity searching (SS) using the ECFP_4 descriptors and the 30 MDDR activity classes, and illustrate the increase of relative performance as the sought structures becomes more heterogeneous. The upper part of the
ch06
FA April 1, 2006
15:41
145
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening Table 5 Ten MDDR Activity Classes selected as having: (a) the Lowest Average MPS Scores, and hence as being the most diverse for virtual screening experiments; (b) Medium Average MPS Scores, and hence as being of medium diversity for virtual screening experiments; (c) the Highest Average MPS Scores, and hence as being the least diverse for virtual screening experiments.
(a)
Activity Class Muscarinic (M1) agonists NMDA receptor antagonists Nitric oxide synthase inhibitors Dopamine beta-hydroxylase inhibitors Aldose reductase inhibitors Reverse transcriptase inhibitors Aromatase inhibitors Cyclooxygenase inhibitors Phospholipase A2 inhibitors Lipoxygenase inhibitors
(b)
Activity Class CRF antagonists 5HT2B antagonists 5HT2C antagonists Dopamine (D1) antagonists Carbonic anhydrase inhibitors Thrombin inhibitors CCK A antagonists Oxytocin antagonists Protease inhibitors Phosphodiesterase V inhibitors
(c)
Activity Class Adenosine (A1) agonists Adenosine (A2) agonists Renin inhibitors CCK agonists Monocyclic beta-lactams Cephalosporins Carbacephems Carbapenems Tribactams Vitamin D analogs
Number of Compounds
Average MPS
848 1311 377 95 882 519 513 636 704 2555
0.21 0.20 0.19 0.23 0.23 0.22 0.23 0.22 0.22 0.22
Number of Compounds
Average MPS
254 90 174 167 255 803 161 209 574 164
0.32 0.31 0.31 0.32 0.31 0.32 0.32 0.31 0.31 0.32
Number of Compounds
Average MPS
88 71 1130 79 76 1312 73 896 74 279
0.52 0.54 0.46 0.45 0.55 0.50 0.49 0.46 0.55 0.57
ch06
FA April 1, 2006
146
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
Figure 1. Comparison of the average recall at 5% obtained with group fusion (GF) and similarity searching (SS) using the ECFP_4 descriptors and 30 MDDR activity classes chosen as being of low, medium and high diversity. The upper part of the figure shows the recall at 5% obtained with GF vs. recall at 5% obtained with SS, while the lower part shows the diversity (as measured by the mean pair-wise similarity) vs. the ratio of the recalls at 5% obtained with GF and with SS.
ch06
FA April 1, 2006
15:41
147
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening
figure compares the recalls resulting from the use of the two methods, and it will be seen that many of the GF results lie well above the 45◦ diagonal that would be obtained if the two approaches gave comparable recalls; moreover, the distance above the line is the greatest for the most diverse and medium diverse datasets. This difference is further demonstrated in the lower part of the figure, which plots the ratio of the two recalls against the MPS, with the largest performance enhancements resulting from the most diverse datasets (i.e. the lowest MPS values). Similar behavior is obtained if BKD is used for the multiple-reference search.17
3. Virtual Screening Using a Single Reference Structure The previous section focused on virtual screening using multiple bioactive reference structures; in this section, we report a simple way of enhancing the effectiveness of conventional similarity-based virtual screening, i.e. when just a single reference structure is available. We refer to this approach as turbo similarity searching ; a turbocharger increases the power of an engine by using the engine’s exhaust gases, and here, we seek to increase the power of a search engine procedure by using the reference structure’s NNs. Turbo similarity searching (TSS) is based on two observations: (1) the general applicability of the similar property principle; and (2) the work on the use of multiple reference structures for similarity searching described in the previous section. 3.1. Use of group fusion and nearest neighbor information The similar property principle implies that the NNs of a bioactive reference structure are also likely to possess that activity. If we assume that the NNs are not just likely to be active but actually are active, then we can use GF to combine the results of similarity searches that use them as the reference structures, using the strategy summarized in Fig. 2. The extent to which the procedure shown in Fig. 2 can enhance retrieval effectiveness, i.e. yield a greater number of high-ranked actives, was studied by searching the MDDR using the datasets listed in Table 1, with SS and TSS searches carried out for all of the 8294 active molecules in these 11 classes. The principal results of the study are listed in Table 6, which lists the 5% mean recalls averaged over all of the individual active molecules in each activity class when a TSS search is carried out using the specified numbers of NNs (5, 10, 20, 50, 100). The table also contains the comparable
ch06
FA April 1, 2006
148
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
Input the reference structure R Compute the similarity of R with every molecule in the database D Sort D in decreasing order of the calculated similarity values to give a sorted database SD(0) Identify the k NNs of R from the top of the list SD(0) For each such nearest-neighbor, NN(i) Compute the similarity of NN(i) with every molecule in D Sort D in decreasing order of the calculated similarity values to give a sorted database SD(i) Fuse the sorted lists SD(0)–SD(k) to give the final output from the turbo similarity search.
Figure 2. Use of nearest neighbors for turbo similarity searching.
Table 6 Mean recall at 5% for Conventional Similarity Searching (SS) with just a Single Reference Structure and Turbo Similarity Searching (TSS) using different numbers of NNs
Activity Class
SS
TSS-5
TSS-10
TSS-20
TSS-50
TSS-100
5HT3 antagonists 5HT1A agonists 5HT reuptake inhibitors D2 antagonists Renin inhibitors Angiotensin II AT1 antagonists Thrombin inhibitors Substance P antagonists HIV protease inhibitors Cyclooxygenase inhibitors Protein kinase C inhibitors
31.7 26.3 21.6 25.1 90.4 77.4 44.5 28.6 51.6 13.7 21.0
34.8 28.1 23.4 25.8 91.2 80.8 45.6 30.5 51.9 14.6 21.1
36.8 29.6 24.0 26.9 92.1 83.5 47.1 31.7 52.6 15.0 21.1
38.6 31.8 23.8 27.5 93.1 86.7 48.3 32.2 53.3 15.3 21.1
42.1 34.5 24.3 29.1 94.3 90.2 51.0 33.3 54.5 15.1 20.9
44.0 36.2 24.1 30.3 94.7 92.0 50.7 34.1 55.2 14.4 20.6
Average over all classes
39.2
40.7
41.9
42.9
44.5
45.1
results for a conventional similarity search (SS). Inspection of this table shows that TSS is nearly always superior to SS in its ability to identify active molecules, with the only exceptions being the protein kinase C TSS-50 and TSS-100 searches. With some of the other activity classes, the increases in performance are really quite marked, most notably the 5HT3 antagonists and 5HT1A agonists. It is perhaps surprising that the best results are generally obtained with the largest number of NNs since the more NNs that are used, the greater the number of inactive molecules that are likely to be included in the GF. However, the fact that the average recall does increase, even with 100 NNs, means that these molecules are providing useful structural information. At
ch06
FA April 1, 2006
15:41
149
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening
some point, of course, the inclusion of further NNs will start to affect the retrieval performance; for the activity classes studied here, we found that use of 100 NNs was significantly better than use of 200 NNs and we have hence not included results for the latter number of NNs in Table 6.16 For comparison with the results in Table 6, Table 7 lists the results obtained in sets of searches using two lower-bounds and two upper-bounds. The former were obtained by carrying out a TSS search using either just the inactives in the set of 100 NNs, or the top-ranked 100 inactive NNs; both values are comparable to the basic SS recall of 39.2% in Table 6. The effectiveness of these lower-bound searches may appear rather surprising; however, it simply means that even when inactive molecules are used in TSS, the molecules still contain sufficient relevant substructures in common with the reference structure to enable the identification of further active molecules. The two upper-bound searches demonstrate the performance available with full knowledge of the actives. When one uses a set of 100 active NNs then, hardly surprisingly, the search performance is very much greater than with TSS-100 (where one assumes that all of the top-100 NNs are active). Using just the true actives in the top-100 NNs, the performance is still noticeably above, albeit much closer, to TSS-100 (where one includes these actives and further molecules that are assumed to be active Table 7 Mean Recall at 5% of Two Types of Upper- and Lower-bounds for TSS
Activity Class
Upper-bounds
Lower-bounds
Reference and Reference and Inactives Top-100 Actives among Top-100 Active among the 100 Inactive the 100 NNs NNs NNs NNs 5HT3 antagonists 5HT1A agonists 5HT reuptake inhibitors D2 antagonists Renin inhibitors Angiotensin II AT1 antagonists Thrombin inhibitors Substance P antagonists HIV protease inhibitors Cyclooxygenase inhibitors Protein kinase C inhibitors
49.4 38.0 27.8 30.6 95.2 91.6 58.6 42.2 59.8 17.5 23.2
65.7 55.3 62.8 68.6 96.6 95.2 71.6 53.8 76.1 49.2 58.1
33.5 31.7 21.7 28.7 90.2 90.9 37.4 20.4 50.8 12.5 18.4
32.1 31.9 21.7 28.8 89.8 92.2 33.9 15.8 49.0 12.0 18.3
Average over all classes
48.5
68.4
39.6
38.7
ch06
FA April 1, 2006
150
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
but that are actually inactive). Thus the true actives increase performance, when compared to similarity searching, whilst the true inactives have little effect on performance, yielding the overall enhancement evident in Table 6. 3.2. Use of machine learning and nearest neighbor information Thus far, we have used group fusion (GF) for the second-stage of turbo similarity search. However, there seems no reason in principle why alternative ranking procedures could not be used here, and we now consider the use of machine-learning methods for this purpose. The use of such methods for virtual screening requires the availability of a training-set containing known active and known inactive molecules. If we wish to use such methods when the only activity information available is that represented by a single reference structure, then means must be found to identify inactives and further actives that can be used as training data. Given the success of our TSS experiments, it seems possible that the NNs of the known reference structure could comprise the actives in the training-set, with inactives being obtained by noting that the characteristics of inactives are approximated with a high degree of accuracy by the characteristics of the entire database that is to be searched. Hence, the training-set inactives are obtained by selecting molecules at random from the database that is to be searched (subject, in our experiments, to none of these presumed inactives having a similarity coefficient greater than 0.40, using ECFP_4 fingerprints and the Tanimoto coefficient, with the reference structure and its NNs that comprise the training-set actives). Our new TSS procedure is hence as shown in Fig. 3. We have tested two machine learning procedures: SSA and BKD, the details of which have been presented earlier in this chapter. For SSA, a
Input the reference structure R Compute the similarity of R with every molecule in the database D Sort D in decreasing order of the calculated similarity values Assume that the k NNs at the top of the ranking are active Select k+1 molecules at random from D, subject to the constraint that none of them has similarity ≥ 0.40 (Tanimoto coefficient and ECFP_4 fingerprints) with the reference structure or with any of the top-k NNs Use R, the k NNs and the k+1 randomly selected molecules as the training-set for a machine learning procedure.
Figure 3. Use of machine learning for turbo similarity searching.
ch06
FA April 1, 2006
15:41
151
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening Table 8 Mean Recalls at 5% for Similarity Searching (SS), Conventional Turbo Similarity Searching using Group Fusion (TSS-GF) and Turbo Similarity Searching using Substructural Analysis (TSS-SSA) and Binary Kernel Discrimination (TSS-BKD)
Activity Class
SS
TSS-GF
TSS-SSA
TSS-BKD
5HT3 antagonists 5HT1A agonists 5HT Reuptake inhibitor D2 antagonists Renin inhibitors Angiotensin II AT1 antagonists Thrombin inhibitors Substance P antagonists HIV protease inhibitors Cyclooxygenase inhibitors Protein kinase C inhibitors
31.7 26.3 21.6 25.1 90.4 77.4 44.5 28.6 51.6 13.7 21.0
44.0 36.2 24.1 30.3 94.7 92.0 50.7 34.1 55.2 14.4 20.6
39.9 34.3 25.0 30.4 94.4 90.3 50.8 31.2 55.9 16.6 22.5
36.3 30.9 25.0 30.6 94.4 92.1 52.3 33.0 58.0 14.3 23.1
Average over all classes
39.2
45.1
44.7
44.5
set of initial experiments was carried out to identify the most appropriate weighting scheme17 ; the best results in the current context were obtained using R4, which has the form: Aj /(NA − Aj ) (5) log Ij /(NI − Ij ) (using the notation in Sec. 2.1.3). Searches were carried out using ECFP_4 fingerprints on the 11 activity classes listed in Table 1 and the results are presented in Table 8. Here, as before, SS represents conventional similarity searching and TSS-GF, TSSSSA and TSS-BKD represent turbo similarity searching based on the use of group fusion, substructural analysis and binary kernel discrimination, respectively, for the re-ranking of the original similarity-search output. In all of the TSS searches, 100 compounds were used in the second stage. Taking the average over all of the actives in all of the activity classes, it will be seen from Table 8 that TSS-SSA and TSS-BKD outperform conventional SS but that they are both marginally inferior to TSS-GF as originally described in the previous section. However, there are some cases where they are superior to TSS-GF, especially for the more diverse (low MPS) activity classes. Comparable searches were hence carried out using the 10 highly diverse activity classes listed in Table 5(a) and the results here are shown in Table 9.
ch06
FA April 1, 2006
152
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton Table 9 Mean Recalls at 5% for Similarity Searching (SS), Conventional Turbo Similarity Searching using Group Fusion (TSS-GF) and Turbo Similarity Searching using Substructural Analysis (TSS-SSA) and Binary Kernel Discrimination (TSS-BKD)
Activity Class
SS
TSS-GF
TSS-SSA
TSS-BKD
Muscarinic (M1) agonists NMDA receptor antagonists Nitric oxide synthase inhibitors Dopamine beta-hydroxylase inhibitors Aldose reductase inhibitors Reverse transcriptase inhibitors Aromatase inhibitors Cyclooxygenase inhibitors Phospholipase A2 inhibitors Lipoxygenase inhibitors
27.4 15.7 18.1 37.5 19.9 15.5 29.0 13.7 19.9 13.0
31.0 17.1 16.9 37.1 22.8 14.6 30.3 14.4 20.8 15.3
46.6 20.7 21.0 51.7 23.8 18.0 33.5 16.7 21.2 15.2
42.2 18.7 18.7 42.9 22.7 16.8 32.5 14.4 21.2 13.5
Average over all classes
21.0
22.0
26.8
24.4
Inspection of the figures in this table shows very clearly that the machinelearning versions of TSS are to be preferred to the original TSS-GF when diverse sets of active structures are to be searched for in a chemical database. The increase in performance here is really quite marked: the mean recalls at 5% for SS and TSS-SSA are 21.0 and 26.8, respectively, which represents an increase of over one-quarter in the number of active molecules identified in a conventional similarity search. It may appear surprising at first sight that TSS-SSA is superior to TSS-BKD, given that our previous experience of SSA and BKD had suggested that the latter was to be preferred (see Sec. 2.2 above). However, recent studies on the robustness of BKD and SSA have shown that the performance of BKD is more affected by the presence of noisy activity data than is SSA.45 In the present context, the assumption that the NNs are all active inevitably means that there are many false positives in the training-set, and it is clear that this adversely affects the performance of BKD.
4. Conclusions Similarity searching using 2D fingerprints is probably the simplest of the tools available for virtual screening, requiring just a single active molecule to rank a database in order of decreasing probability of activity. This ease of use means that it plays a key role in the early stages of lead-discovery programmes, when only limited SAR and structural data are available. In this chapter, we have summarized the principal results of a study to enhance
ch06
FA April 1, 2006
15:41
153
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening
the retrieval effectiveness of the current systems for similarity searching, using simulated virtual screening searches of the MDDR database. Our work has had two foci. In the first part of the chapter, we have considered ways in which existing techniques can be extended to the situation where not one but several reference structures are available. We have shown that two methods in particular, one based on combining similarity scores using group fusion and the other based on an approximate form of binary kernel discrimination, provide effective ways of exploiting the information contained in a small set of reference structures. Of these two methods, group fusion is slightly less effective in operation but significantly more efficient. We have also investigated the performance of different types of fingerprint for similarity-based virtual screening, and demonstrated the general effectiveness of the Scitegic Pipeline Pilot circular substructure descriptors. In the second part of the chapter, we have discussed ways of increasing search performance by the use of a second-stage search based on the nearest neighbors resulting from an initial similarity-based ranking of the database. This two-stage approach, called turbo similarity searching, is notably more effective than the conventional, single-stage approach: overall, the best results are obtained from using group fusion to process the sets of nearest neighbors, but better results may be obtained when the sought molecules are structurally highly diverse using an approach based on substructural analysis. In brief, we believe that the techniques described in this chapter provide highly cost-effective ways of enhancing the performance of current systems for similarity-based virtual screening.
Acknowledgments We thank the following: Novartis Institutes for BioMedical Research for funding JH; MDL Information Systems Inc. for the provision of the MDDR database; Barnard Chemical Information Ltd., Daylight Chemical Information Systems Inc., the Royal Society, Tripos Inc. and the Wolfson Foundation for software and laboratory support.
References 1. Bohm H-J, Schneider G (eds). (2000) Virtual Screening for Bioactive Molecules. Wiley-VCH, Weinheim. 2. Klebe G (ed). (2000) Virtual Screening: An Alternative or Complement to High Throughput Screening. Kluwer, Dordrecht. 3. Willett P, Barnard JM, Downs GM. (1998) Chemical Similarity Searching. J. Chem. Inf. Comp. Sci. 38:983–996.
ch06
FA April 1, 2006
154
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
4. Johnson MA, Maggiora GM (eds). (1990) Concepts and Applications of Molecular Similarity. John Wiley, New York. 5. Martin YC, Kofron JL, Traphagen LM. (2002) Do Structurally Similar Molecules Have Similar Biological Activities? J. Med. Chem. 45:4350–4358. 6. Patterson DE, Cramer RD, Ferguson AM, Clark RD, Weinberger LE. (1996) Neighbourhood Behaviour: A Useful Concept for Validation of “Molecular Diversity” Descriptors. J. Med. Chem. 39:3049–3059. 7. Schuffenhauer A, Jacoby E. (2004) Annotating and Mining the Ligand-Target Chemogenomics Knowledge Space. BIOSILICO 2:190–200. 8. Kubinyi H. (1998) Similarity and Dissimilarity: A Medicinal Chemist’s View. Perspect. Drug Discov. Des. 9–11:225–252. 9. Brown RD, Martin YC. (1996) Use of Structure-Activity Data to Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection. J. Chem. Inf. Comp. Sci. 36:572–584. 10. Brown RD, Martin YC. (1997) The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding. J. Chem. Inf. Comp. Sci. 37:1–9. 11. Sheridan RP, Feuston BP, Maiorov VN, Kearsley SK. (2004) Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR. J. Chem. Inf. Comp. Sci. 44:1912–1928. 12. Shanmugasundaram V, Maggiora GM, Lajiness MS. (2005) Hit-Directed Nearest-Neighbor Searching. J. Med. Chem. 48:240–248. 13. Sheridan RP, Kearsley SK. (2002) Why Do We Need So Many Chemical Similarity Search Methods? Drug Discov. Today 7:903–911. 14. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A. (2004) Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. J. Chem. Inf. Comp. Sci. 44:1177–1185. 15. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A. (2004) Topological Descriptors for Similarity-Based Virtual Screening Using Multiple Bioactive Reference Structures. Org. Biomol. Chem. 2:3256–3266. 16. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A. (2005) Enhancing the Effectiveness of Similarity-Based Virtual Screening Using Nearest-Neighbour Information. J. Med. Chem. 48:7049–7054. 17. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A. (2005) New Methods for Ligand-Based Virtual Screening: Use of Data-Fusion and Machine-Learning Techniques to Enhance the Effectiveness of Similarity Searching. J. Chem. Inf. Model. (in the press). 18. Güner O. (2000) Pharmacophore Perception, Development, and Use in Drug Design. International University Line, La Jolla, CA. 19. Shemetulskis NE, Weininger D, Blankley CJ, Yang JJ, Humblet C. (1996) Stigmata: An Algorithm to Determine Structural Commonalities in Diverse Datasets. J. Chem. Inf. Comp. Sci. 36:862–871.
ch06
FA April 1, 2006
15:41
155
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
New Methods for Similarity-based Virtual Screening
20. Tripos Inc. is at http://www.tripos.com 21. Charifson PS, Corkery JJ, Murcko MA, Walters WP. (1999) Consensus Scoring: A Method for Obtaining Improved Hit Rates from Docking Databases of Three-Dimensional Structures into Proteins. J. Med. Chem. 42:5100–5109. 22. Ginn CMR,Willett P, Bradshaw J. (2000) Combination of Molecular Similarity Measures Using Data Fusion. Perspect. Drug Discov. Des. 20:1–16. 23. Wang RX, Wang SM. (2001) How does Consensus Scoring Work for Virtual Library Screening? An Idealized Computer Experiment. J. Chem. Inf. Comp. Sci. 41:1422–1426. 24. Salim N, Holliday J, Willett P. (2003) Combination of Fingerprint-Based Similarity Coefficients Using Data Fusion. J. Chem. Inf. Comp. Sci. 43:435–442. 25. Schuffenhauer A, Floersheim P, Acklin P, Jacoby E. (2003) Similarity Metrics for Ligands Reflecting the Similarity of the Target Proteins. J. Chem. Inf. Comp. Sci. 43:391–405. 26. Whittle M, Gillet VJ, Willett P, Alex A, Losel J. (2004) Enhancing the Effectiveness of Virtual Screening by Fusing Nearest-Neighbour Lists: A Comparison of Similarity Coefficients. J. Chem. Inf. Comp. Sci. 44:1840–1848. 27. Cramer RD, Redl G, Berkoff CE. (1974) Substructural Analysis — A Novel Approach to Problem of Drug Design. J. Med. Chem. 17:533–535. 28. Ormerod A, Willett P, Bawden D. (1989) Comparison of Fragment Weighting Schemes for Substructural Analysis. Quant. Struct.-Act. Rel. 8:115–129. 29. Wilton DJ, Willett P, Lawson K, Mullier G. (2003) Comparison of Ranking Methods for Virtual Screening in Lead-Discovery Programs. J. Chem. Inf. Comp. Sci. 43:469–474. 30. Harper G, Bradshaw J, Gittins JC, Green DVS, Leach AR. (2001) Prediction of Biological Activity for High-Throughput Screening Using Binary Kernel Discrimination. J. Chem. Inf. Comp. Sci. 41:1295–1300. 31. The MDL Drug Database Report database is available from MDL Information Systems Inc. at http://mdl.com 32. Barnard Chemical Information Ltd. is at http://bci.gb.com 33. Daylight Chemical Information Systems Inc. is at http://www.daylight.com 34. Morgan HL. (1965) The Generation of a Unique Machine-Description for Chemical Structures — A Technique developed at Chemical Abstracts Service. J. Chem. Doc. 5:107–113. 35. Scitegic Inc. is at http://www.Scitegic.com 36. Mason JS, Morize I, Menard PR, Cheney DL, Hulme C, Labaudiniere RF. (1999) New 4-Point Pharmacophore Method for Molecular Similarity and Diversity Applications: Overview of the Method and Applications, including a Novel Approach to the Design of Combinatorial Libraries Containing Privileged Substructures. J. Med. Chem. 42:3251–3264. 37. Matter H, Pötter T. (1999) Comparing 3D Pharmacophore Triplets and 2D Fingerprints for Selecting Diverse Compound Subset. J. Chem. Inf. Comp. Sci. 39:1211–1225.
ch06
FA April 1, 2006
156
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
J. Hert , P. Willett and D. J. Wilton
38. Schneider G, Neidhart W, Giller T, Schmid G. (1999) “Scaffold-Hopping” by Topological Pharmacophore Search: A Contribution to Virtual Screening. Angew. Chem. Int. Ed. Engl. 38:2894–2896. 39. Fechner U, Lutz F, Renner S, Schneider P, Scheider G. (2003) Comparison of Correlation Vector Methods for Ligand-Based Similarity Searching. J. Comput.-Aided Mol. Des. 17:687–698. 40. Bohl M, Dunbar JB, Gifford EM, Heritage T, Wild DJ, Willett P, Wilton DJ. (2002) Scaffold Searching: Automated Identification of Similar Ring Systems for the Design of Combinatorial Libraries. Quant. Struct.-Act. Rel. 21:590–597. 41. Cramer RD, Jilek RJ, Guessregen S, Clark SJ, Wendt B, Clark RD. (2004) Lead Hopping. Validation of Topomer Similarity as a Superior Predictor of Similar Biological Activities. J. Med. Chem. 47:6777–6791. 42. Böhm HJ, Flohr A, Stahl M. (2004) Scaffold-Hopping. Drug Discov. Today: Technol. 1:217–224. 43. The Dictionary of Natural Products database is available from Chapman & Hall/CRC Press at http://www.chemnetbase.com 44. Schuffenhauer A, Zimmermann J, Stoop R, van der Vyver JJ, Lecchini S, Jacoby E. (2002) An Ontology for Pharmaceutical Ligands and its Application for in silico Screening and Library Design. J. Chem. Inf. Comp. Sci. 42:947–955. 45. Chen B, Harrison RF, Pasupa K, Wilton DJ, Willett P, Wood DJ, Delaney J, Lawson K, Mullier G. Evaluation of Binary Kernel Discrimination for Virtual Screening in Lead-Discovery Programmes. J. Chem. Inf. Model., submitted for publication.
ch06
FA April 1, 2006
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
7
Structural Informatics: Chemogenomics In silico Derek A. Debe,∗ Kevin P. Hambly and Joseph F. Danzer
1. Introduction The recent completion of the human genome is a dramatic example of how modern experimental automation coupled with large scale collaboration can produce biological data that previous generations of scientists could scarcely dream of. Large scale biology efforts are now pressing forward, fueling two new fields that represent the next logical steps after the completion of the genome: chemogenomics and structural genomics. The goal of chemogenomics is a systematic understanding of how various chemical compounds modulate the function or activity of each and every gene product (protein) in the human body. Before an approach qualifies as chemogenomic, it must provide knowledge about multiple targets or pathways in a holistic manner that represents a departure from drug discovery’s historic target-by-target approach. Hence, technologies such as cell-based screening and expression profiling are commonly labeled chemogenomic by virtue of their ability to furnish data that spans multiple targets and pathways. Naturally, as these chemogenomic technologies begin to yield new types of data, new chemogenomic informatics approaches must be developed to convert this data into knowledge relevant to drug discovery. The goal of human structural genomics is a systematic determination of all of the protein structures in the human body. Current experimental efforts have been instrumental in establishing high-throughput structural genomics platforms that employ automated protein expression, crystallization, data acquisition, and model refinement technologies. These efforts, coupled with large scale collaboration, resulted in more than 5000 new publicly available protein structures (mostly non-human) in 2004, culminating ∗ Corresponding author: Email:
[email protected], 9381 Judicial Drive, Suite 200, San Diego,
CA 92121, Eidogen-Sertanty, Inc.
157
ch07
FA April 1, 2006
158
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. A. Debe, K. P. Hambly and J. F. Danzer
Figure 1. Since the mid-1990s, high throughput technologies have rapidly expanded the number of targets that have been expressed and successfully co-crystallized. This figure shows the substantial growth in the number of distinct kinase co-crystals in the public domain (data for plot gathered from the PDB).
in a total of 28 992 structures, more than 10 times the number of structures in 1993.1 The number of protein-ligand co-crystal structures available for drug targets has also increased substantially over the last decade (Fig. 1). Historically, structure-based drug design (SBDD) approaches have utilized cocrystal information to rationally optimize the activity and ADME properties of lead compounds. Since the first successful structure-based drugs were developed to target HIV protease and influenza neuraminidase in the early nineties, inhibitors for more than 40 distinct targets have been developed using SBDD approaches.2 As the amount of protein and protein-ligand complex structural data increases, structural coverage across many important gene families is becoming much more complete. That is, rather than having the structure for just one target within a gene family, structures are becoming available for many or all of the targets within a gene family. This increase in structural coverage offers the possibility of replacing the historic target-by-target utilization of structural data with a holistic, chemogenomic approach. Structural informatics is an important branch of chemogenomic informatics whose goal is to utilize the rapidly expanding structural database in new ways to enhance the discovery and optimization of small molecule protein modulators on a genomic scale.
ch07
FA April 1, 2006
15:41
159
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Structural Informatics: Chemogenomics In silico
2. Structural Informatics Defined Since structural informatics is currently an emerging field, it is instructive to compare it with its well-established discovery informatics cousins, bioinformatics and cheminformatics. According to the fourth edition of the American Heritage® Dictionary of the English Language, informatics is defined as, “the collection, classification, storage, retrieval, and dissemination of recorded knowledge.” In bioinformatics, gene and protein sequences are classified according to their similarity to infer function for genes and proteins whose functions have not been verified experimentally. For example, many of the proteins that we refer to as kinases have never been assayed for kinase activity, we infer that they are kinases since their sequences are similar to verified kinases. Similarly, in cheminformatics, small molecules are classified according to their similarity (via a multitude of different approaches3,4 ), in hopes of inferring the activity of a molecule that has not been assayed from one that has. This well-established process of inference via similarity is not without error; every once in a while the bioinformatics-based inference of protein function will be incorrect. Much more frequently, the cheminformaticsbased inference of small molecule activity is in error, since slight changes in a molecule can dramatically affect its ability to bind. Hence, in cheminformatics, inference of function from similarity classification is less reliable than in bioinformatics. Because of this lack of reliability, inference in cheminformatics is thought of as an imperfect screening process, whose less than ideal performance is analyzed in terms of an enrichment factor (a measure of how much better the cheminformatic inference performs than random inference). Structural informatics utilizes the same process of inference through classification as its well established informatics cousins. In structural informatics, the data being compared and classified are protein structures, binding sites, and ligand binding modes. Correspondingly, the types of algorithms used for the purposes of classification are structure alignments, site alignments, and binding mode alignments. Figure 2 summarizes the relationship between bio-, structural, and cheminformatics.
3. Calculating the Structural Informatics Universe While the number of compounds that can potentially modulate the activity of human proteins is infinite, the number of human proteins is finite (neglecting, of course, the small differences that make each of us distinct).
ch07
FA April 1, 2006
160
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. A. Debe, K. P. Hambly and J. F. Danzer
Field of Informatics
Primary Data
Similarity Relationships
Key Applications
Bio-
Protein Sequences
Sequence Alignments
Function Inference
Structure Alignments
Function Inference
Binding Sites
Site Alignments
Target Hopping, Cross Reactivity, Selectivity, Opportunity Mining
Sites+Ligands
Binding Mode Alignments
Enhanced Screening, Scaffold Hopping, Novel Scaffolds
Molecular Similarities
Screening
Structure Determination
Protein Structures Site Annotation
Structural
Small Molecule Docking
Chem-
Small Molecules
Figure 2. The relationship between bio-, structural, and cheminformatics. Protein structure determination via experimental or computational methods links bioinformatics to structural informatics; and small molecule docking and protein-ligand co-crystal structure determination link cheminformatics to structural informatics. Just as bio- and cheminformatics utilize protein sequence and small molecule similarity algorithms to infer protein function and compound activity knowledge from existing experimental data, structural informatics utilizes algorithms for determining protein structures, binding sites, and ligand binding mode similarities to infer protein function and compound activity from experimental structure data. Through this process of inference, structural informatics becomes a chemogenomic technology, providing a structural context for small molecule binding and activity on a genome-wide scale.
Hence, it is theoretically feasible to construct bioinformatics and structural informatics databases that contain the sequences, structures, and binding sites for all human proteins. Since experimental structure determination plays a key role in expanding the amount of reliable structural and ligand binding mode data, the foundation of such a database is robust, updatable knowledge management of experimental protein structure information. At Eidogen-Sertanty, we have developed a structural informatics database that utilizes predictive algorithms to amplify the existing experimental protein
ch07
FA April 1, 2006
15:41
161
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Structural Informatics: Chemogenomics In silico
structure and binding site data and to classify the resulting structures, sites, and ligand binding modes according to their respective similarity relationships. We call this database the Target Informatics Platform (TIP). Figure 3 shows the data and algorithms in the TIP database, and how data can be extracted from the database to interface with selected compounds from the infinite space of potential molecules. Figure 4 shows a snapshot of the protein structure and binding site data in TIP for those targets recently assigned to the human druggable genome. The remainder of this chapter will focus on some of the applications of structural informatics and the TIP database. We will use the Key Applications column in Fig. 2 as an outline, describing the applications enabled by understanding: 1) protein structural relationships; 2) binding site relationships; and 3) ligand binding mode relationships.
4. Structural Relationships Due to divergent or convergent evolution, structural homology can be conserved between proteins with undetectable sequence homology. In such instances, protein structure alignment algorithms, such as DALI8 and CE,9 can be used to find structural similarities and potential functional relationships that cannot be found using sequence alignment approaches. The well known structural classification databases SCOP,10 CATH,11 and FSSP12 store the results of structure alignments for protein structures from the PDB, and the Gene3D database13 goes a step further by providing the CATH structural classification for gene and protein sequences from completed genomes. The TIP database goes an additional step beyond Gene3D, not only providing the structural classification for all protein structures in a genome, but also the explicit structural alignments between each of the structures. While protein structure alignment is certainly an important tool for functional genomics, the knowledge gained from structural classification is of limited value for chemogenomics applications. Inferring whether a compound is likely to bind to a target protein requires an understanding of the relationships at the level of the binding site.
5. Binding Site Relationships While there are many resources available for obtaining protein structure relationships, there are comparatively few resources available for understanding binding site relationships. Sali and co-workers developed the first
ch07
FA April 1, 2006
162
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. A. Debe, K. P. Hambly and J. F. Danzer Field of Informatics
Bio-
Primary Data
Similarity Relationships
>51,000 Human IPI Sequences
>20 Million BLAST Sequence Alignments
STRUCTFAST Structure Determination
>30,000 Protein Structures
>28 Million StructSorter Structure Alignments
SiteSeeker Site Annotation
Structural
>90,000 Binding Sites >2300 Human PDB Co-Crystal Sites
Human TIP Database
>39 Million SiteSorter Site Alignments >1 Million SLiC Binding Mode Alignments
Data & Relationships From TIP Database
Structural
Sites+ New Ligands
SLiC & cSLiC Binding Mode Alignments
Chem-
Small Molecules
Molecular Similarities
EVE Analysis Software
External Small Molecule DB
Figure 3. The data and algorithms in Eidogen-Sertanty’s Target Informatics Platform human database. Since structure determination, structure alignment, and site alignment require significantly more computation time than sequence alignment, the database has been designed to store the results of these calculations and automatically initiate the appropriate new calculations when new experimentally derived sequence or structure data is uploaded. This approach enables us to easily update the database as new structures are deposited in the PDB, and allows users to modify the structural informatics data and classifications in realtime. Currently, the sequence layer of the database contains more than 50 000 sequences from the International Protein Index (IPI) human protein sequence database.5 At this time, there is a publicly available crystal structure for just over 2300 of these human sequences. For those sequences without an experimental structure, we built a structural model using STRUCTFAST, our proprietary homology modeling approach. STRUCTFAST builds an accurate model for approximately 2/3 of the human sequences, resulting in more than 30 000 models. When applied to each of the structures in the database, our site annotation algorithm, SiteSeeker, yields more than 90 000 binding sites, approximately half of which are (continued on facing page)
ch07
FA April 1, 2006
15:41
163
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Structural Informatics: Chemogenomics In silico
resource of this kind, LigBase.14 The LigBase database was created by coupling site annotations from the co-crystal record in the PDB along with the CE structure alignment algorithm, yielding multiple alignments for known binding sites. Recently, another site comparison database has become available, CavBase.15 A distinguishing feature of CavBase is that it contains additional similarities between binding sites from proteins that do not share any structural homology, since the binding sites are directly aligned using a clique detection algorithm,16 not a structure alignment algorithm. At Eidogen-Sertanty, we have developed a site alignment algorithm, SiteSorter, which uses a weighted-clique detection approach17 to directly overlay binding sites and avoid the requirement for structure homology. By integrating SiteSorter with fully automated homology modeling (STRUCTFAST) and site annotation (SiteSeeker), TIP goes an additional step beyond LigBase and CavBase, providing intra- and inter family binding site for the entire proteome, not just for those proteins whose structures have been resolved experimentally. Since closely related binding sites are more likely to bind to the same small molecules, binding site similarity analysis allows us to infer important cross-reactivity information. During lead discovery for a new target, finding a cross-reactivity to a target for which there are already leads enables the fast discovery of new leads via target-hopping.18 –21 With the potential of shortcircuiting the lead discovery process on a genomic scale, target hopping is an important chemogenomic application of structural informatics. Figure 5
Figure 3. (continued from previous page) novel sites with no experimental precedent. The total number of similarity relationships stored in the database is approaching 100 million, with more than 20 million BLAST sequence alignments,6 28 million StructSorter structure alignments, and 39 million SiteSorter site alignments (the StructSorter and SiteSorter algorithms were developed internally at EidogenSertanty and have yet to be published). For the ∼2300 human co-crystals in the database, more than one million binding mode alignments have been determined using our site-ligand contacts (SLiC) approach, which will be described later in this chapter. The completion of these calculations for the human proteome requires more than 3 months on a 128-node Linux cluster (3 GHz processors). Currently, we are calculating other drug discovery relevant proteomes, such as mouse and rat, and various pathogenic species. Storing the calculation results allows users to export, in real time, multiple proteins from the database based on sequence, structure, binding site and binding mode similarities. These database exports are utilized by a visualization and analysis package we have developed called EVE (the Eidogen visualization environment). Within EVE, users can examine all of the similarity relationships with a variety of 2D and 3D visualization techniques, and upload docked molecules for binding mode alignment.
ch07
FA April 1, 2006
164
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. A. Debe, K. P. Hambly and J. F. Danzer
>2300 Druggable Human Targets
Other Targets 81% structural coverage >1100 structures >4000 sites
>4000 structures (52% PDBs, 48% models)
>11,900 small molecule binding sites (38% co-crystal, 62% predicted)
Protein Kinases
Peptidases
Nuclear Receptors 98% structural coverage >100 structures >250 sites Transporters
GPCRs
Major Membrane Protein Targets 44% structural coverage >500 total structures >2000 total sites
Major Enzyme Targets 96% structural coverage >2100 total structures >5400 total sites Key Enzymes: Protein Kinases Enzymes: Peptidases Enzymes: Phosphatases Enzymes: Cytochrome P450s Enzymes: Metallopeptidases Enzymes: Phosphodiesterases Enzymes: Dehydrogenases Enzymes: Carbohydrate/Lipid Kinases Enzymes: Isomerases Enzymes: Carbonic Anhydrases Membrane Proteins: GPCRs Membrane Proteins: Transporters Membrane Proteins: Ion Channels Membrane Proteins: Integrins Membrane Proteins: Cell Surf. Receptors Membrane Proteins: Glycoproteins Nuclear Receptors Other
Figure 4. TIP’s structure and binding site coverage for the major drug target families that comprise the druggable human genome.7 TIP’s structural coverage across the major target families is very high, with the only exception being membrane-bound targets (due to the distinct lack of membrane bound crystal structures). Also shown in the chart is the total number of sites annotated for each of the major target classes. SiteSeeker annotates multiple potential small molecule binding sites for a given target structure, spanning substrate, co-factor, proteinprotein interaction, inhibitor, and allosteric sites.
shows an example of intra-family target hopping, while Fig. 6 shows an example of inter-family target hopping. While the potential for target hopping exists when two binding pockets are highly similar, a second set of applications emerges from a detailed understanding of the differences between two similar binding pockets. During lead optimization, where the goal is a highly selective binder, understanding the detailed mechanism of cross-reactivity between targets is critical for modifying existing leads to enhance their affinity for the desired target. Figure 7 shows an example of an undesirable inter-family crossreactivity found in the TIP database, and proposes a mechanism for an optimized lead series to avoid the undesirable off-target. In addition to enabling the optimization of known leads, structural informatics offers the possibility of mining the proteome for interesting drug discovery opportunities that are likely to succeed because binding site similarity analysis reveals an opportunity to design a highly selective binder. Figure 8 shows
ch07
FA April 1, 2006
15:41
165
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Structural Informatics: Chemogenomics In silico
Figure 5. An example of intra-family target hopping within kinases. According to SiteSorter, Braf kinase, the primary target for the clinical compound BAY 43-9006, is one of the 10 most similar kinases to c-Kit, which has also been shown to bind BAY 43-9006 with sub-micromolar affinity22 (60% of the binding site residues are conserved and colored blue; non-conserved positions are colored yellow). This cross-reactivity cannot be predicted based on the sequence similarity of the Braf and c-Kit kinase domains, since approximately one-sixth of the human kinome is more sequence similar to Braf than c-Kit.
an example of opportunity mining in the kinases, and Fig. 9 shows an example of opportunity mining in the area of anti-infectives. Once one or more projects have been mined, structural informatics can also be used to prioritize the projects by their expected feasibility. Figure 10 shows an example of a project whose feasibility has been adversely affected because the target’s binding site is very different in mice, the animal model of choice.
6. Ligand Binding Mode Relationships While it has long been a common practice in structure-based drug design to examine the binding modes of co-crystalized ligands to gain insight into the
ch07
FA April 1, 2006
166
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. A. Debe, K. P. Hambly and J. F. Danzer
Figure 6. An example of inter-family target hopping between human and viral aspartyl proteases. The aspartyl protease active site is located at a homodimer interface in HIV and within a single domain in Cathepsin D, so sequence and structure alignments between these proteins cannot be constructed. By using an approach independent of sequence or structure homology to directly align the sites, SiteSorter finds that the HIV protease and Cathepsin D substrate sites are highly similar (identical chemical groups within 1 Å are colored dark blue). It has been verified experimentally that Cathepsin D is susceptible to inhibition by HIV-protease inhibitors.23
important principles for binding, methods for the fully automated analysis of ligand binding modes have only recently emerged in the literature.28 –30 These methods play a crucial role in structural informatics by enabling similarity based classification of the rapidly expanding database of co-crystal structural data. In the TIP database, a binding mode similarity score is determined for each of the co-crystal binding site overlays using an approach called SLiC (site-ligand contacts), which is similar to the SIFt (structural interaction fingerprint) methodology developed by Singh and co-workers at Biogen.29 In the SIFt and SLiC approaches, the types of contacts that a ligand makes with each of the residues of the binding pocket are coded into a bit string. Aligning the binding pocket residues also aligns the bit strings, enabling a Tanimoto similarity to be calculated (Fig. 11).
ch07
FA April 1, 2006
15:41
167
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Structural Informatics: Chemogenomics In silico
Figure 7. Binding site similarity analysis can reveal unwanted off-target cross-reactivities. Searching the TIP database with the geldanamycin site on Hsp90 retrieves a SiteSeeker predicted site on ATP citrate lyase in the top 0.08% of all human binding sites, and a SiteSeeker predicted site on ADE2 in the top 0.2%. Both of these targets have been shown to bind Hsp90 inhibitors.24,25 This figure shows geldanamycin (red) positioned in the ADE2 pocket according to the SiteSorter overlay between HSP90 and ADE2 pockets. Three important hydrogen bonds to the geldanamycin are preserved in the ADE2 pocket, even though ADE2 does not share any sequence or structure homology with HSP90. Interestingly, positioning 17AAG (yellow), a geldanamycin analog and clinical candidate, into the ADE2 pocket reveals steric hindrance that may account for 17AAG’s reduced toxicity relative to geldanamycin (inside the green circle).
By converting the interactions important for binding into onedimensional bit strings, the SiFT and SLiC approaches can be coupled with small molecule docking approaches to find new molecules that are capable of making the same interactions. In this manner, automated binding mode analysis can be used to significantly enhance docking based approaches for inferring small molecule activity (Fig. 12). Recently, scientists at Vertex have published the BREED approach for determining new compounds with a high likelihood of activity based on three-dimensional binding mode similarity (Fig. 14).31 When coupled with the automated SiteSorter site alignment calculations in TIP, BREED
ch07
FA April 1, 2006
168
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. A. Debe, K. P. Hambly and J. F. Danzer
Figure 8. Structural informatics can be used to mine the best opportunities for selectivity within important drug target families such as kinses. Recently, Cohen et al. developed highly selective, irreversible p90 ribosomal S6 Kinase (RSK) inhibitors by exploiting an exposed and highly non-conserved cysteine in the ATP binding pocket.26 This figure shows an EVE comparative binding site analysis of the four RSK’s along with the most highly similar ATP binding sites in the human kinome. None of the most similar kinases share the cysteine found in the RSK’s.
becomes an important chemogenomic approach for quick determination of large slices of active chemical space for the important drug target families in the human genome.
7. Summary Structural informatics is an emerging field that promises to provide a significant amount of chemogenomic knowledge as the amount of experimental structural data continues to increase. A unique aspect of this rational approach to obtaining chemogenomic information is its potential to answer
ch07
FA April 1, 2006
15:41
169
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Structural Informatics: Chemogenomics In silico
Figure 9. Structural informatics can be used to mine anti-infective opportunities that cannot be discovered by comparative genomics. Because lanasterol demethylase (CYP51), the target for diflucan and other azole drugs, is a highly conserved enzyme that is not at all unique to the fungal genome, it would be completely disregarded as a potential anti-fungal target by comparative genomics methods. In contrast, TIP identifies CYP51 as a target with significant anti-infective potential by revealing important differences between the human and fungal binding pockets. This figure shows an EVE overlay of the azole-binding sites in the STRUCTFAST models for the CYP51 enzyme from the pathogenic fungi Candida albicans (red), and human (cyan). A fluconazole molecule is shown in its predicted binding mode. The bulky Met487 and Arg133 residues in the human enzyme close off a portion of the binding pocket that is accessible in the fungal enzyme, which has serine and histidine residues in the corresponding positions. The shallower human binding pocket does not accommodate the binding of fluconazole or other multiply substituted azole compounds.
the question “What is the mechanism of action?” as soon as it answers the question “Is this molecule active?” Hence, regardless of whether structural informatics is used to generate original activity or cross-reactivity data prior to other emerging high-throughput experimental methodologies, its place as an important technology for resolving the detailed mechanisms behind the primary chemogenomic observations is certainly assured.
ch07
FA April 1, 2006
170
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. A. Debe, K. P. Hambly and J. F. Danzer
Figure 10. Binding site analysis of different species can uncover potential problems with animal models for a given target. The Cathepsin S inhibitor JNJ 10329670 (exact molecule not shown) has an activity of 34 nM in humans, but shows sub-micromolar activity in dog, monkey, and cattle, and only micromolar activity in mice.27 These activity differences can be explained by the fact that in the dog, monkey and bovine Cathepisin S pockets, only two of the residues are non-conserved, while four of the residues are non-conserved in mice.
Figure 11. The SLiC similarities for a set of CDK2 co-crystals as presented in EVE. EVE employs a 4-color coloring scheme to provide an easy to understand visual representation of the various interaction bit strings used in the SLiC scoring. Residues participating in a hydrogen bond with the ligand are colored blue, residues participating in a polar interaction are colored red, while residues participating in a hydrophobic interaction that do not have either h-bond or polar interaction are colored yellow. Residues that participate in both an h-bond and a polar interaction with the ligand are colored purple. The top line, highlighted in blue and labeled “Composite1,” is a composite SLiC (cSLiC) that represents the average of the ligand interactions made by the various co-crystals. In EVE, users can build one or more cSLiCs and use them as an alternative or supplement to energy-based affinity scoring functions for the purposes of docking pose selection and affinity ranking.
ch07
FA April 1, 2006
15:41
171
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Structural Informatics: Chemogenomics In silico
50 45
% of Actives Recovered
40 35 30 25 Ideal
20
Co-Crystal cSLiC+Docking Co-Crystal cSLiC
15
Docking Random
10 5 0 0
1
2
3
4
5
6
7
8
9
10
% of Database Screened
Figure 12. Docking-based virtual ligand screening has emerged as an important workflow in computational lead discovery. This slide demonstrates the enrichment enhancements obtained when the cSLiC shown in Fig. 11 is used to re-rank a set of docking results for CDK2. To generate these results, 91 compounds with <1 micromolar CDK2 activity were buried in a set of 1752 background molecules randomly sampled from the MDDR. Given the number of actives and in-actives in this experiment, random enrichment is represented by the grey line, and an optimal enrichment that ranks all the actives before all of the background molecules is represented by the black line. In this experiment, a commonly used docking methodology in conjunction with a widely used consensus docking score yielded an enrichment of ∼3 at 1% of the database. Re-ranking the best poses from the docking according to the co-crystal cSLiC results in an enrichment of ∼15 at 1% of the database. Using the average of the docking and cSLiC scores allows more than 20% of the actives to be found before a false positive, significantly enhancing the quality of small molecule activity inference.
Figure 13. The SLiC binding mode analysis approach is capable of finding strong similarities between compounds with different scaffolds, providing a new approach to address the scaffold hopping problem in chemogenomics. This figure shows two different CDK2 co-crystal sites along with their aligned SLiC bit strings. While these compounds are distinct molecular scaffolds, their SLiC Tanimoto score is very high (0.78) due to their highly similar contact patterns.
ch07
FA April 1, 2006
172
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
D. A. Debe, K. P. Hambly and J. F. Danzer N
N
N
O
OH
HN
N
N
HN
S
O N
N N
OH
HN O
OH
O
DT Q
O
HO
NH
OH (DTQ_CK5_2)_1PU_2
N H
N H
N H
N
N
N N5 B
N
O NH S O
(DTQ_CK5_2)_1PU_1
N
(DTQ_I1P_2)_107_1 HN
N
S
HO
DTQ_I1P_2 H N N
S HN
HN N
N N
HN
CK5
OH N
S
N H
N
N
N
NH
HN
O
N H
O
N
107
N H
HN
N
N H
1 PU
N H
N H ((DTQ_CK5_2)_1PU_2)_LIG_2
O O
OH
N
N H
N H
N H
N
O S O HN
N
O
S
N
O NH S O
HN
N
OH
I 1P
N
HO
O
O
N
HN
N
L S1
S
O
HN
((DTQ_I1P_2)_107_1)_ALH_1
N ((DTQ_CK5_2)_1PU_2)_N5B_2
OH
N N
N H
N NH
HN
((DTQ_CK5_2)_1PU_1)_N5B_1
HN
NH O S 2 O
N H O
N
((DTQ_CK5_2)_1PU_1)_LIG_1
NH HN
N
OH O
(DTQ_CK5_2)_I1P_1
O O
((DTQ_CK5_2)_1PU_2)_I1P_1
((DTQ_CK5_2)_1PU_2)_LIG_1
O N
O
N
N N
N N S
N
HN
(DTQ_CK5_1)_I1P_1
N
N
((DTQ_CK5_2)_1PU_1)_LIG_2
N
N
N O
H N
S
N N
AL H
N HO
N
N
N
N
N N
N
N
((DTQ_CK5_2)_1PU_2)_CK5_1
HO
N
N
HN
HN
DTQ_I1P_1
DTQ_CK5_2
O
N NH
N
HN
N
OH
N
O
N
N OH
HN
HN
N
HN
O NH S O
O NH S O
N
N
DTQ_CK5_1
N
N
O
N HN
N
O
H N O S O
N N N
N
((DTQ_CK5_1)_I1P_1)_107_2
NH O S O
O NH S O
((DTQ_I1P_2)_107_1)_LS1_1
HN N
LI G N H O
N
O HN
N
N H
N
N S
N
N H
O N
NH NH
N
N N
N H
O H S N O
N
O
((DTQ_CK5_1)_I1P_1)_107_1
HN
((DTQ_I1P_2)_107_1)_LS1_2
S O
((DTQ_I1P_2)_107_1)_I1P_1
((DTQ_I1P_2)_107_1)_ALH_2
N ((DTQ_CK5_2)_1PU_1)_I1P_1
BREED PRODUCTS Figure 14. Recently, scientists at Vertex have published the BREED approach for generating novel ligand scaffolds from the overlay of co-crystal binding sites and subsequent recombinant hybridization of the superimposed bonds of the associated ligands. This figure illustrates a sample product of several iterative rounds of BREED-based fragment recombination. Starting with the 10 co-crystallized ligands on the left, 25 high potential kinase inhibitors are created after 3 generations of BREED. Since the BREED products are composites of known actives, they are highly likely to be active as well.
References 1. Berman HM, Westbrook J, Feng Z, et al. (2002) Nucl. Acids Res. 28:235–242. 2. Kuhn P, Wilson K, Patch MG, Stevens RC. (2002) Curr. Opin. Chem. Biol. 6:704–710. 3. Willet P. (1998) J. Chem. Inf. Comput. Sci. 38:983–996. 4. Schuffenhauer A, Floersheim P, Acklin P, Jacoby E. (2003) J. Chem. Inf. Comput. Sci. 43:391–405. 5. Kersey PJ, Duarte J, Williams A, et al. (2004) Proteomics 4:1985–1988.
ch07
FA April 1, 2006
15:41
173
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Structural Informatics: Chemogenomics In silico
6. Altschul SF, Madden TL, Schaffer AA, et al. (1997) Nucl. Acids Res. 25: 3389–3402. 7. Hopkins AL, Groom CR. (2002) Nat. Rev. Drug Discov. 1:727–730. 8. Holm L, Sander C. (1997) Nucl. Acids Res. 25:231–234. 9. Shindyalov IN, Bourne PE. (1998) Prot. Eng. 11:739–747. 10. Conte LL, Brenner SE, Hubbard TJP, et al. (2002) Nucl. Acids Res. 30:264–267. 11. Orengo CA, Michie AD, Jones S, et al. (1997) Structure 5:1093–1108. 12. Holm L, Sander C. (1994) Nucl. Acids Res. 22:3600–3609. 13. Buchan DW, Rison SC, Bray JE, et al. (2003) Nucl. Acids Res. 31:469–473. 14. Stuart AC, Illyin VA, Sali A. (2002) Bioinformatics 18:200–201. 15. Information available at http://www.ccdc.cam.ac.uk. 16. Schmitt S, Kuhn D, Klebe G. (2002) J. Mol. Biol. 323:387–406. 17. Bomze IM, Budinich M, Pardalos PM, Pelillo M. (1999) Handbook of Combinatorial Optimization. 18. Kauvar LM, Villar HO. (1998) Curr. Opin. Biotech. 9:390–394. 19. Schuffenhauer A, Zimmermann J, Stoop R, et al. (2002) J. Chem. Inf. Comput. Sci. 42:947–955. 20. Armstrong JI, Portley AR, Chang YT, et al. (2000) Angew. Chem. Int. Ed. 39:1303–1306. 21. Breinbauer R, Vetter IR, Waldmann H. (2002) Angew. Chem. Int. Ed. 41: 2878–2890. 22. Fabian A, Biggs WH, Treiber DK. (2005) Nat. Biotech. 23:329–336. 23. Hoegl L, Korting HC, Klebe G. (1999) Pharmazie. 54:319–329. 24. Ki SW, Ishigami K, Kitahara T, et al. (2000) J. Biol. Chem. 275:39231–39236. 25. Dishman R. (2002) Pharmacogenomics Sept/Oct, 58–62. 26. Cohen MS, Zhang C, Shokat KM, Taunton J. (2005) Science 308:1318–1321. 27. Thurmond RL, Sun S, Sehon CA, et al. (2004) J. Pharmacol. Exp. Ther. 308:269–276. 28. Kelly MD, Mancera RL. (2004) J. Chem. Inf. Comput. Sci. 44:1942–1951. 29. Deng Z, Chuaqui C, Singh J. (2004) J. Med. Chem. 47:337–344. 30. Chuaqui C, Deng Z, Singh J. (2005) J. Med. Chem. 48:121–133. 31. Pierce AC, Rao G, Bemis GW. (2004) J. Med. Chem. 47:2768–2775.
ch07
FA April 1, 2006
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
This page intentionally left blank
ch08
FA April 1, 2006
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
8
Construction of a Homogeneous and Informative In vitro Profiling Database for Anticipating the Clinical Effects of Drugs Nicolas Froloff,∗ Valérie Hamon, Philippe Dupuis, Annie Otto-Bruc, Boryeu Mao, Sandra Merrick and Jacques Migeon
1. Introduction The BioPrint database is a detailed map of the biological properties of over 2400 small molecule drugs and relevant drug-like reference compounds. The database is used as a context with which to analyze discovery and development compounds and to gain an understanding of their likely clinical properties prior to testing in animals or humans. The database contains three types of data: (i) chemical structures and molecular descriptors of the active ingredients of the compounds; (ii) high quality in vitro data generated through testing each compound in more than 180 different biochemical pharmacology and ADME (Absorption, Distribution, Metabolism, Excretion) assays; (iii) human pharmacokinetic and adverse effect data for these small molecule drugs. This chapter gives an overview of the database and demonstrates its usefulness in supporting molecular modeling, defining relationships between in vitro assays, and identifying neighbors (as defined by in vitro biological space) of discovery or development compounds and how these neighbors can be used to estimate in vivo effects. The content and quality of the in vitro pharmacological and ADME data, as well as assay selection are described. Compound handling, testing, and data management are discussed as well. Finally, future directions of the projects are outlined. ∗ Corresponding
author. E-mail:
[email protected] Cerep, 19 avenue du Québec, 91951 Courtaboeuf cedex, France. Cerep, Le Bois l’Evêque, 86600 Celle l’Evescault, France. Cerep, 15318 NE 95th Street, Redmond, WA 98052, USA.
175
ch08
FA April 1, 2006
176
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
In vitro profiling data production, analysis, and interpretation have been the core business of Cerep for over 15 years. Drug profiling is a strategy aimed at substantially improving the drug discovery process.1 The BioPrint project began in 1997 as a focused effort to build a high quality dataset linking the knowledge domains of computational chemistry, in vitro biology, and clinical effects. The data set and tools have been successful in bringing fundamentally new insights into each of these knowledge domains. This project is part of current concerted efforts in academic and industrial spheres of drug discovery and development to understand the effects of small molecules in living organisms, and in particular in man. The common first step of all these efforts is the construction of large databases where the knowledge on small molecules is linked to knowledge on their individual effects on genes, biological targets, metabolic pathways, as well as on cells, tissues and ultimately whole living systems. Another critical aspect of this approach, at the interface of chemistry and biology, generally referred to as chemogenomics,2−6 is the development of new computational tools to integrate cheminformatics and bioinformatics, and to analyze and mine the data efficiently for better decision-making. BioPrint contains the data for active ingredients from over 2400 marketed drugs, failed drugs, and reference compounds tested in a panel of over 180 diverse, well-characterized in vitro pharmacological and ADME assays. It contains over one million data records of homogeneously executed pharmacological and ADME assays. This in vitro dataset is complemented by a carefully culled collection of human clinical data covering pharmacokinetics and adverse drug reactions. BioPrint is currently made of three subsets of data: • Chemical properties (2D structures and 3D descriptors) • In vitro profiles • Clinical effects BioPrint is a unique resource for knowledge building and decision making at multiple stages of the drug discovery process (Fig. 1). For early drug discovery and lead optimization, BioPrint can serve as a cheminformatics tool to evaluate the relationships between chemical properties and in vitro profiles. For later stage drug development, BioPrint can serve as a bioinformatics tool to evaluate the relationships between in vitro profiles and clinical effects (Fig. 1). In this chapter, we describe how the database is built, including compound selection, assay selection, and testing. In Sec. 2, we detail the various
ch08
FA April 1, 2006
15:41
177
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database Chemical
In vitro
properties
profiles
Clinical effects
Chemo-IT:
Bio-IT:
Library design
Late lead optimization
General molecular modeling
Drug development
Figure 1. Conceptual view of BioPrint.
families of compounds selected for building the database, as well as the quality assurance procedures for compound purification and compound management for in vitro testing. In Secs. 3 and 4, we describe the strategy for selecting an array of in vitro pharmacological and ADME assays to ensure the rich information content. Also described are the quality controls for data generation and validation which are crucial to the value of the database. In Sec. 6, we summarize the use of BioPrint as a cheminformatics tool. This has been described in detail previously, by Krejsa et al.7 The two main applications of BioPrint in this area are: • Single property prediction through QSAR modeling (Quantitative Structure-Activity Relationship). • Profile prediction through GNB modeling (General Neighborhood Behavior). In Secs. 5, 7 and 8 we focus on the application of BioPrint to bioinformatics. Section 5 deals with the process of extracting and classifying in vivo data on compounds from literature in the public domain. Section 7 describes how proprietary knowledge is generated from BioPrint by systematically looking at the significant statistical associations between assays and adverse drug reactions (ADRs). More than 5000 significant assay-ADR associations have been identified for which compounds active at a given pharmacological target have a significantly higher specific ADR frequency than the baseline frequency. In Sec. 8, we show examples of how BioPrint can be used as a tool to analyze the profiles of new drug candidates. Discovery and development compounds can be placed in the context of nearly all of the compounds that have passed the hurdles of modern drug development. This context
ch08
FA April 1, 2006
178
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
provides knowledge of likely clinical properties (i.e. likely ADRs, bioavailability, pharmacokinetics) prior to testing in animals or humans. Finally, Sec. 9 gives an outlook on the future developments of the BioPrint project.
2. Compounds Currently, there are more than 2400 compounds in the BioPrint dataset. Priority in the selection of compounds for the BioPrint program has been given to information-rich compounds, for which human clinical data are available. Compound selection is made on the following principle: To include currently marketed human non-protein therapeutics, discontinued and withdrawn drugs, veterinary drugs, purified bioactive compounds from herbal remedies, compounds that failed in drug development, and pharmacology reference compounds. Approximately 60% of the dataset comprise marketed drugs; 5% are compounds tested in the clinic but not marketed; 1% are compounds currently in clinical trials; 2% are drugs withdrawn from the market; and 20% are standard pharmacology reference compounds. The remaining compounds are veterinary drugs, drug metabolites, development compounds, synthesis intermediates, drug impurities, fungicides, herbals, nutraceuticals, herbicides, insecticides, food additives and preservatives, pharmaceutical aids, and diagnostic and research reagents. All compounds are tested as purified active ingredients. Most compounds are obtained from commercial sources, some are purified at Cerep from a formulated product, and some have been synthesized at Cerep. The purity (chromatographic purity) is measured by the aqueous solubility assay, which uses LC/MS with UV or evaporative light scattering detection to analyze the sample. More than 99% of BioPrint compounds are >95% pure. The remaining compounds are >80% pure and are primarily compounds purified from natural extracts. The value of the BioPrint dataset is achieved from a combination of high quality in vitro data generated for each compound, and in vivo data extracted from public medical literature (see below). Relating both types of information supports the bioinformatics applications of the database. Also of value is the diversity of compounds, both chemical and biological, which are indicated for a large array of therapeutic areas. This diversity provides a good training set to develop and test various QSAR methods, and supports the cheminformatics applications of the database (Fig. 1).
ch08
FA April 1, 2006
15:41
179
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
3. In vitro Pharmacological Data: Content and Validation 3.1. Assay selection The BioPrint pharmacological profile is designed for optimal assessment of “biological” diversity. Assay selection is based on several considerations including: • Diversity of targets based on the concept of the “druggable” proteome and on phylogenetic analysis. • Important therapeutic target classes. • Cerep’s experience and skills in drug profiling. • Assay number and quality. 3.1.1. “Druggable” proteome The BioPrint profile represents a subset of the “druggable” genome, more specifically a subset of the “druggable” proteome. This is illustrated by clustering the target proteins by sequence homology (Fig. 2) and by the different molecular functions of the assay targets.8−10 Receptors and enzymes are the major classes represented. Of the 168 targets in the current BioPrint in vitro pharmacological profile, 97 are receptors and 43 are enzymes (Table 1). To ensure continued diversity of the BioPrint profile, additions of new targets to the panel of assays are analyzed in relation to recent and detailed phylogenetic analysis of human G protein-coupled receptors (GPCRs) and kinases.11,12 3.1.2. Important therapeutic target classes Receptors (mainly GPCRs) and enzymes represent the majority of therapeutic targets in current drug discovery.9,13,14 Ion channels, transporters and nuclear receptors are also important targets. The BioPrint profile reflects this emphasis: 58% are receptors which include nuclear receptors; 26%, enzymes; 13%, ion channels; and 4%, transporters (Table 1). Among the receptors, G protein-coupled receptors are the most diverse and well-represented class of the targets (91 representatives). Monoamines, and more generally neurotransmitter receptors, are also highly represented in the BioPrint profile (Table 1). Many enzymes included in the profile are proteases and kinases, i.e. 13 and 8 targets, respectively (Table 1). These enzymes are involved in diseases such as cancer and inflammation, and are targets of interest for drug
ch08
FA April 1, 2006
180
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
Figure 2. A cross-section of the druggable proteome taken from Ref. 10. Proteins in close proximity in this dendrogram are members of the same gene family and share sequence similarity and structure similarity in regulatory and ligand-binding domains. 92 proteins that are part of the BioPrint in vitro pharmacological profile are shown in pink and cover a representative portion of the druggable proteome.
discovery. Recently, kinases have emerged from drug discovery projects as new targets because of both the large number of family representatives (not all of which have yet been characterized), and the recent proof-of-concept studies on kinase inhibitors as a new class of anticancer drugs.15 We also include older targets that are still of interest and for which low levels of off-target effects are expected. Examples of this are enzymes involved in arachidonic acid metabolism (cyclo-oxygenases) and phosphodiesterases. 3.1.3. Experience in pharmacological profiling The BioPrint profile is enriched by 15 years of knowledge and experience in pharmacological profiling.16 This experience includes an in-depth
ch08
FA 15:41
Construction of a Homogeneous and Informative In vitro Profiling Database Table 1 Classification and Description of the Different Targets included in the In vitro Pharmacological BioPrint Profile. The number of assays per each family is detailed. Among the receptors there are 91 G-protein coupled receptors.
Non-peptide receptors
Class
Peptide receptors
181
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
RADIOLIGAND BINDING ASSAYS
April 1, 2006
Family Adenosine Adrenergic Cannabinoid Dopamine GABA Histamine Imidazoline Leukotriene Melatonin Muscarinic Platelet Activating Factor Serotonin Sigma Total Angiotensin Bombesin Bradykinin Calcitonin gene-related peptide Chemokines Cholecystokinin Complement 5a Cytokines Endothelin Galanin Glucagon Growth hormone secretagogue Melanin-concentrating hormone Melanocortin Motilin Neurokinin Neuropeptide Y Neurotensin Opioid & opioid-like Somatostatin Thyroid hormone Urotensin Vasoactive intestinal peptide Vasopressin Total
Number 5 10 2 5 2 7 1 2 1 5 1 12 1 54 2 1 1 1 3 2 1 1 2 1 2 1 1 3 1 1 1 1 4 2 1 2 2 3 40
ch08
FA WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
Nuclear receptors (steroids)
Table 1 (Continued ).
Glucocorticoid Estrogen Androgen
Channels
Total Ca2+ channels K + channels Na+ channel Cl - channel Sub-total ion channels GABA Glutamate Glycine Nicotinic Purinergic Serotonin Sub-total ligand-gated channels Total Norepinephrine Dopamine GABA Choline Serotonin Total TOTAL RADIOLIGAND BINDING ASSAYS
Amine transporters
182
15:41
Monoamine metabolism Arachidonic acid metabolism NO synthases Phosphodiesterases Proteases Guanylyl cyclase Phosphatase Kinases Free radicals ATPase Miscellaneous enzymes TOTAL ENZYME ASSAYS TOTAL PHARMACOLOGICAL ASSAYS ENZYME ASSAYS
April 1, 2006
1 1 1 3 5 3 1 1 10 2 5 1 2 1 1 12 22 1 1 1 2 1 6 125 5 2 1 7 13 1 1 8 1 1 3 43 168
understanding of the use of biological data to describe relationships between compound structures, in vitro data, and in vivo data. Targets with a clear association between a specific receptor or enzyme and an adverse effect are favored. For example, muscarinic
ch08
FA April 1, 2006
15:41
183
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
acetylcholine receptors and adrenergic receptors are clearly implicated in cardiovascular effects in vivo. On the other hand, less understood targets are also of interest because compounds with well-characterized in vitro effects might help elucidate the in vivo roles of these targets. Such targets include sigma receptor, melatonin receptor, various kinases, and metalloproteinases. Finally, to increase diversity, and based on our experience, the profile includes targets with low (e.g. AT2 , B2 , ETA ), medium (e.g. AT1 , CCKB , CB2 ) and high hit rates (e.g. Alpha2C , D3, 5-HT2C ). This provides positive and negative data to effectively develop and validate predictive models with sufficient accuracy. Indeed, in the design of the database, both negative and positive results are important, and the extent of data validation is the same for both. Moreover, targets that poorly correlate to other targets can provide valuable information. 3.1.4. Assay number and quality In addition to activity on a specific therapeutic target(s), most compounds have off-target activities. Human adverse effects are mainly due to these off-target activities. Therefore the entire activity profile, not just individual assay activities, is the preferred method to correlate in vitro activities and frequency of adverse effects. The number of assays in the profile must be large enough to include sufficient information, both negative and positive, on a diverse set of biologically active chemicals. Indeed, negative results are as important as positive ones in building a knowledge database. Although it is not possible to include the main therapeutic target for every compound, this absence does not prevent data exploitation to build interesting correlations between compounds in the same biological activity class.10 Among the biological assays, preference is given to assays that measure a direct interaction between a compound and a target, making data interpretation simpler. Therefore, radioligand binding assays and isolated enzyme assays are preferred to functional or cell-based assays. For high quality data, assays that are homogeneous and robust are selected. Also, whenever possible, preference is given to assays based on human sources. Finally, miniaturized assays are preferred for their fast turnaround time and low cost.
ch08
FA April 1, 2006
184
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
3.2. Compound management and testing 3.2.1. Compound management The quality of the compound collection is one key to high quality biological results. DMSO is a widely used solvent and many studies dealing with the long-term storage of compounds in this solvent have been conducted. The important parameters to maintain stability and concentration are: Freeze/thaw cycles, storage temperature and duration, container material, humidity, and oxygen.17 Compounds frozen in DMSO are generally stable. However, the number of freeze/thaw cycles must be kept to a minimum to avoid compound loss. Compound loss is mainly due to precipitation, whereas compound loss due to degradation is negligible (no additional peaks are detected in the HPLC analysis). It has been reported that 10−2 M stock solutions can undergo more than 10 freeze/thaw cycles with no significant effect on compound stability.18 Long-term storage of compounds at room temperature in DMSO, over a three-month period, is also possible with only minor loss of compound. Kozikowski et al.19 observed a less than 20% loss after three months in 92% of the cases. As for container material, polypropylene containers are as satisfactory as glass ones in compound stability and recovery. Since water is more harmful than oxygen to compound degradation and the hygroscopic nature of DMSO favors water uptake,17 humidity control is a key parameter to ensure compound storage stability. To ensure the integrity of the BioPrint compound collection, stringent rules for compound management are applied. The whole collection is stored as dry powders. Before each annual campaign of testing, new stock solutions are prepared at 10−2 M concentration in 100% DMSO in inert polypropylene containers. In cases of insufficient solubility, compounds are prepared at 10−2 M in 50% DMSO/50% H2 O or in 100% H2 O. Various database flags are generated to indicate compounds at the solubility limit for experimental testing. Multiple copies of 96-well microtiter plates containing the stock solutions are created to further avoid numerous freeze/thaw cycles. For screening, one copy is made for each assay of the panel and additional copies are dedicated for the selection of compounds for further screening. In this process, the number of freeze/thaw cycles for any compound does not exceed two.
ch08
FA April 1, 2006
15:41
185
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
3.2.2. Compound testing Compound testing is performed in two steps. The first round of testing is performed at 10−5 M to measure percent inhibition in the binding assay. Selected compounds are further tested at several concentrations to determine an IC50 value. Defining the binding threshold for further testing should balance the objectives to minimize false negatives in which active compounds are missed, and to minimize the number of false positives that require follow-up testing. An activity threshold of 30% inhibition at 10−5 M for follow-up IC50 determination was selected based on the following considerations: • • • •
Slope break point of percent inhibition data. Success rate of IC50 determination. Compound solubility. Theoretical shape of a concentration-response curve.
Plotting the number of compounds as a function of percent inhibition at 10−5 M results in a distribution curve with a “slope break point” that corresponds to a discontinuity in the first derivative (“DFD”). A change in the distribution density suggests that there are different populations within the percent inhibition values. The DFD point is numerically calculated from the distribution curve and gives an assessment of the suitable threshold for each assay. Most often it is located in the vicinity of 30% inhibition at 10−5 M (unpublished work). The rate of confirmation between the inhibitory effect at 10−5 M in the screening and the percent inhibition obtained at the same concentration in the subsequent IC50 determination defines a success rate for each assay. This success rate is deemed satisfactory when the inhibitory effect in the primary screening step is above 50% inhibition; reasonable between 30% and 50% inhibition; but weak below 30% inhibition. A solubility assay with a detection range from 10−6 to 2 × 10−4 M is performed as a part of the BioPrint profile and shows that about one third of the BioPrint compounds are insoluble at 10−4 M. Thus, aqueous solubility limits the high-end concentration for IC50 determination to 10−4 M. The final DMSO concentration tolerated in the assay reaction medium must not exceed 1%. This also limits the high-end concentration to 10−4 M. Moreover, a standard competition binding curve with a Hill number nH = 1 goes from 10% to 90% inhibition within 2 log units of concentration.
ch08
FA April 1, 2006
186
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
Since the highest testing concentration is 10−4 M, the screening value must be at least 25–30% inhibition at 10−5 M in order to obtain at least 50–70% inhibition at 10−4 M. The retained approach for compound testing is therefore a first screening at 10−5 M, in duplicate. Every compound that displays more than 30% inhibition is further tested at 8 concentrations ranging from 10−4 M to 10−11 M, in duplicate, for the IC50 determination. Based on several years of experience and analyses, some exceptions have been introduced. The Na+ /K+ ATPase assay is screened at 3 × 10−5 M because of a low hit rate. The cutoff for IC50 follow-up has been raised to 40% inhibition for CCKB , GABAA , GABAB(1b) , Kainate, Glycine (strychnine-insensitive), N (neuronal alpha bungarotoxin insensitive) and PAF assays. 3.3. Assay and data validation Each in vitro assay is fully validated using several criteria. The validation step for each assay and each experiment follows a specific, well-detailed and documented process. Strict quality standard guidelines ensure the construction of a high quality database of reliable and reproducible data. 3.3.1. Assay validation Each assay development includes establishing a concentration-response curve for the receptor or enzyme studied, determining the reaction kinetics, Kd and Vmax values for binding assays, or a substrate Km for enzyme assays, and testing known compounds as references. Results for the reference compounds are compared with those reported in the literature to assess the accuracy of the assay protocol. The robustness of an assay is evaluated by a high signal-to-noise ratio, reproducible IC50 or EC50 values of reference compounds, Hill number (nH ) close to 1, and a high Z value. All these parameters are addressed before assay validation is completed. To finalize the assay development, all the parameters are documented in a Standard Operating Procedure to be followed during production. Assay reproducibility and consistency is assured by tracking the results of the reference compounds. Throughout their lifetimes all assays are subjected to improvements. In case of an animal species change or a major protocol change, the assay is fully revalidated and all BioPrint compounds retested.
ch08
FA April 1, 2006
15:41
187
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
3.3.2. Data validation Figure 3 shows the multiple levels of scientific and quality controls of the data. Validation of each experiment requires an acceptable signal value (background, non-specific signal, total signal) and an acceptable IC50 value of the reference compound. For both parameters, acceptance limits are well defined and documented. For a reference compound, the obtained IC50 value (at 8 concentrations in duplicate) should fall within one-half log unit of the historical mean value. Each compound is assayed in duplicate at each concentration and homogeneity of replicates is required with a standard deviation <20%.
Scientific controls
Experimental phase
Experiment run according to the Standard Operating Procedures
Quality controls Signal to noise ratio (check of background, non specific, basal and stimulated controls)
These five controls are representative of both assay validity and reproducibility. If these standards are not correct, the assay is performed again
Reference compound data (IC50 value within historical mean +/− half-log, curve shape, Hill number nH, plateau values) Homogeneity of duplicates: if the standard deviation is greater than 20% for a given test concentration, the duplicate is rejected
Reproducibility (match of 1st and 2nd screen data for example)
Verification of the integrity of the data. Matches between original and calculated data Presence of appropriate plate map and output data for each experiment
Check that any deviation from the protocol is reported by the operator Check that any rejected data is justified
Follow-up of the historical IC50 value for each assay
Addition of flags when needed (for example INTER if the test compound interferes with the detection method) Solvent only controls (% of effect, comparison with historical values)
Scientifically and Quality validated data are sent to the Cerep database
Figure 3. Data validation process.
ch08
FA April 1, 2006
188
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
Finally, the reproducibility from one experiment to another is also checked. The effect of the compound obtained at 10−5 M during the IC50 determination is systematically compared to the primary screening result. If the two results are inconsistent, the following rules are applied: When the inhibitory effect of the compound is between 30% and 50% for the primary screening, the IC50 value is accepted regardless of the discrepancy with respect to the screening value. If the inhibition value of the compound is above 50% for the primary screening and the result is not confirmed with the IC50 value determination, a retest is performed for confirmation and then validated. Some particular situations can occur. These include negative percent inhibition values and compound interference. Negative percentages are real and usually confirmed when retested. They are generally attributable to non-specific effects of the compounds. Such compounds are considered to be inactive. Some compounds interfere with the detection method of the assay (fluorescence, fluorescence polarization or colorimetric method). When detected, an INTER flag is added to the data record and the result is cautiously considered. No IC50 follow-up is performed for negative results or for compounds with an “INTER” flag. 3.3.3. Data management All information regarding compounds and raw data are handled through a LIMS (Laboratory Information Management System), and data condensation is performed using internal IT tools. Validated data are automatically uploaded from the LIMS into an Oracle database along with all accompanying information (flags describing solubility, interferences, curve shape, etc.). Several routine cross-checks are run on the database. These look in particular for missing data and inconsistency with values reported in the literature. In the latter case, the literature information is thoroughly analyzed, a retest may be performed, and the data corrected.
4. In vitro ADME Data 4.1. In vitro ADME profile: assay selection and profile The ADME portion of the BioPrint profile is made up of a panel of in vitro assays chosen for their potential to predict in vivo pharmacokinetics (Table 2). Some of the in vitro assays measure properties that contribute to the in vivo bioavailability of the new drug candidate. These include aqueous solubility, log D (octanol), and log D (cyclohexane), physico-chemical
ch08
FA April 1, 2006
15:41
189
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database Table 2 Classification and Description of the Different Assays included in the ADME portion of the BioPrint Profile. The number of assays by classes is indicated.
ADME BioPrint Profile Assay Type Analytical Analytical Analytical Fluorescent and Analytical Spectrophoto-metric
Class Physico-chemical characterization Permeability Metabolism Cytochrome P450s Toxicology
Total Assays
Number 3 5 1 11 1 21
properties known to play a role in drug absorption. Four Caco-2 permeability assays differing in the direction in which the compound flux is measured and the presence or absence of a pH differential between the apical and the basolateral sides, are well known models of intestinal drug absorption. The human liver microsome (HLM) metabolic stability assay allows one to make qualitative observations as to the likely metabolic fate of the test compound. The P-glycoprotein inhibition (of the B-to-A flux of P-gp transporter substrate digoxin) and the cytochrome P450 inhibition assays provide insight into the drug-drug interaction component of drug bioavailability. Together these assays can be used to create an ADME profile that is useful both as individual results, and as a collective ADME profile which can be used to identify known drugs, with well understood in vivo PK properties, that share similar ADME profiles. For ADME assays, the screening test concentrations and rules for IC50 determination have been fine-tuned for each assay. Screening concentrations are 1 to 200 µM, n = 2, depending on the assay. A 40% inhibition cutoff at 30 µM is applied for the IC50 determination of the cell viability assay. A 50% inhibition cutoff at 50 µM is applied for the IC50 determination of the P-glycoprotein assay, while no IC50 determination is needed for permeability, metabolic stability, log D, and solubility assays. 4.2. Compound management, testing process, and assay and data validation With respect to compound management, the testing process, and assay and data validation, the BioPrint ADME assays are similar to the pharmacological assays. The test compounds are handled similarly with aliquots of prepared compounds set aside for the ADME screening. The only deviation
ch08
FA April 1, 2006
190
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
comes in the case of the fluorometric cytochrome P450 assays where the test compounds are aliquoted and dried down before being resuspended in the assay buffer. For assay and data validation, the process is the same as with the pharmacological assays. Appropriate reference compounds or sets of reference compounds are chosen and assayed each time the assay is run. The results for these compounds must fall within well-defined ranges in order for the assay to be validated. Homogeneity of replicates is also closely monitored.
5. In vivo Data Diverse types of in vivo data are collected for the compounds from documents in the public domain. While the in vitro datasets are all generated in-house, all the in vivo data is gathered from drug labeling materials, monographs, publicly available databases, and primary literature. Data is gathered and culled by trained scientists. Some of the data types gathered include adverse drug reaction information (ADRs), pharmacokinetic data, and information on therapeutic area, route of administration, and mechanism of action. The sources of the information are documented in the database.
5.1. ADR dataset The BioPrint ADR dataset was developed from the manufacturer’s product information included on package inserts and from prescribing information supplied by the drug company. The natural language of the drug label, non-standardized descriptions, was used to enter and collate the data in the database. These ADRs are referred to as the “specific ADR terms.” COSTART (Coding Symbols for Thesaurus of Adverse Reaction Terms, 5th edition) terminology was used to standardize the variations of the reported drug responses. The code, developed and maintained by the FDA’s Center for Drug Evaluation and Research, applies standard terminology to side effects. These standardized terms were used to reduce the variations found in drug label information and to standardize terms in the database. For example, the terms “feeling of enlarged abdomen,”“abdominal enlargement,” and “GI fullness” are variations that signify a similar clinical finding stated in three different drug labels. These terms were all described in the database by the COSTART term “ABDO ENLARGE.” In total, 8970
ch08
FA April 1, 2006
15:41
191
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
different ADR terms from the drug labels were reduced to 940 COSTART terms. Each COSTART term is assigned to a specific body system. These body systems are: Allergic, Cardiovascular, Digestive, Endocrine, Hematological, Hepatobilliary, Laboratory Abnormalities, Local, Musculoskeletal, Nervous, Psychiatric, Reproductive, Respiratory, Skin and Appendages, Special Senses, and Urogenital. These combinations of COSTART terms with related body systems were used to generate associations with BioPrint IC50 inhibition data to obtain significant statistical associations using a chi-squared statistical analysis. Most drug labels report incidence for ADRs in categories such as “common,” “frequent,” “infrequent,” “rare,” or “not known.” These clinical incidence data were entered into the BioPrint database in text-based groups defined below. Although some drug labels report ADRs with continuous clinical incidence data from 0% to 100%, this data was also entered according to the text-based bins: • • • • •
Common: occurring in >3% of treated subjects. Frequent: occurring in >1% of treated subjects. Infrequent: occurring in 0.1 to 1% of treated subjects. Rare: occurring in <0.1% of treated subjects. Not known: frequency not listed in drug label.
5.2. Pharmacokinetic data Human pharmacokinetic data is gathered for as many compounds as possible and from a variety of public sources of which the most common are drug labeling, primary literature, and monographs. Diverse pharmacokinetic parameters are covered and include bioavailability, oral absorption, AUC, serum half-life, Cmax , Tmax , clearance, volume of distribution, excretion, excretion unmetabolized, and serum protein binding. These pharmacokinetic data are loaded into the database along with extensive supporting information describing each datum. Examples of such information include route of administration, units, parameter type (e.g. total clearance versus hepatic clearance, renal excretion versus fecal excretion), and value type (range versus value with a standard deviation). In most cases, the database contains multiple values for each combination of drug and PK parameter allowing the user to get a better idea of the consistency of the available values.
ch08
FA April 1, 2006
192
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
6. Structure-In vitro Relationships Original methods for developing predictive models for structure-in vitro relationships using BioPrint data have been a strong research focus at Cerep for many years. These are detailed in previous publications (see Ref. 7 and references therein) and can be partitioned into two categories: • Methods aimed at predicting a single in vitro property through original QSAR modeling (Quantitative Structure-Activity Relationship). • Methods aimed at predicting the full in vitro BioPrint profile through GNB modeling (General Neighborhood Behavior). 6.1. QSAR model building and validation In-house QSAR models are systematically built whenever a sufficient number of active compounds exist for a given assay.20,21 Original approaches and tools have been developed for QSAR model building and validation: • A set of in-house pharmacophoric descriptors.22 • A QSAR builder that is efficient at deriving SAR from various in vitro pharmacological activity or ADME data. • A user-friendly interface that allows chemists to access and use the tools and models. • A growing list of predictive QSAR models derived from BioPrint data: e.g. solubility, log D (pH = 7.4, pH = 6.5), Cyp2D6 inhibition, permeability (A → B, B → A, passive), and more than 30 target-related models. The strength of this approach relies in: • High quality and completeness of the BioPrint data used to build the QSAR models. • Original descriptors that capture essential pharmacophoric features of the modeled ligands. • A synergy approach7 which is the combination of linear/neural net and neighborhood behavior models that are independent ways of identifying correlations between molecular description and experimental activity. 6.2. BioPrint full profile prediction through GNB modeling GNB modeling is an interactive BioPrint tool that estimates the entire activity profile of a compound (the “target”) using the experimental profiles of its nearest neighbors.7 Upon input of the structure of the target, the
ch08
FA April 1, 2006
15:41
193
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
system quickly removes obviously dissimilar reference set members, based on pharmacophoric descriptor fingerprints. The remaining candidates are then automatically overlaid on the target, according to the ComPharm algorithm,23 which seeks an optimal superimposition of pharmacophorically similar functional groups. The resulting overlays are presented to the user to refine the neighbor subset by selection and/or removal of candidates. This interactive step allows the chemist to gain insight into the identity of the neighbors and their structural similarity to the target molecule. The average activity profile is then calculated from user-approved neighbors, with the variability for each property in the profile represented by an error bar.7 6.3. Validation exercise of the GNB tool for quick profile estimation with 300 randomly chosen structures from BioPrint We performed a validation study to assess the ability of the BioPrint GNB tool to yield meaningful predicted activity profiles for test compounds as follows. Three hundred BioPrint compounds with measured experimental profiles were selected on the basis of diversity in the activity profile space. In the selection process, the experimental profiles, consisting of pIC50 values, of all BioPrint compounds were submitted to a principal component analysis followed by a diversity selection procedure in the Cerius2 modeling software (Accelrys Inc., San Diego). For each of the 300 compounds, a GNB search for the 10 nearest neighbors was performed. The profile of the target compound was then predicted as the average of the profiles of neighbors below a ComPharm dissimilarity threshold of 0.1, 0.2, 0.3 or 0.4. For example, at a dissimilarity threshold of 0.1, of the maximal pool of 10 closest molecules identified by the ComPharm overlay, only those neighbors that are dissimilar to the target compound by less than 0.1 were used to predict the target profile. If there were no neighbors below the threshold, the target was labeled as “non-predictable”; otherwise, the predicted target profile was compared with the experimental profile, and the RMS error of the activity values in the profile, the correlation coefficient R2 , and the maximal prediction error were reported. The results of the validation study are summarized as follows: • When one uses a dissimilarity threshold of 0.4 for the neighbors (i.e. the least stringent threshold), the profile can be predicted for 90% of the 300 diverse target compounds; only 10% were labeled as “non-predictable”.
ch08
FA April 1, 2006
194
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
• For 14% of the “predictable” compounds (dissimilarity threshold of 0.4), very good predictions were obtained, with an RMS error on the pIC50 values smaller than 0.2 log units. • Reasonable predictions (RMS error < 0.8) were obtained for more than 85% of the predictable compounds (also at the dissimilarity threshold of 0.4). Activity profiles predicted by a robust method such as the GNB can be used, for example, to make in vitro/in vivo correlations, and thus anticipate adverse drug effects of lead candidates in drug development.
7. Statistical Associations between In vitro Assay Data and ADRs 7.1. ADR binary data A total of 1118 drugs with ADR information were included in the analysis. Incidence data (“common,” “frequent,” “infrequent,” “rare,” “not known”) associated with each COSTART term for each compound were transformed to binary data. Each ADR for a specific drug was scored as “1” if the COSTART term was listed in its label at any clinical frequency including “not known,” and as “0” if the term was not listed. Thus, each occurrence of the ADR in the label information is counted as a positive response. 7.2. Assay-ADR associations A data-mining engine was used to associate the binary ADR variables with each assay. ADRs listed on less than 0.5% of the drug labels were excluded. The associations were developed using chi-squared statistics. For this analysis three thresholds for IC50 values were chosen: 100 nM, 500 nM, and 1000 nM. These thresholds are overlapping with all the hits in the 100 nM level also included in the 500 nM level. For an ADR-assay association to be significant, the ADR must have a statistically significant increase in the percentage of compounds that have both the assay IC50 below threshold and the ADR listed, relative to the baseline percentage. The baseline percentage is the percent of all the compounds evaluated that list the ADR, regardless of IC50 value. For a two-class analysis, a chi-squared value greater than 3.83 is considered significant. A significant chi-squared value indicates a significant increase in the in vivo response among the compounds that “hit”
ch08
FA April 1, 2006
15:41
195
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
a particular in vitro assay. Statistically significant assay-ADR associations met the following criteria: • More than 0.5% of the compounds list the ADR in the drug label. • Significant elevation: chi-squared greater than 3.83 in the ADR listing among “hits” at two of the three threshold levels, 100 nM, 500 nM and 1000 nM. • Risk ratio greater than 1.0 among “hits” at all hit levels. Risk ratio equals the percent of “hits” listing the ADR divided by the baseline percentage. 7.3. ADR risk by IC50 value The associations that met the above criteria were further developed using a series of binned IC50 data to quantify the ADR risks according to IC50 value for each assay. The IC50 bins are as follows: • • • • • • •
<100 nM >= 100 nM and <500 nM >= 500 nM and <1000 nM >= 1000 nM and <5000 nM >= 5000 nM and <10 000 nM >= 10 000 nM and <50 000 nM >= 50 000 nM
For each IC50 bin, the percentage of compounds that list the ADR was calculated as percent ADR positive. For each group of seven IC50 bins, a Spearman rank correlation coefficient was calculated for the percent ADR positive relative to the bin rank. A strong Spearman correlation indicates a dose response in the ADR association. 7.4. Biological relevance This statistical analysis was designed to compare target-by-target the in vitro activities and all ADRs listed in the drug labels. Due to the size of the BioPrint in vitro pharmacological dataset, some statistical relationships with uncertain biological relevance have emerged: • Highly correlated targets may be statistically associated with the same ADR. For example, tachycardia is mediated by blockade of the M2 muscarinic receptor, and is also found significantly associated to M2 data in BioPrint according to our statistical analysis (Fig. 4). However, since M1
ch08
FA April 1, 2006
196
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
Figure 4. Significant statistical association between in vitro binding to M2 and in vivo tachycardia. Seven IC50 activity bins are found on the y-axis ranging from the 0–99 nM bin to the greaterthan-50 000 nM bin which is the default bin for the non-hits. The number of hits in each bin is indicated. The light grey and dark grey bars shown with each bin represent the percentage of compounds in that bin that are positive or negative, respectively, for the ADR in question. The baseline percentage for that ADR is indicated in the legend above. The result of a Spearman rank correlation of the various bins is also indicated.
and M3 data are highly correlated to M2 data, M1 , M2 , and M3 are all found to be associated with tachycardia. • Surrogate receptors may be associated with an ADR, when the mechanistically relevant receptor is not included in the BioPrint pharmacological profile. • Class effects of drugs may be statistically associated both with the offtarget activities and the desired therapeutic activities.
7.5. ADR tools The ADR tools in BioPrint assist the user in accessing and implementing the ADR data included in BioPrint. One can quickly access the ADRs listed for a selected compound, the ADRs related to a selected body system, and the ADRs statistically associated with in vitro assay data (ADR associations). The ADR associations can be searched both by assay and by ADR. When searching by assay, ADRs associated with the selected assays and IC50 values are returned along with the risk assessment. The risk information includes the number of compounds that list the ADR, and the number of compounds
ch08
FA April 1, 2006
15:41
197
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
Figure 5. Heat map showing some of the identified assay-ADR associations. 87 pharmacological assays are clustered on the X-axis and 48 ADRs are clustered on the Y-axis. Red spots in the heat map indicate significant associations. The criteria for significant associations are those described in paragraph 7.2.
falling within the specific IC50 bin that list the ADR. This information is presented in ratio format as well as percentages, and can be viewed in graphical format (Fig. 4). In this example, we see the ADR association between the in vitro assay data for the M2 muscarinic receptor and the ADR tachycardia. For each of the previously discussed activity bins, the ADR positive (light grey) and ADR negative (dark grey) percentages are shown. The baseline ADR percentage (27%), the number of compounds in each bin (28 for the 0–99 nM bin), and the Spearman rank correlation coefficient value (0.96) are indicated. The BioPrint collection of ADR associations can also be viewed using clustering. In Fig. 5, the heat map shows a subset of ADRs clustered on the vertical axis versus the assays on the horizontal axis. At the bottom right of this figure, the seemingly disparate ADRs of “prolactin increase” and “torsades de pointes” cluster closely together. Further research in literature shows that in fact both of these are well known ADRs of atypical antipsychotics.24,25
8. Analyzing Drug Candidates in the Context of BioPrint One of BioPrint’s greatest strengths is that the user is able to examine a new drug in the context of the compounds in BioPrint. Placing the new
ch08
FA April 1, 2006
198
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
drug in the context of BioPrint can occur in many different ways but it is most effective when one starts with the BioPrint profiling of the compound of interest. Subsequent analysis of the experimental profile can occur in a variety of ways. Individual hits, or subsets of the profile, or the entire profile can all be investigated. At the level of individual hits, the database can be queried to retrieve either marketed BioPrint drugs that have that same activity, or the ADR associations discussed in the previous section can be queried to identify potential ADRs and their relative risks. At the profile level, compounds with similar profiles can be identified using standard statistical methods such as similarity metrics and hierarchical clustering. This similarity can be assessed using the whole panel of assays or by using selected subsets of those assays as determined by the user. Once compounds with similar profiles have been identified, in vivo data for the similar compounds can be accessed and examined for information that may permit the user to anticipate in vivo effects. Using pharmacological profiles one may be able to predict therapeutic area or ADRs. Using the ADME profile the user may be able to predict the intestinal permeability, metabolic stability, and drug/drug interactions. 8.1. Identification of compounds with similar in vitro pharmacological profiles using clustering Diverse methods can be used to identify compounds with similar pharmacological profiles, and application of these methods is subjective. There are many ways to work with the profile data and the results can differ greatly depending on which methods are used. We will summarize some methods that in our experience have yielded the most satisfying results from a biological point of view. Two datasets are available for this analysis, primary screening data (% inhibition at the concentration of 10 µM) or IC50 data. Both datasets have their limitations. Primary screening data has greater variability and is truncated at 100% inhibition. The IC50 dataset has less intrinsic variability but has a very wide range (many logs) and there are many non-hits. We have found it preferable to work with the IC50 data set, by converting the IC50 s to pIC50 s and assigning default values to the non-hits. Using the converted pIC50 s, hierarchical clustering is performed using Pearson correlation and the complete linkage method. Compounds are clustered on the vertical axis and assays are clustered on the horizontal axis.
ch08
FA April 1, 2006
15:41
199
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
The combination of the dataset and the clustering tools proves to be very efficient, allowing clustering of compounds with multiple activities, therapeutic and off-target. Figure 6 shows a heat map of a small cluster of related antihistaminic drugs. Figure 7 shows a group of antifungal drugs or metabolites of antifungal drugs that clusters together despite the fact that
Figure 6. Heat map of a small cluster of related antihistaminic drugs. 149 pharmacological assays are clustered on the X-axis and 10 compounds are clustered on the Y-axis. Clustering is performed using Pearson correlation and complete linkage using a pIC50 data set. For nonhits with primary screening % inhibition values of less than 20%, a default pIC50 value of 3.5 was used. Non-hits and above 20% inhibition received a default pIC50 value of 4.0. The pIC50 values range from the default value of 3.5 (blue-green) to 9 (red). Black spaces indicate data that is missing due to compound interference with the detection method.
Figure 7. Heat map of a group of compounds containing antifungal drugs or metabolites of antifungal drugs. 140 pharmacological assays are clustered on the X-axis and 28 compounds are clustered on the Y-axis, using the same methodology and the same color-coding as in Fig. 6.
ch08
FA April 1, 2006
200
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
their therapeutic target is not represented in the BioPrint panel of assays. Inter-dispersed among the antifungal compounds are several drugs whose primary therapeutic uses are not as antifungals. Some of these drugs include the antidepressant paroxetine; the anti-estrogens tamoxifen, toremifene, and clomiphene; and the anti-arrhythmic, amiodarone. The mechanism of the imidazole antifungals is the inhibition of the biosynthesis of ergosterol, the main sterol in membranes of fungi. Recent papers have reported that paroxetine,26 tamoxifen,27 and amiodarone28 also have antifungal activity or inhibit ergosterol biosynthesis. Thus, clustering a new compound of interest with the BioPrint compounds will allow identification of BioPrint compounds with similar pharmacological profiles. ADR profiles are predicted based on the ADRs associated with the BioPrint compounds that are near neighbors based on the pharmacological profile. For example, by analyzing the ADR profiles of near neighbors and the ADR incidence data, plus examining predictions based on the in vitro–in vivo associations (see Sec. 7), the majority of important ADRs of benperidol, such as hypotension, tachycardia, and somnolence, are predicted.7 8.2. Identification of compounds with similar ADME profiles using clustering Neighbor-based ADME interpretation can also be performed using the in vitro ADME subset of the BioPrint profile. Just as the pharmacological profile of a new compound is used to look for compounds with similar pharmacological profiles, the ADME profile can be used to identify compounds with similar ADME profiles. Eight in vitro ADME assays are used to search for similar profiles. These assays are: Solubility, log D (octanol), log D (cyclohexane), Permeability pH 6.5/pH 7.4 A to B, Permeability pH 6.5/pH 7.4 B to A, Permeability pH 7.4/pH 7.4 A to B, Permeability pH 7.4/pH 7.4 B to A, and HLM Metabolic Stability (% remaining). Because the data differ greatly in units, the data are normalized to a similar range (0–2) by taking the log of the value (except for log D values) and offsetting the data to remove any negatives. Here too, Pearson clustering with complete linkage can be applied to identify compounds with similar ADME profiles. Having identified BioPrint drugs with similar ADME profiles, the BioPrint pharmacokinetics database (which contains literature pharmacokinetic data on over 1000 drugs) is queried and predictions for the test compound are made based on the pharmacokinetic profile of the ADME nearest neighbors.
ch08
FA April 1, 2006
15:41
201
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
Figure 8. Heat map of a cluster of BioPrint compounds with similar in vitro ADME profiles. Eight ADME assays are clustered on the X-axis and 8 compounds are clustered on the Y-axis. Normalized data range from 0 (blue-green) to 2 (red). Clustering is performed using Pearson correlation and complete linkage using normalized dataset. Where available in literature, human in vivo oral absorption (oa) and oral bioavailability (ba) values (%) are presented after the compound name.
Figure 8 shows a cluster of BioPrint compounds with similar ADME profiles. All the compounds in this cluster share the pharmacokinetic properties of having near total oral absorption and oral bioavailability. Thus, we have identified an in vitro ADME profile that appears to be highly predictive of good absorption and bioavailability properties. Figure 9 shows a cluster of BioPrint compounds that share the pharmacokinetic properties of moderate to good oral absorption and poor oral bioavailability. A new compound that clusters with these compounds might reasonably be predicted to share these properties.
9. Future In this chapter we have described the construction of a high-quality, homogeneous and informative in vitro profiling database and its applications both to drug discovery and development. The applications of the database to the exploration of in vitro–in vivo relationships (referred to as bioinformatics applications in Fig. 1) have been the focus of the last sections of this chapter. Applications of BioPrint include predicting biological and pharmaceutical properties of existing
ch08
FA April 1, 2006
15:41
202
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
Figure 9. Heat map of another cluster of BioPrint compounds with similar in vitro ADME profiles. Eight ADME assays are clustered on the X-axis and 9 compounds are clustered on the Y-axis, using the same methodology and the same color-coding as in Fig. 8.
compounds to support: • • • •
Late lead optimization (selection of analogs). Selection of drug-candidates for clinical development. Re-orientation of drugs in clinics. Design of clinical trials.
Significant statistical relationships between patterns of in vitro activity and in vivo endpoints have been identified and are available in the database to aid in the interpretation of profiling data from new compounds. Some fundamental uses of BioPrint are: • Arrays of biological data can form the basis for uniquely informative molecular descriptors. By defining the relationships between compounds using biological descriptors (in vitro profiles) in addition to chemical structures, medicinal chemists are given new perspectives to support lead optimization. • By defining protein targets (assays) using the chemical structures that interact with them, in addition to phylogenetic or functional information, biologists are given new perspectives to support the understanding of compound specificity and off-target effects.
ch08
FA April 1, 2006
15:41
203
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
• The statistical relationships between patterns of in vitro activities and clinical endpoints lead to a new understanding of both the therapeutic benefit and adverse effect predictions. One of the most powerful uses of BioPrint is to provide a basis for understanding and learning from past clinical attrition. This is accomplished by systematically profiling historical compounds in the BioPrint assays and using the BioPrint data set as context in which to identify patterns of in vitro activity associated with chemical series, with therapeutic target activity, and with clinical adverse effects. BioPrint allows data from proprietary compounds to be put in context with analogous data on nearly every compound that has succeeded (or not) in getting market approval. Since BioPrint is primarily a contextual tool, and is most useful when several neighbors of a drug candidate are identified, expanding the dataset itself will logically improve the quality of the tool. Expanding the database can be accomplished by increasing either the number of compounds, the number of biological data per compound, or both. The current development plan for BioPrint calls for information-rich compounds, compounds which have human clinical data available. In 2005, 50 new compounds were added to the database, including 41 recently marketed drugs. Additional information-rich compounds are compounds with available animal data. Expansion of the database through the increase in biological data per compound can be classified into five approaches: • • • • •
Increase the number of targets in the BioPrint pharmacological profile. Add functional data corresponding to the binding data. Increase the number of in vitro ADME assays. Add animal data. Add and update human clinical data.
Another direction of development of the data set is to strengthen the in vitro–in vivo correlations and develop multivariate models to predict in vivo endpoints, such as therapeutic effects and adverse events. In this respect, it will be interesting to examine which data (among in silico descriptors, in vitro primary and secondary data, in vitro functional data, etc.) are most appropriate to derive robust and predictive models. Finally, developing rules to allow us to recognize profiles that are associated with specific therapeutic or adverse effects will be useful in supporting
ch08
FA April 1, 2006
204
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
lead optimization, as well as providing a better understanding of the success or failure in clinical development.
Acknowledgments We thank Thierry Jean, Mark Crawford and Frédéric Revah for helpful comments in the preparation of this chapter. We acknowledge the efforts of all individuals at Cerep involved in the BioPrint® project and in data acquisition. We also thank Emmanuelle Guillaume for table preparation.
References 1. Jean T, Chapelain B. (1999) Method of identification of leads or active compounds, CEREP, 1999, International publication number WO-09915894. 2. Caron PR, Mullican MD, Mashal RD, et al. (2001) Chemogenomic approaches to drug discovery. Curr. Opin. Chem. Biol. 5:464–470. 3. Jacoby E, Schuffenhauer A, Floersheim P. (2003) Chemogenomics knowledgebased strategies in drug discovery. Drug News Perspect. 16:93–102. 4. Schuffenhauer A, Jacoby E. (2004) Annotating and mining the ligand-target chemogenomics knowledge space. DDT: BIOSILICO 2:190–200. 5. Mestres J. (2004) Computational chemogenomics approaches to systematic knowledge-based drug discovery. Curr. Opin. Drug Discov. Devel. 7:304–313. 6. Lipinski C, Hopkins A. (2004) Navigating chemical space for biology and medicine. Nature 432:855–861. 7. Krejsa CM, Horvath D, Rogalski SL, et al. (2003) Predicting ADME properties and side effects: The BioPrint approach. Curr. Opini. Drug Discov. Dev. 6:470–480. 8. Venter JC, et al. (2001) The sequence of the human genome. Science 291: 1304–1351. 9. Hopkins AL, Groom CR. (2002) The druggable genome. Nat. Rev. Drug Discov. 1:727–730. 10. Fliri AF, Loging WT, Thadeio PF, Volkmann RA. (2005) Biological spectra analysis: Linking biological activity profiles to molecular structure. PNAS 102: 261–266. 11. Joost P, Methner A. (2002) Phylogenic analysis of 277 human G-proteincoupled receptors as a tool for the prediction of orphan receptor ligands. Genome Biol. 3:0063.1–0063.16. (http://genomebiology.com/2002/3/11/ research/0063). 12. Manning G, Whyte DB, Martinez T, et al. (2002) The protein kinase complement of the human genome. Science 298:1912–1934.
ch08
FA April 1, 2006
15:41
205
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Construction of a Homogeneous and Informative In vitro Profiling Database
13. Drew J. (2000) Drug discovery: A historical perspective. Science 287: 1960–1964. 14. Bleicher KH, Bohm HJ, Muller K, Alanine AI. (2003) Hit and lead generation: beyond high-throughput screening. Nat. Rev. Drug Discov. 2:369–378. 15. Cohen P. (2002) Protein kinases — the major drug targets of the twenty-first century? Nat. Rev. Drug Discov. 1:309–315. 16. Hamon V, Crawford M, Jean T. Pharmacological and pharmaceutical profiling: New trends. In: O’Donnell, Smith (eds), The Process of New Drug Discovery and Development, 2nd ed, Taylor and Francis/CRC Press, 2006. 17. Cheng X, Hochlowski J, Tang H, et al. (2003) Studies on repository compound stability in DMSO under various conditions. J. Biomol. Screen. 8:294–304. 18. Kozikowski BA, Burt TM, Tirey DA, et al. (2003a) The effect of freeze/thaw cycles on the stability of compounds in DMSO. J. Biomol. Screen. 8:210–215. 19. Kozikowski BA, Burt TM, Tirey DA, et al. (2003b) The effect of roomtemperature storage on the stability of compounds in DMSO. J. Biomol. Screen. 8:205–209. 20. Gozalbes R, Barbosa F, Froloff N, Horvath D. (2006) The BioPrint® approach for the evaluation of ADME-T properties: Application to the prediction of cytochrome P450 2D6 inhibition. In: Physicochemical and Computational Strategies, pp. 395–415, VHCA, Zürich, Wiley-VCH, Weinheim. 21. Gozalbes R, Rolland C, Nicolaï E, et al. (2005) QSAR strategy and experimental validation for the development of a GPCR focused library. QSAR Comb. Sci. 24:508–516. 22. Horvath D. (2001a) High throughput conformational sampling and fuzzy similarity metrics: A novel approach to similarity searching and focused combinatorial library design and its role in the drug discovery laboratory. In: AK Ghosh, VN Viswanadhan (eds), Combinatorial Library Design and Evaluation, pp. 429–472, Marcel Dekker Inc., New York. 23. Horvath D. (2001b) ComPharm — Automated comparative analysis of pharmacophoric patterns and derived QSAR approaches, novel tools in high throughput drug discovery. A proof-of-concept study applied to farnesyl protein transferase inhibitor design. In: M Diudea (ed), QSPR/QSAR Studies by Molecular Descriptors, pp. 395–439, Nova Science Publishers, New York, USA. 24. Halbreich U, Kahn LS. (2003) Hyperprolactinemia and schizophrenia: Mechanisms and clinical aspects. J. Psychiatr. Pract. 9:344–353. 25. Titier K, Girodet PO, Verdoux H, et al. (2005) Atypical antipsychotics: From potassium channels to torsade de pointes and sudden death. Drug Saf. 28:35–51. 26. Young TJ, et al. (2003) Antifungal activity of selective serotonin reuptake inhibitors attributed to non-specific cytotoxicity. J. Antimicrob. Chemother. 51:1045–1047.
ch08
FA April 1, 2006
206
15:41
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
N. Froloff et al.
27. Paul R, et al. (1998) Both the immunosuppressant SR31747 and the antiestrogen tamoxifen bind to an emopamil-insensitive site of mammalian Delta8Delta7 sterol isomerase. J. Pharmacol. Exp. Ther. 285:1296–1302. 28. Courchesne WE. (2002) Characterization of a novel, broad-based fungicidal activity for the antiarrhythmic drug amiodarone. J. Pharmacol. Exp. Ther. 300:195–199.
ch08
FA April 1, 2006
15:44
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Index π-π 90, 94, 96, 97 α-helix 12, 13 ADME(T) 14, 20, 22, 23, 26 adverse drug reactions (ADRs) 177 adverse events 203 annotated chemical libraries 50, 51, 53–55 aromatic-aromatic interaction 94, 97 assay-ADR associations 177, 194, 195, 197 ATP 9 β-turn 12, 13 binary kernel discrimination (BKD) 136, 150–153 binding pocket 86, 87, 90, 91, 98 binding site 85, 86, 90, 92, 103–105 bioavailability 11, 12, 23, 24, 178, 188, 189, 191, 201 bioinformatics 12, 26 bioisostere(s) 88, 91, 94, 95, 108 bioisosteric Motif 91 biological targets 176 BRENDA 6, 9 Calphostin C 28 cation-π 89, 90, 94 CDK2 9 channel receptors 43, 45 ChEBI 6, 9 chemical Graph Identifier 47 chemical structure code 47, 48, 54 chemogenomics 39, 176 chemoinformatics 2, 6, 11, 16, 25, 27 classification schemes 40, 41, 43–46, 48, 50, 51, 53, 55 clinical effects 176 clustering 115, 117, 129 CMC 6, 23, 26 CNPD 3, 5 207
co-factors 6, 8 combinatorial chemistry 1, 2, 11 complexity 2, 8, 14, 15, 22, 26 compound library 59–64, 66–72, 76, 78, 80, 81 computational chemistry 176 consensus binding site 90–93, 98, 101 consensus scoring 135 cyclosporin A 3 15-desoxy-spergualin 3 data fusion 135, 144 deorphanization 120 DMPK 19, 20 DNP 3, 5, 23 docking 110, 112, 121–124, 127–129 DOS 14, 15, 28 drug candidates 177, 188, 197, 202, 203 drug discovery 176, 179, 180, 201 drug-likeness 16, 22–25 druggable genome 40, 41 druggable proteome 179, 180 electron deficient recognition 99 electron-rich aromatic ring(s) 95, 96 enrichment 159, 171 enzymes 41, 42, 44–46, 50, 51, 179, 180 epothilone 4 essential properties 7, 18, 19, 23 Family A 90–92, 98–101 Family B 92, 98–101 Family C 92, 98, 101 family-directed knowledge 40, 53 FDA 3, 19 fingerprint 134–138, 140–144, 150–153 FK-506 3 focused libraries 85, 99, 102, 103, 105
Index
FA April 1, 2006
208
15:44
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Index G protein-coupled receptors 43, 45 genome 2 GPCR 3, 9, 11, 88–93, 98–106, 108, 113–122, 127, 128, 179 GPCR-PA+ 99, 102 group fusion (GF) 135, 146, 150–153 H-π interaction(s) 96 H-bond 95 helical recognition elements 105 histacin 15 HTD 16 HTS 2, 5, 8, 9, 15, 18, 21, 23, 25, 28 in silico pharmacology 55 in vitro 18, 22 in vitro ADME profiling 188 in vitro–in vivo relationships 201 in vitro pharmacological profiling 180 in vivo 14, 18, 20, 22, 29 in vivo effects 175, 198 ion channel(s) 88, 89, 92, 105–107, 179 KEGG 5, 6, 9 kinase 9, 28 latent hit 22 lead-likeness 16, 22, 25 lead optimization 176, 202, 204 library design 59, 61, 62, 68, 80, 102, 103, 105 ligand-gated channels 182 Lipinski 7, 22, 23 lipophilic environment 86 lipophilic pockets 85, 86, 97, 101, 102, 104 lipophilic recognition 86 logical map(s) 91–93, 99, 101, 106
microenvironments 85, 91–93, 95, 96, 98, 99, 101, 103–105 minaprine 11 mizoribine 3 molecular descriptors 175, 202 molecular informatics 5, 27 molecular recognition 85, 87, 108 molecular similarity 134 motif(s) 86, 87, 92, 94–98, 106 multicomponent reaction 14 multidimensional scaling 100, 101 mycophenolic acid 3 NADH 10 Naïve Bayes 28 natural product 2–5, 7, 8, 14, 15, 19, 23, 25, 28, 59–67, 70, 72, 73, 75, 78, 80, 81 neighborhood behavior 177, 192 NMR 13 nuclear receptors 44, 45, 51, 179 oxidoreductase 10 PDB 110–113, 123, 125 peptides 2, 11, 12, 25, 28 peptidomimetics 13 pharmacofamily 9, 10 pharmacokinetics 176, 178, 188, 190, 191, 200, 201 pharmacophore map 85, 101–103 privileged structure 60, 62–64, 80, 81 protein families 41, 45, 50 protein structure 45, 157, 159–161 protein Structure Similarity Clustering 59, 60, 67, 68, 70–72, 75, 81 PSA 12, 24 QSAR 177, 178, 192
machine learning 136, 150 marketed drugs 176, 178, 203 MDDR 23, 26 MDL Drug Data Report (MDDR) database 137 medicinal chemistry 1, 6, 16, 22, 23, 29 metabolites 2, 8, 9
RAD-001 3 radar map(s) 104 radar plots 104, 106, 107 rapamycin 3 recognition features 85 recognition processes 85, 92
Index
FA April 1, 2006
15:44
209
WSPC/Book-329: Chemogenomics: An Emerging Strategy for Rapid Target and Drug Discovery
Index reference compounds 175, 176, 178, 186, 187, 190 reference structure 133–135, 137, 139, 140, 143, 144, 147–150, 153 Rule-of-5 7, 23, 24 SAR 2, 17, 19, 21, 85, 86, 91, 92, 98–102, 104, 106, 107 scaffold hopping 171 side effects 190 similarity searching 133, 134, 139–142, 144, 146–148, 150–153 small molecule modulators of protein function 59, 60, 70 solubility 1, 7, 20, 22, 23 SOSA 10, 11 structural informatics 158–166, 168, 169 structure-activity relationship 85, 177, 192 target hopping 163–165 target library 109, 110, 114, 118, 123, 129
thematic analysis 88, 90–92, 98, 99, 101–107 Theme(s) 86, 87, 92, 93, 98–101, 103, 106 therapeutic effects 203 TOS 14 toxicity 5, 25 transporters 179 tubacin 15 turbo similarity searching 147, 148, 150–153 uretupamine 15 urotensin II GPCR 13 vicinity analysis 85, 87, 90 virtual screening 109, 113, 114, 129, 133, 134, 136, 137, 139, 140, 142–145, 147, 150, 152, 153, 171 WOMBAT 6, 27, 28
Index