Johann Gastciger (Ed.)
Handbook of Chcm0inf;onnotiCs
Related Titlesfrom WILEY-VCH and WlLEY Interscience Johann Cast...
346 downloads
2268 Views
102MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Johann Gastciger (Ed.)
Handbook of Chcm0inf;onnotiCs
Related Titlesfrom WILEY-VCH and WlLEY Interscience Johann Casteiger, Thomas Engel (Eds.)
Chemoinformatics A Textbook 2003
ISBN 3-527-30681-1
Jure Zupan, Johann Casteiger
Neural Networks in Chemistry and Drug Design
Hans-Dieter Holtje, Wolfgang Sippl, Didier Rognan, Cerd Folkers
Molecular Modeling 2003
ISBN 3-527-30589-0
Peter Comba, Trevor W. Hambley
Molecular Modeling of Inorganic Cornpounds
An Introduction
Second, Completely Revised and Enlarged Edition
Second Edition
2001 ISBN 3-527-29915-7
ISBN 3-527-29778-2(HC) ISBN 3-527-29779-0(SC)
Thomas Lengauer (Ed.)
Bioinformatics From Cenomes to Drugs 2001
ISBN 3-527-29988-2
Paul von Rague Schleyer, William L. Jorgensen, Henry F. Schaefer Ill, Peter R. Schreiner, Walter Thiel, Robert C. Glen (Eds.)
Encyclopedia of Computational Chemistry 2002
Online Reference Work available at http://www.interscience.wiley.com
Johann Casteiger (Ed.)
Handbook o f Chemoinformatics From Data to Knowledge in 4 Volumes
BWILEY-VCH
Editor
Prof: Dr.Johann Casteiger Computer-Chemie-Centrum and Institute of Organic Chemistry University of Erlangen-Nurnberg NagelsbachstraBe 25 91052 Erlangen Germany
This book was carefully produced. Nevertheless, editors, authors and publisher do not warrant the information contained therein to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate. Library of Congress Card No.: applied for A catalogue record for this book is available from the British Library. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at http:// dnb.ddb.de
’ (“greater-than”) characters which delimit reactant, agent and product molecules. reaction: reactant-molecules ‘>’ agent-molecules ‘>’ product-molecules SMILES specifying reactions always have two ‘>’ characters. None, one, or more than one molecule may occupy each role. In practice, at least one reactant and one product molecule are usually specified. Reactions are usually written in a stoichiometric fashion, e.g., the dissociation of water: 0 >> [H+].[OH-] or
90
I
3 SMILES - A Languagefor Molecules and Reactions
0.0 >> [OH3+].[OH-] However, stoichiometric balance is not a requirement of reaction SMILES. 3.8.1
Advanced Issues
Reaction SMILES is defined minimally, with no requirements for, e.g., mass balance nor mapping reactant and product atoms. The reason is that we often do not know such information for a given observed reaction. Because one of the goals of SMILES is to represent exactly what we observe, we also have to specify what we do not know. This is explained in the comprehensive Smiles-Tutorial on the web [ 11. 3.8.2
Examples SMILES Reaction Specifying (Table 3-7)
Examples of reaction specification in SMILES
Tab. 3-7
Depiction
SMILES/Description
0 >> [H+].[OH-] Dissociation of water to hydroxide and free proton. 0.0 >> [OH3+].[OH-] Dissociation of water to hydroxide and hydronium ion (two steps). c = c c = c . c = c c = o >> c 1 c - c c c C 1 c = o Diels-Alder reaction
CC=NC.[N+]#[C-].C( =O)O >>
NC(=O)C(C)N(C=O)C Ugi reaction
k13C*
,OH
CH2
,CH3
tiO,
‘0
C
/0I
\ CH,
HO ,
CCO.CC(=O)O > [H+] > CC(=O)OCC.O Acid-catalyzed esterification of ethanol and acetic acid to ethyl acetate. The [ H+] catalyst is specified as an agent.
3 . 9 Isomerism Beyond the Valence Model
I
91
3.9 Isomerism Beyond the Valence Model
SMILES is a chemical nomenclature which represents valence models of molecules and reactions. The SMILES language described in Sections 3.1 to 3.7 provides a self-consistent method of describing virtually all chemical valence models. There are only six rules:
Atoms are represented by atomic symbols. Double bonds are represented by ‘=’, triple bonds by ‘#’. Branches are indicated by matching parentheses. Ring closures are indicated by matching pairs of digits. Disconnections are indicated by a ‘.’ (period). Reactions are written as reactants > agents > products. Because the pure valence model is incredibly useful in chemistry, its simple SMILES representation is very useful in chemical information. The valence model is self-consistent and comprehensive within its domain. It is only a model at a fairly coarse level of abstraction, however. The valence model does not enable complete representation of molecular entities. For instance, molecules exist in three dimensions but the valence model does not represent this three dimensional nature. The following sections describe how SMILES represents certain molecular attributes which are outside the formal valence model and the basis for some important applications of SMILES in chemical information. 3.9.1 Isomerism
SMILES provides for four types of specification which are so important to the molecular model that they are included even though they are outside the valence model. They are: isotopism, orientation about double bonds, stereo specification, and (for reactions) reactant-product atom mapping. These are collectively known as “isomeric SMILES”. 3.9.1.1 Isotopic Specification
Isotopic specifications in SMILES are indicated by prefixing the atomic symbol with a number equal to the desired integral atomic mass. An atomic mass can only be specified inside brackets (Table 3-8). 3.9.1.2 Orientation About Double Bonds
Configuration around double bonds is specified in SMILES by the characters ‘1’ and ‘\’ (“forward slash” and “back slash”) which are “directional bonds” and can be thought of as kinds of single bond. These symbols indicate relative directionality
3 SMILES - A Languagefor Molecules and Reactions 921 Tab. 3-8
Examples of isotopic specification in SMILES
Depiction
SMILES
Name
C CH4
ICI C
Elemental carbon Methane
1*C
‘3C ”CH4
0 ZH’
\ZH
Remark
Carbon’s mass is not specified. The mass of carbon and hydrogens are not specified. [12CI Elemental carbon-12 There is nothing special about carbon-12 just because it is the most common isotope. ~ 3 ~ 1Elemental carbon-13 Note that any mass can be specified, not just reasonable ones. [13CH4] C-13 methane Connected hydrogens must be specified inside brackets. [ 2H]0[2H] Deuterium oxide Hydrogen isotopes are specified as separate (heavy water) atoms.
Examples o f double-bond orientation in SMILES
Tab. 3-9
Depiction
SMILES
Name
Remark
F/C=C/F or F\C=C\F
trans-Difluoroethene
The F’s are on opposite sides of the double bond.
F/C=C\F or F\C=C/F
cis-Difluoroethene
The F’s are on the same side of the double band.
F/C=C/C=C/C
trans,truns-1-Fluoropenta1,3-diene
Both double bond orientations are specified.
F/C=C/C=CC
truns,unspec-1-Fluoropenta1,3-diene
Double bond orientation is specified for one double bond but not the other.
F
F I
ti
H
H
T
T
F
I
CH,
I
H
H
H
F\
I //cz C
T H
SCH\ CH
CH,
between the connected atoms and have meaning only when they occur on both atoms which are double bonded (Table 3-9). An important difference between SMILES stereo conventions and others such as CIP is that SMILES represents local chirality (as opposed to absolute stereo orientation), which allows partial stereo specification.
3.9 Isomerism Beyond the Valence Model
I
93
HO
Fig. 3-4
SMILES: N[C@](C)(F)C(=O)O
3.9.1.3
Tetrahedral Stereocenters
SMILES uses a very general type of stereo specification based on local stereo orientation and symmetry point groups. Instead of using a rule-based numbering scheme to order atoms neighboring a stereo center, orientations are based on the order in which atoms occur in the SMILES string. As with all other aspects of SMILES, any valid order is acceptable. The simplest and most common kind of stereo orientation is tetrahedral: four “neighbor” atoms are evenly arranged about a central atom, known as the “stereocenter”. If all four neighbors differ from each other in any way, mirror images of the structure will not be identical. The two mirror images are known as “enantiomers” and are the only two forms that a single tetrahedral center can produce. If two (or more) of the neighbors are identical, the central atom will not be chiral (the mirror images can be superimposed in space). In SMILES, tetrahedral centers may be indicated by a simplified stereo specification ( @ or @@) written as an atomic property following the atomic symbol of the chiral atom. If a chiral specification is not present for a chiral atom, the chirality of that atom is implicitly not specified. Looking at the chiral center from the direction of the “from” atom (as per atom order in SMILES), @ means “the other three atoms are listed anticlockwise; @ @ means clockwise. If all atoms are explicitly specified in SMILES, e.g., N[C@](C)(F)C(=O)O,the order of the atoms should be clear, i.e., N is the “from” atom, and the other atoms are anticlockwise in SMILES order (methyl, fluoro, carboxy) (Figure 3-4and Table 3-10): If the chiral atom is the very first atom in the SMILES, e.g. [C@](F)(N)(C)CC, the first-appearing neighbor is taken to be the “from” atom. If the chiral atom has a non-explicit hydrogen (it can have at most one and still be chiral), it will be listed inside the chiral atom’s brackets, e.g. F[C@H](N)C. The order of the non-explicit hydrogen is exactly as written in SMILES, i.e. in this case the first of the three following atoms (H, N, C). Similarly, if a chiral atom has a ring closure, e.g., NICCCO[C@H]lCC, the 0 is the from atom, and three following atoms are in the order they are connected to the chiral center as written in SMILES, i.e. H (immediately following the symbol), then N (the ring closure is next), then the ethyl carbon. To reiterate: the implied chiral order is always exactly as written in SMILES. 3.9.1.4 ~
General Stereo Specification
There are many kinds of stereo orientations other then tetrahedral. The use of the a sym%ol , ae‘schdea aijovk I’s acnially a special case ‘or tne g’eheral >MILES 3eiFo
3 SMILES - A Language for Molecules and Reactions
Tetrahedral stereo specification examples
Tab. 3-10
Depiction n
n
\HC,
SMILES
Name
Remark
“C@ @ Hl(C)C(=O)O
L-Alanine
From N: (H, methyl, carboxy) appear clockwise.
N[C@H](C)C(=O)O
o-Alanine
From N: (H, methyl, carboxy) appear anticlockwise.
O[C@H]lCCCC[C@H]10
cis-Resorcinol On first stereo carbon, from the hydroxy 0: (H, CO, C) are anticlochse.
ClC[ C@ H]2CCCC[C@H]ZCCl
cis-Decalin
/
-CH,
H
“cis/ trans” ring fusions are specified as tetrahedral stereo centers.
H
specification syntax: chiral: ‘@’ I chiral ‘@‘; i.e., the chiral value is composed of the chiral class and chiral order. The chiral class is a two-letter code named after the base point group, e.g., tetrahedral (TH), allenyl (AL), square-planar (SQ), trigonal-bipyramidyl (BP), octahedral (OH), etc. The chiral order is a positive integer indicating which chiral configuration of the given class is present (including reduced and degenerate chiralities where the number of distinct enantiomers is reduced by symmetry). To simplify input of common chiralities, one chiral class is designated the default chiral class for a given degree (connectivity). For instance, the default chiral class for degree 4 is TH. Notations in the form “@@” are interpreted as @2 (analogous to “++” meaning +2). Whenever possible, the chiral order 1 ( “ @ l or ” just “@”) corresponds to “anticlockwise about the axis represented by SMILES order”. (“@” is supposed to be visual mnemonic in that the symbol looks like an anticlockwise spiral about a central spot.) Table 3-11 provides examples of a few common non-tetrahedral chiralities. More detailed information is available elsewhere.
@SP1 to @SP3
@TB1 to @TB20
@OH1 to @OH30
Square-planar
Trigonal-bipyramidal
Octahedral
/
c1’I
I
I#+
SH
c=a
Br-Go-F
c1‘
~
I -F IC*
SH
~
o,*.cl
Bra, ’/A~
I
BZ,/,
Ho
~
/”=== Fcu“ .
F
CH3
O=C[Co@](F)(Cl)(Br)(I)S
O=C[As@](F)(Cl)(Br)S
F[ Po@ SPl]( C1)(Br)I
OC(a ) = [C@ALl]=C(C)F
As if:
OH is the default class for degree 6 .
Not a default class. @SP1 makes a “U”. @SP2 makes a “4”. @SP3 makes a “ 2 ” . TB is the default class for degree 5.
IOD
YC
c1
2
Q
S
3.
2 P <
3
i7
2
96
I
3 SMILES - A Language for Molecules and Reactions
0
H3C-CH2-0 H3 C- CH2 - 0 H Fig. 3-5 SMILES: CCO.CC(=O)O> [H+] > CC(=O)OCC.O 2
H3C-CH2-0
H~C-CHZ-OH 1
1
> [H+] > CC(=[O:Z])[O:l]CC.[0:3]
Fig. 3-6
SMILES: CC[OH:l].CC(=[O:2])[0:3]
3.9.1.5
Reactant-Product Atom Mapping
One normally performs a reaction by mixing reactants together, allowing reaction to occur, then looking for the product. From such experiments, one does not normally find out which reactant atoms become which atom in the products (“atom mapping”). The minimal “non-isomeric” SMILES described above is the correct way to describe such observed reactions. The [ Hf]acid-catalyzed esterification of CCO ethanol and CC(=O)O acetic acid to CC(=O)OCC ethyl acetate and 0 water is shown in Figure 3-5. We start with ethanol and acid (e.g. fresh wine tastes acidic) and observe that ethyl acetate is formed (e.g. aged wine tastes fragrant). We can’t observe which of the three kinds of reactant oxygens become the product acetate and which becomes the product water. If we performed separate experiments using 0-18 labeled ethanol and acetic acid, we could observe that the alcoholic oxygen becomes the ester oxygen and the acidic oxygen becomes water. This is written in SMILES shown in Figure 3-6. Within a bracketed atom specification, the “atom map class” is represented by a colon followed by an integral number. Mapped reactant and product atoms must have the same value (the numerical value of the map class is not significant). None, one, some, or all atoms in a reaction may be mapped - SMILES itself makes no requirement for completeness. As with other aspects of SMILES, the important thing is to write down what you know, and no more. More than one reactant and product atom may have the same class. For instance, imagine we had done just the esterification experiment using 0-18 labeled ethanol and found that no 0-18 labeled water was produced. Strictly speaking, we would not know which of the two carboxy oxygens became the water (or both, e.g., if the carboxy is tautomeric). We could express exactly what we know in the SMILES (Figure 3-7). Note that there are four atoms in map class 2 (two on each side of the reaction).
3.9 Isomerism Beyond the Valence Model
I
97
2 0
II H3C-C-OH +
2
0 H+ ____)
H3C-C
II + 1
2 H20
H~C-CHZ-OH 1 Fig. 3-7
SMILES: CC[OH:l].CC(=[0:2])[0:2]
> [H+] > CC(=[O:2])[O:l]CC.[O:2]
It is not uncommon to find reactions with the same reactants and products but with different atom maps, i.e., more than one mechanism exists for a generic reaction. It is common that the same generic reaction is catalyzed by different agents, e.g. different catalytic enzymes for the same biochemical step. The idea of “multiple kinds of the same reaction” is very analogous with the idea of “isomers of a molecule” and this is how it is implemented in SMILES. Reactant-product atom map information is part of the “isomeric SMILES” for reactions. 3.9.2 Conventions
A few SMILES conventions are useful 3.9.2.1 Hydrogen Specification
Hydrogen atoms do not normally need to be specified when writing SMILES for most organic structures. The presence of hydrogens may be specified in three ways:
*
*
implicitly - for atoms specified without brackets, from normal valence assumptions; explicitly by count - inside brackets, by the hydrogen count supplied; zero if unspecified; or as explicit atoms - i.e., as explicit [HI atoms. There is no distinction between “organic” and “inorganic” SMILES nomenclature. One may specify the number of attached hydrogens for any atom in any SMILES. For example, ethane can be written as:
There are four situations where explicit hydrogen specification is required:
*
charged hydrogen, i.e. [H+] proton; hydrogens connected to other hydrogens, e.g. [ H][ HI, molecular hydrogen; hydrogens connected to other than one other atom, e.g. bridging hydrogens; and isotopic hydrogen specifications, e.g. [ 2H]0[2H] heavy water
98
I
3 SMILES - A language for Molecules and Reactions 3.9.2.2
Aromaticity
Some confusion in SMILES arises from the SMILES definition of aromaticity. This is a shame, because in virtually all cases, one can simply (and safely) ignore aromaticity. When should I specih a structure as aromatic? You never need to do so. If you find yourself typing in SMILES, it is a bit easier to type clcccccl for benzene than Cl=CC=CC=C1 cyclohexatriene, but it is just a matter of convenience, because they mean exactly the same thing. W h a t does “aromatic” mean, anyway? “Aromatic” means “it smells nice”. No kidding, that is the only defensible definition. There is no single rigorous definition of aromaticity in chemistry. To a synthetic chemist, aromaticity implies something about reactivity; to a thermodynamicist, about heat of formation; to a spectroscopist, about N M R ring current; to a molecular modeler, about geometrical planarity; to a cosmetic chemist, it probably means “smells nice”. The SMILES definition of aromaticity has nothing to do with the other definitions, except that we would all agree that benzene is “aromatic”. Why does SMILES provide a n “aromatic” concept at all? The SMILES language was specifically designed to be “canonicalizable”, i.e. not only to provide an unambiguous chemical nomenclature but also be able to express a single, unique SMILES for every structure in the same language. This implies a fundamental requirement to express the symmetry of a molecule correctly. Consider the problem of generating a unique SMILES for OclccccclF ortho-fluorophenol, but without aromatic bonds. There are two ways to write it, OCl=CC=CC=ClF (with the substituted carbons joined by a single bond) and OCl=C(F)C=CC=Cl (with the substituted carbons joined by a double bond). These are two different molecular graphs: the SMILES for these will always differ. For purposes of unique nomenclature, it is not acceptable to have two different “unique SMILES” for the same molecule. SMILES language provides an “aromatic” concept to avoid this conundrum. How does SMILES determine “aromaticity ”? Unfortunately it is not as trivial as “alternating single and double bonds”, but it is not rocket science, either. The SMILES algorithm uses an extended version of Hueckel’s rule to identify aromatic molecules and ions. To qualify as aromatic all atoms in a ring must be sp2 hybridized and the number of available “shared’ n electrons must satisfy Hiickel’s 4 N + 2 criterion. For example, an sp2 carbon shares one n electron, so benzene (or cyclohexatriene) is aromatic (G = 4(1) + 2). Conversely, Cl=CC=Cl cyclobutadiene and Cl=CC=CC=CC=C1 cyclooctatetraene, are (correctly) not aromatic, with 4 and 8 shared electrons, respectively. Note that these are anti-aromatic compounds, i.e. FCl=CC=CC=CC=ClO and FCl=C(O)C=CC=CC=Cl are not the same structure. The rules become slightly more complicated for heterocycles. Oxygen and sulfur can share a pair of pi electrons. Nitrogen can also share a pair, if three-connected as in [nH]lccccl methylpyrrole, otherwise sp2 nitrogen shares just one electron (as in nlcccccl pyridine). An exocyclic double bond to an electronegative atom “consumes” one shared n-electron, as in O=-clnccccl2-pyridone or O=cloc2ccccc2ccl
3.9 isomerism Beyond the Valence Model 199
Tab. 3-12
Examples of aromatic compounds and their SMILES -~
Depiction
43
SMILES
Name
Remark
C1=CC=CC=C1 same as clcccccl
Cyclohexatriene same as benzene
6 = 4N 2 shared n electrons.
FCl=CC=CC=CIO FCl=C(O)C=CC=CI FclccccclO
ortho-Fluorophenol
All three SMILES represent the same molecule.
nlcccccl same as Nl=CC=CC=Cl
Pyridine
Normal aromatic n nitrogen is pyridyl-N
[nH]lccccl same as NlC=CC=Cl
1- H-Pyrrole
Pyrrolyl-N is written [nH] and shares two n-electrons.
O=nlcccccl same as O=Nl=CC=CC=Cl
Pyridine-N-oxide, neutral representation
[ 0-1 [ n+] lcccccl same as [ 0-][N+]l=CC=CC=Cl
Pyridine-N-oxide, charge-separated representation
oicccci Same-as 01C=CC=Cl
ruran
Exocyclic =O consumes one n electron from a N that would otherwise share 2n electrons. One electron is missing (+)from a N that would otherwise share 2n electrons. vxygeanares a’ pair 01 n electrons, so h r a n is aromatic.
slccccl same as SlC=CC=Cl
Thiophene
Sulfur shares a pair of w electrons, so thiophene is aromatic.
[cH-Jlccccl same as [CH-J lC=CC=Cl
Cyclopentadienyl anion
The - charge is an extra electron, making 6 .
clcc2cccccc2cl same as c1=cc2=cc=cc=cc2=c1
Azulene
3 + 2 + 5 = 10 = 4N + 2, so azulene is aromatic.
NH
“
+
coumarin. But that is about it. Add up the electrons in rings (and ring systems, such as azulene); if they meet the 4 N + 2 criterion, it is “aromatic” (Table 3-12). 3.9.2.3 Tautomers
Tautomeric structures are explicitly specified in SMILES. There are no “tautomeric bond’, “mobile hydrogen”, or “mobile charge” specifications. Selection of one or all tautomeric structures is left to the user and strongly depends on the application.
’
‘
100
I
3 SMILES - A Languagefor Molecules and Reactions Tab. 3-13
Depiction
Examples of tautomers
SMILES
Name
Oclnccccl
2-Pyridinol
LN-1
Given one tautomeric form, most chemical information systems will report data for all known tautomers as needed. The role of SMILES is to specify exactly which tautomeric form is requested, and for which there are data. A simple example, with two possible tautomeric forms, is shown below (Table 3-13). 3.9.3 Tricks
There are two “tricks” which are fundamental to efficient chemical information processing using SMILES. These are presented below, with exercises. 3.9.3.1 Canonical SMILES
One of the main goals of chemoinformatics is to store chemical information precisely and then to retrieve it reliably and efficiently. The ability to generate canonical SMILES provides the ultimate mechanism to achieve these goals when storing information keyed to molecular structures. SMILES canonicalization is simple in principle. Given any valid input SMILES for a given molecule, produce a single “canonical” SMILES. For instance, any valid SMILES for ethanol such as OCC, OCC, C(C)O, C(O)C, Ol.ClC, etc. would all produce the canonical SMILES CCO. Although simple in principle, this is difficult to do comprehensively, for both theoretical and practical reasons. The graph theory needed to do this in non-exponential time is a relatively recent invention. The algorithm is beyond the scope of this section, but suffice it to say that it is a great job for a computer rather than human labor. The availability of canonical SMILES allows “order 1” retrieval of structureoriented information. This means that the time it takes to retrieve information for a given structure is completely independent of the number of structures which exist. The order of an algorithm is the most important factor in determining how well it “scales up”. For instance, a linear search is “order N”; it takes a million times longer to look up a structure in a list of 10 million structures than it takes in a list of 10 structures. If one can order the items (as in a phone book), binary search can be used which is “order lg( N)” (lg is log base 2). Because 220 is approximately a million, it only takes 20 times longer to find something in a list of 10 million than in a list of 10 when using binary search. The ultimate algorithm would have an “order I”, i.e. it would take the exact same amount of time to find
3.9 isomerism Beyond the Valence Model
information in lists of any length. Remarkably, this is easily achieved using canonical SMILES and a technique called “hashing”. The basic idea of hashing is to create a unique key (canonical SMILES) and treat that as the address of that key (memory address or disk position). For instance, to store the observation that the boiling point of OCC, ethanol, is 78”C, we canonicalize it in SMILES to CCO, use a hash function to convert CCO to a computer address, and store the information in that location. If we are later asked to find the boiling point of C(C)O, we again canonicalize the SMILES to CCO, convert it to an address, and look at that location for the answer. If the data exists, that is where it will be stored. If nothing is found at that location, that information does not exist in the database; we do not have to look anywhere else. If we can use this trick once to provide constant-time retrieval of information for chemical structures, we can use it twice. If we are given other identifiers for a structure (e.g. the names “ethanol”, and CAS No “64-17-5”) these can also be hashed along with the canonical SMILES. When we are asked for the boiling point of “ethanol” or “64-17-S”, we first hash on that identifier to find the canonical SMILES, then hash on that SMILES to find the data. Twice constant time is still constant time! Once such a system is in place, retrieval speed is independent of the number of structures, the number of kinds of identifiers, and the number of identifiers. This is a pleasant circumstance for such an information-rich field as chemistry. There are other interesting applications of canonical SMILES. For instance, if one uses a deterministic method (e.g. a computer program) to produce a structure diagram from a SMILES, using canonical SMILES as input produces a canonical diagram. Advanced exercise. The algorithm which produces canonical SMILES labels atoms with properties such as atomic number, charge, etc. Such properties are known as “graph theoretical invariants” because they do not change with the order of input. It is clear that, e.g., the atomic number of an atom is not affected by input order. Isomeric SMILES can also be canonicalized (canonical isomeric SMILES is known as “absolute SMILES”), this includes properties like stereo orientation which appear to be order-dependent. Can you think of a way to make the specification of stereo orientation into a graph theoretical invariant? 3.9.3.2 Fast SMILES Parsers
Another computer-science trick is to parse (interpret) languages very quickly using an algorithm called a “finite state machine” (FSM). The idea behind FSMs is that there is a finite number of states which an interpreter can be in, and a finite number of inputs possible in any state (e.g. if the input is printable ASCII there are only 95 possible characters.) For some languages, one can convert the syntax and grammar into a table which says what state to jump to for each input in each state. Such tables can get pretty complicated but, given formally defined syntax and grammar, FSMs can be automatically generated for languages such as SMILES, by programs such as yacc (“yet another compiler-compiler”). For example, the FSM might receive the character C in the start-state. The C might be the carbon symbol or the first character of chlorine, the FSM does not know yet. The action for C in
I
lol
102
I the start-state is to go to a “might be
3 SMILES - A language for Molecules and Reactions
C or Cl” state. If the second character is C, it knows that a carbon was previously specified and deals with that, then leaves itself in the “might be C or Cl” state. If the second character is 1, it deals with a chlorine and puts itself back in the start-state. The remarkable property of FSMs is that their running time is proportional only to the number of characters of input. The running time is not at all dependent on the complexity of the language. For each input character it has to do a compare and a jump. In practice approximately five machine operations are required for each input character. The following exercise illustrates the implications for modern chemistry on modern computers. Exercise: Approximately 20 million chemical structures and reactions are currently known. The average uncompressed SMILES length for comprehensive databases is approximately 32 characters. Current laptop computers have processors running at approximately 1 billion operations per second and can address 1 GB of fast memory (memory street price is -$300). Assume you have an FSM which can process SMILES input at 5 operations/character. Can you fit all of chemistry into the memory of a laptop? If so, how long would it take to interpret all structures, one after the other? How much does it cost to store each structure in memory? Advanced exercise: Write a FSM which computes the molecular formula for a molecule given its SMILES. How fast does it run? Really advanced exercise: Write an FSM which determines if a SMILES is syntactically valid or not. How fast does it run? The size of the internet (“surface web”) is estimated to be about 20 terabytes. How much running time would your FSM take to find all SMILES on the Internet? References 1
2
http://www.dayllght.com/smiles/, SMILES Home page, A collection of SMILES-relatedhyperlinks and information, maintained by Daylight CIS, Inc. D. WEINIGER, J . Chem. ln& Comp. Chem. 1988, 28, 31-36.
D. WEINIGER, A. WEINIGER, J. L. WEINIGER,].Chem. In& Comp. Chem. 1989, 29, 97-101. 4 D. WEINIGER,].Chem. lnf: Comp. Chem. 1990, 30, 237-243. 3
Handbook of Cltentoinfovntatics Johann Gasteiger Copyright 02003 WILEY-VCH Verlag GmbH & Co.KGaA, Weinheim
4.7 Introduction
Ovidiu Ivanciuc studied organic chemistry at the Polytechnic University of Timisoara (Romania), where he started to investigate topological indices, structural descriptors, and QSAR with Professor Dan Ciubotariu. He continued investigations in these directions during the PhD studies with Professor Alexandru T. Balaban. After the Romanian Revolution against the communism (1989) he was appointed Assistant Professor and then Associate Professor of organic chemistry at the Polytechnic University of Bucharest. Between 1996 and 1999 he was visiting Professor at the University of Nice (France), collaborating with Professor Daniel Cabrol-Bass in various chemometrics projects. As a Welch Foundation fellow at the Texas A&M University (2000-2001) he performed quantum computations for very large systems with Professor Douglas J. Klein. Since 2001 he has investigated the structural determinants of protein allergenicity (http:/i fermi.utmb.edu/SDAP) in the department of cornputational biology and bioinformatics at the University of Texas Medical Branch, on Galveston island. He is chief editor of the Internet Electronic Journal of Molecular Design (http://www.biochempress. corn). A list of his publications and reprints can be downloaded from http://www.ivanciuc.org.
4
Graph Theory in Chemistry Ovidiu lvanciuc 4.1 Introduction
The wide applications of graph theory in physics, electronics, chemistry, biology, medicinal chemistry, economics, or the information sciences are mainly the effect of the seminal book of Harary [ 11. A graph is usually represented in a graphical form as vertices interconnected by edges [ l , 21. Each graph vertex represents an object whereas the edge between two vertices represents the relationship between the two objects. In a chemical graph the objects can represent orbitals, atoms, bonds, groups of atoms, molecules, or collections of molecules. The edges of a chemical graph symbolize the interactions between chemical objects and are used to define chemical bonds, reactions, reaction mechanisms, kinetic models, or other relationships and transformations of the chemical objects. The rich literature on
I
103
104
I
4 Graph Theory in Chemistry
chemical graphs and their applications [3-81 is an important guide in the exploration of the major applications of chemical graphs: topological indices and other structural indices for QSPR (quantitative structure-property relationships) and QSAR (quantitative structure-activity relationships) [9-1G]; molecular orbital theory of conjugated compounds [ 17, 181; structure of benzenoid hydrocarbons [ 19-22] ; enumeration of isomers, constitutional symmetry perception and coding of chemical compounds [ 23-25]; kinetic and reaction graphs [2G]; and computerassisted synthesis design [ 271. The main graph theory application in chemistry is the representation of chemical compounds as molecular graphs in which atoms are symbolized as vertices and the chemical bonds are represented as edges. Molecular graphs are used to represent chemical compounds and reaction mechanisms in a graphical form, to encode and search chemicals in databases, but their main application is in computing structural descriptors for drug design, virtual screening of chemical libraries, or for QSPR and QSAR models. Therefore, after a brief introduction to graph theory, we will present the main tools used in computing graph descriptors of molecular structure: parameters for molecules containing heteroatoms and multiple bonds, molecular matrixes, polynomials, spectra, and spectral moments. Other important applications of chemical graphs will be reviewed, such as enumeration of Kekuli! structures, the topological resonance energy, and isomer enumeration. This contribution is a modified and updated version of a chapter in the Encyclopedia of Computational Chemistry [ 541.
4.2 Elements of Graph Theory
A graph G = G(V, E) is defined as an ordered pair consisting of two sets V = V( G) and E = E(G), where the elements of the set E define the binary relationship between the elements of the set V. In graphical form, the elements of the set Vare represented as vertices whereas the elements of the set E are represented as edges connecting the vertices. The number of elements in V(G), N = (V(G)(,defines the number of vertices N in the graph G and the number of elements in E ( G ) , M = 1 E( G) 1, defines the number of edges M. The graph vertices are labeled from 1 to N, V( G) = { ul, u 2 , . . . , u ~ } and , the edge connecting vertices ui and u, is denoted by ep Two vertices ui and u, of a graph G are adjacent if there is an edge ey joining them; the vertices ui and uj are incident to the edge ey. Two distinct edges of G are adjacent if they have at least one vertex in common. As an example, consider graph 1,that has five vertices and four edges, the vertex set V(1) = {ul, u2, vj, u4, u s } , and edge set E(1) = {e1,2,e2.3,e3,4,e2,5}. Vertices u2 and u3 from graph 1 are adjacent and edges e 1 . 2 ~e2.3, and e2.=, are adjacent. The vertices are represented in a graphical form either as black circles or by the edge endpoints or connection with other edge(s). Usually, we will represent vertices as edge endpoints, as in graph 1; however, in diagrams where the angle between edges is 180" the vertices must be represented as black circles, as in graph 2, which is another graphical form of graph 1.
4.2 Elements ofCraph Theory
1
3
2
4
A subgraph H of a graph G is a graph whose vertices and edges are contained in G. If V(H) is a subset of V(G), V(H) G V(G), and E ( H ) is a subset of E ( G ) , E ( H ) E E(G), then the subgraph H = V(H), E ( H ) is a subgraph of the graph G = V( G), E( G). The graph 4 is a subgraph of 3, because all vertices and edges of 4 are contained in 3. The subgraph G - vi is obtained by deleting from a graph G the vertex vi and all its incident edges. The graph G is a subgraph of 5, obtained by deleting the vertex v7 together with its incident edges e 6 , 7 and e 7 , + The subgraph G - eg is obtained by deleting from graph G the edge ec. The graph 7 is a subgraph of 5, obtained by deleting the edge eG,7, while the graph 8 is a subgraph of 5 obtained by deleting the edge e2,3.
6
5
6
7
8
The degree of a vertex vi, degi, is equal to the number of vertices adjacent to vertex ui. The degree vector Deg = Deg(G), in which the ith element represents the degree of the vertex vi, collects the degrees for all vertices in a graph. The degree vector of graph 1is Deg(1) = { 1 , 3 , 2 , 1 , 1 } and the degree vector of graph 5 is Deg(5) = {1,2,3,2,2,2,2,2}. A multigraph contains pairs of vertices connected by more than one edge. A multiedge of multiplicity m is a set of m edges incident with the same pair of distinct vertices. The multigraph 9 has two vertices, u2 and v3, connected by two edges. A loop is an edge joining a vertex with itself. The graph 10 contains a loop at vertex v3. A general graph may contain both multiple edges and loops, while a simple graph contains no loops or multiple edges. A digraph or directed graph D = D(V, A) is defined by a set of nodes V = V(D) and a set of arcs or directed edges A = A(D), where the elements of the set A define the binary relation between the elements of the set V A directed edge starting at node vi and ending at node v, is called an arc from ui to vj and is denoted by a+ Two arcs ag and a,i are different. The digraph 11 has the node set V(11) = (01, v 2 , 113,214,US} and the arc set A(11) = {a1,4, a 2 , 1 , a 2 , 3 , @ 3 , 4 )a4,Sr a5,4}.
I
105
106
I
4 Graph Theory in Chemistry
9
11
10
A walk w in a graph G is defined as a sequence of vertices and edges w = { u,, cab, ub: ebc, uc, . . . , u i , ey, u,, . . . , urn,emn,u,} beginning and ending with vertices, in which two consecutive vertices ui and ui+l are adjacent, and each edge eg is incident with the two vertices ui and u, preceding and following it, respectively; u, is called the initial vertex of the walk and u, is called the terminal vertex of the walk. Alternatively, a walk can also be defined as sequence of edges w = {eabr ebc,. . . ,em,}, in which two consecutive edges ey and ejk are adjacent, or as a sequence of vertices w = { u,, v b , . . . u n } , in which two consecutive vertices ui and ui+l are adjacent. In a walk any edge of the graph can appear more than once. The length of a walk is the total number of edges that are contained in it. The following walks in the graph 12 are equivalent: wl(12) = { u l , e1.2. u2, e2.3, u 3 , e3.4. u4: e4,3, v3}; ~ ~ ( 1=2 {el,2,e2.3re3,4,e4,3}; ) w3(12) = { ~ 1 , ~ 2 , ~ 3 , ~ 4 , In ~ 3 these } . three identical walks of length four, w1(12) = w2(12) = w3(12), the initial vertex is u l , and the terminal vertex is u3. ~
I
12
13
14
15
A closed walk or self-returning walk is a walk in which the initial and the terminal vertices coincide. A walk in which the initial and the terminal vertices are different is called an open walk or self-avoiding walk. In the graph 13 the walk ~ ~ ( 1=3 ()u 1 , u 2 , v4, u 3 , u4, u j } is open, while the walk w5(13) = {Q, u 2 , u 3 , u q } is closed. A path is a walk in which all vertices are distinct. The length of a path in a graph is equal to the number of edges along the path. A path of length 2 in the graph 14 is ~ ~ ( 1=4 {ul,u4,u3}, ) while a path of length 3 in the same graph is w7(14) = { u2, v3, u1, u 4 } . The graph 14 has no paths with length greater than 3 . The length of a walk in a graph with two or more vertices has no limit. In a connected graph G every pair of vertices is joined by a path. In a disconnected graph G there is at least one pair of vertices u l , uJ E V(G) with no path between them; these two vertices belong to different components of the graph G. The graph 8 is disconnected. The graph distance d, between a pair of vertices v, and vJ from a connected graph G is defined as the length (number of edges) of the shortest path connecting the two vertices. The graph distance has the following properties: d,, = 0 for all u, E V(G); d, > 0 for all v,, vJ E V(G); d, = dJ, for all
4.2 Elements ofCraph Theory ui, uj E V(G); dik + dkj 2 dg for all ui, uj, u k E V(G). In the graph 15 there are three paths between vertices u1 and 06: wg(15) = { u 1 , u 6 } of length 1, wg(15) = ( u 1 , u 2 , u 5 , u 6 } oflength 3, and wlo(15) = { u 1 , u 2 , u 3 , u 4 , u ~ , u 6oflength ) 5 . Because the length of the shortest path connecting vertices u1 and 06 is 1, the distance between the two vertices is 1, d l , 6 = d6,1 = 1. The distance between two adjacent vertices is 1. The eccentricity ecc(ui) of a vertex ui is defined as the maximum distance from the vertex ui to any other vertex uj in the graph G, max(dg) for all vj E V(G). In the graph 15 the eccentricity of the vertex u1 is 3, while the eccentricity of the vertex u2 is 2. The diameter diam(G) of a graph G is the maximum eccentricity. For the graphs 12-15, diam(l2) = 4, diam(l3) = 2, diam(l4) = 2, and diam(l5) = 3. A graph circuit or cycle is defined as a closed walk in which only the initial and terminal vertices coincide while all other vertices are distinct. In the graph 15 the walk w11(15) = { u j , u4, us, 0 6 , ~ 1 , 7 1 2 u, j } is a cycle of length 6. The subgraph G - Ci is obtained from the graph G by deleting all the vertices of the cycle Ci and their incident edges. The cyclomatic number p is defined as the number of cycles in the graph, p = M - N 1. For the graph 13 p(13) = 1, for the graph 14 ~ ( ( 1 4=) 2, for the graph 15 p(15) = 2, while for the graph 12 ,412) = 0. An acyclic graph, or a tree T, has ,u = 0. A k-tree is a tree with the maximum degree k. The graph 16 is a 4-tree, the graph 17 is a 2-tree, and the graph 18 is a 3-tree. A rooted tree is a tree in which one vertex (the root vertex) is distinct from the others.
+
16
17
18
19
A chain or a linear graph LN contains N - 2 vertices with degree 2 and two vertices with degree 1. The graph 17 is the linear graph L 4 . A star SN is a tree with N vertices, N - 1 of them having degree 1. The graph 18 is the star graph S4. A spanning subgraph of a graph G is obtained from G by deleting one or more bonds. A subtree of a graph G is a subgraph of G which is a tree. A spanning tree of a graph G is a subtree of G that contains all vertices of G. The graph 13 is a spanning subgraph of the graph 14, the graph 7 is a spanning tree of the graph 5, while the graph 17 is a spanning tree of the graph 19.
20
21
22
23
A graph in which every vertex has the degree k is a k-regular graph. The graph 20 is 1-regular, the graphs 4, 19, and 21 are 2-regular, and the graphs 22 and 23 are 3regular. A complete graph KN containing N vertices is a regular graph of degree
I
'07
108
I
4 Graph Theory in Chemistry
N - 1 with N(N - 1)/2 edges in which each vertex is adjacent to the remaining N - 1 vertices. The graph 20 is the complete graph K2, the graph 21 is K 3 , and the graph 22 is the complete graph &. A graph with N vertices that all have the degree 2 defines a ring RN. The graph 4 is the ring R5, the graph 19 is the ring &, and the graph 21 represents R3.
25
24
26
A subgraph of a graph G consisting of k independent, mutually non-incident, edges represents a k-matching of G. The number of selections of k independent edges in G is denoted by m( G, k); by definition, m( G, 0) = 1. The graph 24 is a 2matching of the ring &, and graphs 22 and 26 are two different 2-matchings of the graph 1; the missing bonds are indicated with dashed lines. In the line graph Li( G) of a graph G each vertex from Li( G) corresponds to an edge from G, and two vertices from Li( G) are adjacent if the corresponding edges from G are incident to a common vertex. The line graph of 1 is the graph 13, 13 = Li(l), while the line graph of 13 is the graph 14, 14 = Li(13). The line graph of a ring RN is the same ring, RN = Li(RN), while the line graph of a linear graph LN is the linear graph with N - 1 vertices, LN-I = Li(LN).
4.3
Molecular Graphs
Chemical compounds are usually represented as molecular graphs, i.e. nondirected, connected graphs in which vertices correspond to atoms and edges represent covalent bonds between atoms. The molecular graph model of the chemical structure emphasizes the chemical bonding pattern of atoms, whereas molecular geometry is neglected. Obviously, the molecular graph model is appropriate for prediction of physical, chemical, or biological properties that depend mainly on the bonding relationships between atoms. On the other hand, the molecular graph representation reflects mainly the connectivity of the atoms and is less suitable for the modeling of those properties that are determined mostly by the molecular geometry, conformation, or stereochemistry. The graph representation of chemical compounds must retain the features of the molecular structure that are relevant for the investigated physical, chemical, or biological property.
27
28
29
4.3 Molecular Graphs
Cyclopropane can be represented, in different conventions, by graphs 27, 28, and 29. In the graph 28 hydrogen and carbon atoms are not differentiated, making this representation of little practical use. Also, the vertices representing the hydrogen atoms are redundant, because the structure of cyclopropane 27 can be reconstructed from its carbon skeleton 29. Therefore, the usual graph representation of an organic chemical compound is the hydrogen-depleted (or hydrogensuppressed) molecular graph in which vertices correspond to non-hydrogen atoms and edges represent covalent bonds between non-hydrogen atoms. For hydrocarbons, the vertices in the molecular graph represent carbon atoms. Using this convention, alkanes are represented as 4-trees. The graph 1 is the hydrogensuppressed molecular graph of 2-methylbutane, the graph 3 represents ethylcyclopentane, the graph 4 represents cyclopentane, the graph 12 represents 2,4dimethylpentane, the graph 21 represents cyclopropane, the graph 22 represents tetrahedrane, and the graph 23 represents cubane. Another important class of molecular graphs is the Hiickel graph in which each vertex represents an sp2-hybridizedcarbon atom and each edge represents a conjugated carbon-carbon bond. The graph spectral theory has a close connection to the Hiickel molecular orbital (HMO) theory, and the HMO study of conjugated compounds benefited from the application of graph theory [2, 4, 7, 17, 181. Cyclobutadiene has the Hiickel graph 30, benzene has the Hiickel graph 31, and naphthalene has the Hiickel graph 32.
30 A Kekule structure of a conjugated hydrocarbon can be represented as a Kekule graph. A KekulP structure of a conjugated hydrocarbon is a structural formula, which may or may not include hydrogens, in which every carbon atom is sp2-hybridizedand incident to exactly one double bond. A Kekule graph is a disconnected graph consisting of K2 components (isolated edges) representing the double bonds in the corresponding Kekule structure. Naphthalene has three Kekuli. structures, namely 33, 34, and 35, represented by the KekulC graphs 36, 37, and 38. A l-factor of a graph G is a spanning subgraph of G consisting of K2 components. A KekulC graph represents a l-factor generated from the corresponding Kekule structure.
33
34
35
I
109
110
I
4 Graph Theory in Chemistry
36
37
38
The above examples of molecular graphs show clearly that the convention of representing chemical structures as molecular graphs is very important and has to be clearly specified in each study that uses molecular graphs. Depending on the convention used to translate atoms and bonds into vertices and edges, different chemical compounds can be represented by the same graph. When graph vertices represent sp3-hybridized carbon atoms and edges represent covalent carboncarbon single bonds, graph 17 represents butane, graph 30 represents cyclobutane, graph 31 represents cyclohexane, and graph 32 represents decalin. On the other hand, when chemical compounds are represented as Hiickel graphs, with each sp2-hybridized carbon atom represented as a graph vertex and each conjugated carbon-carbon bond represented as a graph edge, graph 17 represents butadiene, graph 30 represents cyclobutadiene, graph 31 represents benzene, and graph 32 represents naphthalene. The convention of representing molecules as graphs is also important for computing structural descriptors from molecular graphs. Although topological indices are usually computed from hydrogen-suppressed molecular graphs, some structural descriptors are derived from the whole, hydrogencontaining molecular graph. These examples demonstrated the importance of indicating in any model that uses chemical graphs the rules used to translate a chemical structure into a molecular graph.
4.4
Vertex- and Edge-Weighted Molecular Graphs
Although more than one thousand topological indices are used in QSPR, QSAR and design of chemical libraries, most are defined only for simple graphs that represent alkanes and cycloalkanes. The main reason for this limitation is that the mathematical theory of weighted graphs is not developed to a level high enough to ensure significant application to molecular graphs. QSAR studies, however, require the computation of topological indices for molecules containing heteroatoms and multiple bonds, which can be conveniently represented as vertex- and edgeweighted (VEW) molecular graphs. The computation of molecular matrixes and topological indices for VEW molecular graphs is usually performed with various sets of vertex and edge parameters [ 28, 291. A VEW molecular graph G = G( V, E , Sy, Bo, Vw,Ew,w ) consists of a vertex set V = V(G), an edge set E = E ( G ) , a set of chemical symbols for vertices Sy = Sy(G), a set of topological bond orders for edges Bo = Bo(G), a vertex weight set Vw(w)= Vw(w,G), and an edge weight set Ew(w)= Ew(w,G). The elements of
4.4 Vertex- and Edge-Weighted Molecular Graphs
the vertex and edge weight sets are computed with the weighting scheme w. USUally, hydrogen atoms are not considered in the molecular graph, and in a VEW graph the weight of a vertex corresponding to a carbon atom is 0 and the weight of an edge corresponding to a carbon-carbon single bond is 1. Also, the topological bond order Bog of an edge ey takes the value 1 for single bonds, 2 for double bonds, 3 for triple bonds and 1.5 for aromatic bonds. In a VEW graph G the length of a path p i between vertices ui and u,, l ( p q , w ,G), is equal to the sum of the edge parameters E W ( W )for ~ all edges along the path. The topological length of a path p y , t ( p y , G), in a VEW graph G is equal to the number of edges along the path. In a VEW graph, the distance d ( ~ between ) ~ a pair of vertices ui and uj is equal to the length of the shortest path connecting the two vertices, d(w),. = min(l(py,w)). The topological distance t d i between vertices ui and u, from a VEW graph G is equal to the minimum topological length of the paths connecting the two vertices, tdy = min(t(py)),i.e. the minimum number of bonds between vertices ui and u,. In simple (non-weighted) graphs, the distance dy and topological distance tdy are equal, while in weighted graphs usually this is not true. In a weighting scheme w the vertex Vw and edge Ew parameters are computed from a property pi associated with every vertex u, from G, vi E V(G), and the topological bond order Bo of all edges from the molecular graph. The vertex parameter VW(W)~ for the vertex ui is [29]:
and the edge parameter E W ( W )for ~ the edge between vertices ui and u, is:
where pi is the atomic property of vertex ui, pj is the atomic property of vertex u,, and pc is the atomic property for the carbon atom. Several weighting schemes for molecular graphs were defined by applying Eqs (1) and (2) to different atomic properties: Z,when p is the atomic number [28]; A, when p is the atomic mass [29]; P, when p is the atomic polarizability [29]; E, when p is the atomic electronegativity [29]; R, when p is the atomic radius [29]. In Table 4-1 we present a selected set of atomic properties used with different weighting schemes, while in Table 4-2 we give vertex parameters Vw computed with the Z , A, P , R, and E weighting schemes. The edge parameters Ew computed with the A and E weighting schemes are presented in Tables 4-3 and 4-4, respectively. The A H weighting scheme uses the following equation to define the vertex parameter VW(AH)~for the non-hydrogen atom i [29]:
The edge parameter E W ( A H ) ~for the bond between atoms i and j is defined by the equation:
I”’
112
I
4 Graph Theory in Chemistry Selected set o f atomic properties used with different weighting schemes: atomic number Z, atomic mass A, polarizability ct (A3), atomic radius r (A), and electronegativity x [29] Tab. 4-1
Atom
B
Z
5 6
C N 0
7
F Si
P S
c1 As Se Br Te I
8 9 14 15 16 17 33 34 35 52 53
A
a
r
X
10.811 12.011 14.007 15.999 18.998 28.086 30.974 32.066 35.453 74.922 78.960 79.904 127.60 126.90
3.03 1.76 1.10 0.802 0.557 5.38 3.63 2.90 2.18 4.31 3.77 3.05 5.5 4.7
1.45 1.21 1.03 0.93 0.82 1.75 1.54 1.43 1.30 1.63 1.56 1.45 1.77 1.68
2.02 2.55 3.12 3.62 4.23 1.87 2.22 2.49 2.82 2.11 2.31 2.56 2.08 2.27
Vertex parameters Vw computed with the 2, A, P, R, and E weighting schemes
Tab. 4-2
Atom
Vw(Z)
B C N 0
-0.200 0.000 0.143 0.250 0.333 0.571 0.600 0.625 0.647 0.818 0.824 0.829 0.885 0.887
F Si P S
c1 As Se Br
Te I
Vw(A)
~0.111 0.000 0.143 0.249 0.368 0.572 0.612 0.625 0.661 0.840 0.848 0.850 0.906 0.905
Vw(P)
Vw(R)
Vw(E)
0.419 0.000 -0.600 -1.195 -2.160 0.673 0.515 0.393 0.193 0.592 0.533 0.423 0.680 0.626
0.166 0.000 -0.175 -0.301 -0.476 0.309 0.214 0.154 0.069 0.258 0.224 0.166 0.316 0.280
-0.262 0.000 0.183 0.296 0.397 -0.364 -0.149 -0.024 0.096 -0.209 -0.104 0.004 -0.226 -0.123
+ N o H ~ A H ) ( A+~NoH~AH) 12.011/Boij(Ai + 1.0079NoHi)(Aj + 1.0079NoHj)
Ew(AH)ij = AcAc/Boij(Ai =
12.011
(4)
where Ac = 12.011 is the atomic mass for carbon, AH = 1.0079 is the atomic mass for hydrogen, NOH, is the number of hydrogen atoms bonded to the non-hydrogen atom i, and NOH, is the number of hydrogen atoms bonded to the heavy atom j.
4.5 Molecular Graph Matrixes
Tab. 4-3 Edge parameters Ew(A) computed with the atomic mass weighting scheme A Atom;
Atom;
Single
Double
Triple
Aromatic
C C C C C C C C C C C C N N
C N
1.000 0.857 0.751 0.632 0.428 0.388 0.375 0.339 0.152 0.150 0.094 0.095 0.735 0.644 0.281
0.500 0.429 0.375
0.333 0.286
0.667 0.572
0
0
F Si P S
c1 Se Br Te
I
N 0 S
0.187
0.250
0.368 0.322 0.141
0.490 0.429
Edge parameters Ew(E) computed with the atomic electronegativity weighting scheme E
Tab. 4-4
Atom;
Atomj
Single
Double
Triple
Aromatic
C C C C C C C C C C C C
C N 0 F Si P S
1.000 0.817 0.704 0.603 1.364 1.149 1.024 0.904 1.104 0.996 1.226 1.123 0.668 0.576 0.721
0.500 0.409 0.352
0.333 0.272
0.667 0.545
N N 0
c1 Se Br Te
I
N 0
S
0.512
0.683
0.334 0.288 0.361
0.445 0.384
4.5
Molecular Graph Matrixes
Molecular graphs are widely used to represent the chemical structure of covalent compounds in a graphical form, and this convention is widely used in chemistry textbooks and research papers. The molecular graph is, however, a non-numerical
I
113
114
I
4 Graph Theory in Chemistry
representation of the chemical structure, and the computation of topological indices for QSAR requires a numerical description of graphs. Graphs can be represented in algebraic form as matrixes [7, 15, 291. This numerical description of the structure of chemical compounds is essential for computer manipulation of molecules and for calculation of various topological indices. 4.5.1
The Adjacency Matrix
The adjacency matrix A = A(G) of a graph G with N vertices is the square N x N symmetric matrix whose elements [A]y are defined as:
where E( G) represents the set of edges of G. In the adjacency matrix A( G ) the row i and column i correspond to vertex vi from G. As an example, the molecular graph and the adjacency matrix of 1-ethyl-2-methylcyclopropane 39 are given. The graph vertices are labeled from 1 to 6, and each vertex vi is represented in the matrix A(39) by the row i and column i, respectively.
1
4 5
39 A(39)
1
0
0
The adjacency matrix A(w, G) of a vertex- and edge-weighted molecular graph G with N vertices is the square N x N real symmetric matrix whose element [A(w)ly is [29]:
Vw(w), if i = j if ey E E(G) if ey E(G)
+
where VW(W), is the weight of the vertex ui, E w ( w ) is ~ the weight of the edge ey, and w is the weighting scheme used to compute Vw and Ew.The first example of a
4.5 Molecular Graph Matrixes
VEW adjacency matrix is for butadiene 40; in the corresponding matrix A(40) the weight for the double bonds is 0.5.
40 “0)
1.0
0.5
0.0
1
41 The adjacency matrix of 2-isoxazoline 41 computed with the electronegativity weighting scheme E is:
I 1
1 2 3 4
5
0.296 0.576 0.000 0.000 0.704
2
3
4
5
0.576 0.183 0.409 0.000 0.000
0.000 0.409 0.000 1.000 0.000
0.000 0.000
0.704 0.000 0.000 1.000 0.000
1.000 0.000 1.000
4.5.2
The Burden Matrix
The molecular matrix originally proposed by Burden for computation of graph spectra (matrix eigenvalues) is a modified adjacency matrix obtained from the hydrogen-suppressed molecular graph of an organic compound [30]. The BCUT descriptors [31], derived from the graph spectra of the Burden matrix B, are extensively used in combinatorial chemistry, virtual screening, diversity measurement, and QSAR. An extension of the Burden matrix was defined by inserting on the main diagonal of the matrix a vertex structural descriptor V S D ,representing a vector of experimental or computed atomic properties [32]. The rules defining the Burden matrix B( VSD) = B( V S D ,G) of a graph G with N vertices are:
I
115
116
I
4 Graph Theory in Chemistry
1. The diagonal elements of B, [B],;, are computed with the formula:
(7)
[B(VSD)],;= VSDi
where VSDi is a vertex structural descriptor of the vertex ui that reflects the local structure of the corresponding atom i. 2. The non-diagonal element [B]y, representing an edge ey connecting vertices ui and u,, has the value 0.1 for single bonds, 0.2 for a double bond, 0 . 3 for a triple bond, and 0.15 for an aromatic delocalized bond. 3. The value of a non-diagonal element [BIGrepresenting an edge ey connecting vertices ui and uj is augmented by 0.01 if either vertex ui or vertex uj have the degree 1. 4. All other non-diagonal elements [BIGare set equal to 0.001 (these elements are set to 0 in the adjacency matrix A and correspond to pairs of nonbonded vertices in the molecular graph G). 4.5.3
The Laplacian Matrix
For a graph G we define the diagonal matrix, DEG = DEG(G), whose iith entry is equal to the degree of the vertex u ; , deg,, and all other elements are equal to zero. The Laplacian matrix of a simple graph G, L = L(G), is a symmetric matrix defined with the equation [33, 341: L(G) = DEG(G)
-
A(G)
(8)
where A( G) is the adjacency matrix of the graph G. The elements of the Laplacian matrix are:
deg, if i = j -1 if e i E E(G) 0 if ey $ E(G)
(9)
The Laplacian matrix of 1-ethyl-2-methylcyclopropane 39 is: L(39) 1 1
-1 -0 -0 -0 -0
2
3
4
5
6
-1 3
-0 -1
-0 -1 -1 3 -1 -0
-0 -0 -0 -1
-0 -0 -0 -0 -1 1
-1
2
-1 -0 -0
-1 -0 -0
2
-1
4.5 Molecular Graph Matrixes
The above formula for L can be applied only to simple molecular graphs. The definition of the Laplacian matrix for VEW graphs uses the vertex valence instead of deg. The valence of the vertex ui, val(w), = val(w, G)i, is defined as the sum of the weights E W ( W ) of ~ all edges ey incident with vertex ui [34]:
where w is the weighting scheme used to compute the Ew parameters. Alternatively, the valence of the vertex ui may be computed as the sum of the non-diagonal elements in the row i, or column i, from the adjacency matrix A(w) = A(w, G), of a molecular graph G with N vertices [34]:
The set of valence values for all vertices in a graph forms the vector Val(w) = Val(w, G) whose ith element represents the valence of the vertex ui. In alkanes and cycloalkanes the degree of a vertex u i , degi, is identical with the valence of that vertex, vali, while for molecules containing heteroatoms and multiple bonds, represented as vertex- and edge-weighted molecular graphs, this equality is not true. The Laplacian matrix L(w) = L(w, G) of a vertex- and edge-weighted molecular graph G with N vertices is the square N x N real symmetric matrix whose element [L(w)lyis defined as [34]:
if
ej
# E(G)
where E W ( W )is~ the weight of the edge ey, and w is the weighting scheme used to compute the parameters Ew. 4.5.4 The Distance Matrix
The distance matrix D = D(G) of a graph G with N vertices is the square N x N symmetric matrix whose elements are defined as [7, 151:
dy if i # j ifi=j
where dq is the graph distance (number of edges on the shortest path) between vertices ui and uj. As an example, the distance matrix of the graph 39, D(39), is shown:
I
117
118
I
4 Graph Theory in Chemistry
D(39) 1
2
3
4
5
6
0 1 2 2 3 4
1 0 1 1 2 3
2 1 0 1 2 3
2 1 1 0 1 2
3 2 2 1 0 1
4 3 3 2 1 0
The distance matrix D(w) = D(w, G) of a vertex- and edge-weighted molecular graph G with N vertices is the symmetric square N x N matrix with real elements defined with the formula [35]:
where d(w)< is the distance between vertices ui and uj, Vw(w),is the weight of the vertex ui, and w is the weighting scheme used to compute the parameters Vw and Ew. 4.5.5
The Reciprocal Distance Matrix
The reciprocal distance matrix RD(w) = RD(w, G) of a vertex- and edge-weighted molecular graph G with N vertices is the square N x N symmetric matrix with real elements defined with the equation [29,3 G ] :
where [D(w)ly is the graph distance between vertices u, and u,, [D(w)liiis the diagonal element corresponding to vertex ui, and w is the weighting scheme used to compute the parameters Vw and Ew.The reciprocal distance matrix of the graph 39 corresponding to 1-ethyl-2-methylcyclopropane is:
1
2
3
4
5
6
0 1 112 112
1 0 1 1
113 114
112 113
112 1 0 1 112 113
112 1 1 0 1 112
113 112 112 1 0 1
114 113 113 112 1 0
4.5 Molecular Graph Matrixes
4.5.6 The Detour Matrix
The detour matrix (or the maximum path matrix MP) of a graph G with N vertices is the square N x N symmetric matrix whose element [AIi is defined as [37]: [Ail =
{
max(l(pg)) if i # j ifi=j
where l( pg) is the length of the path pi, and max(1( pi)) is the length of the longest path connecting the vertices u; and u,. For trees the detour matrix is identical with the distance matrix. While for the computation of the distance matrix there are many efficient algorithms, the detour matrix is difficult to determine, especially for polycyclic graphs. 4.5.7 The Detour-Distance Matrix
The detour-distance matrix A - D (originally called the maximum/minimum path matrix MmP) of a graph G with N vertices is the square N x N non-symmetric matrix that collects in its upper triangle the elements of the detour matrix while the lower triangle elements are identical to those in the distance matrix [ 371: [AIi
if i < j
4.5.8 The Distance-Detour Quotient Matrix
Another mixing of the detour and distance matrixes into a single matrix results in the distance-detour quotient matrix D/A whose elements are defined with the formula [ 381: [D/AIij =
{ [Dly/[AIg if i
#j ifi=j
4.5.9 The Distance-Valence Matrix
The distance-valence matrix of a vertex- and edge-weighted graph G with N vertices, Dval(p, q, r , w ) = Dval(p, q, r , w ,G), is a square N x N matrix, whose entries [Dval(p,q, r , w)lYare equal to [39]:
I
119
120
I where
4 Graph Theory in Chemistry
V W ( Wis) ~ the weight of vertex vi, d(w)qis the graph distance between vertices ui and uj, val(w)i is the valence of vertex ui, all computed with the weighting scheme w. For vertex- and edge-weighted molecular graph the definition of the Dval matrix was formulated in analogy with that for the reciprocal distance matrix RD, in such a way that Dval(-l,O, 0, w,G ) = RD(w, G). In the particular case when p = 1 and q = r = 0 the Dval(p, q , r , w) matrix is identical with the distance matrix D(w). From the definition of the Dval matrix one can see that nonsymmetric matrixes can be obtained if q # r. 4.5.10 The Resistance Distance Matrix
Applying results from the electrical network theory, Klein and RandiS introduced a new distance function on graphs called the resistance distance [40]. This novel graph distance was utilized to define the resistance-distance matrix 52, proposed as an alternative to the distance matrix D. For the computation of the molecular matrix 0, Klein and RandiS superposed onto the molecular graph G an electrical network of resistors, in such a way that carbon atoms become nodes in the network and carbon-carbon single bonds are represented as 1 ohm resistors; the matrix element 0, is equal to the effective electrical resistance between the vertices vi and vj. From the theory of electrical networks it is easy to determine that in the case of acyclic hydrocarbons (i.e. alkanes, alkenes, alkynes, etc.), the resistancedistance matrix 52 is identical with the distance matrix D, while in the case of cyclic compounds the two matrixes are different.
42 The computation of the resistance distance matrix 52 of a simple (non-weighted) graph is presented for ethylcyclobutane 42, giving:
1
2
3
4
5
1
0
3/4
1
3/4
1
2
314
0 314
314 0
1
714
1
314
3
1
4
314 1
5 6
2
714 1114
2 3
314 0 714 1114
2 714
0 1
6 2 1114 3 1114
1 0
We present here a more general definition and procedure for the computation of the resistance-distance matrix 52 of weighted molecular graphs, corresponding to
4.5 Molecular Graph Matrixes
organic compounds with heteroatoms and multiple bonds [41].Consider an electrical network of resistors in which a node (vertex) ui corresponds to a vertex (with the same label) in the molecular graph G, while each chemical bond ey from the molecular graph is represented as a resistor between nodes ui and u,. Each resistor has a value Ewy(w) (in ohm) depending on the chemical nature of the atoms represented by vertices ui and u,, and on the type of the chemical bond between them; the Ewg(w) parameter is computed using the weighting schemes w proposed in the literature. The resistance distance matrix Q(w) = Q(w,G ) of a graph G with N vertexes is the square N x N symmetric matrix whose non-diagonal element fig is equal to the electrical resistance r(w)gbetween vertexes ui and u,:
where r(w),j.is measured in ohm. The algorithm for computing the resistancedistance matrix of a vertex- and edge-weighted molecular graph comprises the following steps: 1. Set up the edge-weighted adjacency matrix of the molecular graph G that contains heteroatoms and multiple bonds:
ifi=j if eg E E(G) if e i 4 E(G) 2. Compute the reverse adjacency matrix A - ( w , G): ifi=j G ) if ey E E(G) if ey $ E(G)
3. Obtain the Laplacian matrix of the reverse adjacency matrix:
lo
-
if ey 4 E(G)
4. Calculate the eigenvalues and eigenvectors of the weighted Laplacian matrix L(w, G) of the molecular graph G with N vertices:
L = UAUt
(24)
where U is an N x N column matrix of eigenvectors of the weighted Laplacian matrix L, U' is the transpose matrix, and A is an N x N diagonal matrix con-
I
121
122
I
4 Graph Theory in Chemistry
taining on the main diagonal the eigenvalues of L; the eigenvalue [AIii corresponds to the eigenvector from the ith column of matrix U. For any connected molecular graph the Laplacian matrix L has all eigenvalues positive except for one which is 0. 5. The N x N diagonal matrix V is computed from the eigenvalues of L:
{
0 if [A],i = 0 [V].. 9 = {A]il if [A],i f O 6. The generalized inverse of L is the matrix r which is 0 on its null eigenspace and the “true” inverse on the subspace orthogonal to this null space:
7. The resistance-distance matrix is obtained from R,i(w, G ) =
r:
r i i ( w , G) - 2rj(w, G) + T’(w,G) if i # j if i = j
5
4dN‘6 3
43 The computation of the resistance distance matrix for cyclobutylmethylamine 43, using edge parameters Ew(P ) computed with the molecular polarizability weighting scheme P , with a parameter for carbon-nitrogen bonds Ew(C-N, P) = 1.60, gives:
1
1
0 0.75 1.00 0.75 1.60 3.20
2
3
0.75 0 0.75 1.00 2.35 3.95
1.00 0.75 0 0.75 2.60 4.20
4
5
0.75 1.00 0.75 0 2.35 3.95
1.60 2.35 2.60 2.35 0 1.60
6
3.20 3.95 4.20 3.95 1.60 0
4.5.1 1 The Electrical Conductance Matrix
The reciprocal of the resistance distance matrix 0, with non-diagonal elements equal to l/Qj,defines the electrical conductance matrix EC(w) = EC(w,G ) [41]:
4.5 Molecular Graph Matrixes
where w is the weighting scheme used to compute the parameters E W ( W ) ~representing the value of the resistor between vertices u; and u,. The electrical conductance matrix EC of ethylcyclobutane 42 is: EC(42) 1
2
3
4
5
6
0 413 1 413 1 112
413 0 413 1 417 4/11
1 413 0 413 112 113
413 1 413 0 417 4/11
1 417 112 417 0 1
112 4/11 113 4/11 1 0
4.5.12 The Distance-Path Matrix
Consider the vertex- and edge-weighted graph G with N vertices and its distance matrix D(w) = D(w, G) computed with the weighting scheme w.The distance-path matrix of the weighted graph G, Dp(w) = D,(w, G), is the square N x N symmetric matrix whose element [Dp(w)lyis defined with the formula [42]:
4.5.13 The Reciprocal Distance-Path Matrix
The reciprocal distance-path matrix of a VEW graph G with N vertices, RD,(w) = RD,(w, G), is the square N x N symmetric matrix whose element [RD,(w)ly is equal to the reciprocal of the corresponding distance-path matrix element, l/[D,(w)ly, for non-diagonal elements, and is equal to [Dp(w)lii for the diagonal elements [42]: l/[D,(w, G)lG if i # j [Dp(w,G)jii if i = j 4.5.14 The Distance Complement Matrix
The distance complement matrix DC(w, G) of a VEW graph G with N vertices is the square N x N symmetric matrix whose elements are defined as [431:
I
123
124
I
4 Graph Theory in Chemistry
N
x
5
1 2
-
[D(w,G)lV if i # j if i = j
6
44
The distance complement matrix DC of 1,2-dimethylcyclobutane 44 is:
7 DC(44)
0 5 4 5 5 4
5 0 5 4 4 5
4 5 0 5 3 4
5 4 5 0 4 3
5 4 3 4 0 3
4 5 4 3 3 0
4.5.15 The Reciprocal Distance Complement Matrix
Using a procedure similar to that employed in the computation of reciprocal matrixes one can transform the distance complement matrix into the reciprocal distance complement matrix. The reciprocal distance complement matrix RDC(w) = RDC(w, G) of a vertex- and edge-weighted molecular graph G with N vertices is the square N x N symmetric matrix with real elements defined with the equation [ 3 5 , 441: [RDC(w,G)lY=
l/[DC(w, G)lq if i # j if i = j [DC(w, G)lii
(32)
where [DC(w)lt, is the graph distance complement between vertices u, and u,, [DC(w)],,is the diagonal element corresponding to vertex u,, and w is the weighting scheme used to compute the vertex and edge parameters Vw and Ew. 4.5.16 The Complementary Distance Matrix
The complementary distance matrix CD is another matrix in which the value of the matrix elements corresponding to pairs of vertices decreases when the distance between the vertices increases. The complementary distance matrix CD(w) = CD(w, G) of a vertex- and edge-weighted molecular graph G with N vertices is the
4.5 Molecular Graph Matrixes
square N x N Symmetric matrix whose elements are defined as [35, 451:
where [D(w)lq is the ijth element of the distance matrix D(w) which is equal to the graph distance between vertices ui and uj, d(w),, is the maximum distance between two distinct graph vertices (the graph diameter), and d(w),,, is the minimum distance between two distinct graph vertices (equal to 1 for alkanes and cycloalkanes):
4.5.1 7 The Reciprocal Complementary Distance Matrix
The reciprocal complementary distance matrix RCD(w, G) of a vertex- and edgeweighted molecular graph G with N vertices is the square N x N symmetric matrix defined with the equation [35, 451: [RCD(w, G)Iq =
l/[CD(w, G)Iq if i # j if i = j [CD(w, G)lii
4.5.18 The Reverse Wiener Matrix
The reverse Wiener matrix RW(w, G) of a vertex- and edge-weighted molecular graph G with N vertices is the square N x N symmetric matrix whose elements are obtained by subtracting from d,, each dij value in the distance matrix [4G]:
where [DIYis the 9th element of the distance matrix D which is equal to the graph distance between vertices ui and uj. 4.5.1 9 The Reciprocal Reverse Wiener Matrix
The reciprocal reverse Wiener matrix RRW(w, G) of a vertex- and edge-weighted molecular graph G with N vertices is the square N x N symmetric matrix defined
I
125
126
I
4 Graph Theory in Chemistry
with the equation 135, 441: [RRW(w,G)ly =
l/[RW(w, G)ly if i # j [RW(w,G)lii if i = j
4.5.20 The Szeged Matrix
Consider a graph G with N vertices labeled from 1 to N; for any pair of vertices v, and u, from G, v,,vJ E V(G), ntr represents the number of vertices uk of the molecular graph G having the property tdk, < tdk, and nJ, represents the number of vertices u k with the property tdb < tdk, [47]:
From the above two definitions it is clear that vertex ui is counted in ny while vertex u, is counted in nji; if a vertex L’k is situated at the same topological distance from vertices vi and vj, i.e. tdki = tdkj, the vertex is not counted neither in nv nor in nji. For any pair of vertices ui and vj from G, ny gives the number of vertices closer to vertex ui and nji gives the number of vertices closer to vertex vj. Diudea used the ny and nji numbers to define three new molecular matrixes [47]. For a graph G with N vertices, the Szeged matrix Sz, = Sz,(G), is a square N x N nonsymmetric matrix defined by [47]:
The non-symmetric Szeged matrix Sz,(44) of 1,2-dimethylcyclobutane 44 is:
1
2
3
4
5
6
0 3 1 2 1 1
3 0 2 1 1 1
2 4 0 3 2 1
4 2 3 0 1 2
5 3 4 2 0 3
3 5 2 4 3 0
The path Szeged matrix Sz, = Szp(G) is a square N x N symmetric matrix obtained from Sz, through a symmetrization operation [471:
4.5 Molecular Graph Matrixes
The symmetric path Szeged matrix Sz,(44) of 1,2-dimethylcyclobutaneis:
1 1 2 3 4 5 6
1 0 9 2 8 5 3
2
3
4
9 2 8 0 8 2 8 0 9 2 9 0 3 8 2 0 5 2 8
5
6
5 3 3 5 8 2 2 8 5 , 9 0
The edge Szeged matrix Sz, = Sze(G) is a square N x N symmetric matrix obtained from Sz, through a similar symmetrization operation 1471:
[Szellj.=
n..n,. if i # j and ey E E(G)
,
if i = j o r e c $ E(G)
0
(43)
where E(G) represents the edge set of G. An alternative definition of the Sz, matrix is:
where [Alg is the element of the adjacency matrix A = A(G). The edge Szeged matrix of 1,2-dimethylcyclobutane,Sz,(44), is: SZ444)
1 2 3 4 5 6
1
2
3
4
5
6
0 9 0 8 5 0
9 0 8 0 0 5
0 8 0 9 0 0
8 0 9 0 0 0
5 0 0 0 0 0
0 5 0 0 0 0
The edge Szeged matrix Sz,(w) = Sz,(w, G) of a VEW molecular graph G with N vertices is the square N x N real symmetric matrix whose element [Sz,(w)lq is [35]:
ifi=j if eu E E ( G ) if ey $ E(G)
(45)
where V W ( Wis) ~the weight of the vertex ui, E W ( W ) is ~ the weight of the edge eg, and w is the weighting scheme used to compute the parameters Vw and Ew.
I
127
128
I
4 Graph Theory in Chemistry
45 The edge Szeged matrix Sz, with edge weights obtained from the atomic polarizability weighting scheme P, is given for 1-chloro-2-fluorocyclobutane45:
1 2
3 4
5 G
1
2
3
4
5
G
0 9.000 0 8.000 4.037 O
9.000 0 8.000 0 0 15.799
0 8.000 0 9.000 0 0
8.000 0 9.000 0 0 0
4.037 0 0 0.000 0.193 0
0 15.799 0 0 0 -2.160
4.6 Molecular Graph Polynomials
Polynomials and spectra derived from molecular graphs and matrixes have important applications in chemistry in connection with the molecular orbital theory of unsaturated compounds [4,7, 17-19, 481. The characteristic and acyclic polynomials of conjugated molecules were used to define several aromaticity indices. Various topological indices and other QSAR descriptors were derived from graph polynomials, spectra, and spectral moments. We present here the molecular graph polynomials that are used in the definition of structural descriptors and topological indices. 4.6.1 The Characteristic Polynomial
The characteristic polynomial Ch( G , x) of a molecular graph G is the characteristic polynomial of its adjacency matrix A(G) [4,7, 481: N
Ch(G, x)
= det(x1-
c,xN-’
A) = n=O
where I is the unit matrix of order N and c, is the nth coefficient of the characteristic polynomial. A graph eigenvalue xi is a zero of the characteristic polynomial: Ch(G, xi) = 0
(47)
4. G Molecular Graph Polynomials
for I = 1 to N. The complete set of graph eigenvalues xl, x2,. . . ,X N forms the spectrum of the graph G, Sp(A, G) = (xi,i = 1 , 2 , . . . , N). The characteristic polynomials and spectra for several linear and cyclic graphs are presented below: Ch(Lo) = 1 Ch(L1) = x Ch(L2) = x 2 - 1 Ch(L3) = x 3 - 2~ C h ( 4 ) = x4 - 3x2 + 1 Ch(L5) = - 4x3 + 3% Ch(L6) = xG - 5x4 + 6%' - 1 Ch(R3) = x 3 - 3x - 2 Ch(R4) = x4 - 4x2 Ch(RS) = x S - 5x3 + 5x - 2 Ch(R6) = xG - 6x4 + 9x2 - 4 S P ( 4 L2) = (1, -1) Sp(A, L3) = {1.41421,0, -1.41421) Sp(A, L 4 ) = {1.61803,0.61803, -0.61803, -1.61803) Sp(A, Ls) = {1.73205,1,0, -1, -1.73205) Sp(A,L6) = {1.80194,1.24698,0.44504,-0.44504,-1.24698,-1.80194) SP(A, R3) = (2, -1, -1) SP(A, %) = {2,0, 0 , -2) Sp(A, R5) = {2,0.61803,0.61803, -1.61803, -1.61803) Sp(A, R6) = {2,1,1, -1, -1, -2)
The spectral moment of order k, SMk, is defined as: N
SMk
N
x,"= Tr Ak = z [ A k l i i
= i=l
i= 1
An important result is the geometric interpretation of Eq. (48). The number of self-returning walks of length k may be computed by considering the diagonal elements of the first k powers of the adjacency matrix A, because each diagonal element [AkIiiof the matrix A'"can be interpreted as the sum of all self-returning walks of length k from/to vertex ui. It was initially conjectured that the characteristic polynomial and its spectrum might be used as unique descriptors of graphs; however, non-isomorphic graphs with identical spectrum, spectral moments and characteristic polynomial were found, and they were called isospectral or cospectral graphs.
46
47
I
129
130
I
4 Graph Theory in Chemistry
The pair of isospectral4-trees representing 2,3-dimethylheptane 46 and 3-ethyl-5methylhexane 47 has the same characteristic polynomial:
Ch(46) = Ch(47)
= X’
-
8x7 + 1%’ - 14x3 + 2x
4.6.2 The Acyclic (Matching) Polynomial
The acyclic (matching) polynomial of a graph G is defined by the equation [48]: Nl2
N
Ac(G,x) = x(-l)krn(G,k)xN-2k = ~ k=O
U , X ~ - ’
(49)
n=O
where m( G, k) denotes the number of k-matchings of G. This polynomial plays an important role in statistical physics (theory of monomer-dimer systems), in developing topological indices for quantitative structure-property relationships, and in quantum organic chemistry (topological resonance energy). 4.6.3 The Characteristic Polynomial of a Molecular Matrix
In the previous section we introduced the most important molecular graph matrixes, such as the adjacency, distance, resistance distance, or Szeged matrixes. Any symmetric molecular matrix can be used to derive a corresponding characteristic polynomial and matrix spectrum, which are the basis of computing various topological indices for QSAR. The characteristic polynomial of the molecular matrix M of the graph G is:
Ch(M, G)
= det(x1-
M)
(50)
The spectrum of the molecular matrix M, Sp(M, G) = (xi,i = 1,2,. . . , N), is formed by the complete set of eigenvalues of M.
4.7 Enumeration of Kekule Structures
The number of Kekuli. structures (Kekule structure count, KSC) of a conjugated hydrocarbon gives information on the thermodynamic stability and chemical reactivity and is used for the computation of various structural indices for benzenoid hydrocarbons [4, 7, 19-22]: the Pauling Bond Order, P B O the stability index SI, SI = KSC2IN;the n-electron energy. Kekuli. structures are considered in valencebond resonance-theoretical models such as the Pauling-Wheland model, the Simpson-Herndon model, and the conjugated-circuits model. The calculation of the KSC can be performed with recurrence relationships, matrix methods, and
4.7 Enumeration of Kekuli Structures
explicit combinatorial expressions derived for a large number of classes of conjugated hydrocarbons: various classes of cata-condensed benzenoid hydrocarbons, honeycomb lattice strips, polymers. Some general expressions and recurrence relationships which allow the computation of KSC for benzenoid or general conjugated molecular graphs are presented. The determinant of the adjacency matrix A of the Hiickel graph G of a benzenoid hydrocarbon is related to the number of KekulC structures in G [2]: detA(G) = (-1)NlNKSC(G)2
(51)
If G is the Hiickel graph of a conjugated hydrocarbon then [4, 7, 191: KSC(G) = KSC(G - e i )
+ KSC(G
-
ui - u,)
(52)
When the vertex ui is of degree one, Eq. (52) becomes [4, 7, 191:
If the conjugated system G is an essentially disconnected benzenoid composed of two non-interacting fragments GI and G2,then [4, 7, 191: (54)
KSC(G) = KSC(Gi)KSC(G2)
If G is the Hiickel graph of a benzenoid hydrocarbon with N vertices, then the free term of the characteristic polynomial, i.e. the coefficient of xo, Ch(G,O), is related to KSC(G) [4, 7, 191: Ch(G,O) = (-1)N’2KSC(G)2
(55)
If G is the Hiickel graph of a conjugated hydrocarbon with N vertices, then the free term of the acyclic polynomial, i.e. the coefficient of xo,Ac( G, 0), is connected with KSC(G) [4, 7, 191:
(56)
Ac(G,O) = (-l)”’KSC(G) If G is the Hiickel graph of a benzenoid hydrocarbon, then: r
KSC( G - ey) KSC( G - ui - u,) =
KSC(G - C,)
(57)
k=l
where Ci is a cycle of G and the summation goes over all r cycles in G containing the edge ed. If G is the Hiickel graph of a benzenoid hydrocarbon then KSC(G) can be expressed as a function of the number of Kekuli. structures of the subg:rdpIis-urL-m€&-dcirkr‘curir&iri-uie’ec&- ? ’ l p ~ ] : An‘.
I
131
132
I
4 Graph Theory in Chemistry
KSC(G)’= K S C ( G - e , j . ) 2 + K S C ( G - u ; - v j ) 2 + 2 ~ K S C ( G - Ck)’
(58)
k-1
where the summation goes over all r cycles in G which contain the edge eh. If G is the Huckel graph of a benzenoid hydrocarbon then KSC(G) is related to the number of Kekule structures of the subgraphs of G obtained after the removal of a vertex I; , [49]:
j= 1
k= I
where the first summation goes over all deg, vertices adjacent to vertex I ) , , and the second summation goes over all r cycles Ck which contain vertex vi. If G is the Huckel graph of a conjugated hydrocarbon then K S C ( G ) can be computed from the number of Kekule structures of the subgraphs of G corresponding to the decomposition at its vertex ui [ 491: deg,
KSC(G ) =
1KSC( G
- ~ ’i ~ j )
j-1
where the summation goes over all deg, vertices adjacent to vertex u,. The Pauling Bond Order PBOh between two adjacent vertices v, and u, from the Huckel graph of a benzenoid hydrocarbon can be computed with the equation [ 19, 201: PBOq = KSC( G - ui
-
uj)/KSC(G ) = 1 - KSC( G - e v ) / K S C (G )
(61)
The Pauling bond order was successfully correlated with experimentally determined bond lengths of various benzenoid hydrocarbons.
4.8
Molecular Graphs and Hiickel Molecular Orbital Theory
The Huckel theory, the first molecular orbital theory used in organic chemistry, provided qualitative information on the structure and reactivity of conjugated compounds. Their is a close connection between Hiickel theory and the graph spectral theory of Hiickel molecular graphs: the energies (in p units) of Huckel molecular orbitals of a conjugated molecule are identical with the spectrum of the adjacency matrix of the Huckel molecular graph of the respective molecule. Various quantities obtained from the Hiickel molecular orbitals and energies (total n-electron energy, charge density, bond order, free valence, absolute hardness, etc.) can be derived from the spectral analysis of the Huckel molecular graph or from empirical equations relating them to selected graph invariants [4, 7, 17, 181.
4.9 The Topological Resonance Energy
4.9
The Topological Resonance Energy
The definition of the resonance energy is:
where E,,rnol and E,,,,f are the n-electron energies of a conjugated molecule and the corresponding reference structure, respectively. Both En terms can be computed at different levels of theory, from Huckel to ab initio methods. There are many proposals on the definition of the reference structure, which is a hypothetical structure containing the same number of n-electrons as the corresponding conjugated molecule but without conjugation. The topological resonance energy ( TRE) defined by Aihara [ S O ] and by Gutman, Milun, and TrinajstiC [51] is computed at the Huckel level:
where xiis the ith value from the characteristic polynomial spectrum, yi is the ith value from the acyclic polynomial spectrum, and gi is the electron occupancy of the ith energy level. The TRE can be normalized by dividing its value to the total number of n-electrons, N,, giving the TRE per z-electron (TREPE)value: TREPE = TRE/N,
(64)
TREPE represents the conjugation stabilization or destabilization that one nelectron contributes to the molecular n system. TRE values were computed for a large variety of conjugated compounds, including conjugated ions, radicals, and ion radicals, coumarins, thiocoumarins, and bridged heteroannulenes [ 71. Selected TRE and TREPE values for conjugated hydrocarbons are presented in Table 4-5.
T R E and TREPE values for conjugated hydrocarbons (in units)
Tab. 4-5
molecule
TRE
benzene naphthalene anthracene phenanthrene cyclobutadiene benzocyclobutadiene
0.276 0.390 0.476 0.546 -1.228 -0.392
TREPE
0.046 0.039
0.034 0.039 -0.307 -0.049
I
133
134
I
4 Graph Theory in Chemistry
4.10 Isomer Enumeration
Isomers are stable chemical compounds with an identical molecular formula but different structure, conformation or configuration which display different physicochemical properties. The isomer enumeration represents an important graph theory application in chemistry; the presentation of the most important isomer enumeration algorithms can be found in a number of reviews [52, 531. Cayley developed generating functions for enumeration of alkanes and alkyl radicals and produced extensive numerical data on the number of isomers with various molecular formulas. Pdya introduced the most powerful isomer enumeration algorithm by using the molecular symmetry, weighting factors, and generating functions [52]. The mathematical background of the Polya enumeration method is presented here, together with an example. An arrangement of a set of objects is an ordering of these objects. A permutation n is an operation that changes one arrangement into another arrangement and can be represented in a two-row notation by the expression:
with the meaning that object 1 is permuted to object pl, object 2 is permuted to object p2, object 3 is permuted to object p3, object i is permuted to object pi, and object N is permuted to object P N . A permutation of a set of n objects is called a cyclic permutation (or cycle) of length m if it moves the object in position p1 to position p2, the object in position p2 to position p 3 , .. . , the object in position p,_l to position p m , the object in position p, to position pl, and all other objects are left fixed. A cyclic permutation is denoted by ((PI, p2, p 3 , . . . , Pm-1, pm)). A transposition is a cycle of length 2. Consider four distinct objects al, a2,a3, and a4 and four arrangements with the four objects: A1 = (a1 a2 a3 a4); A2 = (al a3 a2 a4); A3 = (a2 a3 a1 a4); & = (a2 a3 a4 al). The permutation which changes Al to A2 is the transposition n1 = ((2 3)); the permutation which changes A1 to A3 is a cycle of length 3 denoted n2 = ((1 2 3)); the permutation which changes Al to 4is a cycle of length 4 denoted n3 = ((1 2 3 4)). If we indicate also the objects that remain fixed in a permutation, the first two permutations are denoted: nl = ((1)(2 3)(4)); n2 = ((1 2 3)(4)). The set of all possible permutations that can be applied to a set of n objects, together with the composition operation, forms a permutation group. If one considers a set containing three objects then the six permutations of its group, denoted S3, are: p1 = ((1)(2)(3)), p2 = ((1 2)(3)), p3 = ((1 3)(2)), p4 = ((1)(2 3)), PS ((1 2 3)), p6 ((1 3 2)). For a group of n objects the set of all possible permutations corresponds to the group S,, the symmetric group of degree n, containing n! permutations. Two permutations are disjoint if they act on mutually exclusive sets of objects in an arrangement. To any permutation of n objects n one assigns a monomial s(n) in
4.70 Isomer Enumeration
Fig. 4-1 The symmetry axes of benzene
the variable sk corresponding to a cyclic permutation of length k in the unique product of disjoint cycles of x. A fixed object corresponds to a factor si, m fixed objects correspond to s;", and a transposition corresponds to si. The factors assos:, and s i . ciated with the above permutations xl, x2,and x3 are Let r be a permutation group of degree n. For each permutation x E r, let jk(x) be the number of cycles of length k in the disjoint cycle decomposition of x. The cycle index Z ( r ) is the polynomial in n variable sl, s2, . . . ,s, given by the formula:
4.10.1
Polya's Theorem
If C ( X ) is the counting series for a collection of weighted objects, and r is a permutation group acting on n positions such that r defines an equivalence relation on arrangements consisting of n objects, then the substituted cycle index z(r,~ ( x )is) the counting series for nonequivalent arrangements consisting of n objects where the weight of an arrangement is the sum of the weights of the n objects of which it is composed. The benzene cycle index derivation considers the symmetry elements indicated in Figure 4-1, corresponding to the permutations and cycle index terms presented in Table 4-6. The cycle index for benzene is: 1 z= -sf + 2s: + 2s: + 4s; + 34s; 12
The number of benzene substitution isomers with formula C&-k& is obtained by substituting Eq. (87) in the cycle index for benzene, Eq. (67). This substitution gives the counting polynomial:
I
135
136
I
4 Graph Theory in Chemistry Tab. 4-6
Generation of the cycle index of benzene
Symmetry operation
123456
Permutation
Cycle index term
123456 234561 612345 345612 561234 456123 216543 654321 432165 165432 321654 543216
z= (1/12)[(1 + x ) +~2(1 + x G )+ 2(1 + x3)’ + 4 ( 1 + x2)3 + 3(1 + x)’(1 + x’)’] = 1 + x + 3x2 + 3 x 3 + 3x4 + x5 + xG
(68)
showing that benzene has one unsubstituted isomer, one monosubstituted isomer, three disubstituted isomers, three trisubstituted isomers, etc.
4.1 1
Conclusions
The graph representation of chemical objects, their relationships, and their transformations has a profound impact in our understanding of the relationships between chemical structure and the physical, chemical, and biological properties of chemicals. Computational methods based on molecular graphs are used to encode chemicals and reactions in databases, to enumerate all isomers with a given molecular formula, or to generate combinatorial libraries. The adjacency matrix is familiar to chemists because they were first applied in Huckel molecular orbital theoretical calculations. In addition to the adjacency matrix, various other types of matrix can be associated with molecular graphs; these molecular matrixes represent important sources of topological indices. The rich literature on topological indices and their applications in QSPR, QSAR, and virtual screening of combinatorial libraries, shows that the most active application of graphs in chemistry is the development of new structural descriptors derived from molecular graphs. Significant developments were made by defining parameters for heteroatoms and multiple bonds, by introducing new molecular matrixes, and by computing topological indices with graph operators. For the future, major progress is expected in this
4.17 Conclusions
direction, with benefits for the drug design process and for structure-property models. References 1
2
3
4
5
6
7 8
9
10
11
12
13
F. HARARY, Graph Theory, AddisonWesley, Reading, MA, 1971. D. M. CVETKOVI~, M. DOOB,H. SACHS,Spectra of Graphs. Theory and Applications, 3rd edition, Johann Ambrosius Barth, Heidelberg, 1995. A. T. BALABAN (Ed.), Chemical Applications of Graph Theory; Academic Press, London, 1976. I. GUTMAN, 0. E. POLANSKY, Mathematical Concepts in Organic Chemistry, Springer, Berlin, 1986. D. H. ROUVRAY (Ed.), Computational Chemical Graph Theory, Nova Science, New York, 1990. D. BONCHEV, D. H. ROUVRAY (Eds.), Chemical Graph Theory. Introduction and Fundamentals, Abacus Press/ Gordon and Breach Science Publishers, New York, 1991. N. TRINAJSTI~, Chemical Graph Theory, 2nd ed., CRC Press, Boca Raton, 1992. A. T. BALABAN (Ed.), From Chemical Topology to Three-Dimensional Geometry, Plenum, New York, 1997. L. B. KIER, L. H. HALL,Molecular Connectivity i n Chemistry and Drug Research, Academic Press, New York, 1976. A. T. BALABAN, A. CHIRIAC, I. MOTOC, 2. SIMON,Steric Fit i n Quantitative Structure-Activity Relations, Lect. Notes Chem. Vol. 15, Springer, Berlin, 1980. D. BONCHEV, Infomation Theoretic Indicesfor Characterization of Chemical Structure, Research Studies Press Wiley, Chichester, UK, 1983. L. B. KIER,L. H. HALL,Molecular Connectivity in Structure-Activity Analysis, Research Studies Press, Letchworth, 1986. N. VOICULETZ, A. T. BALABAN,I. NICULESCU-DIJVAZ, 2. SIMON,Modeling of Cancer Genesis and Prevention, CRC Press, Boca Raton, 1990.
L. B. KIER, L. H. HALL,Molecular Structure Description: The Electrotopological State, Academic Press, San Diego, 1999. 15 J. DEVILLERS, A. T. BALABAN (Eds.), Topological Indices and Related Descriptors i n Q S A R and Q S P R , Gordon and Breach Science Publishers, The Netherlands, 1999. 16 M. V. DIUDEA (Ed.), Q S P R I Q S A R Studies by Molecular Descriptors, Nova Science, Huntington, N.Y., 2001. 17 A. GRAOVAC, I. GUTMAN, N. TRINATSTI~, Topological Approach to the Chemistry of Conjugated Molecules, Springer, Berlin, 1977. 18 J. R. DIAS,Molecular Orbital Calculations Using Chemical Graph Theory. Springer, Berlin, 1993. 19 S. J. CWIN, I. GUTMAN, Kekuli Structures i n Benzenoid Hydrocarbons, Lect. Notes Chem. Vol. 46, Springer, Berlin, 1988. 20 I. GUTMAN, S . J. CWIN,Introduction to the Theory of Benzenoid Hydrocarbons, Springer, Berlin, 1989. 21 I. GUTMAN, S. J. CWIN,Advances in the Theory of Benzenoid Hydrocarbons, Topics Cum. Chem. Vol. 153, Springer, Berlin, 1990. 22 S . J. CWIN,J. BRUNVOLL, B. N. CWIN, Theory of Coronoid Hydrocarbons, Lect. Notes Chem. Vol. 54, Springer, Berlin, 1991. 23 G. POLYA,R. C. READ,Combinatorial Enumeration of Groups, Graphs, and Chemical Compounds, Springer, Berlin, 1987. 24 S . FUJITA,Symmetry and Combinatorial Enumeration i n Chemistry, Springer, Berlin, 1991. 25 0. IVANCIUC,Canonical Numbering and Constitutional Symmetry, in: The Encyclopedia of Computational N. L. Chemistry, P. v. R. SCHLEYER, 14
I
137
138
I
4 Graph Theory in Chemistry
ALLINGER, T. CLARK, J. GASTEIGER, P. A. KOLLMAN,H. F. SCHAEFER 111, A N D P. R. SCHREINER (Eds.), John Wiley and Sons, Chichester, 1998, pp. 167-183. 26 0. N. TEMKIN, A. V. ZEIGARNIK, D. BONCHEV,Chemical Reaction Networks. A Graph-Theoretical Approach, CRC Press, Boca Raton, 1996. 27 J. K o ~ A ,M. KRATOCHV~L, V. KVASNIEKA,L. MATYSKA, J. POSP~CHAL, Synthon Model of Organic Chemistry and Synthesis Design. Lect. Notes Chem. Vol. 51, Springer, Berlin, 1989. 28 M. BARYSZ, G. JASHARI, R. S. LALL, V. K. SRIVASTAVA, A N D N. T R I N A J S T I ~ , On the Distance Matrix of Molecules Containing Heteroatoms. In: Chemical Applications of Topology and Graph Theory, R. B. KING (Ed.), Elsevier, Amsterdam, 1983, pp. 222-227. 29 0. IVANCIUC,Rev. Roum. Chim. 2000, 45, 289-301. 30 F. R. BURDEN,].Chem. In$ Comput. Sci. 1989, 29, 225-227. 31 R. s. PEARLMAN A N D K. M. SMITH. I. Chem. In& Comput. Sci. 1999, 39, 28-35. 32 0. IVANCIUC, Rev. Roum. Chim. 2001, 46, 1047-1066 33 N. T R I N A J S TD. I ~ ,B A B I ~S., N I K O L I ~ , D. PLAVSIC, D. A M I ~A,N D 2. M I H A L I ~ , I. Chem. In$ Comput. Sci. 1994, 34, 368-376. 34 0. IVANCIUC,Rev. Roum. Chim. 2001, 46, 1331-1347. 35 0. IVANCIUC, 1.Chem. In& Comput. Sci. 2000, 40, 1412-1422. 36 0. IVANCIUC,T:S. BALABAN, A N D A. T. BALABAN,]. Math. Chem. 1993, 12, 309-318. 37 0. IVANCIUCA N D A. T. BALABAN, MATCH (Commun. Math. Chem.) 1994, 30, 141-152. 38 M. R A N D IJ. ~ ,Chem. In& Comput. Sci. 1997, 37, 1063-1071.
0. IVANCIUC, Rev. Roum. Chim. 2000, 45, 587-596. 40 D. 1. KLEIN A N D M. R A N D I ~J. , Math. Chem. 1993, 12,81-95. 41 0. IVANCIUC,A C H - Models Chem. 2000, 137, 607-631. 42 M. V. DIUDEA, J . Chem. In$ Comput. Sci. 1996, 36, 535-540. 43 M. RANDIE, New]. Chem. 1997, 21, 945-951. 44 0. IVANCIUC,T. IVANCIUC,A. T. BALABAN, Internet Electron. 1.Mol. Des. 2002, 1, 467-487, http://www. biochempress.com. 45 0. IVANCIUC, T. IVANCIUC, A. T. BALABAN,A C H - Models Chem. 2000, 137, 57-82. 46 A. T. BALABAN,D. MILLS,0. IVANCIUC, A N D S. C. BASAK,Croat. Chem. Acta 2000, 73,923-941. 47 M. V. DIUDEA,].Chem. In& Comput. Sci. 1997, 37, 292-299. 48 0. IVANCIUC, T. IVANCIUC, M. V. DIUDEA, Roum. Chem. Quart. Rev. 1999, 7, 41-67. 49 0. IVANCIUC,A. T. BALABAN,]. Math. Chem. 1992, 11, 169-177. 50 J. AIHARA,].Am. Chem. Soc. 1976, 98, 2750-2758. 51 I. GUTMAN, M. MILUN,N. T R I N A J S T I ~ , 1.Am. Chem. SOC.1977, 99,1692-1704. 52 G. POLYA,R. C. READ, Combinatorial Enumeration of Groups, Graphs, and Chemical Compounds, Springer, Berlin, 1987. 53 A. T. BALABAN,J. W. KENNEDY,L. V. QUINTAS, J . Chem. Educ. 1988, 65, 304-31 3. 54 0. IVANCIUC, A. T. BALABAN, Graph Theory in Chemistry, in: The Encyclopedia of Computational N. L. Chemistry, P. v. R. SCHLEYER, ALLINGER, T. CLARK,J. GASTEIGER, P.A. KOLLMAN,H. F. SCHAEFER 111. A N D P. R. SCHREINER (Eds.), John Wiley & Sons, Chichester, 1998, pp. 1169-1 190. 39
Handbook of Cltentoinfovntatics Johann Gasteiger Copyright 02003 WILEY-VCH Verlag GmbH & Co.KGaA, Weinheim
5.1.7 Introduction
5
Processing Constitutional Information Ovidiu Ivanciuc’sbiographical notes are given at the beginning of Chapter 11, Section 4.
5.1
Canonical Numbering and Constitutional Symmetry Ovidiu lvanciuc 5.1.1 Introduction
Chemoinformatics systems use a wide variety of algorithms for indexing and retrieving chemical compounds in databases, generating all isomers with a constitutional formula, or for computer-assisted organic synthesis. All these tasks involve three classes of algorithm for chemical graphs: 1. the canonical coding problem, for generating a unique representation of a
chemical compound; 2. the automorphism partitioning problem (also known as constitutional symme-
try perception, graph symmetry, or topological symmetry), for detection of equivalent atoms and bonds in a molecule; and 3. the graph isomorphism problem, for determining if two connection tables represent the same chemical compound. The three problems are related and their considerable practical and theoretical importance has encouraged many mathematicians and chemists to investigate them. The main chemical applications of canonical coding and constitutional symmetry perception are briefly presented below: 1. In a chemical documentation system each substance must be characterized by a unique code, which is used for storage, retrieval and comparison of chemical compounds. The nomenclature systems used by man to communicate chemical information are not suitable for computer manipulation, and special chemical structure representations are developed for chemical database management and searching. 2. The computer generation of chemical compounds consistent with given structural constraints is used both in synthesis design and in structure elucidation. The chemical structure set generated must be exhaustive (to contain all compounds consistent with the generation rules) and non-redundant (to contain
I
139
140
I
5.1 Canonical Numbering and Constitutional Symmetry
each structure only once). The main problem of most structure-generation algorithms lies in generating a great number of redundant structures, which have to be found and eliminated. Canonical coding of structures is used to eliminate redundant compounds, while constitutional symmetry information is necessary to generate all possible structures. 3. Artificial intelligence systems for synthesis design use canonical codes and constitutional symmetry information to generate and evaluate reaction paths. 4. The synthesis by computer-aided molecular design of new compounds that conform to various physical property requirements can reduce the time and effort required using traditional empirical approaches. This process generates chemical structures compatible with experimental and structural restrictions. Canonical coding and symmetry perception are used to uniquely generate all possible isomers that adhere to the design constraints. 5. Constitutional symmetry is important for interpretation of NMR and ESR spectra, and computer-assisted structure elucidation systems use this information. 6. Quantitative structure-activity relationships are used to model the biological effect of a set of compounds and to propose new structures with optimized biological activity. Such a system makes extensive use of structure generation, substructure search and symmetry perception algorithms. 7. Modern drug-design procedures make extensive use of large combinatorial libraries and in silico screening of chemicals, both using algorithms based on canonical coding, graph isomorphism, and constitutional symmetry perception. The problems of canonical coding, graph isomorphism, and graph automorphism have both mathematical and chemical significance. Mathematical formulation of these problems is covered briefly below, and some connections with their chemical counterparts are presented. In subsequent sections the main algorithms used in chemistry for canonical coding of molecular graphs and constitutional symmetry perception are presented and compared. We will use definitions introduced in the graph theory section (graphs, molecular graphs, molecular matrixes, characteristic polynomial and matrix spectra) (Chapter 11, Section 4)and in the topological indices section (vertex and molecular invariants) (Chapter VIII, Section 1).This review is a modified and updated version of a chapter published in the Encyclopedia of Computational Chemistry [ 11.
5.1.2
Graph Labeling
All algorithms for canonical coding and graph symmetry determination generate, compare, and manipulate labeled graphs. The process of generating all labelings of a graph is presented below. A molecular graph G = G(V. E ) consists of a nonempty se! ,V of vertices representing atoms and a set E of edges representing chemical bonds; the number of vertices is N = I V( G) 1. If the edge ev E E , then the
~
5.7.2 Graph Labeling
A
A
I
141
A
Fig. 5.1-1 The tree representation o f the generation o f the
3! = 6 permutation labelings of cyclopropane
3
46
7 9 10 13 1416 1719 20 23 24262729 3033 3436 3739 40
Fig. 5.1-2 The tree representing the depth-first permutation labeling of a graph with four vertices
two vertices ui and uj are adjacent, and eg is incident with ui and u,. The degree of a vertex ui, deg,, is the number of edges adjacent with ui. A labeling Lb of the graph G composed of N vertices consists of a one-to-one mapping Lb : V(G) + { 1 , 2 , . . . , N}. The integer Lb(u) E (1, 2 , . . . , N} assigned to a vertex u E V(G) is called the label of the vertex u. A graph G together with the mapping Lb is called a labeled graph and is denoted by G(Lb). For a graph with N vertices there are N!permutation labelings. The generation of all N!labelings of a graph can be represented as a rooted tree where the root node is the unlabeled graph, at each node a new label is added to an unlabeled vertex, and each terminal node is a completely labeled graph corresponding to one labeling. The tree representing the generation of the 3! = G permutation labelings of cyclopropane is presented in Figure 5.1-1. Two basic approaches are used to construct the tree of permutation labelings: breadth-first or depth-first. Each one has advantages and disadvantages, and most coding, isomorphism, and automorphism algorithms use a mixture of these two approaches. The order of exploring the labeling tree for the depth-first method is presented in Figure 5.1-2; the breadth-first labeling process is presented in Figure 5.1-3. The problem of generating N! labelings in canonical coding algorithms can be reduced by using rules which allow cutting of some branches of the labeling tree.
142
I
5. I Canonical Numbering and Constitutional Symmetry
17 1819 2021 22 23 2425 2627 28 29 3031 3233 34 35 3637 3 8 3 9 4 0 The tree representing the breadth-first permutation labeling o f a graph with four vertices Fig. 5.1-3
5.1.3
Constitutional Symmetry of Graphs
The problem of graph symmetry investigates the equivalence relationships between the elements of the molecular graphs (atoms, bonds, pairs of atoms, etc.). The geometrical information is neglected and only bonding relationships are considered. Consider two graphs G = G( V, E) and G’ = G’( V’, E’) with Nl Vl = NI V’I and a mapping rn : V + V’ which assigns to each vertex u E V a vertex v‘ E V’ in such a way that if ui # u, then rn(vi) # rn(u,). The two graphs G and G’ are called isomorphic if there exists a mapping rn between V and V‘ which preserves the adjacency of vertices, i.e. if eG E E, uk = rn(v,), v1 = rn(v,) then ekl E E’. The problem of recognizing if two graphs G and G’ are isomorphic or not is called the graph isomorphism problem [ 2 ] . The chemical counterpart of this problem is to determine if two molecular graphs represent the same chemical compound. An isomorphism of a graph with itself is called an automorphism. An automorphism can be represented by a permutation (mapping) that transforms a graph labeling into another labeling and preserves the adjacency of the vertices. A permutation P represented in a two-row notation has the following form (Eq. (1)):
with the meaning that atom 1 is permuted to atom pl, atom 2 is permuted to atom p 2 , atom 3 is permuted to atom p 3 , atom i is permuted to atom pi, and atom N is permuted to atom p N . An orbit is the set of all atoms that are transformed from one into another by the actions of all automorphisms of a molecular graph. The set of all different orbits of a molecular graph G forms a partition of G. If P represents an automorphism then atom i is symmetric (topologically equivalent) with its image pi, and atoms i and pi belong to the same orbit. This symmetry relationship in the molecular graph may or may not be true for the three-dimensional molecular structure. Let Aut( G) = (A, B, C, . . .) be the set of automorphisms of the atoms in a graph G and let @ symbolize a binary operation on Aut(G). Aut(G) is called an automorphism group if the following conditions are satisfied [ 31:
5.1.3 Constitutional Symmetry of Graphs
1. For any two permutations A, B E Aut(G) there exists a unique element C = A 0B, C E Aut(G). 2. The operations respect the associative law: A 0B 0C = A 0(B 0C) = (A 0B) 0C for all permutations A, B, C E Aut(G). 3. For every permutation A E G there exists an inverse permutation A-' E Aut(G) such that A 0A-' = A-' 0A = E. 4. The set Aut(G) contains a unique permutation E such that A 0E = E 0A = A for all A E Aut(G).The permutation E is called the identity permutation of the group Aut(G).
The automorphism group describes all symmetry properties of a graph. The determination of the automorphism group of molecular graphs is important in enumeration and generation of isomers and interpretation and simulation of spectra. The combination of two automorphisms with the operation 0, A 0 B = C, considers in the first step the second permutation, which transforms an atom i into its image B(i). The second step considers permutation A and correlates an atom B(i) from the first row with its image A(B(i)) from the second row. The third step generates the permutation C which correlates an original atom i with its image A(B(i)).
1
2
3
4
5
6
Consider the labeled graph of 2,2-dimethylpentane 1 and its five permutation labelings 2, 3, 4, 5, and 6. The five orbits of 2,2-dimethylpentane are: Xl = {l}, X2 = {2}, X3 = {3}, X, = {4}, X5 = {5,6,7}. The identity permutation E of 1 and the five automorphisms PI, P2, P3, Pq, and Pg, which transform 1 in 2, 3, 4, 5, and 6 , respectively, are presented below:
) 7 ) 5 7 6)
1 2 3 4 5 6 7
E=( 1 2 3 4 5 6 7 P2 =
P4=(
( 11
2 3 4 5 6 2 3 4 7 6
1 2 3 4 5 6 1 2 3 4 7 5
PI=(
1 2 3 4 5 6 7 1 2 3 4 5 7 6
P3=(
1 2 3 4 5 6 7 1 2 3 4 6 5 7
Pi=(
1 2 3 4 5 6 7 1 2 3 4 6 7 5
I
143
144
I
5.7 Canonical Numbering and Constitutional Symmetry
The combination of two automorphisms of 2,2-dimethylpentane, namely
P10P2 = Pg, is presented below: 1 2 3 4 5 6 7 1 2 3 4 5 7 6
P1 0P2 =
1 2 3 4 5 6 7 1 2 3 4 7 6 5
1 2 3 4 5 6 7 1 2 3 4 6 7 5 Because the result of P1 0P2 is another automorphism of 1, the above equation is an application of property (1)of automorphism groups.
7
Consider the molecular graph of isopropylcyclopropane 7, its orbits XI = (3}, Xj = {4}, X , = (5, G}, and its automorphisms:
= { 1,2},
X2
1 2 3 4 5 6 1 2 3 4 6 5
1 2 3 4 5 6 2 1 3 4 6 5
The combination of these permutations gives the following multiplication table:
E A B C
IE
A
B
C
E A B C
A E C B
B C E A
C B A E
Each permutation has an inverse element as presented in the following table:
From the above properties of the permutations E,A, B, and C, one may conclude that they form the automorphism group of the graph of isopropylcyclopropane 7. This automorphism group describes all constitutional symmetry relationships of isopropylcyclopropane.
5.7.3 Constitutional Symmetry ofcraphs
A permutation of the vertices of a graph G can be described by a permutation matrix P whose element [Ply = 1 if the vertex u; is permuted to uj, and [Ply= 0 otherwise. If a permutation is in the automorphism group then the following equation is valid (Eq. (2)):
The matrix representation of the permutation C of 7 is: 0 1 0 0 0 0
1 0 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 0 1
0 0 0 0 1 0
To verify with Eq. (2) that the permutation C is in the automorphism group of 7 we have to show that CTAC = A, i.e.: 0 1 0 0 0 0
_ _
1 0 0 0 0 0
0 0 1 0 0 0
_ _
0 1 0 0 0 0
0 0 1 0 0 0
1 0 0 0 0 0
0 0 1 0 0 0
1 1 0 1 0 0
0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0-
0 0 1 0 1 1
0 0 1 0 0 0 0 0 0 1 0 0
0 0 0 1 0 1 _0 0 0 1 1 0
0 0 0 0 0 1 0 0 0 0 1 0-
0 0 1 0 0 0 0 0 1 0 0 0 - 1 1 0 1 0 0 0 0 1 0 1 1
0 0 0 1 0 1 -0
0
0
1
1
0-
For a graph with N vertices there are N! permutation matrixes, and a brute force generation of Aut(G) requires N! tests to verify if Eq. (2) is satisfied. Algorithms that greatly reduce the number of tests in the determination of the automorphism group use information on constitutional symmetry (orbits) and avoid the generation of non-automorphic permutations [4,51. Balasubramanian developed algorithms to determine the automorphism groups of edge-weightedgraphs [GI and to generate nuclear equivalence classes based on the three-dimensional molecular structure [71.
I
145
146
I
5. I Canonical Numbering and Constitutional Symmetry
5.1.4 Canonical Coding of Graphs
A code Cd( G , Lb) of the labeled graph G(Lb) is a string obtained from G by a set of rules. The code is not a structural invariant, because different labelings of G usually give different codes. A code is a complete representation of G(Lb) because the labeled graph can be reconstructed from Cd(G, Lb). The code of a chemical compound is a numerical representation of the chemical structure suitable for computer manipulation. An important property of codes is that the lexicographical relation (or numerical relation, in the case of numerical codes) between two strings induces an order of the codes. Two labelings Lbl and Lb2 of graph G are called equivalent, Lbl = Lb2, if the corresponding codes are identical, Cd(G, Lbl) = Cd(G,LB2). Because a code depends on the graph labeling, if Cd(G,Lbl) 2 Cd(G, Lb2) then Lbl 2 Lbz and G(Lb1) 2 G(Lb2). In this way it is possible to order two labeled graphs. The set of all possible labelings Lb of graph G is denoted LbS(G). A maximal canonical labeling LbMcan E LbS(G) of graph G generates a maximal canonical code CdMcanwith the property that VLb E LbS(G) : Cd(G, Lb) 2 CdMcan(G,LbMcan).A minimal canonical labeling Lb,,, E LbS(G) of graph G generates a minimal canonical code Cd,,, with the property that: VLb E LbS(G) : Cd(G, Lb) 2 Cd,,,,(G, Lbmcan). Both minimal and maximal canonical codes are used in chemistry, depending on the definition selected to represent the chemical structures. From the above definition it is clear that for a given molecular graph the canonical code is unique. This property is used in graph isomorphism testing and in storage, retrieval and comparison of chemical compounds, because two molecular graphs with identical canonical codes represent the same chemical compound.
3
C
8
1
9
10
For two permutation labelings giving the same matrix or code, two vertices with the same label are automorphic. In the case of isopropylcyclobutane 8 the two labelings 9 and 10 have identical adjacency matrixes, showing that two pairs of vertices are automorphic, namely vertices ( a c) and (e g). A rigorous method to derive the canonical code and automorphism partitioning is to construct all N! permutations, to generate their codes and compare them to extract the one-to-one correspondence as presented in the above example. The permutation labelings corresponding to the canonical code are identified by a lexicographical comparison of the N! codes, followed by the selection of the maximal (or minimal) code. The process of generation of the canonical code by investigating automorphism permutations is called canonical code generation by automorphism permutation (CCAP).
5.7.4 Canonical Coding ofCraphs
All coding algorithms use a heuristic approach to reduce the number of permutation labelings that have to be investigated in order to detect the canonical labelings, however. Any vertex invariant can be used to obtain a preliminary partition of the vertices from a molecular graph. Consider a vertex invariant of graph G with N vertices, In = (Inl,1712,. . . , I n N ) , which assigns a value Ini to vertex ui. A vertex graph invariant is any vertex property, computed on the basis of the graph structure, whose value does not depend on the graph labeling. Examples of vertex invariants are the degree and distance sum. A partitioning of the vertex set V is induced by the invariant In by including vertices vi and vj in the same atom invariant class (AIC)if Ini = In,; the number of vertices in the class iis denoted by ni. Two vertices from different AIC cannot be automorphic, while two vertices from the same AIC are not necessarily automorphic. Despite numerous efforts, no vertex graph invariant is known which is sufficient to establish the automorphism partitioning, because for certain graphs non-automorphic vertices are partitioned in the same class. The process of atom partitioning in AIC induced by a certain atomic invariant is called graph invariant atom partitioning (GIAP), and represents an important step in the generation of the canonical code. To determine the canonical code and automorphism partitioning, each AIC resulting from GIAP procedure is investigated to detect non-automorphic vertices by using the definition of automorphism. The partitioning of vertices in k classes, with n1,n2,.. . , nk vertices in each class, reduces the complexity of canonical coding of a graph with N vertices from N! to the construction of nl!nz!.. . nk! = ni! permutation labelings. The classes of vertices are ordered with some rules, then vertices in the first class receive labels 1 , 2 , . . . , n1, vertices in the second class are labeled nl 1,nl 2 , . . . , nl n2, and so on.
n
+
+
+
I'
The efficiency of the GIAP depends on the discriminating power of the atomic invariant. Consider the molecular graph of 3-methylhexane 11 and its vertex partitioning into three classes induced by atom degrees: three atoms with degree 1, three atoms with degree two, and one atom with degree 3. The graph 11 is an identity tree because all vertices are topologically distinct. The brute force determination of the canonical code and constitutional symmetry of 3-methylhexane 11 requires the comparison of the codes generated by 7 ! = 5040 permutation labelings. The atom degree partitioning reduces the number of labelings needed to determine the constitutional symmetry to 3!3!1! = 36.
12
I
147
148
I
5.7 Canonical Numbering and Constitutional Symmetry
a
d Fig. 5.1-4
b
e
C
f
The cooperative labeling of 1-isopropyl-2-methylcyclohexane
This number can be further reduced by the use of more powerful vertex invariants, such as the distance sum. The distance sum, DSi, of a vertex ui is the sum of the topological distances from vertex ui to all other vertices in the molecular graph. The partitioning in DS equivalence classes of the atoms from 3-methylhexane is presented in 12, showing that all atoms have distinct DS values. Using this partitioning, the constitutional symmetry of 3-methylhexane can be determined with one labeling. The use of vertex graph invariants in determining the constitutional symmetry is hampered by the degeneracy (i.e., two or more topologically distinct vertices have identical numerical values for a certain vertex graph invariants) of almost all known graph invariants. The cooperative labeling used by Morgan in his canonical coding algorithm [8] restricts the enormous number of labelings investigated in order to find the canonical code of graph G. In this procedure an arbitrary vertex of G is selected and labeled by 1. The r vertices adjacent with vertex 1 are labeled by 2,3,. . . , r + 1 in an arbitrary combination. In the next step the vertex with index 2 is considered and its s adjacent vertices (not labeled yet) are labeled by r 2, r + 3 , . . . , r + s + 1. The labeling process continues until all vertices are labeled. The process of cooperative labeling is illustrated in Figure 5.1-4 for 1-isopropyl-2-methylcyclohexane. The use of the cooperative indexing can reduce further the number of labelings generated in the process of canonical coding by stopping the exploration of a branch as soon as it becomes clear that the corresponding code is not the canonical one. Consider that the minimal code of a chemical structure is searched for; the code is generated during the labeling process and at each step it is compared with the current minimal code stored. If the investigated labeling gives a partial code that is greater than the minimal one, the branch is not further investigated and other branch is explored. Many coding algorithms used in chemistry use a combination of cooperative labeling and atom partitioning with a discriminant graph invariant. An algorithm for canonical coding can be separated in two steps:
+
5.7.5 The MORGAN Algorithm
1. GIAP: compute a discriminant atom invariant and establish with it an initial atom partitioning. 2. CCAP: using the atom partitioning established in the first step identify the canonical code by investigating all automorphism permutations.
The use of the GIAP step is based on the property that two atoms with distinct values for the same invariant cannot be automorphic. On the other hand, the assumption that atoms in the same GIAP class are automorphic is not correct. Carhart pointed out that a rigorous canonical coding and constitutional symmetry perception algorithm must contain both GIAP and CCAP steps [9]. The same idea was formulated for the problem of graph isomorphism by Read and Corneil [2]. Even after these warnings, the illusion of obtaining a “better” and “faster” canonical coding algorithm by eliminating the step CCAP continued and one can still find in the literature papers that propose algorithms for the computation of vertex invariants, with the wrong assumption that vertices with identical values of the invariant are automorphic and belong to the same orbit [lo-131. Even when such AIC algorithms give vertex partitionings that coincide with the automorphism partitioning for a certain set of graphs, this fact is not a demonstration that the algorithm can generate the automorphism partitioning for any graph. This type of “canonical coding” algorithm is incomplete, and its use in a chemical database, in synthesis design or structure elucidation systems is unreliable.
5.1.5 The MORGAN Algorithm
The extended connectivity algorithm (EC) defined by Morgan is the first efficient algorithm for the partitioning of atoms in equivalence classes [8]. In a modified form, it is used at the CAS in the chemical registry system in order to generate a unique, canonical code of a chemical compound. The EC algorithm is the basis of an important branch of methods for graph partitioning used in various coding methods. The EC algorithm attempts to make a graph partitioning by considering the connectivity of adjacent atoms: 1. Set the level 1 EC of each atom to the value of its degree. 2. Determine the number of different EC’ values, NECV’. 3. The level n 1 EC of each atom is equal to the sum of EC” values of the adja-
+
cent atoms. 4. Determine NECVn+’. 5. If NECV”+’ > NECV” go to step ( 3 ) . 6. The EC” values are the final ones. While the EC algorithm presented by Morgan does not always allow the complete classification of atoms in equivalence classes, it gives a good starting point for the generation of the canonical code. The calculation of the extended connectivity
I
149
150
I
5.7 Canonical Numbering and Constitutional Symmetry
a
b
C
e
d
Fig. 5.1-5 Computation o f the extended connectivity values with the Morgan algorithm for ethylcyclopentane (a). The level 1-4 EC values are presented i n diagrams b-e
values is illustrated in Figure 5.1-5 for the molecular graph of ethylcyclopentane. Level 1 to 4 EC are presented in Figure 5.1-5b-e. The final partition induced by the EC values is used for the cooperative labeling of the molecular graph: The atom with the highest EC value is labeled with 1 and is considered the current atom. If the current atom has any unlabeled adjacent atom then assign the next label to it. If the current atom has more than one unlabeled adjacent atoms select the one with the highest EC value. Repeat step (2) until all atoms adjacent to the current atom are labeled. Stop if the graph is completely labeled, else increase the label of the current atom and go to step (2). Using the above rules and the EC values from Figure 5.1-5e one obtains two cooperative labelings of ethylcyclopentane, namely 13 and 14. The Morgan EC procedure described above is able to reduce the search for a canonical labeling of ethylcyclopentane from 7 ! = 5040 to only two labelings.
13
14
In the final stage of the Morgan algorithm the molecular structure is transformed into a linear code formed by several lists. The FROM list contains for every atom the label of the atom from which it was labeled. This list describes a spanning tree of the molecular graph. The RING-CLOSURE list defines the ring closure bonds by the labels of the atoms connected by the bonds. The atom types are
5.l.G The Augmented Connectivity Molecular Formula
specified in the ATOM-TYPE list following the order of their labels. The BONDTYPE lists the bond types in the order in which the bonds were defined in the FROM and RING-CLOSURE lists. Morgan defined as canonical code the minimal code obtained from the above four lists. The canonical code is determined with a CCAP procedure, with a rigorous examination of all cooperative labelings that can be canonical. The two cooperative labelings 13 and 14 give two identical Morgan codes, representing the canonical code of ethylcyclopentane: FROM: 1 1 1 2 3 4 RING-CLOSURE: 5 G ATOM-TYPE: C C C C C C C BOND-TYPE: 1 1 1 1 1 1 1 Because both labelings 13 and 14 give identical codes, two vertices with the same label are automorphic. This property allows one to generate the orbits of ethylcyclopentane, using the notation from Figure 5.1-5a: XI = {a}, X2 = { b } ,X3 = {c}, x4 = ( 4 e ) , x s = {f,g).
5.1.6
The Augmented Connectivity Molecular Formula
CAS uses a chemical registry system that is a computer system for the unique representation of the molecular structure in a connection table. The chemical registry system generates the canonical connection table with the algorithm defined by Morgan. The CAS registry system uses also a compact and easy calculated numerical representation of a chemical structure called the augmented connectivity molecular formula (ACMF) [14]. The ACMF molecular representation is a quick and simple way to determine if a compound is absent from the database: if the ACMF of a compound is not found in the database, the compound is new. On the other hand, because the ACMF is not a canonical representation of the molecular structure, two different molecules can have identical ACMF. The ACMF algorithm consists of the following steps: 1. Assign to each atom in the molecular graph a level 1 value that depends on its
chemical nature. The elements are considered in alphabetic order, each element having a value Av between 32 and 240: actinium, 32; carbon, GO; hydrogen, 106; lutetium, 130; lawrencium, 132; oxygen, 158; zirconium, 240. 2. Assign to each bond in the molecular graph a value that depends on its type. The bonds are numerically characterized by bond values Bv which are prime numbers: cyclic single bond, 3; cyclic double bond, 5; cyclic tautomer bond, 7; cyclic delocalized bond, 11; cyclic alternating bond, 13; cyclic triple bond, 17; acyclic single bond, 19; acyclic double bond, 23; acyclic tautomer bond, 29; acyclic delocalized bond, 31; acyclic triple bond, 37.
I
151
152
I
5.7 Canonical Numbering and Constitutional Symmetry 3. Assign to each atom in the molecular graph a level 2 augmented connectivity value equal to (Eq. ( 3 ) ) :
ACZ
= xAv(uj)Bv(ey)
(3)
j
where the summation goes over all atoms j adjacent with atom i, Ac(u,) represents the Av parameter of atom j and Bv(eg) represents the Bv parameter of the bond between atoms i and j. 4. The AC"+' values arc obtained from the set of AC" values (Eq. (4)):
where the summation goes over all atoms adjacent with atom i. 5. If n < 4, increment n by 1 and go to step (4). 6. Denote by NAC" the number of distinct AC" values; if NAC" < NACn+', increment n by 1 and go to step (4). 7. Assign as final atomic value the AC" value. In the following steps the ACMF linear representation of the molecular structure is obtained on the basis of AC" values by concatenating them and generating a 64-bit hash code that represents the final ACMF description of the chemical structure. The calculation of the augmented connectivity values is illustrated in acid. Figure 5.1-6 for the molecular graph of 2-cyclopenten-1-ylacetic The level 1 values (Av parameters) associated with each non-hydrogen atom are presented in Figure 5.1-6a, the Bv parameters are presented in Figure 5.1-8b, while Figure 5.1-6c presents the AC2 values computed by summing the products of level 1 atom values and the corresponding bond value for all adjacent atoms. Level 3, 4, and 5 AC values are computed by summing the AC values of the previous level for all adjacent atoms; the corresponding diagrams are presented in Figure 5.1-6d-f. The number of distinct values at level 1, NAC', is 2 and their number increases to 8 for NAC4 and NAC'. Because NAC4 = NAC' the algorithm converged to a stable partition and the AC values from level 4 are used to generate the ACMF linear representation of 2-cyclopenten-1-ylaceticacid. The ACMF algorithm is not able to distinguish all non-equivalent atoms, as computed for the two oxygen atoms in Figure 5.1-6d-f, although at level 2 they are characterized by different AC2 values.
15
16
17
;&
5.1.G The Augmented Connectivity Molecular Formula
158
$776
0
60 60
1380
60
60
360 6;:80
I60
3 b
a 7776
840
840 d
360
480 C
4800
0
$:g186 l77; :& 9
0 1140
2280
;O:
2700
24828
0
$:85
2820
e
678
t828
6660 f
Computation o f the augmented connectivity molecular formula for 2-cyclopenten-1-ylacetic acid Fig. 5.1-6
Because the ACMF algorithm considers only structural information of adjacent atoms, all atoms in a cubic molecular graph will have identical ACkvalues, for all k levels. Therefore, the (CH)2k saturated valence isomers of annulenes, represented by cubic molecular graphs, will have identical augmented connectivity for all atoms, giving an identical ACMF representation for the whole set of (CH)2kisomers with a given k. The three (CH)8 isomers 15, 16, and 17 are characterized by identical ACMFs, showing an important limitation of this procedure. Another important class of chemical compounds, fullerenes, are not discriminated by the ACMF algorithm. In a hllerene all carbon atoms are characterized by identical AC values, and as a consequence of this fact all fullerene isomers with a given number of carbon a t o m s have identical ACMFs. This problem is common to all GIAP algorithms derived from the EC procedure that use only the connectivity information of adjacent atoms. Although the ACMF is not a unique molecular representation, it has a good discriminating power and from the CAS database that contains several million chemical compounds only less than 0.04% of them represent distinct substances with identical ACMFs. ACMF is a molecular graph invariant, whose value is independent of a particular labeling. In the CAS Chemical Registry System the ACMF is used to determine if a chemical compound is new and its ACMF is not found in the database. This algorithm represents a practical tool for the selection from a
I
153
154
I
5.7 Canonical Numbering and Constitutional Symmetry
chemical database of the compounds that have identical ACMFs with a potential new compound, and the comparison of the unique canonical code is performed only for this small subset of molecules from the entire database.
5.1.7
Modifications of the Extended Connectivity Algorithm
The EC method is by far the most investigated GIAP algorithm, and many chemical information systems use this atom partitioning method in various implementations that improve its discriminant power. Wipke and Dyott have modified the EC algorithm and used it in their simulation and evaluation of chemical synthesis (SECS) program [ 151. In this implementation the EC values are computed as in the Morgan definition, unless for a primary atom, in which case its EC value is 1. The coding system proposed by Wipke and Dyott, called stereochemical extension of morgan algorithm (SEMA) has two strings coding the stereochemical information: the double bond configuration list and the atom configuration list. The canonical code is formed by all lists appended linearly together and gives distinct names for stereoisomers. The GIAP method defined by Morgan fails sometimes to distinguish between topologically non-equivalent atoms. Some causes of the degeneracy of EC descriptors were identified and some improved algorithms were defined. Consider the EC values of the atoms b and f in ethylcyclopentane (Figure 5.1-5a). Although the two atoms are topologically distinct, they are characterized by identical EC values in all iterations. By inspecting the EC diagrams in Figure 5.1-5b-e it is clear that the degenerate EC values are obtained by summing different terms, e.g., EC2b = 1 + 3 = 4 and EC2f = 2 + 2 = 4. The canonical coding and constitutional symmetry algorithm developed by Shelley and Munk [lG, 171 and used in the CASE (computer-assisted structure elucidation) system substitutes the summing of adjacent atoms EC values with an ordered list of EC values. Their algorithm is described below: 1. Assign a Class Identifier (CI) to each non-hydrogen atom. The CI value is a two-
2.
3. 4.
5.
digit integer, the first one being equal with the degree of the atom and the second designates atom type, i.e., C = 2, N = 3 , 0 = 4. Count the number of different CI values, NCI, and assign new CI values between one and NCI to each atom. The atoms with the smallest CI value receive label one and the atoms with the largest CI value receive label NCI. If NCI is equal to the number of non-hydrogen atoms then go to (7). Assign a trial class identifier (TCI) to each atom. The TCI is a string of five twodigit integers with the leftmost field containing the CI of the respective atom. The remaining four fields contain a list of the adjacent atoms CIS in descending order from right to left. If there are less than four adjacent atoms then the list is zero filled. Count the number of different TCI values (NTCI) and assign new TCI values
5.1.8 Other Symmetry Perception Algorithms
between one and NTCI to each atom: the atoms with the smallest TCI value receive label one and the atoms with the largest TCI value receive label NTCI. 6. If NTCI is not greater than NCI then go to (7), else set the CI of each atom to its TCI value and set NCI to NTCI. 7. The GIAP is finished. A similar algorithm, HOC (hierarchically ordered extended connectivities), was proposed by Balaban, Mekenyan and Bonchev [ 181. The HOC algorithm considers also the stereochemical information [ 191, and is followed by a CCAP algorithm that provides the canonical code. Moreau defined a discriminant EC algorithm, but disregarded the CCAP procedure [lo]. Another implementation of the Morgan algorithm was proposed by Figueras [20]. A modified Morgan algorithm was used for the determination of topological equivalence classes of atoms and bonds in C ~ O - C ~hllerenes O [21].
5.1.8 Other Symmetry Perception Algorithms
Many other GIAP and canonical coding algorithms were proposed both in chemical and mathematical literature. A succinct enumeration of them is presented in this section. For the computer manipulation of organic reactions and reaction mechanisms Fujita defined the concept of imaginary transition structure (ITS), which is a structural formula obtained by the superposition of the molecules of the reagents and products 122, 231. Each ITS can be numerically represented as a connection table and transformed into a canonical code which can be manipulated by a computer and stored in databases. A complete algorithm for partitioning the atoms in a molecule into equivalence classes was introduced by Jochum and Gasteiger using the NOON (number of outermost occupied neighbor) sphere of an atom defined as the minimal number of neighbor spheres necessary to accommodate all atoms of a molecule starting at that atom 1241. A related algorithm, the atom environment matrix method, was developed by Bersohn for partitioning the atoms into equivalence classes [25]. A group of efficient canonical coding algorithms were developed starting from the adjacency matrix. The adjacency matrix of a molecular graph is not a graph invariant, because it depends on the particular labeling of the chemical structure. RandiC defined the UAC (upper adjacency code) notation of the labeled graph G as a string composed of N(N - 1)/2 digits of the upper triangle of the adjacency matrix A: UAC = (012 a13 0 1 4 . . . o ~ N023.. . O N N - ~ ) [26, 271. From the UAC code the adjacency matrix can be unambiguously reconstructed. Different labelings of the molecular graph generate different UAC strings. SUAC (smallest upper adjacency code) is a unique representation of the adjacency matrix, which was proposed by RandiC as an efficient method for canonical labeling, and perception of constitutional symmetry.
I
155
156
I
5.7 Canonical Numbering and Constitutional Symmetry
The system Chemics, developed for the automated structure elucidation of organic compounds, uses the connectivity stack to generate all possible chemical structures that are consistent with given structural information [28]. The connectivity stack uses a notation for each substructure, e.g., C3 for a methyl and LC for chlorine, and with a set of rules the molecule is linearly coded into a list of substructures and a list of connections. The canonical code is defined as the labeling which offers the maximal linear representation of the upper triangle of the adjacency matrix. The Syngen synthesis design program developed by Hendrickson uses another molecular code generated from UAC, namely the greatest upper adjacency code (GUAC), which is a canonical labeling of the molecular graph giving the maximal UAC representation of the upper triangle of the adjacency matrix [ 2 9 ] . GUAC is used to establish a catalog of available starting material molecules and to give a unique representation to the molecules generated by the program. In practical use, there are some advantages in using GUAC over SUAC code because the GUAC labeling is cooperative and there are fewer atoms with a maximal valence than atoms with a minimal valence. Therefore GUAC makes fewer trials than SUAC and requires less time to code the graph and uses less memory space. Kvasnicka and Pospichal introduced an algorithm of canonical indexing for the non-redundant and exhaustive constructive enumeration of graphs [ 30, 311. Their algorithm uses the LAC (lower adjacency code) notation of the labeled graph G which is a string composed of N(N + 1)/2 digits of the lower triangle of the adjacency matrix A: LAC = (all a21 a22.. . a N 1 . . . a”). Unlike UAC, the LAC notation contains the diagonal of the adjacency matrix. The greatest lower adjacency code (GLAC) is generated by the canonical labelings of G that give a maximal LAC. To obtain canonical codes of fullerenes, Liu and Klein introduced a vertex invariant defined on the basis of matrix spectral theory [32]. Denote by 3, an eigenvalue of the adjacency matrix A and by Is1) the corresponding set of orthonormal eigenvectors, where s is a degeneracy index with values from 1 to the degeneracy g(3,) of the eigenvalue 1. Then (Eq. (5)):
and the projection operator for the ith eigenspace is (Eq. (6)):
where (s3,l is the row-vector corresponding to the column vector Isd), and (a I b) represents the inner product between two vectors la) and Jb). The invariant T, with row i representing the vertex i and column 3, representing the eigenvalue d, is (Eq. (7)):
5.1.8 Other Symmetry Perception Algorithms
where );1 is a vector with the ith element equal to 1 and all other equal to 0. The rows of the matrix T are ordered lexicographically to obtain a vertex invariant. A hash code is a fixed length representation of a data structure used as an index or key to a direct access file. The input structure cannot be restored from a hash code, and due to the limited range of values two different data structures may be represented by the same hash code. Ihlenfeldt and Gasteiger proposed to represent chemical structures with hash codes by using a hierarchical algorithm: atom hash codes are computed first, merged into molecule hash codes and the molecule hash codes are combined to give a molecular ensemble hash code [ 331. The input of the algorithm is a connection table, and a prime number table is used to select initial values for atoms. The atom hash codes contain information on the number of hydrogen and non-hydrogen neighbor atoms, the atomic number Z, number of atoms in molecule modulo 257, stereochemical descriptors, isotope information, rr-system size, and charge. In the next step the hash codes of the atoms are combined with the hash codes of the neighbors by rotation and exclusive-or operations applied on the 64-bit representation. Using simple bit operations atom hash codes are combined to yield molecule and ensemble hash codes. The hash codes were used to classify chemical catalogs, in the Wodca synthesis planning system [ 341, in the Eros 6 reaction prediction program [35, 361, and in the Massimo mass spectra prediction system [ 371. Uchino used the matrix multiplication method for obtaining the canonical code and automorphisms of a molecular graph [ 381. He considered the adjacency, distance, and open walks matrixes in a series of efficient algorithms that offer the automorphism partition of graphs. The incidence matrix of molecular hypergraphs was used for the canonical coding of non-classical molecular structures with polycentric delocalized bonds [39, 401. The canonical adjacency matrix was used in the MOLGEN program for the generation of constitutional and configurational isomers [41]. Projection operators of graph spectra were proposed for canonical coding [42]. The canonical coding algorithm of Schubert and Ugi [43] was implemented in synthesis design systems [44, 451. Graph-matching algorithms were found useful for automorphism, isomorphism, and maximal common substructure determination [46, 471. The algorithms developed for determining constitutional symmetry were extended to obtain the three-dimensional symmetry [48]. New results were obtained for the coding of configurational isomers [49]. Faulon developed polynomial-timealgorithms for the problems of isomorphism, automorphism partitioning, and canonical labeling of molecular graphs, which represent a particular class of graphs for which these problems can be efficiently solved [SO]. Satoh et al. introduced a canonical coding method for representing three-dimensional structures, CAST (canonical representation of stereochemistry), which can differentiate between conformers, enantiomers, or diastereomers [ 5 1, 521. The illusory enterprise of obtaining a canonical coding algorithm that contain
I
157
158
I
5.1 Canonical Numbering and Constitutional Symmetry
only the GIAP step continues, and Ouyang et al. have proposed a modified Morgan EC algorithm for “topological symmetry perception and unique numbering” of atoms [53]. However, their algorithm is a simple graph invariant atom partitioning method, without the CCAP step, which makes it unfit for the “canonical coding algorithm” label. Even though such algorithms are able to correctly identify the canonical labeling of a large number of graphs, they are not complete and foolproof. Also, Ouyang et al. continues the series of false assertions stating that the Morgan algorithm is not suitable for canonical labeling of molecular graphs [53]. In fact, the original Morgan algorithm contains the two steps, GIAP and CCAP, required for a canonical coding algorithm [ 8 ] : the GIAP step consists of the EC algorithm, while in the CCAP step the final EC atom partition is used to generate all possible connection tables, and the labeling that gives the minimal code represents the canonical labeling. Although the EC algorithm is not very efficient, overall the Morgan algorithm is demonstrated to always give the canonical code. Contreras et al. developed a stereoisomer generation system, Camgec2, using a linear coding system that encodes also the stereo information [ 541. A problem related to graph isomorphism is the detection of the maximal common substructure between two molecules. Raymond et al. proposed a molecular similarity measure based on the maximal common edge subgraph [ 551.
5.1.9 Conclusions
The determination of the constitutional symmetry of chemical compounds is a highly investigated research topic, and the various automorphism partitioning algorithms presented in this section show the evolution from simple techniques to methods with a strong mathematical background. The partitioning of atoms in orbits represents the common starting point for three structure representation and manipulation techniques: graph isomorphism, determination of the automorphism group, and canonical coding. The algorithms for automorphism partitioning of atoms consist of two steps: (1) an atom invariant is used to establish an initial atom partitioning (GIAP); (2) the GIAP result is used to establish the automorphism partition by investigating all automorphism permutations. The Morgan extended connectivity algorithm [ 81 inspired a large number of variants, but the method fails to discriminate atoms in fullerenes, (CH)zk saturated valence isomers of annulenes, and other regular graphs. More powerful algorithms, many inspired from vertex topological indices, are presently used to obtain a better atom partitioning based on graph invariants. A balance must be maintained between the efficiency and the computational complexity of the GIAP algorithm: methods that solve a system of linear equations, use matrix multiplication, or compute matrix eigenvectors can be a good choice, but path enumeration algorithms are not efficient. The selection of the canonical code definition is equally important: the use of the cooperative labeling can greatly reduce the number of labelings investigated in the
CCAP step. A canonical code is generated and manipulated by a computer, and its definition and representation must consider the data structures offered by the programming languages. References 1
2
3
4
5 6 7
8 9 10 11 12 13
14
15
16
17 18
0. IVANCIUC,Canonical Numbering and Constitutional Symmetry, in: The Encyclopedia of Computational N. L. Chemistry, P. v. R. SCHLEYER, ALLINGER, T. CLARK, J. GASTEIGER, P. A. KOLLMAN,H. F. SCHAEFER 111, A N D P. R. SCHREINER (Eds.), John Wiley and Sons, Chichester, 1998, pp. 167-183. R. C. READ,D. G. CORNEIL, J . Graph Theor. 1977, I , 339-363. I. GUTMAN, 0. E. POLANSKY, Mathematical Concepts in Organic Chemistry; Springer, Berlin, 1986, Chapter 9, 108-116. M. RAZINGER,K. BALASUBRAMANIAN, M. E. MUNK,J . Chem. In& Comput. Sci. 1993, 33, 197-201. S. BOHANEC, M. PERDIH,J . Chem. I f : Comput. S C ~1993, . 33, 719-726. K. BALASUBRAMANIAN, J. Chem. Inf: Comput. Sci. 1994, 34, 1146-1150. K. BALASUBRAMANIAN,]. Chem. Znf: Comput. Sci., 1995, 35, 761-770. H. L. MORGAN,].Chem. Doc. 1965, 5, 107-113. R. E. CARHART, J. Chem. Znf: Comput. Sci. 1978, 18, 108-110. G. MOREAU, Nouu.]. Chim. 1980, 4, 17-22. C. Y. Hu, L. Xu, Anal. Chim. Acta 1994, 295, 127-134. C.-Y. Hu, L. Xu, 3. Chem. Znf: Comput. Sci. 1994, 34, 840-844. H. HONG,X. XIN,/. Chem. 1nJ Comput. Sci. 1994, 34, 730-734. R. G. FREELAND, S. A. FUNK,L. J. O’KORN,G. A. WILSON,].Chem. lnj Comput. Sci. 1979, 19, 94-98. W. T. WIPKE,T. M. Dyorr,]. Am. Chem. Soc. 1974, 96,4834-4842. C. A. SHELLEY, M. J. MuNK,]. Chem. Inf: Comput. Sci. 1977, 17, 110-113. C. A. SHELLEY, M. J. MuNK,J. Chem. Inf. Comput. Sci. 1979, 19, 247-250. A. T. BALABAN, 0. MEKENYAN, D.
BONCHEV, J . Comput. Chem. 1985, 6, 538-551.
A. T. BALABAN, 0. MEKENYAN, D. J. Comput. Chem. 1985, 6, BONCHEV, 562-569. 20 j. FIGUERAS, J. Chem. Inf: Comput. Sci. 1992, 32, 153-157. 21 T. LAIDBOEUR, D. CABROL-BASS, 0. IVANCIUC, J . Chem. In$ Comput. Sci. 1996, 36,811-821. 22 S . FUTITA,J.Chem. hf: Comput. Sci. 1986, 26, 205-212. 23 S. FUTITA,/.Chem. InJ Comput. Sci. 1988, 28, 128-137. 24 C. JOCHUM, J. GASTEIGER, J . Chem. I f : Comput. S C ~1979, . 19, 49-50. 25 M. BERSOHN, Comput. Chem. 1987, 11, 67-72. 26 M. RANDIC,J.Chem. Phys. 1974, 60, 3920-3928. 27 M. RANDIC, G. M. BRISSEY, C. L. WILKINS, J. Chem. In$ Comput. Sci. 1981, 21, 52-59. 28 H. ABE,Y. KUDO, T. YAMASAKI,K. TANAKA, M. SASAKI, S. SASAKI,]. Chem. lnf: Comput. Sci. 1984, 24, 212216. 29 J. B. HENDRICKSON, A. G. TOCZKO,]. Chem. Inf: Comput. Sci. 1983, 23, 171177. 30 V. KVASNICKA,J. POSPICHAL, J . Chem. Znf: Comput. Sci. 1990, 30, 99-105. 31 V. KVASNICKA,J. POSPICHAL,]. Math. Chem. 1992, 9, 181-196. 32 X. LIU, D. J. K L E I N , ~ Comput. . Chem. 1991, 12, 1243-1251. 33 W. D. IHLENFELDT, J. GASTEIGER, /. Comput. Chem. 1994, 15, 793-813. 34 J. GASTEIGER, W. D. IHLENFELDT,R. FICK,j. R. ROSE,J . Chem. In$ Comput. Sci. 1992, 32, 700-712. 35 J. GASTEIGER, W. D. IHLENFELDT, P. ROSE,R. WANKE, Anal. Chim.Acta 1990, 235, 65-75. 36 P. ROSE,J. GASTEIGER, Anal. Chim. Acta 1990, 235, 163-168. 19
160
I
5.7 Canonical Numbering and Constitutional Symmetry 37
38 39
40
41
J. GASTEIGER, W. HANEBECK, K.-P. SCHULZ,I. Chem. Znf: Comput. Sci. 1992,32, 264-271. M. UCHINO,]. Chem. In& Comput. Sci. 1982,22, 201-206. E. V. KONSTANTINOVA, V. A. SKOROBOGATOV, 1.Chem. In$ Comput. Sci. 1995,35, 472-478. E. V. KONSTANTINOVA, V. A. SKOROBOGATOV, Discr. Math. 2001, 235, 365-383. T. WIELAND,A. KERBER,R. LAVE, Chem. Inf: Comput. Sci. 1996,36, 413-419. I.V. STANKEVICH, E. G. GAL’PERN, 1. I. BASKIN,M. I. A. L. CHISTYAKOV, SKVORTSOVA, N. S. ZEFIROV,0. B. TOMILIN,]. Chem. Inf: Comput. Sci. 1994,34, 1105-1108. W. SCHUBERT, I. UGI,]. Am. Chem. SOC. 1978, 100,37-41. I.UGI, J. BAUER,J. BRANDT,J. FRIEDRICH, J. GASTEIGER, C. JOCHUM, Angew. Chem. Znt. Ed. W. SCHUBERT, Engl. 1979, 18, 111-123. I.UGI, J. BAUER,C. BLOMBERGER, J. BRANDT,A. DIETZ,E. FONTAIN, B. GRUBER,A. VON SCHOLLEY-PFAB, A. SENFF,N. STEIN,]. Chem. Ifif: Comput. Sci. 1994,34, 3-16. H. SCSIBRANY, K. VARMUZA, ToSiM:
47 48
49
I. 42
43
61
45
46
50
51
52
53
54
55
PC-Sofhvarefor the Investigation of Topological Similarities in Molecules, C. J O C H U M (ed.), Software-Development in Chemistry 8; Gesellschaft Deutscher Chemiker: Frankfurt a m Main, 1994, 235-249. J. XU, J . Chem. Znf: Comput. Sci. 1996, 36, 25-34. T.LIDBOEUR, D. CABROL-BASS, 0. IVANCIUC, 1.Chem. hf: CompUt. sci. 1997,37, 87-91. M. ~ Z I N G E M. R , PERDIH,]. Chem. rnf: comput. sci. 1994,34,290-296. J.-L. FAULON, 1.Chem. Inf: Comput. Sci. 1998,38, 432-444. H. SATOH,H. KOSHINO,K. FUNATSU, T. NAKATA,]. Chem. Inf: Comput. Sci. 2000,40, 622-630. H. SATOH,H. KOSHINO,K. FUNATSU, T. NAKATA,J.Chem. Inf: Comput. Sci. 2001,41, 1106-1112. 2. OUYANG,S. YUAN,J. BRANDT,C. ZHENG,J . Chem. In& Comput. Sci. 1999,39, 299-303. M. L. CONTRERAS, J. ALVAREZ, M. RIVEROS,G. ARIAS,R. ROZAS, 1.Chem. lnf. Comput. Sci. 2001,41, 964-977. J. W. RAYMOND, E. J. GARDINER, P. WILLETT,]. Chem. Inf: Comput. Sci. 2002,42, 305-316.
Handbook of Cltentoinfovntatics Johann Gasteiger Copyright 02003 WILEY-VCH Verlag GmbH & Co.KGaA, Weinheim
5.2.1 Introduction
Geoff Downs has worked with Barnard Chemical Information Ltd. since 1988 and has specialist knowledge in fragment descriptor generation, ring perception, similarity and cluster analysis, and Markush structure handling. Between 1984 and 1988 he studied for his PhD in ring perception under Professor Mike Lynch on the Generic Structures Project in the Department of Information Studies, University of Sheffield, and then remained in the department for four years as part-time Research Manager for Professor Peter Willett’s Chemoinformatics Research Group.
5.2
Ring Perception Geoffrey M. Downs 5.2.1
Introduction
Ring systems are a major structural feature of many chemical compounds, knowledge of which gives insight into the chemical nature of the compounds, their likely behavior, and possible routes to their synthesis. Ring perception, to distinguish the number and type of rings in a topological representation of a chemical structure, is essential for tasks such as: *
*
*
compound naming, indexing and classification fragment screen generation for structure retrieval and spectral analysis aromaticity perception structure display optimization transform description for reaction indexing and computer-assisted synthesis
This section gives a summary of the considerations for ring set choice, the basic perception techniques available, and the algorithms employed to perceive particular ring sets. The most extensive review of ring perception algorithms for chemical graphs to date was published in 1989 by the present author [ 11, with a discussion of twenty-four algorithms and references to many more related algorithms. In addition, discussions related to the theoretical basis for various ring sets were published in 1989 [2] and 1993 [3]. A further seven algorithms [4-101 for chemical graphs published since 1989 were included in a previous version of this article published in 1998 [ l l ] . This update includes a further four algorithms [12-151.
I
16’
162
I
5.2 Ring Perception
5.2.2 Terminology and Definitions
Ring perception is used in many different disciplines, particularly in electrical engineering. One consequence is that different terminology is used for the same concepts. This section outlines the minimum amount of terminology necessary for the rest of the article, and is an amalgam of terms used in graph theory and combinatorics (see Chapter 11, Sections 4 and 5.4). A chemical structure, typically given by a structure diagram, is represented as an undirected graph in which the atoms are the vertices and the bonds are the edges. Vertices and edges can be colored, i.e. labeled, with associated information such as atom type or bond type. The valence or degree of each atom is the vertex connectivity; the number of ring bonds attached to an atom is its ring-connectivity. Starting from any vertex of the graph, a tree can be grown to yield a spanning tree, which contains each vertex of the graph once only, and each edge except the chords. The chords are the minimum number of edges required to turn the tree from acyclic to cyclic. If the graph is disconnected then a spanning tree can be grown for each component of the graph; a ring component contains edges that are all cyclic. The number of chords for a graph is the nullity, p (alternatively known as the cyclomatic or first Betti's number), and is calculated by the Cauchy (or FrPrejacque) formula: p = number
of edges - number of vertices + number of components
The nullity is also the cardinality (size) of a fundamental basis set of rings, which is a set of rings from which all others, in a non-trivial ring system, can be derived by combining subsets of them. Fan et al. [ G , 161 have recently published and proved an alternative formula, based on the numbers of vertices of each degree. A walk is an alternating sequence of connected vertices and edges; if the start and end vertex are the same, and all other vertices and edges occur once only then the walk is a cycle. If a cycle is found from each chord then a fundamental basis set of cycles has been found. If in a cycle there are no pairs of vertices joined by an edge that is not in the cycle then it is a simple cycle; all other cycles are complex cycles. If the sum of the sizes of the cycles in a fundamental basis is a minimum then the set is referred to as a minimal cycle basis. Most chemical graphs can be drawn in two dimensions such that no edges cross, in which case the graph is said to be embeddable on a plane. Planar embedment defines regions (which have no edges or vertices crossing them) and non-regions. The region defined by the outer boundary of a graph is the infinite region; all other regions are finite regions. Finite and infinite regions are always interchangeable by redrawing the graph. A region that is a simple cycle is a simple face. A non-region that is a simple cycle is a cut face. If one simple face is larger than all the other simple faces then it is the maximal simple face; if a simple face is the smallest simple face associated with each of its edges then it is a minimal simple face; all
5.2.2 Terminology and Definitions non-regions
regions
regions + non-regions = all cycles A + B + C + D + E + F = all simple-cycles all cycles all simple-cycles = all complex-cycles A = maximal simple-face B = intermediate simple-faces C = minimal simple-faces D = primary cut-faces E = secondary cut-faces F =tertiary cut-faces
Fig. 5.2-1 Relationships between different classes o f simple faces and cut faces 1
2
4 Fig. 5.2-2
3 Cubane
other simple faces are intermediate simple faces. If a cut face is the smallest simple cycle associated with at least one of its edges then it is a primary cut face; otherwise, if at least one of its edges is associated with a simple cycle of the same size (and none smaller) then it is a secondary cut face; otherwise it is a tertiary cut face. The relationships between the various cycles are summarized in Figure 5.2-1. The terms can be understood more readily by reference to example graphs. In cubane, Figure 5.2-2, the six 4-edged regions are all minimal simple faces; there is no maximal simple face. The eight 6-edged simple cycles are all tertiary cut faces.
I
163
164
I
5.2 Ring Perception
Fig. 5.2-3 Norbornane
Fig. 5.2-4
Example of alternative embedment simple cycles
Fig. 5.2-5 Example o f a primary cut face (in bold)
In norbornane, Figure 5.2-3, the two 5-edged regions are minimal simple faces, and the G-edged infinite region is the maximal simple face. In Figure 5.2-4, there are two distinct embedments: (i) defines an 8-edged maximal simple face, and a 6-edged and two 5-edged minimal simple faces; (ii) defines two 7-edged intermediate simple faces and two 5-edged minimal simple faces. In Figure 5.2-5, the %edged simple cycle, in bold, is a primary cut face. In Figure 5.24, the 4-edged simple cycle, in bold, is a secondary cut face. In Figure 5.2-7, the 8-edged simple cycle, in bold, is a tertiary cut face.
5.2.3 Perception Methods
Fig. 5.2-6
Example of a secondary cut face (in bold)
Fig. 5.2-7 Example of a tertiary cut face (in bold) displacing intermediate simple faces
5.2.3 Perception Methods
Underlying most ring perception algorithms there are just a few basic perception methods and operations, most of which were originally developed for electrical engineering applications. This section summarizes the two basic perception methods, those using graph theory and those using linear algebra, and then discusses some of the pre-processing methods used to make the perception methods more efficient. Details of all the methods mentioned can be found in the introductions to the papers cited in this article and the papers cited in the 1989 review [ 11. 5.2.3.1 Graph Theoretic Methods
Graph theoretic methods treat a chemical structure connection table as a graph containing vertices and edges. The two main methods for finding cycles in the graph are depth-first search and breadth-first search. Depth-first search has lower storage requirements and is generally better for finding all cycles. Breadth-first search is generally faster at finding the smallest cycles. The searching can be used
I
165
166
I to find the cycles directly, or to find all paths between two vertices with subsequent 5.2 Ring Perception
processing joining pairs of paths together to find the cycles. The searching can also be performed on an incidence or adjacency matrix derived from the graph. Since the paths are stored during search, the correct vertex-edge sequence of a cycle is known once the cycle has been found. 5.2.3.2
Linear Algebraic Methods
Linear algebraic methods use matrix manipulation, row reordering etc. on incidence or adjacency matrixes derived from the graph. If a fundamental basis set of rings can be found then linear algebraic methods can generate every other cycle present in the graph. Fundamental bases are not generally unique but they always contain ,u cycles. The usual way to find a fundamental basis is to generate a spanning tree. Other cycles can then be generated by either using a depth or breadthfirst search from the ends of the chords of the spanning tree, or by taking the exclusive or (XOR) of the edges of each combination from 1 to p of the basis cycles. XOR operations are generally very fast, but since binary sets are used for the combinations, the correct vertex-edge sequence of each cycle needs to be looked up afterwards. A feature common to many of the perception methods using linear algebra is the use of linear independence to determine either a particular fundamental basis or an extension of it [8, 171. 5.2.3.3
Pre-processing Methods
Perception methods can generally be made more efficient by pre-processing the graph to prune acyclic vertices and edges, identify each separate component, and/or reduce the graph to a simpler graph. 5.2.3.3.1
Pruning and Processing Components
Pruning, to remove acyclic vertices and edges, can be accomplished simply by successively removing each vertex with a connectivity of one (and its associated edge). Alternatively, a spanning tree can be grown to trace and label all cyclic edges and vertices; subsequent perception can then ignore unlabeled vertices and edges. Individually processing each ring component of a graph is particularly effective for matrix manipulation methods. Row reordering can identify separate ring components, and the ring components processed by including only the relevant rows. Depth and breadth-first searches can use cyclic labeling to avoid crossing from one ring component to another. 5.2.3.3.2
Graph Reduction
In electrical engineering, reduction methods are used a lot for solving combinatorial problems on graphs, and now many of the more recent algorithms for chemical graphs employ graph reduction to simplify the ring perception. The simplest
5.2.4 Ring Sets and Algorithms
reduction involves reducing vertices with a ring-connectivity of two to a single edge, to produce a basic graph. If this is done before ring perception proceeds the graph has fewer edges and vertices to search; Balaban et al. [ 171 use this technique to produce a basic graph referred to as a homeomorphically reduced graph, and Hanser et al. [ 101 use it to reduce a path graph derived from a graph. If reduction is done during processing, by reducing such vertices in a ring that has just been found, then it can reveal further “embedded’ rings [ S ] . This contrasts with the non-reduction method of using “used’ count labels on vertices and edges to direct processing towards the embedded rings [4, 18, 191. A more complex form of graph reduction is shown by the cut-vertex graphs of Downs et al. [2]
5.2.4
Ring Sets and Algorithms
For complex ring systems, such as multiply-bridgedsystems and cage structures, a wide range of cycles can be perceived, starting from the minimum number necessary to include all vertices and edges (a fundamental basis set containing ,u cycles) to the maximum number of cycles in the ring system. This range varies in terms of the number, size and atom/bond composition of the cycles. Different applications have different requirements, and so a variety of ring sets (the particular sets chosen from within the wide range available)have been defined. The problem then becomes one of choosing a ring set that is in some way “optimum” for the particular application. The main factor is usually that the ring set should be unique for a given structure and invariant, i.e. processing or ordering the graph in a different way should not produce a different set, or a choice between several sets. Given the large number of rings that could be included in a set, a general aim is to include the minimum number of rings necessary to describe the ring system and also to include sufficient rings to describe the ring system adequately for a given application. The problem is deciding what is necessary and sufficient. For instance, in cubane, Figure 5.2-2, there are 28 cycles, 14 simple cycles, 6 simple faces, the nullity is 5, and yet all edges and vertices can be included by using just four of the simple faces. If only the 4-edged simple faces should be included in the ring set for cubane, then how many should be included? Similarly, is norbornane, Figure 5.2-3, two fused 5-edged rings or a bridged G-edged ring; the nullity is 2 but the 6-edged ring is essential in the Diels-Alder reaction for the synthesis of norbornane. For two different applications, such as structure storage/retrieval and synthesis design, the answers to such questions may well be different, and need to be considered when choosing a ring set. This section covers nine of the main ring sets used in applications processing chemical structure graphs. The most commonly used ring set is the minimal cycle basis, usually referred to as the smallest set of smallest rings (SSSR). In addition to the ring sets mentioned here, most other published ring sets are heuristic supplements to an SSSR; further details of these can be found in the 1989 review [I]. A summary of the contents of each of the ring sets in this section is given in
168
I
5.2 Ring Perception Tab. 5.2-1 Summary of the contents of the main ring sets used for chemical graph applications
Ring set
Contents
All cycles/simple cycles Beta-ring
All cycles/simple cycles 3- and 4-edged simple cycles + linear independent from 3 or more smaller beta-rings Heuristic selection of smallest simple cycles Intersection of all SSSR All simple faces and primary/secondary cut faces Union of all SSSR K-rings + simple cycles that are fusions of pairs of them Selection of simple faces and primary/secondary cut faces p smallest simple cycles (minimal cycle basis)
ESER Essential cycles ESSR K-rings/relevant cycles SER SSCE SSSR
Table 5.2-1, using the terms given in Figure 5.2-1 and Section 5.2.2. Each of these ring sets is discussed more fully below, along with brief summaries of the algorithms to find them (the original papers should be consulted for full details). If more than one published algorithm for a particular ring set is available then details are given of the one the present author feels is best (in terms of efficiency and clarity of the algorithm). 5.2.4.1
All Cycles/Simple Cycles
The set of all cycles, and its related subset of all simple cycles, is unique for a given structure, but they generally contain far more rings than are necessary to describe the ring system. For complex ring systems processing to find the number of cycles can grow exponentially with the number of vertices. Algorithms exist [ 11 that use each of the methods outlined in Section 5.2.3. The most recent algorithm is by Hanser et al. [ 101, which uses graph reduction to make the processing fast and easy to implement. The graph is converted to a path graph which initially has the same edges and vertices as the original graph, but each edge is labeled by the path between the incident vertices. The path graph is reduced by iterative removal of each vertex until none remain. At each iteration the relevant paths are updated. A vertex that ends up connected to itself (self-loop) corresponds to a cycle in the original graph, with the self-loop edge describing the cycle. Efficiency is improved by removing the lowest connectivity vertices from the path graph first. In summary, the algorithm is: 1. Convert the graph to a path graph 2. While there are still vertices left in the path graph:
Remove the lowest connectivity vertex Create a new edge from each pair of edges from the removed vertex Update the path for each new edge and replace the original edges Check for a new edge being a self-loop; if it is then store the path as a cycle
5.2.4 Ring Sets and Algorithms
Due to the large number of cycles to be found in complex ring systems, in the worst case processing will be slow. However, for certain applications, such as structure display optimization and the automatic generation of chemical names, the set of all cycles or all simple cycles is required to ensure complete description of a ring system. 5.2.4.2
Beta-ring
The set of Beta-rings [20] is one of the earliest attempts to extend the SSSR to include the maximal simple face of norbornane but without including much larger envelope rings (i.e. maximal simple faces). The set of Beta-rings comprises: All 3- and 4-edged simple cycles All simple cycles that are linearly independent of 3 or more smaller Beta-rings In the published paper, the set of simple cycles is generated first, all 3- and 4-edged members are selected and then linear combinations are taken to select those that conform to the second criterion. No algorithm is given in the published paper, but processing to generate all simple cycles and then take linear combinations of them is going to be exponential rather than polynomial. A more efficient way would be to use a breadth-first SSSR algorithm and then take linear combinations of these (since all cycles can be generated from linear combinations of a fundamental basis), using heuristics to limit the number of linear combinations. The set of Beta-rings includes some tertiary cut faces, but not others. For instance, in Figure 5.2-7 the 8-edged tertiary cut face is not a Beta-ring while one of the 9-edged simple faces is. However, in Figure 5.2-8 the 7-edged tertiary cut face (in bold) is a Beta-ring while the 7-edged maximal simple face is not. In general, the set of Beta-rings contains more rings than an SSSR but, for large or complex ring systems, does not include maximal simple faces. 5.2.4.3
ESER (Essential Set of Essential Rings)
The ESER was defined by Fujita [21] specifically for use with the Imaginary Transitions Structure construct for representing reaction-site changes during organic
Fig. 5.2-8
Example of a tertiary cut face (in bold) displacing maximal simple face
I
169
170
I reactions. ESER rings are simple cycles selected using ideas based on synthetic 5.2 Ring Perception
importance. All rings in the sets of simple cycles and complex cycles are categorized, according to their vertex coloring, into three classes: carbon (all vertices are carbon), hetero (at least one vertex is N, 0 , S, P; all others are carbon) and abnormal (at least one vertex is some other atom). The ESER contains all simple cycles that are not “dependent”. A simple cycle is dependent if all its edges occur in a subset, C, of the complex cycles and: 1. The rings in C are all smaller than, or the same size as, the simple cycle 2. The simple cycle contains less than half the edges in C 3. All rings in C are of the same class as the simple cycle 4. All rings in C have the same or a smaller number of hetero and/or abnormal atoms as the simple cycle
In summary, Fujita’s algorithm is: 1. Find all cycles 2. Sort into increasing size 3. Label as carbon, hetero or abnormal 4. Label as simple cycle or complex cycle 5. For each simple cycle check to see if it is dependent on complex cycles of the same class using the above conditions 6. Put all simple cycles not labeled as dependent into the ESER
These complex heuristics, and in particular the reliance on the presence/absence of various atoms, can give rise to some unusual classifications. For instance, in Figure 5.2-9 the maximal simple face of (i) is not in the ESER, but in (ii) it is (due to the presence of the heteroatom). In general, the ESER is always a superset of an SSSR, but which additional rings are included can be difficult to visualize. 5.2.4.4 Essential Cycles
In contrast to the set of relevant cycles, which is the union of all possible SSSRs (see Section 5.2.4.6), Gleiss and Stadler [12] have defined the set of essential cycles
(11 Fig. 5.2-9
(11)
Example o f t h e heteroatom dependence o f ESER rings
5.2.4 Ring Sets and Algorithms
as the intersection of all possible SSSRs. For ring systems with a single embedment the SSSR and set of essential cycles are identical. The nullity is thus the upper bound to the set cardinality, with the lower bound being zero. In the context of biopolymers this enables use of a common denominator set of cycles for complex ring systems. The set of essential cycles can be generated by finding each minimal cycle basis (e.g. by using Vismara’s algorithm [13]) and retaining the intersection. 5.2.4.5
Extended Set of Smallest Rings (ESSR)
The ESSR was defined by Downs et al. [2] for use with Markush structures used in chemical patents, where a unique ring set is essential. The ESSR contains all simple faces, all primary cut faces and all secondary cut faces (i.e. all simple faces and any adjacent cut faces that are the same size or smaller). The algorithm to find the ESSR [19]is an extension of Zamora’s three-phase algorithm [18].Phase 1 path traces to find the smallest rings from vertices not in the ring set until each vertex is in the ring set. Phase 2 path traces to find the smallest rings from edges not in the ring set until each edge is in the ring set. Phase 3 path traces to include all faces in the ring set. Phases 1 and 2 remain the same as the original algorithm, except that if several rings the same size are found then all are put in the ring set rather than having to choose one of them. Phase 3 is split into two sub-phases to strip off outer rings and path trace the “embedded’ rings of complex ring systems. In summary, the algorithm is: Start from the highest ring-connectivity vertex and find the smallest ring(s) from it Add the ring(s) to the ESSR and increment the “used’ counts for the vertices and edges Continue until all vertices are ‘‘used’ Phase 2 Start from an “unused’ edge and find the smallest ring(s) from it Add the ring(s) to the ESSR and increment the “used’ counts for the edges Continue until all edges are “used’ Set the vertex counts to zero Phase 3-1 Start from a vertex with a ring-connectivity >2; increment its “used’ count Find any rings from it that have a bond used limit of 1 (i.e. at most one edge with a “used’ count >1) Add the ring(s) to the ESSR and increment the “used” counts for the vertices Continue until such vertices are “used’ and then increment the “used’ counts for all edges in rings found in Phase 3-1 Set the vertex counts to zero Phase 3-2 Repeat Phase 3-1 using a modified bond used limit of increments >1 Phase 1
I
171
172
I
5.2 Ring Perception
For other applications, a breadth-first trace would be more efficient than the depth-first trace used in the published algorithm. Although originally developed for use with Markush structures, the nature of the ring set makes it general purpose, with particular use for synthesis design and activity relationship applications. However, the inclusion of the maximal simple face does not always appeal to chemists, particularly when it is much larger than any of the other simple faces. 5.2.4.6
K-rings (Relevant Cycles)
The set of K-rings was defined by Plotkin [22] and is the set containing all possible SSSR rings. In summary, Plotkin’s algorithm is: 1. Find an SSSR and add all SSSR rings to the set of K-rings 2. While there are still rings in the SSSR
Find the longest unforked path P (it may be a single edge) in the largest ring R currently in the SSSR. Generate all cycles the same size as R that also contain P and add them to the set of K-rings Remove R from the SSSR. In 1997, Vismara [ 131 rediscovered the set of K-rings, defining it as the union of all possible SSSRs and naming it the set of “relevant cycles”. The algorithm to find the set of relevant cycles is based on Horton’s algorithm for finding a minimal cycle basis [ 231, in which an initial set of cycles (prototypes) is analyzed to extract a minimum cycle basis. The set of K-rings/relevant cycles avoids arbitrary exclusions of rings when there is more than one SSSR for a structure. For instance, the set of K-rings includes all six 4-edged simple faces of cubane, Figure 5.2-2. However, if a simple face is not in any SSSR then it is not in the set of K-rings. For instance, the set of K-rings does not include the 6-edged maximal simple face of norbornane, Figure 5.2-3, or the 7-edged intermediate simple faces of the alternative embedment of Figure 5.2-4(ii),or the %edged intermediate simple faces of Figure 5.2-7. 5.2.4.7
SER (Set of Elementary Rings)
The SER was defined by Takahashi [7] and consists of the set of K-rings plus any fusions of pairs of K-rings that have two or more edges in common. Appropriate pairs for fusion are found by attempting to generate a @graph for each pair of K-rings. A Bgraph is a subset of the ring-connectivities of the original pair so that there are two non-adjacent vertices of ring-connectivity 3 with all other vertices of ring-connectivity 2. This enables the bridge to be deleted to produce the fused “envelope” from the original pair. In summary, the algorithm is:
5.2.4 Ring Sets and Algorithms
Fig. 5.2-10
Adarnantane, with one of the tertiary cut faces in bold
1. Generate an SSSR 2. Add the SSSR rings to the SER 3. Take each pair of SSSR rings and fuse them 4. If the fusion forms another simple cycle not already in the SER then add it to the SER
Fusion of SSSR rings enables the maximal simple face of norbornane, Figure 5.2-3, to be included in the SER. Unfortunately it also enables the three 8-edged tertiary cut faces of adamantane, Figure 5.2-10, (one of which is in bold) to join the SER. As with the set of Beta-rings, in Figure 5.2-8 the 7-edged tertiary cut face (in bold) is included in the SER, but the 7-edged maximal simple face is not. 5.2.4.8
SSCE (Set of Smallest Cycles at Edges)
The SSCE, published by Dury et al. [14], is mid-way between the set of K-rings (Section 5.2.4.6) and the ESSR (Section 5.2.4.5). It was designed to avoid the problems of non-uniqueness associated with an SSSR but without going as far as including the maximal simple face included in the ESSR. The algorithm to find the SSCE concentrates on edges rather than vertices, and proceeds as follows: 1. Find the smallest cycle associated with each edge; add those that are simple cycles to the current SCE (Smallest Cycles at Edges) 2. Add the current SCE to the SSCE 3. Remove all edges in common between two or more of the current SCE rings 4. Empty the current SCE and repeat steps 1-3 until no more cycles can be found 5. If there is only one largest ring in the resultant SSCE then remove it on the basis that it is the maximal simple face
This algorithm follows a familiar pattern of peeling away rings to reveal rings missed by a previous iteration, and the resultant ring set has many strengths. However, the removal of the maximal simple face in all circumstances means, for
I
173
174
I
5.2 Ring Perception
instance, that the synthetically important 6-edged maximal simple face of norbornane (Figure 5.2-3) is rejected. 5.2.4.9
SSSR (Smallest Set o f Smallest Rings)
The SSSR is a fundamental basis containing the ,u smallest simple cycles in a structure, i.e. a minimal cycle basis. The majority of non-chemical and many chemical applications use the SSSR as the ring set since p can be calculated easily and the mathematical foundations of a fundamental basis are well understood. The SSSR is still the most important and widely used of all ring sets. As with the set of all cycles, algorithms exist that use each of the methods outlined in Section 5.2.3. Since the 1989 review [l],at least five improved SSSR perception algorithms for chemical structure applications have been published [4-G, 8, 91. Qian et al. [4] have improved Zamora’s algorithm [18] by using a breadth-first search instead of the original depth-first search, by enhancing the preference rules for choosing rings where there is more than one SSSR, and by replacing Phase 3 (see Section 5.2.4.6) by a new algorithm based on linear combinations. Baumer et al. [5] use paths traced in a basic graph to find the SSSR, and group rings into internal (having more than two vertices with a ring-connectivity more than 2) and external (having two or fewer vertices with a ring-connectivity more than 2) which can be processed separately. This follows the trend of many other algorithms to use graph reduction and to strip off outer rings to enable processing of “embedded’ inner rings. Fan et al. [GI use a depth-first trace, but start from the smallest ringconnectivity vertices and then strip off vertices from found rings that have ringconnectivities of 2. Balducci and Pearlman [8] use a breadth-first trace, storing the paths traced (as originally done by Corey and Petersson [24] using Paton’s algorithm [ 2 5 ] ) so that each ring can be produced from the ring sum of the two paths, and then apply an efficient method of linear combination to check the rings before inclusion in the SSSR. This algorithm is greatly improved by Figueras [9] who combines it with aspects taken from Fan et al. In summary, the algorithm is: 1. Select a vertex of ring-connectivity 2 (if none available then temporarily delete edge(s) from the lowest ring-connectivity vertex to reduce it to 2)
2. Perform a breadth-first search to find the smallest ring for each vertex of ringconnectivity 2 (use a queue to improve efficiency) 3. Add each ring to stored ring set if it is linearly independent 4. Break each sequence of ring-connectivity 2 vertices in graph (or original vertex if ring-connectivity more than 2) 5. Repeat steps 1-4 while there are still vertices left in the graph Raymond [2G] has subsequently pointed out that the published algorithm is incorrect. Indiscriminately breaking all sequences of ring-connectivity 2 fails in circumstances such as that shown in Figure 5.2-11; in this case the chains con-
5.2.4 Ring Sets and Algorithms I 1 7 5
1 Fig. 5.2-11
Example of structure requiring correction to Figueras’s algorithm
taining vertices (4) and (1,2) are broken so that the ring (3,4,5,6,11,12) cannot be found. To correct this, step 4 above should be: 4. For each ring found in the current iteration, find the longest sequence of ringconnectivity 2 vertices and break it
For chemical applications the main problem with the SSSR is that it is not unique, presenting the implementer with three options when there is more than one SSSR: 1. Arbitrarily select the smallest rings until p is reached 2. Select the ,u smallest rings on the basis of a preferred ordering (e.g. atom, bond and/or ring-connectivity composition) 3. Include more rings to find a superset of the SSSR
For instance, in an analog to cubane, Figure 5.2-2, if vertices 1 and 3 are heteroatoms and the simple cycle 1,2,3,4 is arbitrarily excluded to give the required five rings in the SSSR from the six available, then potentially important information has been omitted. A similar structure may be analyzed to include this ring and exclude another leading to a mismatch between the two ring sets. For the SSSR, ring size is the only initial selection criterion; no consideration is given to the distinction between regions, non-regions or alternative embedments. The regions, which correspond to faces of a three-dimensional solid, are most likely to be related to the observed activity of a compound. In Figure 5.2-7 neither 9-edged simple faces are included in the SSSR since there is an 8-edged tertiary cut face. Other examples can be shown where the rings in the SSSR, although all simple faces, do not all occur in the same embedment. For the inclusion of “synthetically important rings”, norbornane, Figure 5.2-3, is one of the simplest examples of why many ring sets are an extension of a basic SSSR to include rings such as norbornane’s 6edged simple face (without requiring explicit chemical knowledge in the perception algorithms). With the exception of the set of essential cycles, most ring sets outlined here take this option, i.e. they extend the SSSR in some way to include
176
I
5.2 Ring Perception
other rings. In contrast, a method published by Weiser et al. [ 151 aims to generate an SSSR, but by only finding the smallest ring from a starting atom sometimes finds fewer rings. By considering vertices only, and not edges, the method will fail to find the four-edged ring in Figure 5.2-lla, and by not choosing the highest valence vertices first could fail to find the three-edged ring in Figure 5.2-llb (e.g. by starting from vertices 1 and 2). These failures highlight the necessity of the second phase of Zamora’s algorithm [18] and the ESSR algorithm (Section 5.2.4.5).
5.2.5 Conclusions
Ring perception is important for the analysis and description of chemical structure ring systems. There is a wide variety of rings sets and perception methods to choose from, many of which are specifically aimed at particular applications. Certain generalizations can be made, however, to give an indication of the best general-purpose sets and methods. Perception can be made more efficient by pre-processing the chemical graph to eliminate acyclic vertices and edges from the processing and to localize processing to individual ring system components. Further efficiencies can be gained by using one of several available graph reductions to reduce the size and/or complexity of the graph. The most efficient perception routines use a mix of breadth-first trace, linear algebraic combination of the resultant ring vectors, and graph reduction during processing. These routines perform ring perception in polynomial time rather than exponential. The most commonly used ring set is the SSSR, but if a unique ring set is desirable then the set of K-rings (relevant cycles) and the ESSR are the most generally suitable, and have a similarly strong mathematical background. A unique ring set is particularly important for applications requiring an unambiguous description of entire ring systems (rather than just a summary basis set), or even partial ring systems (as found in the Markush formulations of chemical patents and combinatorial libraries). The use of chemical rather than mathematical concepts in some ring sets and perception routines can lead to unexpected problems and inconsistencies. In the future, we can expect a strengthening of the mathematical concepts, the refinement of current unique ring sets, the implementation of more efficient algorithms to find them (to bring them up to the level of the most efficient polynomial SSSR algorithms), and the continued development of further ring sets. Acknowledgment
Grateful thanks go to John Raymond for reporting the error and correction to Figueras’s algorithm during his time at the Dept. of Information Studies, University of Sheffield, UK.
5.2.5 Conclusions
References M. DOWNS,V. J. GILLET, J. D. M. F. LYNCH,J. Chem. In$ HOLLIDAY, Comput. Sci. 1989, 29, 172-187. 2 G. M. DOWNS, V. J. GILLET,J. D. HOLLIDAY, M. F. LYNCH,J. Chem. In$ Comput. Sci. 1989, 29, 187-206. 3 G. M. DOWNS, Rings - the importance ofbeingperceiued, WARRW. A. (Ed.) Chemical Structures 2; Springer, Berlin, 1993. 4 C. QIAN,W. FISANICK, D. E. HARTZLER, S. W. CHAPMAN, J. Chem. In$ Comput. Sci. 1990, 30, 105-110. 5 L. BAUMER, G. SALA,G. SELLO, Comput. Chem. 1991, 25, 293-299. 6 B. T. FAN,A. PANAYE, J.-P. DOUCET, J. Chem. In$ Comput. Sci. A. BARBU, 1993, 33, 657-662. 7 Y. TAKAHASHI, J. Chem. In$ Comput. Sci. 1994, 34, 167-170. 8 R. BALDUCCI, R. S. PEARLMAN, J. Chem. In$ Comput. Sci. 1994, 34, 822-831. 9 J. FIGUERAS,J . Chem. In$ Comput. Sci. 1996, 36,986-991. 10 HANSER, P. JAUFFRET, G. KAUFMANN, J. Chem. In$ Comput. Sci. 1996, 36, 1146-1152. 11 G. M. DOWNS, Ring Perception in The Encyclopedia of Computational Chemistry, Vol. 4, P. v. R. SCHLEYER, N. L. ALLINGER, T. CLARK,J. GASTEIGER, P. A. KOLLMAN,H. F. SCHAEFFER 111, P. R. SCHREINER (Eds), John Wiley and Sons, Chichester, 1998. 12 P. M. GLEISS, P. F. STADLER, Relevant cycles in biopolymers and random graphs 1 G.
Fourth Slovene Int. Conf. Graph Theory, Bled, June 28-July 8, 1999. 13 P. VISMARA, Electronic J. Combinatorics 1997,4, #R9 (15 pages). 14 L. DURY,T. LATOUR, L. LEHERTE, F. BARBERIS, D. P. VERCAUTEREN, J. Chem. Zn$ Comput. Sci. 2001,41, 1437-1445. 15 J. WEISER, M. C. HOLTHAUSEN, J . J. FITTER, Comput. Chem. 1997, 18, 1264-1281. 16 M. PETITJEAN, B. T. FAN,A. PANAYE, J.-P. DOUCET, J. Chem. Zni Comput. Sci. 2000, 40, 1015-1017. 17 A. T. BALABAN, P. FILIP,T. S. BALABAN, J. Comput. Chem. 1985, 6, 316-329. 18 A. ZAMORA, J. Chem. In$ Comput. Sci. 1976, 16,40-43. 19 G. M. DOWNS, V. J. GILLET;J. D. HOLLIDAY, M. F. LYNCH,J. Chem. In$ Comput. Sci. 1989, 29, 207-214. 20 H. NICKELSEN,Nachr. Dok. 1971, 3, 121-123 (and associated microfiche). 21 S. J. FUJITA,Chem. In$ Comput. Sci. 1988, 28, 78-82. 22 M. J. PLOTKIN, Chem. Doc. 1971, 11, 60-63. 23 J. D. HORTON, SIAM J. Comput. 1987, 16, 359-366. 24 E. J. COREY, G. A. PETERSSON,]. Am. Chem. Soc. 1972, 94:2, 460-465. 25 K. PATON,Commun. Ass. Computing Mach. 1969, 22, 514-518. 26 J. RAYMOND, January 2001,personal communication.
I
177
Handbook Handbook of Cltentoinfovntatics of Cltentoinfovntatics Johann Gasteiger Johann Gasteiger CopyrightCopyright 02003 WILEY-VCH 02003 WILEY-VCH Verlag GmbH Verlag&GmbH Co.KGaA, & Co.KGaA, Weinheim Weinheim
178
I
5.3 Topological Structure Generators
For many years Dr Ivan Bangov worked as Senior Research Fellow in the Institute of Organic Chemistry at the Bulgarian Academy of Sciences. His main research interests were in the fields of mathematical chemistry, computer chemistry and chemoinformatics (development of methods for computeraided structure elucidation from I3C NMR spectra and related approaches to structure representation, structure generation, and structure perception) artificial intelligence in chemistry, and chemical databases. He has worked briefly at the Sadtler Division of Bio-Rad, USA, and currently works for Molecular Networks GmbH, Germany.
5.3
Topological Structure Generators Ivan P. Bangou 5.3.1
Introduction
Enumeration and generation of chemical graphs (molecular structures) are areas tightly related to the development of graph theory in mathematics. Although graph theory originates from the solution of the famous Euler problem of the Konisberg bridges in 1768, the subsequent development of graph theory by mathematicians such as Sylvester [I], Caley [2], Polya [3], and others deals with the practical problem of enumeration of classes of chemical structures. It is worth mentioning that the notion graph is used for the first time by Sylvester in relation to the solution of such chemical problems. Hereafter we are making a difference between the notions enumeration (obtaining the exact number of chemical structures) and generation (obtaining the structures themselves in a given mathematical representation). Chemical structure enumeration consists of application of mathematical formulas whose solution leads to the exact number of all (p, q ) graphs, where p is the number of atoms and q the number of bonds. For solution of a series of real-world chemical problems, however, not only the number, but also the constitution of the chemical structures obtained in a proper mathematical representation is required. This enables them to be further manipulated, graphically represented, and their properties further studied. This is the so called structure-generation problem and programs dealing with this problem are named structure generators. Hereafter we shall discuss mostly the problem of generation of chemical graphs. In the mid sixties structure generation became a focus with the rise of applications of artificial intelligence techniques to chemical problems. One of the first projects in this field was the heuristic Dendral system developed at Stanford Uni-
5.3.2 Structure Generation Fundamentals
versity, USA [4-61. Then, the structure generators of Munk et al. [7, 81 (Arizona State University, USA), Chemics of Kudo, Sasaki and Funatsu [9-111 (Toyohashi University, Japan), Molodzov et al. (Siberian Branch of the Academy of Sciences of SSSR) and others, followed. Since then, structure generators have become the heart of most program systems for computer-aided structure elucidation. An comprehensive overview of topological structure generators has been given by K. Funatsu [ 111.
5.3.2 Structure Generation Fundamentals
A graph is usually defined as a mathematical object G, consisting of two sets: V- the set of graph vertices, and E, the set of graph edges (see Chapter 11, Section 4.1 for more details). Formally, it is given by the expression (Eq. (1)): G = (V, E )
The graph vertices can be considered as the objects and the edges as their mutual connections. Hence, graph theory can be applied to areas where we deal with structures. It was in the mid fifties when it was realized that graph theory is suitable for the representation of the classical idea of chemical structure. Hence, Eq. (1) can be rewritten as (Eq. (2)): S = (A, B)
(2)
where S is the chemical structure and A and B are the sets of atoms and bonds, respectively. One can see that there is a one-to-one correspondence between the notion of chemical structure and a special class of graphs called molecular graphs. From now on, when we talk of graph we will imply molecular graph or chemical structure and when we talk of vertices and edges we will imply atoms and bonds, respectively. A very essential feature of graphs is that they do not present the actual geometry (bond lengths, valence, and torsion angles), i.e. graphs represent only the connectivity. Accordingly, the three objects in Figure 5.3-1 represent one and the same graph. Hence, we speak of the topological nature of graph representation. If a direction is defined for the edges we speak of directed graphs (the edges are called arcs in this case) (Figure 5.3-lc). Structure generation is a complex mathematical and algorithmic problem. Each gross formula mathematically represented by a set of atoms, corresponds to an immense variety of constitutional formulas (isomeric structures) which can be represented by the corresponding molecular graphs. They can be obtained by creating different combinations of connections between the atoms. Hence, the set of edges E, can formally be represented by the following Cartesian product (Eq. ( 3 ) ) : EkcVxV
(3)
I
179
180
I
5.3 Topological Structure Generators
-0
arc
0/ O
d
0
/\o
\/”
t
0
Fig. 5.3-1 Three graphical representations of one and the same graph
Consequently, the basic difficulties arise from the combinatorial nature of this problem. The number of generated structures increases exponentially, hence for most chemically relevant cases of compounds with more than 10 heavy (nonhydrogen) atoms their number becomes immense. Mathematicians speak of exponential or Np complexity. In practice, this problem is referred to as “combinatorial explosion”. It should be realized though that the variety in nature is mainly due to this complexity. The process of generation of chemical graphs can be represented by the relationship (Eq. (4)):
Here, the operator r is in no case a simple function but a complex algorithmic procedure, which extracts the set Ek c V x V, thus forming the graphs G E W,C is a set of different restrictions imposed on the process of generation, and V x V is the Cartesian product of the set of the vertices V The procedure r can be applied only to numbered graphs (graphs where the vertices are numbered), as far as it processes their matrix representations. Hence, the set of isomeric structures 9Jl is formed of numbered instead of abstract graphs (having non-labeled vertices). The constraints c; E C can be chemical (only structures obeying chemical rules are generated), physical (only structures having pre-defined physical properties are generated), and mathematical. One of the most serious mathematical constraints is to avoid the generation of isomorphic graphs. The nature of the isomorphism problem will be discussed in the next section of this section. Here we want to emphasize that the generation of isomorphic structures leads to duplications which enormously increases the total number of generated structures. This problem was called the isomorphism disease by Read [12].
5.3.2 Structure Generation Fundamentals
The idea of first generating isomorphic structures and then subsequently pruning them is counterproductive.The solution of this problem is one of the most important and challenging tasks to be faced in the development of graph generation procedures. Accordingly, the development of a good structure-generation algorithm requires that instead of diagnostics (recognition of isomorphic structures after being generated) and subsequent surgery intervention (their pruning from the hit-list) a prophylaxis at all stages of the generation process is to be carried out (i.e. their generation to be avoided as early as possible). The following basic rules are usually applied to the generation of molecular graphs:
c l : each molecular graph is to be compatible with the input gross (empirical) formula, i.e. it must be formed by the types and numbers of the atoms specified in this formula; c2: the vertices ui E V of each graph have to reflect some basic chemical properties such as valence, hybridization state of the atoms ai E A in the chemical structure; c3: the formation of the edges ey E E has to follow the rules of formation of bonds in organic chemistry, i.e. only graphs consistent with chemical structures should be generated; c4: some additional structural information, such as available structural fragments and chemical groups have to be easily incorporated into the structure generation scheme, the fragments have to be treated in a uniform way just like separate atoms; c5: the molecular graph has to correspond to some basic structural properties such as topological symmetry, bond multiplicity, and the number and the size of rings as required by the input information; cG: the set 9Jl of the generated structures has to be exhaustive which means that all possible structures complying with the input information must be generated; c7: the set 9Jl to be non-redundant, that is, only structures complying to the input information to be generated without duplications; Accordingly, the main task is to avoid the generation of both mathematically redundant (isomorphic) graphs and redundant graphs with respect to the input chemical, physical etc., information. Thus, by means of the constraints ci E C, which can be considered as rules from a knowledge base, the process of generation can be directed towards either one or other types of graphs in concert with the information available. First, the generation of chemical graphs (structures) must comply with the chemical consistency constraints, i.e. they must correspond to the chemical objects (molecules) which they represent. The redundancy related to isomorphism and chemical consistency constraints can be exactly defined by the number of generated structures. In contrast, some other redundancies related to chemical, physical, biological, spectral, and other properties might not be so sharply defined due to the fuzzy nature of the employed information.
I
181
182
I
5.3 Topological Structure Generators
As mentioned above, the nature of structure generation is combinatorial. It is usually a procedure which produces different connections between atoms. w e may consider the creation of a bond as adding an atom to another or by converting a zero entry in the adjacency matrix into 1. This is carried out in a combinatorial manner leading at each level (adding an atom or fragment) to the creation of substructures, which hereafter we call extensions. It should be mentioned that understanding of the combinatorial algorithms is a serious challenge to both the developer and the reader.
5.3.3
The Isomorphism Problem
Chemistry teaches us that chemical structures are formed of different types of atoms (C, N, 0, etc.). If atoms of the same type (say, only carbon atoms) are free (not bonded) they are indistinguishable, i.e. they are equivalent. All of their properties are considered as equal (charges, number of electrons in the different layers, etc.). This is not the case for bonded atoms of the same type. Depending on their environment their electronic structure is perturbed. Thus, a carbon atom next to a CH3 or an OH groups in a given structure is different from a carbon atom surrounded by CH2 groups. If one calculates their charge densities different values will be obtained. Hence, the atoms in a molecular structure are distinct and must be labeled distinctly. Most frequently, the labeling is performed by numbering the separate atoms. Hence, the graphs which are formed of numbered vertices are called numbered graphs in contrast with the non-labeled or abstract graphs (three different ways of labeling are presented in Figure 5.3-2). Generally speaking, to represent a graph mathematically we need it to be numbered. On the basis of a numbered graph we can construct its mathematical representation. It can be either the adjacency matrix, or the connection table, or a linear notation representation (such as the SMILES code), or any other (the author has suggested the so called two-row matrix representation which will be discussed below). In as much as the structure-generation process is a “blind” combinatorial procedure, consisting of subsequent adding of atoms to the other atoms, thus generating extensions of the chemical structure, always some equivalent structures having different numbering will appear, e.g. the two structures b. and c. from Figure 5.3-2. Their adjacency matrixes are also represented in Figure 5.3-2. One can see that for equivalent (isomorphic) structures we have different representations (adjacency matrixes). Hence, the computer considers them as different structures. Generally speaking, there are N! distinct numkrings for structures of N atoms which leads to the generation of N! isomorphic structures. They obey the similarity transformation (Eq. (5)):
5.3.4 Structure Generation Approaches
a.
b.
C.
#
0 1 0 10 1 0 0 0 0 0 0
010010000000 101000000000 0 101 0 1 0 0 0 0 0 0 001010000001 100100000000
A“ =
n=
10001 0 0 0 0 0 0 0 000110000000 101000000000 011000000000 100000100000 000001010010 000000101000 0 0 0 0 0 0 0 101 0 0 000000001010 000000100101 000100000010
1 2 3 4 5 6 7 8 9 10 1 1 12 3 4 1 2 5 6 7 8 9 10 1 1 12
Fig. 5.3-2 Two differently labeled isomorphic graphs: a. labeled by letters and b. and c. labeled by numbers. The adjacency matrix (A’ and A ’ ) representations of the numbered graphs and their mutual transformation (n)are also provided.
Here A’ and A” are the original and the transformed matrixes, respectively. P is a permutation matrix which is defined as: if a permutation invokes a mapping of vertex i on to vertex j = vice versa One can see from Figure 5.3-2 that this transformation consists of a permutation between two rows ( 1 , 3 ) and (2,4) and between the respective columns ( 1 , 3 ) and (214).
5.3.4 Structure Generation Approaches
Several approaches to the solution of the graph isomorphism problem in the process of graph generation exist, mainly developed during the last 30 years. These
I
183
184
I
5.3 Topological Structure Generators
approaches can formally be classified into three basic groups: (i) approaches based on the selection of one and only one numbering out of all isomorphic numberings to uniquely represent the graph. This numbering is called canonical numbering; (ii) approaches based on superatoms; (iii) approaches based on the relationship between isomorphism and the topological vertex symmetry of the generated extensions; The methods of group (i) are based on the fact that for each abstract (nonnumbered) graph only one out of all numbered graphs can be selected, i.e. only one unique numbering called canonical may be found. Usually this is carried out by defining a canonicity criterion K. Hence, all the methods of this group consist of tracing of all the extensions G’ formed after an addition of a new vertex u, to the different vertices of the previously generated extension G. Thus, the new extensions are investigated against a canonicity criterion and only the extensions complying with this criterion are selected for the next level of the generation process (adding a new vertex). There are three types of criteria known from literature: the maximal (minimal) adjacency matrix [ 13-15], the maximal connectivity stack [9-11], and the lexicographically maximal n-tuple code
[ 161. The criterion for the maximal (minimal) adjacency matrix is formed by developing in a row-by-row manner the upper (or lower) triangle of the adjacency matrix A into a characteristic vector (Eq. (6)):
Hence the criterion K can be defined as (Eq. (7)): K : [A] = min[A] or
K : [A] = max[A]
(7)
Accordingly, the characteristic vector of each new extension is compared with the maximal (minimal) vector and only the extension corresponding to the latter is left for the next step of the generation process. In Figure 5.3-3, two isomorphic structures with their vectors are presented. One can see that the vector corresponding to the first adjacency matrix is larger than the one of the second matrix. A procedure usually applied is: for each new extension all the atom number permutations (numberings) together with their characteristic vectors [A] are generated and compared with the current numbering of this extension. If the criterion (5) is satisfied the current numbering is considered canonical and the current extension is a canonical form. If the criterion K is not satisfied the current extension is rejected. In this way, the structure investigates itself during the process of its formation. A substantial drawback of this approach is the necessity of generating of a great number of permutations for each extension. Kvasnicka et al. (151 introduced a slightly different approach. It is based on the semi-canonical num-
5.3.4 Structure Generation Approaches
4
4
A=
0100000 1011100 0100000 0100000 0100011 0000100
A'=
0010000 0010000 1101010 0010000 0000010 0000101 0000010
bering method of Faradjiev [14]. Instead of using permutations they compute a unique semi-canonical numbering which they further use to construct the characteristic vector. Kudo and Sasaki [9, 101 suggested another method based on the connectivity stack. Thus, the connectivity stack of a matrix is a sequence of the adjacency matrix values sk = ay: (Eq. (8)):
The position of the kth element within the stack is given by the following relationship between the indices i and j of the adjacency matrix elements a0 and the element number k (Eq. (9)):
k
=i
+(j
-
1 ) ( j- 2 ) / 2
(i < j )
(9)
Three three-atom extensions and their connectivity stacks are presented in Figure 5.3-4. All the stacks of a given extension are formed by permuting the numbering of its atoms and comparing it with the stack of the extension currently generated. The extension is considered canonical if its numbering corresponds to the maximal stack. Otherwise the extension is rejected. Knop, Trinajstic et al. [lG] introduced a method based on the lexicographical ordering of N-tuples. An N-tuple is a rooted tree and its code is presented by the
I
185
186
I
5.3 Topological Structure Generators
1-
2-3
s1=)101)
1-
3-2
3-2-1
s2 = 1011)
c s3 = 1110)
Fig. 5.3-4 Connectivity stacks of three three-atom isomorphic extensions. They consist of the 0 and 1 values from the corresponding adjacency matrixes, as their positions within the stack are determined according t o Eq. (9)
non-negative numbers which are less than N. N is the degree of the current vertex (valence of the atom). The creation of the maximal N-tuple is illustrated in Figure 5.3-5. The structure is examined by successively stripping off the nodes starting with the node with the highest degree (valence). Each tree may have many multiple N-tuple codes but the lexicographically maximal code is considered canonical. The first vertex is considered the root, hence, the rooted tree can formally be represented as: T, = ( V , E,r ) , where r is the root-vertex. Thus, with each extension, a N-tuple code is generated and only the extensions corresponding to the canonical (lexicographically maximal) codes are left for the next level of the generation process. This approach was further developed for the cases of poly-condensed cyclic structures. A serious problem of the latter method is the generation of structures containing heteroatoms, multiple bonds, arbitrary cycles, etc. Thus the valence of an oxygen atom is the same as the valence of a CH2 group, hence, the generated N-tuples will be the same. A general shortcoming of all the methods discussed up to now is that it is difficult to implement the incorporation of some structural features such as multiple bonds, chemical groups, and fragments within the structure generation process. The use of fragments as basic elements of the combinatorial process sharply alleviates the combinatorial explosion problem, the larger are the fragments the more the cornbinatorial problem is alleviated. To solve these problems Contreras et al. [17, 181 suggested the following approach. First, the skeletons of the structures are generated by using the lexicographically maximal N-tuple approach and without a further a procedure is applied for coloring the vertices (assigning different hetero-atoms) and the bonds (assigning the bond types). In Chemics, the structures are first generated by using a set of atoms and small groups derived from both the empirical formula and the available spectral information. Then, each structure is studied by a substructure search procedure whether it incorporates the input fragments. Both substructure search procedures (see Chapter VI, Section 3) and the procedures for coloring the vertices are, however, combinatorial, i.e. they are of N p complexity. This also aggravates the process of structure generation. In a series of papers, Zefirov et al. [19] developed a sophisticated method based on the approach of Faradjiev for directly incorporating the molecular fragments in the process of structure generation.
5.3.4 Structure Generation Approaches
i 4 o-o-o-o-o~o
I
I
0
0
.o_ _ _ _ - -0-2 -
6
0-0-0
I
0
9
?
0
I
0
0
b
0
N-TUPLE CODE: 4 2 1 1 0 0 0 0 0 Fig. 5.3-5
Construction o f an N-tuple from a chemical structure
The methods working with superatoms are represented mainly by the Congen and Genoa programs within the Dendral project [4-61. A superatorn is considered as a cyclic substructure having any number of free valences. A catalog of different super-atoms was created (the catalog of the Genoa program contains 300 superatoms). Hence, the molecular structures are considered to be formed of two parts: cyclic and acyclic. A simplified scheme of the different stages of generation of cyclic structures is presented in Figure 5.3-6. First, the various distributions of the super-atoms within the gross formula are generated. From each such distribution the vertex graphs are generated. A vertex graph is a cyclic skeleton whose vertices of
I
187
188
I
5.3 Topological Structure Generators
CI0N2U2..__.__
Molecular formula
I
I
1
Degree of unsaturation
J
1
CI 0N2u2 CgN2U2lC .._._ C B N ~ U ~ / C ~C \ ~NU~/CFN --_.._ --_ --. .. -. .----.--_ --