This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
considered as an operator, then in this context the expectation value of any observable, expressed as the discrete n-dimensional image vector u, say, may be obtained as ^ = u^w + au^Tw
(95)
so molecular properties can be determined related to the TM. As the discrete observable vectors may be decomposed as a linear combination of the density vector coordinates w and an orthogonal complement vector v, say u = aw + Pv
(96)
Quantum Similarity
33
Then, it is also noteworthy that observable expectation values may be related to QMS-SM, because substituting Eq. 96 into expression 95, the following approximation is obtained: <^>A == cce^ + Yv^Tw = a e ^ + Yco^
(97)
where y = Pa. Now, after inspecting Eq. 97 and using self-similarity, 9 ^ , like an available variable as the first term suggests, then 0)^ and thus <^>^ can be developed in a power series such as <^>A-S^P^
(98)
In the above series {a } are appropriate coefficients, depending on the operator and the nature of the MQS-SM. This possible feature is currently being investigated in our laboratory. Preliminary interesting relationships between self-similarity and conformational energy have already been published."^^ Equation 98 also suggests that nonlinear QSPR may be found in some particular cases. The successive powers of the MQS-SM could be substituted by multiple QS-SM if necessary. For example, using the equivalence
9^M = ^S = Jp.('-r'^r
(99)
the integral definition can be changed according to computational conveniences and the availability of the density matrix elements. D. MQSM Topological Indices
Due to the connection between MQSM and molecular topology as evidenced in the previous section, it can be assumed that the ASA metric matrices, involved in the computation of self-similarity measures, may act as good substitutes of TM. Two interesting points must be stressed with respect to this association: 1. There could be several metric matrices accessible, which can be used for the purpose of constructing a quantum similarity TM (QSTM). Depending on the weighting operator or the density matrix element chosen to compute the metric MQSM framework there will appear to be a somewhat different matrix and thus another kind of QSTM will be available. 2. Any chosen QSTM will bear in a natural way information on the three-dimensional structure of the molecule represented in such a manner. The drawback of this approach lies in the necessary use of molecular atomic coordinates to compute the QSTM metric elements, in front of classical TM, where only a molecular graph is needed.
34
R. CARB6-DORCA, L. AMAT, E. BESALU, and M. LOBATO
Once admitting the possibility of using QSTM to compute molecular quantum topological indices (MQTI), they can then be computed in the same way as is customary in the TI calculations. Topology and Molecular Similarity Matrix Elements The traditional description of a molecule, by means of hydrogen-suppressed graphs, has been widely used in QSPR studies, giving good results when correlating numerical quantities derived from them with respect to physicochemical and biological properties."*^ The classical topological representation of a molecule by a graph can be coded by means of its TM, T, whose elements are 1 if the associated atoms can be considered connected (bonded) and 0 otherwise, as previously mentioned. Thus, only links between bonded atoms are taken into account when constructing the classical TM. This matrix is used in the definition of TI, which are taken afterwards as molecular descriptors, susceptible to being correlated with properties. The TI most widely applied and successful in QSAR studies are the Wiener and Randi6 approaches on TI definitions. Also, the subset of generalized connectivity indices as defined by Kier and Hall^^ has proved to be extremely successful. However, it is often found that the three-dimensional molecular structure can be of key importance for describing some of the molecular properties and in this manner forcing more accurate QSAR results. This may be so, because part of the spatial information of the molecule can be lost when a topological matrix is used. To include this information when defining molecular descriptors, Randic has introduced a geometrical distance matrix representation as an ersatz input data set^^ related to this question. Among the TI that can be defined and may be relevant in this part of the present study are the well-known Wiener path number, Randic, Schultz, and Balaban indices, Harary number, and generalized connectivity indices '"x^. All of them are widely discussed in Ref. 44. To compute TI, it is just necessary to define some auxiliary matrices as shown in Table 1. The TI definition and the relationship with the matrix elements, shown in Table 1 is summarized in Table 2.
Table 1, Definition of Matrices Used in TI Calculations^ Matrix
Elements Definition
^ in^ n)
Classical topological matrix: J _ j1 if atoms / and j are bonded u [0 otherwise
D (n X n) V in)
Djj'. topological length of shortest path from atom / to atom j Vf. sum of entries in the /th row or column of matrix T
Note: ^n is the number of atoms.
Quantum Similarity
35
Table 2. Definition of Several Assorted Topological Indices Index
Deiinuton
Wiener path number
Randi6 index
n
n
n
n
^"Z S "V^'^y y>'
Schultz index (molecular topological index)
Harary number
n
M7/=5;[V(T+D)]y n
n
^-
y>'
Balaban index
n
H+1 f
n
^ (D),
ji: number of cycles ne: number of edges {P\\ sum of distances from vertex / Generalized connectivity indices Order: m, type: f
"m m f l
s=l
/=1
n^: number of connected subgraphs of type t
To obtain the new QSTI, associated with the presentframework,a set of matrices related to the one used in QSM is evaluated for each molecule and introduced as the background computation of the indices. These new indices do not derive from the classical TM elements, but they can arise from the use of the new TM set associated with QSM, the QSTM—using the QSTM elements afterwards, and dealing with the same computational structure as in classical TI evaluation over TM. To compute MQTI, it is necessary to know some molecular geometry data and use a simple set of atomic basis functions, sufficient to describe in a very schematic way the molecular density. The geometry can be obtained by means of either ah initio or semiempirical methodology or by any other reliable procedure or source. Once the molecular geometry is known, the procedure associates a basis function with each atom. The active basis set can be defined using a normalized Is GTO
36
R. CARB6-DORCA, L. AMAT, E. BESALU, and M. LOBATO
function {^^(r - R^)} for atom a, with a scale factor ^^. In practice, these functions can be constructed transforming single-zeta STO exponents, derived from atomic SCF energy optimization,"^^ to GTO function exponents by minimizing the integral quadratic error between both function types. But other approaches may be imagined and are currently being tested in our laboratory. The similarity matrix used in computation of MQTI, which can be used to substitute the TM and the topological distance matrix as used in classical index evaluation, has to be defined as previously discussed. It must be noted that, while MQSM concepts are usually defined between molecular pairs, within the MQTI framework every matrix is defined using only a unique molecular structure, as in self-similarity calculations, according to the analysis carried out before. In the following definitions, a and b are atoms belonging to molecule A. Integrals are computed between two normalized Is GTO functions corresponding to atoms a and b with centers R^ and R^, respectively. The most relevant QSTM elements are computed as follows: Overlap-like S(Q): Each element is computed taking into account the structure of the metric matrix with a positive definite operator, as the general integral:
The use of such an integral form may become justified once one is aware of the simplified structure, which is associated with the present discussion, where only one Is GTO function is ascribed to a given atom. Because spherical GTO can be taken as probability density functions directly, Eq. 100 corresponds to the next expression below: Coulomb-like E(Q): Each element £^^ is defined as the two-electron Coulomb repulsion integral:
Both integrals above, leading to the TM elements, can be described in a unique form, using a new more general expression: Gab = \ \ So (ri - R / i 2 ( r p T^g,{T^ - R,rdr,dr^
(102)
where p = 1 orp = 2 produces the pair of previous integrals respectively. Moreover, if /? = 1 and Q = 6(ri - r2), one has an overlap integral and the corresponding overlap-like similarity integral is associated with the same operator but taking p = 2. A Coulomb repulsion matrix element is obtained with /? = 1 or 2 but Q = IFJ - r2l - 1, while a gravitational matrix element will be computed using the operator Q = Ir. - r,!"^.
Quantum Similarity
37
Exchange-like 0(Q): Besides the kind of definitions described in the general context of the last equation above, each element of the topological matrix O^ can be defined as the integral involving an exchange-like structure:
where depending on the operator's choice, that is, when taken in the same order as before, the exchange-like measures as Cioslowski overlap-like, exchange repulsion, and gravitational exchange integrals are reproduced. Also, some transformations can be applied to the above-defined matrix elements, to avoid the presence of large values. The matrix transforms can be chosen to constrain all of its values within the interval between zero and one. Diagonal elements are defined as zeros following the custom in the classical TM. The matrix transforms of electron repulsion, gravitational, and overlap QSTM lead to three new matrices, where each transformed matrix C(A) is defined from the original matrix A elements, using a Carbo index construction rule: CiAJ-
Kb
(104)
^a^. bb
Thus, several, immediate QSTM can be defined in this way: topological, overlap, Cioslowski-like overlap, electron repulsion, electron exchange, gravitational, gravitational exchange, and also the matrix transforms of these matrices can contribute quite generally to the TM construction. Finally, MQTI can be obtained from the same classical formulas used in the TI computation, as shown in Table 2, but using any of the QSTM elements instead. Practical Computation and Use of MQTI In a practical computation, all MQTI obtained from the SM or their transforms, as defined above, are to be evaluated. The generalized connectivity indices are computed from order m = 0 up to 6th order. This gives quite a large set of indices, which can be collected as molecular descriptors in an appropriate matrix. This matrix is manipulated in the same manner as can be done with the MQSM matrix Z. The practical MQTI selection works as follows: The best individual index is chosen; then, all possible pairs are generated and the best correlation is saved. Later on, three index subsets are obtained and the best is saved, and so on. In this context, best means a multilinear regression model achieving higher cross-validated r^ (Q^), which is the parameter that determines the model's prediction accuracy."^^ The whole procedure is performed applying a nested summation symbol^"* parallelizable algorithm. The algorithm was coded into a Fortran 90 program: NESTED-MLR, developed in our laboratory."*^ Results will be published elsewhere.
38
R. CARBO-DORCA, L. AMAT, E. BESALU, and M. LOBATO
VII. SIMILARITY OVER ENERGY SURFACES Until now the principal characteristic of the functions attached to QOS elements was on their positive definition, being associated to normalizable probability distributions. But in quantum chemistry, under Born-Oppenheimer approximation, there are several function types, mainly associated with electronic energy variation, which are not positive definite, but nondefinite functions. Indeed, electronic energy surfaces (EES)^^ may have either positive or negative zones: The typical most available and widespread such function is the molecular electrostatic potential (MEP), first defined by Bonnacorsi et al.'^^ Although the electronic part of the MEP has been inserted in the MQSM theoretical structure,"^^ the whole MEP function is not definite. MEP functions even present evident discontinuities, making similarity measure integrals divergent, in the same way as the simple Coulomb potential function integral in the interval [0, +00] diverges. Thus, comparisons of nondefinite functions of quantum theoretical origin are not yet included in a, mathematically speaking, sound manner within the QSM structure. This section will try to propose a possible way to carry out this task. The starting point of the possible QSM treatment over such functions is agreement that only positive definite probability functions can be safely and generally used to compute measure integrals. Once this circumstance is accepted, then the problem is to design a procedure to transform such functions into adequate measurable functions. Fortunately, the answer is already here in the form of statistical mechanics probability distributions. This simple statement will be the starting point of the present discussion. A. Boltzmann Distributions and Boltzmann Similarity Measures
Statistical mechanics probability distributions, such as the Boltzmann Distribution^^ (BD), or partition functions, are ideal constructions to relate EES with a probability distribution capable of being integrated to form some kind of similarity measure integral. The procedure will demand here compulsively to transform the original function into a definite positive BD. Once obtained, the new function is the procedure will be the same as in the usual QSM methodology. Suppose that a QOS, 5 = {5'^}, is known and that a set of EES are associated with each object in 5: £ = {efR)], where the coordinate vector, R, defines the variables on which every surface depends. The limitation here is such that the vector R must be uniformly and coherentiy defined for every function in the set E, in the same way it was when discussing QSM over density functions. It is possible to find a transformation such as a new set of functions B = {P/R)}, corresponding to a BD of every function in E, that is, \/sjeS--^
3efR)
SEA
0(^/R))
= P/R)
EB
(105)
Quantum Similarity
39
The transformation O leading to the BD can be defined as in the usual theoretical definition: 0(^/R)) = (3/R) = exp(-^/R)(/:rr^)
(106)
where k is the Boltzmann constant and T the temperature in K. The BD must be normalizable in the sense that the integral over the volume V, containing the variation of R, converges:
and 0(R) is a suitable operator to force, if necessary, the integral to converge. Then, the BD set 5 = {P^}, can be normalized using the above integral, that is, Pf(R) = 'a7^P/R)
(108)
the original or the new normalized BD set B may, then, be used to compute QSM in the same way as in the usual electronic density distribution framework. That is, a Boltzmann similarity measure (BSM) involving two QO, {A, fi}, corresponds to the integral \x^^(Q) = J P^(R)Q(R)P^(R)^
(109)
just as in Eq. 1, Q(R) is a weighting operator and the densities used here are the associated BD (P^, P^}. Thus, everything said in the realm of electronic density distributions can be repeated here. A paper dealing with this kind of problem has just been published. ^^ Within it detailed examples are given, and will not be repeated here. That discussion has led to the definition, besides the pure QSM and BSM, of mixed quantum and Boltzmann similarity measures, where one of the probability density distributions appearing in the measure integral could be of quantum and the other one of Boltzmann origin. B. General Distributions and Similarity Measures
Other transformations can be envisaged in the light of the BD form shown in Eq. 106, so as to obtain positive definite functional forms to be compared. An immediate one could be imagined as some sort of Gaussian distribution (GD) definition, although the interesting statistical mechanics connection of the previous section will be in this case lost. That is, the BD transformation definition may be changed by the alternative GD expression: r(^.(R)) = Y/(R) = cxp(-e^R)(kT)-'
^)
d 10)
40
R. CARB6-DORCA, L. AMAT, E. BESALU, and M. LOBATO
where all of the previous definitions hold, but the Boltzmann exponential has been transformed into a GD. The set of GD, G = {Y/}, attached in this way to the QOS, 5, can be used in the usual way to produce any kind of similarity measure. The discussion above carried out on the possible QSM methodological extensions, using, for example, BD or GD, to obtain other plausible molecular similarity measures, can be imagined even more extensive. Some general aspects of the question were discussed early on,^^ and in Section IV.A, with respect to the fact that, prior to the computation of any QSM, the density matrix elements may be, in turn, transformed. In this manner, the similarity measure will contain these density transforms, instead of the raw density distributions. Allan and Cooper employed a related procedure when they used a well-known Fourier transform in so as to work the QSM over the momentum space representation.^^ The situation can be finally generalized, because the construction of a general transformation rule, over any quantum distribution function descriptor (QDFD) set A = {X/R)}, having the positive definite structure that has been associated with such functions, can be formally written as the integral transform: cp/r) = J 0(R, rmX^R))dR
d ^D
where (A,/R)) can be any previous suitable function transformation and Q(R, r) a convenient operator, the above integral transform produces over the original set A a new QDFD set F = {cp.(r)}, which can be used afterwards in the calculation of QSM integrals.
VIII. CONCLUSIONS It can be agreed at this stage of the MQSM question that a vast and adapted to almost chemical problem set of similarity measures can be computed for any chemical system. So, a complete collection of tools to measure any chemical similarity aspect is set. The description possibilities of these procedures, being of quantum mechanical origin, and so generally defined, encompass the territory of chemistry and can even be thought to extend to other areas dealing with Quantum Objects, as atomic nuclei for instance, where density distributions can also be easily defined.
ACKNOWLEDGMENTS This work has been partially sponsored by project SAP 96-0158 of the CICYT L. A. is a fellow of Ministerio de Educacion y Cultura and M. L. benefits from a University of Girona grant. The authors are grateful for lively discussions with Dr. P. Constans, which enlightened many aspects of this work.
Quantum Similarity
41 REFERENCES
1. Carbo, R.; Amau, M.; Leyda, L. Int. J. Quantum Chem. 1980, 77, 1185-1189. 2. See, for example: (a) Carbo, R.; Calabuig, B. Comput. Phys. Commun. 1989, 55, 117-126. (b) Carbo, R.; Calabuig, B. J. Chem. Inf. Comput. Sci. 1992, 52, 600-606. (c) Carb6, R.; Calabuig, B. J. Mol. Struct. (Theochem) 1992,254,517-531. (d) Carb6, R.; Calabuig B. In Computational Chemistry: Structure, Interactions and Reactivity. Vol. A; Fraga, S., Ed.; Elsevier: Amsterdam, 1992, pp. 300-324. 3. Carb6, R.; Calabuig, B. In Concepts and Applications of Molecular Similarity, Johnson, M.A.; Maggiora, G., Ed.; Wiley: New York, 1990, pp. 147-171. 4. Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches; Carbo, R. Ed.; Kluwer: Dordrecht, 1995. 5. Advances in Molecular Similarity, Carbo-Dorca, R.; Mezey, RG. Eds. JAI Press: Greenwich, CT, 1996, Vol. 1. 6. See, for example: (a) Concepts and Applications of Molecular Similarity, Johnson, M.A.; Maggiora, G., Eds.; Wiley: New York, 1990. (b) Molecular Similarity. Sen, K., Eds.; Topics in Current Chemistry, Vols. 173 & 174; Springer Verlag: Berlin, 1995. 7. Carbo, R.; Besalii, E. In Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches; Carbo, R. Ed.; Kluwer: Dordrecht, 1995, pp.3-30. 8. See, for example: (a) Carb6, R.; Besalu, E.; Amat, L.; Fradera, X. J. MatkChem. 1995, 75, 237-246. (b) Carb6, R.; Besalu, E. Afinidad 1996, 53, 77-79. 9. See, for example: (a) Sola, M.; Mestres, J.; Carbo, R.; Duran, M. J. Am. Chem. Soc. 1994,116, 5909-5915. (b) Sola, M.; Mestres, J.; Carbo, R.; Duran, M. J. Chem. Inf Comput. Sci. 1994,34, 1047-1053. (c) Mestres, J.; Sola, M.; Carbo, R.; Luque, F J.; Orozco, M. J. Phys. Chem. 1996, 100,606-610. (d) Fradera, X.; Amat, L.; Besalu, E.; Carbo-Dorca, R. Quant. Struct.-Act. Relat. 1997,16, 25-32. 10. See, for example: (a) Constans, R; Carbo, R. J. Chem. Inf Comput. Sci. 1995, 35, 1046-1053. (b) Constans, R; Amat, L.; Fradera, X.; Carbo-Dorca, R. \n Advances in Molecular Similarity; Carbo-Dorca, R.; Mezey, R G., Eds.; JAI Press: Greenwich CT, 1996, Vol. 1, pp. 187-211. 11. See, for example: (a) Carbo-Dorca, R.; Besalu, E.; Amat, L.; Fradera, X. In Carbo-Dorca, R.; Mezey, P. G., Eds.; Advances in Molecular Similarity; JAI Press: Greenwich, CT, 1996, Vol. 1, pp. 1-42. (b) Amat, L.; Fradera, X.; Carbo R. Scientia Gerundensis 1996, 22, 97-107. 12. See, for example: (a) Constans, P.; Amat, L.; Carbo-Dorca, R. J. Comput. Chem. 1997, 18, 826-846. (b) Amat, L.; Carbo, R.; Constans, P Sci. Gerundensis 1996, 22, 109-121. 13. See, for example: (a) Carbo, R.; Calabuig, B. Int. J. Quantum Chem. 1992, 42, 1681-1693. (b) Carbo, R.; Calabuig, B. Int. J. Quantum Chem. 1992,42, 1695-1709. 14. Bourbaki, N. Theorie des Ensembles. Elements de Mathematique; Hermann: Paris, 1979. Carbo, R.; Besalu, E.; Amat, L.; Fradera, X. J. Math. Chem. 1996,19, 47-56. 16. See for example: Kier, L.B.; Hall, L.H. Molecular Connectivity and Drug Research; Academic Press: New York, 1976. 17. Carbo-Dorca, R.; Besalu, E. J. Math Chem. in press. See also IQC Technical Report IT-IQC-011996. 18. Arsenin, V. Y Basic Equations and Special Functions of Mathematical Physics; Iliffe Books: London, 1968. " 19. See, for a computational formula over GTO's: Matsuoka, O. Int. J. Quantum Chem. 1973, 7, 365-381. 20. Bethe, H. A.; Salpeter, E. E. Quantum Mechanics of One- and Two-Electron Systems; SpringerVerlag: Berlin, 1957. 21. See, for example: Carbo, R.; Besalu, E.; Calabuig, B.; Vera, L. Adv. Quantum Chem. 1994,25, 253-313. 22. Carbo, R.; Calabuig, B.; Besalu, E.; Martinez, A. Mol. Eng. 1992, 2, 43-64.
42
R. CARB6-DORCA, L. AMAT, E. BESALU, and M. LOBATO
23. See, for example: (a) Carb6, R.; Domingo, L.; Peris, J. J. Adv. Quantum Chem. 1982, 75, 215-265. (b) Carbo, R.; Mir6, J.; Domingo, L.; Novoa, J. J. Adv. Quantum Chem. 1989, 20, 375-441. (c) Carbo, R.; Domingo, L.; Peris, J.J.; Novoa, J. J. J. Mol. Struct. 1983, 95, 15-33. (d) Carb6, R.; Calabuig, B. Comput. Phys. Commun. 1989, 52, 345-354. 24. Carb(5, R.; Molino, L.; Calabuig, B. J. Comput. Chem. 1992,13,155-159. 25. Mestres, J.; Sola, M.; Duran, M.; Carbo, R. J. Comput. Chem. 1994, 75 ,1113-1120. 26. Pople, J.A.; Beveridge, D.L. Approximate Molecular Orbital Theory; McGraw-Hill: New York, 1970. 27. Cioslowsky, J.; Piskorz, P J. Chem. Phys. 1997,106, 3607-3612. 28. See for a detailed analysis of EES: Mezey, P.G. Potential Energy Hypersurfaces. Studies in Physical and Theoretical Chemistry, Vol. 53; Elsevier: Amsterdam, 1987. 29. Jacobi, C. G. J. J. Peine Angew. Math. 1846, 30, 51-94. 30. GATOMIC Program, Amat, L.; Carb6-Dorca, R. Institute of Computational Chemistry, University of Girona, 1997. 31. See for example: Schmidt, M. W.; Ruedenberg, K. J. Chem. Phys. 1979, 71, 3951-3962. 32. See for example: Carbo, R.; Riera, J. M. A General SCF Theory. Lecture Notes in Chemistry,Vol. 5; Springer-Veriag: Beriin, 1978. 33. See for example: Wilkinson, J. H.; Reinsch, C. Linear Algebra; Springer- Veriag: BerUn, 1971. 34. See, for example: (a) Carb6, R.; Besalu, R. J. Math. Chem. 1993,13, 331-342. (b) Besalu, E.; Carbd R. In Strategies and Applications in Quantum Chemistry: from Astrophysics to Molecular Engineering; Defranceschi, M.; Ellinger, Y, Eds.; Kluwer: Dordrecht, 1996, pp. 229-248. (c) Besalu, E.; Carb6, R. / Math. Chem. 1995, 78, 37-72. 35. Lunin, V. Y; Lunina, N. L. Acta Crystallogr 1996, A52, 365-368. 36. See for example: (a) Carbo, R.; Domingo, L. Int. J. Quantum. Chem. 1987, 23, 517-545. (b) Carbo, R. Sci. Gerundensis 1987,13, 177-184. 37. Hodgkin, E. E.; Richards, W. G. Int. J. Quantum Chem. 1987,14,105-110. 38. See for example: Carb6, R.; Martin, M.; Pons, V. Afinidad 1977, 34, 348-353. 39. Coulson, C. A.; Streitwieser, A., Jr. Dictionary of n-electron Calculations. Pergamon Press: Oxford, 1965. 40. Cioslowski, J.; Fleishmann, E. D. J. Am. Chem. Soc. 1991, 775, 64-67. 41. Oliva, J. M.; Carbo-Dorca, R.; Mestres, J. In Advances in Molecular Similarity; Carbo-Dorca, R.; Mezey, P G., Eds.; JAI Press: Greenwich, CT, 1996, Vol. 1, pp. 135-165. 42. See, for a review: Balaban, A. T. J. Chem. Inf Comput. Sci. 1995, 35, 339-350. 43. Randic, M.; Jerman-Blazic, B.; Trinajstic, N. Comput. Chem. 1990,14, 237-246. 44. Mihalic, Z.; Trinajstic, N. J. Chem. Educ. 1992, 69, 701-712. 45. Clementi, E.; Raimondi, D. L. J. Chem. Phys. 1963,38, 2686-2689. 46. Montgomery, D. C ; Peck, E. A. Introduction to Linear Regression Analysis. Wiley: New York 1992. 47. NESTED-MLR Program. Lobato, M.; Besalu, E. Institute of Computational Chemistry, University of Girona, 1996. 48. Bonnacorsi, R.; Scrocco, E.; Tomasi, J. J. Chem. Phys. 1970, 52, 5270. 49. See for example: (a) Carb6, R.; Sufie, E.; Lapena, F; Perez, J. J. Biol. Phys. 1986,14, 21-28. (b) Carb6, R.; Lapena, F.; Sufie, E. Afinidad 1986,45, 483-485. 50. See for example: Eyring, H.; Henderson, D.; Stover, B. J.; Eyring, E. M. Statistical Mechanics and Dynamics; Wiley: New York, 1964. 51. See, for a recent review: Allan, N. L.; Cooper, D.L. Top. Curr Chem. 1995,775,85-111.
FUZZY SETS AND BOOLEAN TAGGED SETS; VECTOR SEMISPACES AND CONVEX SETS; QUANTUM SIMILARITY MEASURES AND ASA DENSITY FUNCTIONS; DIAGONAL VECTOR SPACES AND QUANTUM CHEMISTRY
Ramon Carbo-Dorca
Abstract I. Introduction II. From Fuzzy Sets to Boolean Tagged Sets and Beyond A. Preliminary Definitions B. Tagged Classes C. Operations over Boolean Tagged Sets D. Boolean Tagged Vector Spaces E. Applications
Advances in Molecular Similarity, Volume 2, pages 43-72. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 43
44 44 46 46 47 48 48 49
44
RAMON C A R B O - D O R C A
F. Extensions Tagged Sets, Convex Sets, and QSM A. QSM B. Convex Sets C. ASA D. Convex Operators E. Finely Tuned QSAR IV. On the Statistical Interpretation of Density Functions: Diagonal Vector Spaces and Related Problems A. The Nature of Discrete QO Representations B. The Structure of the Generating «-DimensionalVS: DVS C. Expression of the Density Functions and Other Problems V. Conclusions Acknowledgments References
III.
50 51 51 56 57 61 63 65 66 67 68 70 70 71
ABSTRACT Fuzzy set structure is analyzed from the point of view of a new definition: Boolean tagged sets, which are constructed as a straightforward generalization of the fuzzy set conception, but fully adapted to computational purposes. Boolean tagged sets are simply structured as sets whose members are tagged by a binary string, attached in turn to the vertices of a unit hypercube. Boolean tagged sets are designed to be useful in any field of applied mathematics, but they also can be seen as obviously prepared for chemical applications, associated with molecular information gathering and manipulation. Concepts and definitions of tagged and convex sets are applied afterwards, with the aim to discover a general mathemafical pattern enveloping the quantum similarity measures framework. As a consequence, several aspects of the quantum similarity theoretical structure become beautifully related to a mathemafical construction, adopting the form of some interwoven essential formalism, connecting quantum theory, molecular similarity, convexity, and tagged sets. Finally, following a statistical interpretation analysis, as a discrete probability distribution, of the atomic shell approximation linear coefficient set, a new description of quantum object sets is given, using a discrete representation, based on diagonal vector spaces. Previous definitions, involving essentially tagged sets and vector semispaces, are used.
I. INTRODUCTION The development of quantum similarity (QS) theoretical structure in recent times, as well as the applications, extensions, and prospective scenarios have been studied collectively in various books, published before the present volume.^'^
Fuzzy Sets and Boolean Tagged Sets
45
From the initial naive QS formulation, where first-order density functions were approximated by a CNDO-like expression,^ up to the actual atomic shell approximation (ASA) formalism,^^ there has not been any published attempt to produce a comprehensive study of the mathematical background attached to this field. Even more lacking appears to be a discussion, as complete as possible, of the possible influential consequences of the QS formalism, into discrete quantum-mechanical procedures, so heavily related to usual quantum chemistry computational practice. This work, which conglomerates a set of three separate studies,^^'^^'^^ shall be considered a primitive sketch to make up for the absence of a basic mathematical formalism in QS. The present study has three main sections, as briefly described below. In Section II an alternative definition of fuzzy sets will be given. This can be understood as a primitive resource, which can be employed to open the way to describe, in Section III, the needed mathematical formalism adapted to the purposes of quantum similarity measures (QSM). Section IV will analyze some consequences that the previous developments can produce, within the realm of discrete quantum chemistry. Then, this paper will be organized as follows: 1. Section II will be devoted to the definition and mathematical settings of an alternative, almost trivial, definition of the concept of fuzzy set. Boolean tagged sets will be introduced to the reader, as a natural way to describe molecular structures, by means of an appropriate mathematical formulation. To construct and discuss this kind of tagged set structures. Section II will be organized as follows. First, some definitions will be given, then operations on tagged sets and tagged vector spaces will be described, and finally some applications, sketching the main formalism provided later on, will be presented. 2. In Section III, the role that could be played by tagged set definition and convex sets together, will be studied within the framework of QSM and along the related techniques and algorithms. Therefore, this section will be structured as follows. First, the appropriate definitions allowing construction of the general theoretical background of QSM theory will be given. Then, some necessary notions associated with tagged set extensions and convex sets will be provided. The connection between both subjects will be discussed next. Finally, an application example involving quantitative structure-activity or -property relationships (QSAR, QSPR) will be presented. 3. Section IV provides an analysis of the nature of the linear coefficients appearing in some expressions of the first-order density function (DF), as in the ASA formalism. This preliminary discussion will reveal that the ASA coefficients can be associated with a discrete probability distribution. Finally, the need for diagonal vector spaces (DVS) will appear as a natural choice of the theory. This will permit obtaining in the discrete quantum theory framework the same formalism as in the continuous case.
46
RAMON CARB6-DORCA
11. FROM FUZZY SETS TO BOOLEAN TAGGED SETS AND BEYOND Since Zadeh's original definition of the fuzzy set concept/ there has been a growing body of literature dealing with the theory and applications of this interesting generalization. Fuzzy sets are perfect tools to deal with the current everyday life situations, where no sharp cut between logical, {true, false}, sentence contents is present.^ The aim of this section is to describe an alternative general way of dealing with the same problems as fuzzy set theory does, while presenting an immediate connection with actual information structure and manipulation. The fact is that, owing to the current technological situation, information is gathered and manipulated within a pattern, which can be reduced to the binary equivalent of the logical contents, {1,0}, previously mentioned. Due to this situation, electromechanical computational devices deal with sequences of information units, bit strings, which in turn can easily be seen as the vertex set of unit n-dimensional cubes or hypercubes, whose origin can be arbitrarily placed at the vertex made of an appropriately dimensional {0} bit string. More than this, when one considers any kind of information structure, it can be realized, with a simple imagination effort, how this information can be transported into some vertex of the unit n-dimensional cube. Then, as a set of obvious examples, one can describe, using the vertex bit string of some appropriate dimension unit hypercube: the CPU contents of a given computer in a given time slice, or the information structure of a hard disk, or the contents of a stable CD-ROM, and so on. A not so typical example may correspond to the books of a library translated into binary code. Using a completely similar situation, the set of all molecules reported within Chemical Abstracts, with the known information on them, translated to binary code and ordered forming a bit string, constitutes another obvious example of a set, with the appropriate characteristics as those that will be described here. Another interesting example of a Boolean tagged set form, connected to a particular molecular discrete description, may be briefly discussed. Consider a given set of molecular structures. Take it as a background set. Construct the attached topological matrix to every molecule in the background set and order it as a column vector. When finished with this task, homogenize the dimension of these column topological vectors by, for instance, adding the appropriate number of zeroes. Take the homogenized column topological vectors as the tag set. A Boolean tagged set has been defined, in this manner, over the molecular set. A. Preliminary Definitions
Definition 1. Hypercube. A unit-length n-dimensional cube or hypercube is a set of 2" vertices, whose elements can be constructed with the binary string vectors, associated with the integer sequence: {0, 1, 2 , . . . , 2""^}. Two vertices should be
Fuzzy Sets and Boolean Tagged Sets
47
underlined from the abovementioned sequence, namely, 0 = (0, 0, 0, . . . , 0), the zero vertex, where the origin of coordinates can be supposed to be, and 1 = (1, 1, 1, . . . , 1), the unit vertex. Let us use K„ to note such n-dimensional regular polyhedron, n Hypercubes of this sort can be constructed iteratively as K^^^ <— K^. Using the algorithm: Vv, € A:„: V'^ = V, e (0)AV^,, = V, © (1) =* {y'^, v^,,} € K„,,
(D
The algorithm above amounts to the same as considering a direct sum construction: A^^^i =K^® Ky This also allows admitting the possible existence of more general possible constructs as K^_^_^ = K^® K^ and even of still more generic forms like ^yv = ©/^n(/)WithAr = Z.n(0. Definition 2. Boolean Tagged Set. Suppose any set, 5 = f^}, the background set, and a hypercube, AT^. A Boolean tagged set, B„, is constructed by means of the ordered pairs: B„{S) = SxK„ = {zeB,\z = [s,Y,];seSAy,eK„}
(2)
The order of a Boolean tagged set will indicate the dimension, n, of the underlying hypercube K„. So as to ease the notation, the equivalent symbols B„ = B„(S) will be used whenever no confusion could be generated. A classical set can be associated with a Boolean tagged set of unit order: 5,. D B. Tagged Classes It is apparent that the Boolean tag, associated with every element of the background set S, placed in the second moiety of the ordered pairs when constructing Boolean tagged sets, can be understood as a membership tag class. A trivial example is given by the unit order Boolean tagged set, where the following interpretation holds: VzG ^1 -> {z = [s, 1]ASE S] V {Z = [S,0]AS€ S). The integer value, k, of a given Boolean tag, v^, could be defined by the symbol x(vj^) = k, so one will have, for example, x(0) = 0 and x(l) = 2'* - 1. In general, now, and following the same path as in fuzzy set theory, when observing any Boolean tagged set, B^, only those background elements, bearing the unit vertex Boolean tag, can be considered true members of the background set 5. Likewise, only those background elements, bearing the zero vertex Boolean tag, can be considered true nonmembers of 5. The rest of the background elements may be interpreted as the marginal elements of B^. Definition 3. Tagged Class. A tagged class C,^ of a Boolean tagged set B^ can be defined by means of the following condition: x(v,) = VAT = [t, vj e
48
RAMON CARB6-DORCA
Cj^czB^.A tagged class will be attached to a subset 7^ of the background set 5, such that Vr G 7^ c 5 -^ Q = T; X {v^}. D From the definition of tagged classes, the following relationships hold:
k
U 7 ; =5Ar^nr^ = 0
(3)
k
That is, \/s£ 5: 3 v^ G K^-^z = [s, v j G C^ c B^. Thus, Boolean tagged sets may be defined in such a way that every element is a member of some disjoint subset, associated with a given precisely well chosen hypercube vertex, also representing an integer value in the range ke {0, 1, 2 , . . . , 2" - 1}. A nondegenerate Boolean tagged set has the available 2" classes with one or nil attached elements of the background set. On the other hand, degenerate Boolean tagged sets would give to the observer the meaning of lack of information on the tag moiety, whereas the presence of void tagged classes could signal that possible candidates of the background set remain unknown. C. Operations over Boolean Tagged Sets Any operation defined over the elements of a Boolean tagged set, B^, has to be decomposed into two distinct operations, involving the background and the tag set parts separately. That is, a general situation may be depicted in the following way: Va,Z7G ^ „ : a * ^ - ^ [ a , ^ i ] * [ p , v ] - > [ a o p , i L i * v ] G 5^
(4)
At the same time, transformations of the elements of B^ can be defined in three possible ways, according to how they affect the Boolean tagged set elements. There may thus be transformations of the background part only, or the tag part could be the one affected, or both parts may be changed. Background, tag, or total transformations may be defined accordingly. D. Boolean Tagged Vector Spaces A Boolean tagged vector space is simply a Boolean tagged set whose background set is a vector space. The structure of vector space is preserved providing that appropriate operations could be defined in the tag moiety. To the vector sum of the background part, there must be defined a corresponding composition on the tag set part. That is, if W^ corresponds to such a Boolean tagged vector space, then V a, Z? G W^: a + Z? = [a, jx] + [(3, v] = [a + p, |i • v]
G
W„
(5)
Fuzzy Sets and Boolean Tagged Sets
49
In this way, choosing the tag set part operation as •=A, for example, then true members are preserved as such in the final sum result, and when summed up to true nonmembers the result is a true nonmember. Sums performed over a tagged class of marginal set members preserve the tagged class association of the result. Within this trend, marginal members not of the same tagged class, when summed, may yield true nonmembers. Other associations from the tag set part point of view may be defined, using diverse Boolean operators. When taking into account the product of a vector by a scalar, it is sufficient to consider this operation as a background set transformation, leaving the tag part invariant. As a consequence, linear combinations of Boolean tagged vector spaces are perfectly defined in this way. Metric background vector spaces may merit some attention, because scalar products or norms will produce a scalar result in the background part. Thus, in the tag part some sound transformation must be defined too, providing the projection of the whole Boolean tagged space into a boolean tagged unit order set with a scalar background set. Then, metric Boolean tagged vector spaces are easily defined. Distances can also be defined quite effortlessly over Boolean tagged vector spaces. The background part is immediate and can be obtained using some common form, as in the corresponding space and in the tag part a Minkowski formula may be used, which amounts to the same as counting the number of noncoincident bits in the pair of implied tags. The biggest distance number obtained in any usual way will be the one involving the extreme vectors 0 and 1. E. Applications
Obvious applications may be envisaged in the domain of chemistry, as well as in other areas, like Boolean tagged logic, which do not directly concern this contribution. The background set S may be taken, from a chemical optics, as a set whose elements are made of molecular structures. The hypercube tag elements can be identified as a set of chosen molecular descriptors, transformed into a corresponding set of bit strings. In doing so, the molecular classes may be easily compared and ordered. Proceeding in this manner, it can be straightforwardly deduced how the Boolean tagged sets are perhaps to be considered in the backyard of all of the attempts and procedures to classify molecules. Molecular classification and ordering can be achieved by means of studies, essentially based on molecular properties of any origin as molecular descriptors. The molecular magnitudes, obtained through theoretical considerations or experimental measures, being rational numbers in the worst case, can be easily transformed into bit strings. A molecular set can be associated with Boolean tag descriptors, and transformed in this way into a Boolean tagged set. The atomic periodic table is nothing but an early nice example of this situation. Another trivial example of such molecular tagged sets can be constructed
50
RAMON CARB6-DORCA
using as a tag set part the molecular coordinates of every background set molecular structure. The coordinates are fixed at some conformation and can be understood as a set of positive definite rational numbers, describing a three-dimensional molecular structure. Positive definition of the coordinates can always be achieved by means of an appropriate translation of the origin. The realm of QSM^, a field where the author has been active in recent years, provides us with a perfect example of the possible construction of a Boolean tagged molecular set. It is well known that molecular QSM (MQSM)^'"^ can produce a discrete representation of molecular structures within a given molecular set, the so-called molecular point-cloud^ of the set. Each molecule, by using this procedure and described in the molecular point-cloud, is represented by a vector, SLpoint-molecule, obtained by purely computational means. For this reason, the QSM molecular discrete description can be easily transformed into a bit string. A molecular point-cloud is, from this point of view, a representative example of a potential Boolean tagged set. It is not an accident that in early examples of molecular point-cloud representations in the form of graphs, n-dimensional hypercubes^ of the appropriate number of vertices were successfully used. F. Extensions
There are multiple extensions of the previous description, but they will lose the inherent direct Boolean computational flavor, which was the main objective of the previous Boolean tagged set definitions. However, this drawback can be compensated by the construction of possible linking of tagged sets with QSM. Suppose, again, a set 5, and a set D, made of positive definite (PD) functions of some variable(s): D = {5(x) I Vx G C„ -> 5(x) eR"). The general Boolean tagged set structure may be transformed into a function tagged set. This can be easily done, just by defining the ordered pairs B^(S) = Sx D. Then, to every element of S can exist an attached function with some additional properties, not only a bit string, but possessing a continuous functional structure, that is, BJS) = SxD={ze
5 ^ | z = [5,5(x)];5G S A 5 ( X ) 6 D]
It is immediately seen that fuzzy sets are nothing but a very particular case of functional tagged sets. It is much more interesting, though, to recognize that functional tagged sets contain as particular cases quantum object sets (QOS).^ For this purpose it is only necessary to associate the background set S with a collection of microscopic systems in a given well-defined state and, according to quantummechanical postulates, one can consider D as the set of associated DF. This possibility will be studied in the next section. An even more general tagged set structure may be easily envisaged. Up to now a unique tag set has been used throughout our description of a tagged set. But nothing opposes the use of a broader definition, where the tag set part may be
Fuzzy Sets and Boolean Tagged Sets
51
formed by composite aggregates of two or more collections of appropriate tags. The earlier tag set part example, made of molecular coordinates, can be interpreted in this way.
III. TAGGED SETS, CONVEX SETS, AND QSM QSM first^, and tagged sets as defined above and convex sets (CS)^ in the second term, constitute interestingly interconnected fields. The relationship may be found in the associated characteristic definitions and mathemafical structures, which have been developed over time in the realm of QSM.^^ In the recent literature, the structure of such QSM theory has always been associated with integrals involving the description of quantum objects (QO). By a QO must be understood a microscopic system that possesses all of its information in an associated PD function. Such functional descriptive power lies in the assignation to the attached PD function of some statistical probability distribution formalism, the so-called density function}^ which in turn is nothing but a result of the system wave function squared module manipulation. Such initial function information, in the form of wave functions, is obtained as a solution of the system's Schrodinger equation, in the well-known quantum mechanics scenario. The whole conceptual structure may be cast as a quantum-mechanical postulate. Also, besides the PD DF dependence of any QSM, these can be easily attached to PD operators too, as well as to the functional spaces where PD DF belongs. This apparent PD mathematical background makes of QSM theory a good candidate to be related, somehow, to the structure and definitions of convex sets, because the particular structure of this kind of set deals, preferentially, with PD linear combinations of vector space elements. This collection of problems will be discussed in the present section. A. QSM QSM have been defined in several previous papers (e.g., see Refs. 5-8), as have the basic concepts associated with the theoretical body, involving the ideas underlying the basic integral structure of QSM. To keep these primary ideas present at the time to deepen in the QSM theory, several definitions, which depend on the previously presented ones, are given below. Definitions
Tagged Sets and QO. A parallel definition to the one associated with Boolean tagged sets, defined in Section II and the subject of a recent report^^ as an alternative to fuzzy sets, will be employed to construct QOS. The term quantum object^^ will be used from this point on as a synonym for any microscopic system composed of a numerable set of particles, associated with a
52
RAMON CARB6-DORCA
probability DR Within the current quantum-mechanical structure the QO DF acts as a PD descriptor, which possesses all of the information contained in the system. Definition 4. Quantum Object. Any microscopic-sized system can be called a QO, if it can be constructed with a numerable set of particles, possessing all of the extractable information contained within a probability distribution: the system's DF. The DF is a PD function of random variables made by the system's particle positions or momenta. When dealing with nonstationary system states, time should be added as an additional random variable. Due to this attachment between QO and DF, one can refer to the DF as a QO continuous descriptor. D Definition 5, Density Function Tagged Set. The Boolean tagged set definition can be easily generalized, considering various possible tag set extensions, as presented in Section II.F. Suppose now known a set of microscopic systems, 5, and let us associate it with a background set. Suppose also a set made of PD DF, P, and let us call it the tag set. According to quantum mechanics, to every element of S there can be found a one-to-one correspondence with a DF of P. The situation may be cast into a new set, 2L density function tagged set, T, defined in the following way: r = 5 x P = { T l V 5 e 5 , 3 p 6 P - ^ T = (j,p)}-
(6)
Within this definition a QOS may be associated with a density function tagged set, r, and a QO can be considered in consequence as an element of T. D This structure may be considered as the simplest case of a possible general tagged set form, which for the moment is not relevant here, where Definition 5 will be used throughout. In fact, in this possible general definition, QOS may be considered tagged sets, with the tag set part, formed not only employing a unique function, but the collection of all possible DF of the system's quantum states. The tag, p, could have, from this point of view, necessarily a vector structure, whose elements will be the state DF. An alternative way can be associated with the tagged set formed with the elements of the microscopic system background set and one quantum state density as the tag part: In this case the tagged set elements will possess the same background part and diverse tags. QSM. Let us define primarily a QSM as a composition involving two QO, constructed with the rule: ya,bETA{a
= (s^,pJ',b = (s^,p,)}:
Z,, (Q) ^ = J J p,(ri)Q(ri, v^)p,(r^)dr,dr^
(7)
Fuzzy Sets and Boolean Tagged Sets
53
where ^ is a PD operator, whose dependence on the variable set {r^ r2} must be coherent with the ones associated with the tag functions of the involved QO. The case of computing a QSM when both involved systems are the same, a = b, produces a quantum self-similarity measure (QS-SM). Definition 6, Convex Conditions. By convex conditions K„(c) held over an n-dimensional vector c, it is understood that the elements have the following properties: K^(c) = {c = {c,} G K(RO: c, > 0; V/ A ^
c, = 1} D
^^^
Definition 7. Positive Definite Operators. A PD operator, Q > 0, acting over a vector space, "U, such that Q'.'U-^'U, can be defined in the usual way as follows:
Suppose that convex conditions K„(yv) hold over a coefficient vector w. Suppose known a set of PD operators W={Q„}, the convex linear combination: ^ = E^a^aA/^„(w)
(10)
produces a new PD operator. DF can be considered PD operators too, then this property also applies to this collection of functions, a Due to the PD nature of all of the involved elements in the QSM, the values of the measure integrals, described in Eq. 7, are always positive. Thus, a QSM, as previously defined, can be considered an operation transforming the ordered pairs of tagged set elements into the set of positive real numbers: Z:(Tx T) -^ R"*". Integral defined scalar products with weights, associated with the PD operators Q, are good candidates to be connected to QSM too. The most popular operator, used to date in QSM integral expressions, corresponds to Dirac's delta function 5(^1 - r^^, and in this case, integral 7 transforms into a so-called overlap-like QSM. A third element of the QOS can be used instead, producing several possible forms of triple QSM.^^ Multiple products of DF substituting the operator will yield a multiple QSM. Similarity Matrices and Discrete Representations ofQO
In any case, given a QOS, the computation of QSM involving, at least, element pairs produces a new set, which has been discussed in the literature in many ways.^"^ In the simplest situation, every QO can be connected with the rest of the QOS elements, including itself.
54
RAMON CARB6-DORCA
Definition 8. Similarity Matrix. When all of the QOS elements in pairs are involved in the QSM calculation and ordered, this gives as a result a symmetric matrix: The similarity matrix (SM), Z. The dimension of the SM will depend on the cardinality of the QOS: If #(7) = n, then Dim(Z) = (n x n). The SM can be considered a row hypermatrix whose elements are n-dimensional column vectors, collecting all of the matrix elements associated with a given QO. That is, Z = (z„ z , , . . . ,z„) = {z,} c C„(R^) A iS:„(z,); V,
d D
As already mentioned, the SM elements are computed within some kind of scalar product formalism as in Eq. 7. Because the DF tag part, if the background set elements are chosen to be very different, can be considered a linearly independent function set, the collection of QSM may be easily seen as forming a metric matrix, computed over the DF tag set, and thus the SM, Z, may be considered a PD matrix too. This is so whenever the QSM are calculated with DF bearing the same orientation in the particle coordinates space, for all matrix elements, a An interesting feature, derived from the PD nature of the DF tag set and from the resulting QSM, corresponds to the elements of such vectors and matrices: the column vectors, {z.}, elements are positive too, as represented schematically in Eq. 11. Given a QOS, constructed as a function tagged set, as the one previously defined for QO, and provided the QSM among the elements of the QOS, as defined in Eq. 7, Eq. 11 introduces the possibility of constructing another tagged set kind. This can be performed employing the following definition, which has a parallel structure to Definition 5: 0 = 5 x Z = { e | V 5 G 5 , 3 z G Z - ^ e = (5,z)}
(12)
where the symbol for the tag set, Z, has been chosen the same as the one used for the SM. The new vector tagged set is another possible representation of the QOS. The vector tagged set constructed as in Eq. 12 is immediately connectable to concepts defined early in the QSM context. ^° Indeed, a QOS in the form of a vector tagged set has been called, when molecular QOS were studied, a molecular point-cloud, and the elements contained in it point-molecules. Definition 9, Point-Clouds. Vector tagged sets or QO point-clouds can be derivedft-omthe original function tagged sets, T, by projecting the PD DF tagged set, T=SxP, into a PD vector tagged set, 9 = 5 x 2 : ^ (7) = e => VT G T: (P (X) = T(is, p)) = (s, !P(p)) = (^, z) = e G 0
D
Fuzzy Sets and Boolean Tagged Sets
55
The new projected QOS corresponds to a discrete QO description, where every system from the background set is no longer tagged by a continuous DF, but by an n-dimensional PD vector. As such, it can be easily transformed into a Boolean tagged set, according to comments in previous work,^^ given also in Section lI.E and based on the evidence of the usual computational practice. Thus, computations are made within the field of rational numbers, and as such the QSM values involved in the elements of the tag set part of any QO point-cloud can be associated with this numerical field. Rational numbers are expressed within the usual computational structure in the form of bit strings. Already in this numerical form, they can enter as the elements of the tag part of vector tagged sets, transforming them into the equivalent Boolean tagged sets. The sets P and the projection Z=!P(P) constitute very peculiar ensembles. In fact, they are properly defined by elements, which have values only in the PD set of real numbers. Both correspond to sets whose elements belong to some vector space, with operations defined in R"*". The vectorial addition cannot possess a complete group structure, but a semigroup one instead,^^ lacking reciprocal elements. Everything else can be considered preserved. One can refer to structures of this kind as vector semispaces. Definition 10, Vector Semispace. A vector space defined over the positive real field, R"^, only and provided with the usual operations of addition and product by a scalar. In a vector semispace (VSS) the additive Abelian group is substituted by a semigroup}^ As a consequence, no vector reciprocal elements can be present in a VSS, and the linear combination coefficients will be made by positive real numbers only. A two-dimensional image of this kind of VSS may be given by the vectors lying over the positive {4-JC, + J } quadrant of the real plane. D The following definition will serve to summarize all of the concepts already discussed. Definition 11. Discrete QO Description. Suppose known a QOS Q = S x P. This means that \/ se S, in the QOS background set part, there exists a DF, p G P, in the tag set part, which is a continuous descriptor of s. A discrete representation of the involved QO may be obtained choosing a PD operator, Q, and defining the QSM^^: Q = SXP: Vp,, p^eP=>z^=
jPa^p^dV
(13)
The SM, Z = {Za,b}y ^s described in Definition 8, which collects all of the integrals between the DF pairs of the QOS, can be considered as a row hyper vector formed by a set of column vectors Z = {Zj, Zj, . . . , z„}. Every column vector z^, say, is formed by the QSM involving the QO, s^, DF, p^, with all of the QO elements in S, including itself. Being the involved operator PD and considering that the QOS is
56
RAMON CARB6-DORCA
made of essentially different QO, the matrix Z could be considered PD. The whole matrix may belong to some n-dimensional VSS, Z„(R^). Then, starting from here, a new TS, 0 , can be constructed with the same background set part as the former QOS Q, but with the tag set part formed by the columns of Z, that is, 0 = S x Z. This corresponds to obtaining a PD operator-dependent discrete representation of the QO. One can name the column vectors z^ G Z as discrete descriptors of the QO. D B. Convex Sets CS can be referred to as VSS for our purposes, although there are huge libraries full of interesting problems related to this concept, which can be easily found in the literature. ^^ A CS in the present discussion will be considered as a collection of linear combinations of vectors in a VSS, thus having positive coefficients, which fulfill an extra constraint such that the coefficient addition yields the unity. That is, a convex set is a subset !I(^ofa, VSS !H, which is defined by means of
[3a = {a.} c R^: X = 5 i ttjCj A ^ ttj = 1] <= K^ia)
(14)
A curious question can be formulated now about how CS, as defined above, can be considered made using vectors of a general vector space. Whenever a set of complex numbers are defined such as to form a normalized column vector: {Y,.} c C A g = (Y,, Y2. • • •. Y„)^e K(Q A g'g = ^ IViP = 1 i
A[if: a= a,. = |Y,.p e r ; Vi} ^ ^ «i = H ' ^ ^«(«)
^^^^
i
Then, CS may be constructed from an n-dimensional vector space over the complex field, ^JjC), associating normalized vectors of this general space to the positive coefficients summing unity of Eq. 14. The squared modules of the general vector elements are the needed positive coefficients of the CS. Such a vector space, as l^^(C), may be called a generating vector space of the CS,!}(, By extension, any vector g G ^J,C) will be called the generating vector of the attached CS element. A generating rule can be defined formally as 3ge'^',(QAg^g=l=» ^(g->f)^
(16)
This bears, no doubt, a discrete quantum-mechanical flavor. Quantum-mechanical DF are computed as manipulated squared modules of some complex-valued wave
Fuzzy Sets and Boolean Tagged Sets
57
function, which can act as a generating oo-dimensional vector. The generating vector space of DF is the Hilbert space whose elements are the quantum system's wave functions. This aspect of the question will be studied in detail below. It must be remarked now that DF constitute some kind of PD functions, admitting only positive coefficients for linear combination purposes. Moreover, the volume subtended by any DF shall be finite and constantly equal to some function of the particle number. As a consequence, DF can be, without loss of generalization, considered normalized to unity: possessing a unit volume. Thus, DF could be considered the elements of a CS defined in oo-dimensional space. Indeed, suppose a set of DF, possessing coherent particle coordinates {p,(r)} c P, such as V/:Jp.(r)Jr=l
(1'^)
and construct a new DF, -©(r), using a set of PD coefficients {c.} c R^: (18)
Thus, forcing new linear combinations of DF to have constant unit volumes is the same as embedding the DF set into a CS structure. It will now be an easy task to construct the following: Definition 12, Convex Set. A convex set is made of vectors as elements, whose linear combinations and elements altogether fulfill convex conditions. The set of linear combinations of a subset belonging to a VSS, whose coefficients and generating vectors are constrained within the unit sphere, is a convex set. D C. ASA Recently, various papers have been devoted to the problem of constructing properly defined first-order DF, by using spherical function basis set superpositions.^^ The so-called ASA has been discussed in various molecular and atomic environments as well as practically studied through several methodological algorithms. According to the previous generalization and analysis, the ASA becomes nothing but a consequence of the nature of the DF and of the vector semispaces where they belong. ASA within a CS Environment
Suppose a PD basis set of totally symmetric functions S = {a-}, which can be built up from a normalized function set O = {cp.}. This can be obtained using simply the equivalence a,. = l(p-P; V/. Taking into account the normalization conditions of the functions belonging to the set O, the set S acquires automatically the property of unit volume, mentioned earlier, when dealing with the CS structure of DF:
58
RAMON CARB6-DORCA
V,;J|9,(r)pdr=Ja,.(ryr=l
d^)
Suppose a DF, p(r), with an appropriate coherent variable set, r, as the one associated with the basis function set Z. The ASA approach consists in expressing the DF as a CS element generated from the basis set E: Vp(r) ^ J3c = [c.] c IT A ^ c , = l | ^ K„(c):p{r) = ^ ^ A
(20)
The coefficients {c.} of Eq. 20 are to be estimated using a constrained leastsquares technique. Recently/^ it has been shown how elementary Jacobi rotations (EJR)^^ can be employed efficiently to deal with problems of this kind. For atoms, the ASA procedure produces first-order DF with almost negligible quadratic error integral measures. Also investigated was whether SLpromolecular approach can be constructed as an accurate ersatz for molecular first-order DF. The promolecular formalism states that the first-order DF tag part for any molecule can be approximated by a simple sum of atomic ASA functions, computed by means of a constrained least-squares procedure as already mentioned. Results have been accurate enough for QSM computational purposes.^^ More refined approaches can use a completely equivalent formalism as in Eq. 20, employing atomic ASA optimal functions as the components of the basis set E. ASA in Atoms
First-order atomic DF can be approximated by means of the so-called ASA function. ^^ These approximated functions for atoms can be constructed in the following fashion. Suppose known a set of ns-type AO {5'-(a,,r)}, an atomic ASA function can be written as pA(^) = Yc,\s,(r)f'AK„(c=[c,})
(21)
where the coefficient vector c = {c.} has to be fitted to any given "ab initio " atomic DF, previously computed by any meaningful procedure, using a constrained leastsquares technique, so as to obtain the solution while keeping the convex conditions KJjc). Constrained fitting within the convex restrictions associated with Eq. 21 over the coefficient set can be easily accomplished using an EJR scheme. ^^ The main idea, before applying the EJR technique to fit the atomic ASA function to an exact DF, appears when gathering the coefficients into the column vector c, and associating with it a new column vector of the same dimension, the generating vector, X, in such a way that the following generating rule ! ^ (x -^ c) holds: iRJix ^ c) = {3x e 'J/„(Q A x+x = 1 => c = {c,. = |;c,.p} -^ ^^(c)
(22)
Fuzzy Sets and Boolean Tagged Sets
59
Thus, EJR can be applied to transform the generating vector elements {JC.}, and then, indirectly, those of the ASA coefficient vector are varied, while preserving the convex conditions ^„(c). ASA Structure in Molecules With the above considerations known, it is easy to think that, by performing a previous computational task, a set of fitted atomic DF, A = {p^}, can be gathered, having the ASA form as in Eq. 21. The set A can be taken, being strictly formed by convex linear combinations of PD functions, as a PD operator set. Thus, A can be used as a generating set of new ASA-type DF. A molecular DF, p^, for example, can be approximated by a linear combination of A elements, with a coefficient vector w = {w^}, fulfilling the convex conditions ^„(w): pA/('-) = E'^AP^(«--r4)A^„(w)
(23)
A
where {r^} are the molecular atomic coordinates, on which the ASA DF are centered. According to the properties of the PD operators, described in Definition 7, p ^ is also a PD function. This allows the possibility of fitting the coefficient vector, w, to the molecular DF, in the same fashion as described for the atomic case. This also means that some generating vector u can be defined too, fulfilling a similar set of conditions as shown in Eq. 22, provided x <— u and c <- w, that is defining a generating rule like !l{Ju —> w). A promolecular approach to p ^ may correspond to choosing the ASA-type coefficient vector w with all of its components equal, that is w = n~4, and 1 = (1, 1 , . . . , 1)^. Another possibility is to associate with each atomic density a coefficient w^=Z^Ar\ where Z^ is the atomic nuclear charge and A^ = Z^Z^ is the total molecular number of electrons. A CNDO^ approximation, as was employed in early stages of QSM calculations, will correspond to using, as the vector w elements, normalized Mulliken gross atomic populations. Considerations around MO Theory Another example of this first-order DF convex form may be found within MO theory. Indeed, if an orthonormalizedMO set is at our disposal—{cp.}—then it is well known that the expression of the DF may be obtained as PM = E'0,.|cpf
(24)
where the set co = {(oj can be constructed to fulfill the usual discrete convex conditions ^„(a)). In this case the elements of vector co are interpreted as the set of MO occupation numbers, which in the usual SCF monoconfigurational theory are
60
RAMON CARB6-DORCA
just integers, 2 or 1. In the present framework they must be trivially transformed in such a way as to fulfill the convex conditions. In other computational environments, as in MC SCF or in natural orbital algorithms, the MO occupation can be made of real numbers, so the convex conditions may become more naturally defined. Equation 24 has the same structure as the previous ASA described expressions. In closed shell systems, studied in a monoconfigurational way, the vector (O just corresponds to a simple promolecular approach with co = n'^l, with n being half the number of electrons. The dependence of the MO basis set from an AO basis, within the well-known LCAO MO approach, has not been exploited in the ASA approaches. This possibility will be studied in forthcoming work. Here it can be said that the LCAO form of MO might be related to the DF generating VS. An interesting point to be noted is the possibility of transforming SCF theory and electronic energy expressions into a normalized form, so as to use a first-order DF with the appropriate ^„(co) conditions. It is a trivial matter to use Eq. 24 with the adequate convex conditions in monoconfigurational energy expressions, and the effect of this point of view in the Fock operator definition will be immediately grasped. Keeping convex conditions over the DF will scale monoconfigurational electronic energies by a factor that corresponds to the inverse of the squared number of electrons: G^ = N^^. This means that 1. Repulsion integrals need not be scaled at all. 2. One-electron Hamiltonian must be scaled in full by a factor equal to the inverse of the number of electrons: a = N~^ . Taking these simple rules into account, one can see that any SCF computation under this convex scaling of the DF will translate the process of any atomic or molecular system into some computational algorithm, with such a structure that it will become independent of the number of system particles. Thus, in this scaled framework the SCF iterative structure could be studied with respect to convergence, extrapolation, and so on, for all of the systems on an almost equal footing. The only variable aspects of the SCF process will derive from the basis set size and the nature, as well as the three-dimensional space position, of the atomic centers. Another interesting aspect of this scaling process corresponds to the numerical size of electronic energies, which will become constrained within a shorter interval than before scaling. For example, absolute values of atomic SCF electronic energies, from H to U atoms, will belong to the approximate interval {0.5; 4.}. Resizing them to the usual values needs only a scaling by a factor N^. The Continuous ASA Case
Equation 23 can be made much more general if it can be studied as embedded in a continuous environment. Provided some PD DF basis set is known beforehand—
Fuzzy Sets and Boolean Tagged Sets
61
{p(t, r)}—while considering that the DF basis elements could depend on some continuous parameter set t, then the continuous construction of a PD DF, p^^Cr), can be set up using the integral form p„(r) = J(o(t)p(t, f)dt A K„ (co(t))
(25)
where co(t) is a function of the parameter set t, fulfillling the continuous convex conditions ^^((o(t)), which in this case shall be written as ^oo(co(t)) = {Vt: co(t) G R+ A Jco (t)dt = 1}
(^^)
At the same time can be defined a generating function, Y(t), contained in some Hilbert space, which fulfills an equivalent generating rule as the one shown in Eq. 22 in the discrete case: !^(Y(t)^(0(t))^
3Y(t)e5/JC)AjY(t)*Y(tMt=l
^(o(t) = |Y(t)p^^J{o(t))
(27)
In this way, it is easy to see how a discrete ASA framework can be extended into an oo-dimensional environment. Equation 27 has interesting features, which provide the generating function, Y(t), with a nearby kinship to a QO wave function. Also the coefficient function, co(t), closely resembles a DF itself, through the corresponding equivalent definition. On the other hand, Eq. 25 also can be considered as an integral transform^ ^ of the coefficient function co(t), with the DF p(t, r) acting as a transform kernel. Thus, somehow, the continuous DF linear combination 25 can also be considered a scalar product between two functions, having an equivalent DF nature. A convolution transformation may also be considered as a particular, but nonetheless appealing, situation, instead of a typical integral transform. However, this possibility can lead us far from the present discussion objectives, and will perhaps be considered elsewhere. D. Convex Operators In the previous analysis of the ASA approach, the CS nature of the DF has been fruitful in designing an efficient approximate algorithm, so as to obtain reasonably accurate approaches to first-order DF. From there, and from the relationship between CS and DF, as stated in Eq. 18, one can try to go beyond the functional CS, and, using the PD operator nature of DF,^^ extend the previous ideas to PD sets of operators. In such an extension, the QSM themselves will be affected, being based on PD operator weighted integrals. The present discussion, from this point on, will be devoted to this extension of the CS framework and, finally, as a corollary it will
62
RAMON CARB6-DORCA
attempt to determine if there is some possible application of the resulting formalism. Convex Linear Combinations ofPD Operators Suppose a set of PD operators, Q = {co^^}. They fulfill the following relationship, when studied over the coherently defined DF VSS, P: Q = {co|co:P -> R^} A Vco 6 ^ ; Vp G P ^ Jco(r)p(ryrG R"
(^^)
Nothing opposes considering the following situation: Q c P => V(0, p: Jco(r)p(ryr= e R""
(^^)
where the application of the PD operator set over the PD DF set can be interpreted as a noncommutative scalar product, defined over the VSS, where both PD operators and DF sets belong. Moreover, the scalar product in Eq. 27 can be regarded according to the usual quantum-mechanical interpretation as the expectation value,
Y = X H-aCO^:
(30)
a
The second constraint has been introduced so as to obtain a pattern comparable to the CS structure of the VSS P and transfer it to Q, but it is not strictly necessary to keep this unit coefficient sum, if not needed. The most interesting thing is the obvious result, according to Definition 7, that PD operators can yield, in a CS environment, new PD operators. Tuned QSM^ SM^ and QO Descriptors The previous conservation of the PD property on linear combinations of PD operators in a CS environment can be employed in the evaluation of new kinds of QSM, by constructing a new breed of PD operator weights. The y-type operators appearing in Eq. 30 can be tuned up, while maintaining the identity of the operator set, just by changing the values of the CS coefficient set, w, conserving the initial
Fuzzy Sets and Boolean Tagged Sets
63
chosen constraints. A QSM, following the definition provided in Eq. 7, can be built up, under these circumstances, as:
The resulting tuned up SM elements, Z^^(Y), produce another obvious result for the SM set, {Z(cOj^)}, associated with every operator in Q. With each SM attached to a PD operator, every such SM can be considered as some discrete matrix representation of the associated operator in the corresponding basis set of the involved DF. These matrices, as already mentioned, can be considered as PD matrices. Thus, Eq. 31 can be written in whole matrix form as Ziy) = ^wj^(0j
(32)
Being the resultant matrix PD, because if in the SM set, 0 = {Z((0(j)}, all of the SM elements are PD, then the following property will hold: Vx e C„ A VZ(a)„) € e-> x^Z(coJx e R+=> x'-Z(Y)x = ^ w„x+Z(co^x 6 R^; if V„: w^ e R+
(33)
a
These results demonstrate that a finely tuned set of QO descriptors can be obtained in this way. This is so because Eq. 32 holds for the SM columns too, in such a way as z,(Y) e Z(Y) A z,((o„) e Z({o„) -> Z,(Y) = S ^a^ii^^a)
^^"^^
a
Thus, all of the findings and definitions up to this point can be summarized as follows. A QOS is chosen in form of a DF tagged set. A PD set of suitable operators is used, as a set of weights, in the evaluation of QSM between QO. A set of SM is thus computed for each operator. A CS with suitable coefficients is chosen to combine the elements of the SM set. The resultant SM columns are convex descriptors of the corresponding QO, and provide a discrete vector tagged set representation of the QOS. E. Finely Tuned QSAR
If an immediate application of all of the previous development has to be chosen, quantitative structure-activity or -property relationships (QSAR or QSPR) constitute a good candidate field. In our laboratory, the basic theory connecting QSM
64
RAMON CARB6-DORCA
and QSAR or QSPR was developed^^ some time ago and various practical applications have been reported^^ more recently. It has been deduced that molecular properties have to be, in some manner, related to the discrete representation of molecular descriptors furnished by the columns of SM, constructed in turn from QSM over the molecular QO. As a consequence of Eqs. 29 and 34, a given property value, 7i, for a particular molecular QO, described in turn by a discrete descriptor, Z(Y), can be related by means of 7i = u'^z(Y)
(35)
where the vector u corresponds to an unknown discrete representation of some operator over the same PD DF basis set, used to construct the convex discrete molecular descriptor Z(Y).^^ The usual procedure is to use a least-squares algorithm so that, knowing the pairs {71, Z(Y)} for a molecular QOS, the values of u can be obtained. Taking into account the tuned construction of the vectors Z(Y), it can be easily seen that the vector u will depend on the tuning parameter set, Y- We use the least-squares solution of the problem u = (Z(Y)^Z(Y))-^Z(Y)^p
(36)
where the vector p = {71^} contains the values of the property for each molecular QO, and Z(Y) is the SM of the QOS computed according to Eq. 32. Equation 36, however, has been written taking into account the possibility that the SM may no longer be square symmetric, but rectangular. This will constitute the more general case, where instead of a unique tagged set, two QOS with different cardinalities, m and n, are used to compute the QSM. The resultant SM will be of dimension (m x n). From this previous definition, one can easily deduce that the vector u will depend on the tuning coefficients w. Opfimization of the tuning set coefficients w can be done at the same time as the classical least-squares problem is solved, keeping in mind the associated CS constraints, which the tuning set w bears. A parallel nonlinear constrained opfimization on a quadratic function of the w elements will appear. The interesting feature here is that CS constraints can be studied in the same way as these are kept in the optimal ASA problem. Thus, to the usual least-squares problem, involving the operator associated vector, u, there will appear another least-squares equation, which starts defining the residual vector: A = p-Z(Y)u = p - ^ w „ Z ( c o J u = p - ^ w „ v „ = p - V w
(37)
where the previous definitions of the involved matrices have been employed. Also, the matrix V collects the vector set {v^ = Z(cOj^)u}, and the vector w = {w^}, contains the coefficients of the tuning set W. The residual vector 37 is obviously dependent on the classical least-squares solution u in Eq. 36. From the inspection
Fuzzy Sets and Boolean Tagged Sets
65
of the residual vector A, it is easy to see that the quadratic error will depend on a generalized quadratic function with a variable set formed by the new unknown vector w. This least-squares problem has to be solved under the constraints associated with the PD nature of the vector w. A CS constraint structure may be very convenient in normalizing the problem form. Thus, the quadratic function and the constraints may be written, using a compact matrix form, as follows: e^^^ = %-2qV
+ w^Qw A K^(W)
(38)
where the following simplifications have been used: X = p^pAq = V^pAQ = V^V
(39)
It is important to define the appropriate SM set, {Z(co^^)}, so that the matrix V = {v^^}, possesses its elements linearly independent, to obtain a PD matrix Q. This is equivalent to saying that the SM set shall provide images of the least-squares solution u, which must be linearly independent. The solution of the second optimization problem for the vector w could be sought using a generating vector, for example x, which will substitute the w elements in the following convex constraints A^„(w) and the generating !^ (x -> w) rule. Expression 38 will transform into a quartic function in terms of the components of vector X. Optimization under the unit norm of the generating vector x'*"x = 1 may be obtained by means of EJR, as in the ASA case.^^ The whole optimization process shall be made in an iterative manner: 1. Using a starting approximate tuning vector w obtain u, solving Eq. 36. 2. Knowing u, compute a new w, minimizing function 38. 3. Go to step 1 while the vector pair {u,w} remains inconsistent with respect to the previous iteration. Changing the number and nature of the SM composite Z(Y) will obviously produce different results, but within a given choice these can be coherently tuned up. This can add extraordinary possibilities to QSAR procedures.
IV. ON THE STATISTICAL INTERPRETATION OF DENSITY FUNCTIONS: DIAGONAL VECTOR SPACES AND RELATED PROBLEMS One of the main contributions of the present paper is the definition of n-dimensional diagonal vector spaces (DVS). The objective in introducing DVS will be to find some discrete vector representation so that it can consistently fit some usual properties of oo-dimensional Hilbert spaces,^"^ containing the relevant functions, which are subsequently employed to describe QO, in accordance with the von Neumann^"^ point of view. Thus, the main concern here will be to obtain, in a natural
66
RAMON CARB6-DORCA
way, the CS structure of approximate DF within DVS, in the same natural manner as the DF is obtained from the squared module of the QO system wave function. A. The Nature of Discrete Q O Representations
Let us now suppose a QOS, Q, constructed in the usual way as a TS, that is, Q = S X P. Let us also suppose that the elements of the tag set part are ASA-type DF, built as in Eqs. 21 or 23. Accepting this scenario is the same as considering that a QO is described under some finite PD functional basis set 0 = {l(p.(r)l^} with coordinates: CO = {O)-} fulfilling the convex conditions ^„(co) and belonging to a given Ai-dimensional VSS, such as (O G Wj^(R'^). The TS constructed as Q„ = S X {O c W^(R^)} corresponds to a QOS which has as tag set part a Subset of some VSS of finite dimensions. A discrete representation of QO can thus be reached in this way, besides the one discussed in Section IILA. Considering the nature of the continuous AS A-type transform 25, it can be seen that in the discrete case, the PD basis set O and the coefficient vector O) shall bear some equivalent structure. As the convex conditions A'^(co) hold for the coefficient vector, it is easy to interpret this feature in such a way that the elements of the coefficient vector CO constitute a discrete probability distribution. For example, in a promolecular approach as well as in an MO monoconfigurational closed shell structure, CO can appear as a homogeneous discrete probability distribution. Thus, the coefficient vector, co, bears the equivalent statistical features of a DF in discrete n-dimensional spaces. It is not strange that there exists a generating vector, Y G '^(C), producing the PD ca elements by application of the generating rule !^(Y -^ ca). The structure of the rule is not one that can be attached to a linear transformadon, but has to bear a nonlinear form. This nonlinear relationship between generating vector and coefficient vector appears nonnatural from an algebraic point of view. This is more obvious when examined in the continuous situafion, as discussed previously. The image appears even more conspicuous when observing the nature of the DF from the quantummechanical side, because any DF has to be considered as a squared module of the QO wave function, acting thus as a generating vector. The problem can be stated transparently using a simple, well-known mathematical device, associated with the most basic aspects of quantum mechanics. Suppose a QO wave function is known for some system state ^ (r) G i^(C). The corresponding DF is simply computed as p(r) = I 4^(r)p. The interesting fact is that the DF thus defined may belong either to the Hilbert space direct product 9{{C) (8) i^(C), when considered as an operator, or to some functional VSS HCR"*"), when considered as a PD real-valued function. In this sense, the generating rule can be applied here, and immediately written as %J^ -^ p). However, it can be interpreted in the following way, using the practically unmodified structure of Eq. 27:
Fuzzy Sets and Boolean Tagged Sets
^(^-^P)^
3^(r) G itf (C) A ||^(r)p^r= 1
67
(40)
This continuous generating rule must imply a closer relationship between the vectors involved in the discrete case. There, the generating rule !l(Jy —> co) means that, while the normalization part for y possesses a simple algorithm to be computed, 1 = Y'^Y, it is no such simple operation in the second part of the generating rule. That is, when one must attach the O) coefficient vector elements to the squared modules of the generating vector y, the algorithm is not naturally isomorphic to the one in Eq. 40. Insisting on the problem: there is a lack of simple, naturally obtained, isomorphic operation in the second part of the discrete generating rule, as stated in Eq. 22, when compared with the continuous case as defined in Eqs. 27 or 40. A possible solution of this interesting situation will be discussed next. B. The Structure of the Generating n-Dimensional VS: DVS The generating rules 22 and 40 are a shorthand notation of some nonlinear transformation involving the generating VS, '^'^(C), and the final VSS containing the coefficient vectors, 'H^^(R^). The lack of a simple natural operation, producing the results, implicitiy stated in the generating rule in the discrete case, can be circumvented using the following scheme. Suppose such an isomorphic pair of n-dimensional VS, which will be named ^„(C) and !FJR^) Both can be good substitutes for the original "UjiQ and ^„(R^) VS described above, respectively. A sound isomorphism of column or row VS is constituted by DVS, whose elements possess the structure of diagonal matrices. Let us consider that the isomorphic ^n(C) and !FJJR^) VS elements are chosen as diagonal matrices. This element choice has not been arbitrary, because matrix multiplication is closed in DVS, that is, matrix products of diagonal matrices yield new diagonal matrices. Moreover, diagonal matrix products are commutative. Considering only the diagonal part of the matrix elements, and discarding the off-diagonal elements, the DVS possess the same dimension as their isomorphic column-row vector counterparts. Then, it is easy to see that using this simple isomorphic device both the discrete and continuous generating rules acquire the same formal structure. Indeed, the discrete rule in Eq. 22 will be rewritten within any DVS framework as
^(D^A)^
3DG^,(C)A
(41)
A = D*D = Diag(|rf,.p) A (A) = 1 ^ K„iA) The symbol (D) = Z,- dj is, in this case, equivalent to the trace of the corresponding diagonal matrix, but it can be used as a simple symbol to mean the sum of all matrix
68
RAMON CARBO-DORCA
elements, and so was defined and used for the same purposes before?^ The convex condition has to be slightly modified to take into account the new DVS element structure: K^(A) =\A = Diag(7i.) G jT^Cr): n. > 0; V/ A (A) = ^n. = 1
(42)
thus, working with DVS instead of conventional VS and VSS, the coefficients in discrete DF description possess the same structural properties as the DF themselves. The generating DVS elements, D e ^n(^)' ^^^ ^^ ^^® ssuno manner as do the QO wave functions. And the resultant coefficient diagonal matrix, A e !FJJR^), satisfying the convex conditions A'^(A), can be written as a squared module of the former diagonal matrix. This can be done using a discrete form of the generating rule i^(D —> A), similar to the wave function-DF generating rule ^Q¥-^p):A = D'^D = DD"" = IDP = Diag(ld.P), as described in Eq. 41. The DVS of ^n(C) type may be considered normed spaces, with one of the possible norms defined as the trace of the squared matrix module. As a consequence, the DVSS ^n(R^) elements are constructed in such a way that their trace is always normalizable, and thus easily made unit. A diagonal TS, (D^, can be derived in the usual way by using a given background set part, 5, and a DVSS convex subset, HC as the tag set part, that is, !D„ = 5 x{!^ c ir„(R^)}. The question, now, may not need to be: why do the n-dimensional DVS fulfill in a natural way the same conditions as oo-dimensional functional VS? But it could be much better stated as follows: which kind of consequences, if any, will this situation have in the development of a discrete quantum chemistry framework? The next section will try to describe some of the possible features of this DVS structure. C. Expression of the Density Functions and Other Problems
It has been shown that the best discrete representation of the DF, having an ASA-like form, as in Eqs. 21, 23, and 24, is better described as a diagonal matrix, instead of a vector, as is usually done. Then, the scalar-like expression of the ASA DF type could be redefined in terms of the natural operations presented in the discussion of the preceding section. To obtain a coherent view of all of the possible redefinitions, which can be found as a consequence of the adoption of the DVS representation, some preliminary considerations will be made next. As the formal structure of ASA generating rules is better represented from the point of view of diagonal matrices rather than vectors, both the generating and coefficient vectors are thus transformed into elements of some DVS. The ASA forms discussed in Section III.C, besides the coefficient vector, are associated with a PD function set, which is in turn connected to the squared module of another function set, further belonging to another structure which can be termed a generat-
Fuzzy Sets and Boolean Tagged Sets
69
ing function VS. This situation can be managed in the same way as in the preceding discussion. Suppose that a function basis set is known: 0 = {(p.}. Nothing opposes the situation in which the set O can always, without loss of generality, be arranged into a diagonal matrix structure, and considered constructed as O = Diag((pj, (p2,..., (p„) G F(C). Then, it is obvious that when the following diagonal matrix product is made: p = 0*0 = Diag(l(Pil^, I(p2p , . . . , IcP;,!^)} G PCR"^), it will always produce a new diagonal matrix, whose elements belong to a special function VSS made of PD functions, that is, made of function squared modules. Thus, taking into account the definition of the diagonal product of the initial basis set, one can consider that the above result produces an entirely new PD basis set: P = {l(p,P}. Also, having defined the generating and coefficient VS, one can construct the following hybrid diagonal matrix: VD = Diag(J.) G ^(C) A VO = Diag((p.) G F(C) => ^ = DO = Diag(J.(p,) eJiOQg
(C) xF(C)
(43)
Once mixed structures of this kind are constructed, the ASA-like DF could be simply built by computing traces of squared modules of the diagonal structures, ^ , as defined in Eq. 43. That is,
p=
/■
(^^
/
The formalism is now clear on how to construct the necessary generating elements and the road is open to obtain, in a very natural way, the structure of ASA-like DF. The most interesting feature of the whole procedure, perhaps, will consist in finding out how closely the deducible formal rules, based on discrete DVS, are equivalent to the formalism based on continuous quantum mechanics. But it seems that nothing opposes this possibility. In fact, it only remains to express the formal problem as to how an expectation value (Q) of some observable, associated with an operator Q, can be computed within a DVS formalism. A possible way could be:
{a} = JQpdV=JQ{^*^)dV= XKP J^kP = S (oJQp,dv = S COJ(p;Q(p,^/y=/(T'Q^yy /
(^^s)
i
The last linear combination of integrals is suited to differential operators, and can be naturally obtained when considering the operator £2 as a scalar matrix QI.
70
RAMON CARBO-DORCA
V. CONCLUSIONS A general framework, where quantum objects can be described in a systematic way, has been constructed. The concept of density function tagged set encompasses an early generalization that is proposed as a sound substitution of fuzzy set definitions, to describe molecular structures, namely, the Boolean tagged sets. At the same time, the definition of quantum-mechanical density functions has been used to put in evidence its essential positive definite nature. This fundamental property of density functions, often forgotten in the current literature, has also been used to connect quantum similarity measures, a simple concept, which compares two or more quantum objects, with the spaces containing positive definite operators. Vector semispaces and, more conventional, convex set algebra have been put into the context of the computation of approximate density functions, as in the ASA framework. This kind of computational algorithmic experience has been extended to positive definite operators and their matrix representation, the similarity matrices, from the point of view of quantum similarity measures. Positive definite operators can be used to construct a convex set of new positive definite operators and consequently their matrix representations remain positive definite. In this way, a new window is opened to obtain discrete,finelytuned, molecular descriptors in the form of positive definite vectors belonging to n-dimensional vector semispaces. The utility of the presented theoretical results in the context of quantitative structure-activity relationships is but one of the vast prospective application fields. The quantum chemical, statistically coherent, significance of the expansion coefficients, satisfying convex conditions, in ASA-like DF forms, which can be considered as a discrete probability distribution has been shown. Moreover, there is apparently no problem in using the fitted atomic densities to obtain expectation values of quantum chemical operators. The formalism, based on DVS and TS, becomes in this manner a fruitful tool, where one can fundament further work.
ACKNOWLEDGMENTS This work was partiallyfinancedby CICYT Research Project SAP 96-0158. Professors J. Karwowski and P. G. Mezey are thanked for lively debates on the subject of fuzzy and tagged sets, and Dr. E. Besalu for constructive criticism and advice. The author warmly thanks Mr. LI. Amat for stimulating conversations on ASA and QSAR, which led to various Fortran 90 implementations, as well as for patiently performing preliminary calculation tests on some tuned QSAR problems. Enlightening, informal discussions with Dr. J. Mestres have been carried out in previous stages of this work.
Fuzzy Sets and Boolean Tagged Sets
71
REFERENCES 1. Zadeh, L. A. Inf. Control 1965 S, 338. 2. Trillas, E.; Alsina, C ; Temcabras, J. M. Introduccion a la Logica Difusa\ Ariel Matematica: Barcelona, 1995. 3. Carbd, R., Ed. Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches; Kluwer: Dordrecht, 1995. 4. Carbo-Dorca, R.; Mezey, R G., Eds. Advances in Molecular Similarity, Vol. 1; JAI Press: Greenwich, CT, 1996. 5. Carb6, R.; Calabuig, B.; Vera, L.; Besalu, E. Adv. Quantum Chem. 1994, 25, 253-313. 6. Carb6, R.; Calabuig, B. Int. J. Quantum Chem. 1992, 42, 1681-1709. 7. Carbo, R.; Besalu, E. In Molecular Similarity and Reactivity: From Quantum Chemistry to Phenomenological Approaches', Carb6, R., Ed.; Kluwen Dordrecht, 1995, pp. 3-30. 8. Carbo, R.; Amau, M.; Leyda, L. Int. J. Quantum Chem. 1980,17, 1185-1189. 9. Stoer, J.; Witzgall, C. Die Grundlehren der matematischen Wissenschaften in Einzeldarstellungen. Vol. 163; Springer-Verlag: Berlin, 1970. 10. See for example: (a) Carb6, R.; Calabuig, B. Comput. Phys. Commun. 1989, 55, 117-126. (b) Carbo, R.; Calabuig, B. J. Mol. Struct. (Theochem) 1992,254, 517-531. (c) Carbo, R.; Calabuig B. In Computational Chemistry: Structure, Interactions and Reactivity; Fraga, S., Ed.; Elsevier: Amsterdam, 1992, Vol. A, pp. 300-324. 11. See for example: (a) Lowdin, R O. Phys. Rev 1955,97,1474-1489. (b) McWeeny, R. Rev. Mod. Phys. 1960, 32, 335-369. 12. Carbo-Dorca, R. Fuzzy Sets and Boolean tagged sets; Technical Report IT-IQC-5-97, see also: J. Math. Chem. 1997, 22, 143-147. 13. Carb6, R.; Calabuig, B.; Besalu, E.; Martinez, A. Mol. Eng. 1992, 2, 43-64. 14. Carbo, R.; Calabuig, B. J. Chem. Inf. Comput. Sci. 1992, 32, 600-606. 15. Encyclopaedia of Mathematics; Reidel-Kluwer: Dordrecht, 1987. 16. See for example: (a) Constans, R; Carbo, R. J. Chem. Inf Comput. Sci. 1995, 35, 1046-1053. (b) Constans, R; Amat, L.; Fradera, X.; Carbo-Dorca, R. In Advances in Molecular Similarity; Carb6-Dorca, R.; Mezey, R G., Eds.; JAI Press: Greenwich, CT, 1996, Vol. 1, pp. 187-211. (c) Amat, L.; Carb6, R.; Constans, R Sci. Gerundensis 1996, 22, 109-121. (d) Amat, L.; CarboDorca, R. QSM and Expectation Values under ASA: First Order Density Fitting Using EJR; Technical Report IT-IQC-2-97, see also: J. Comp. Chem. 1997,18, 2023-2039. 17. Jacobi, C. G J J. Peine Angew. Math. 1846, 30, 51-94. 18. Constans, R; Amat, L.; Carbo-Dorca, R. / Comput. Chem. 1997,18, 826-846. 19. Carbo, R.; Besalu, E.; Amat, L.; Fradera, X. J. Math. Chem. 1995,18, 237-246. 20. See for example: (a) Fradera, X.; Amat, L.; Besalu, E.; Carb6-Dorca, R. Quant. Struct.-Act. Relat. 1997, 16, 25-32. (b) Lobato, M.; Amat, L.; Besalu, E.; Carbo-Dorca, R. Estudi QSAR d'una familia de Quinolones; Technical Report IT-IQC-4-97. (c) Lobato, M.; Amat, L.; Besalu, E.; Carb6-Dorca, R. Structure-Activity Relationship of a Steroid Family using QSM and Topological QS Indices; Technical Report IT-IQC-8-97, see also: Quant. Struct-Act. Relat. 1997,16, 465-472. 21. Carbo, R.; Besalu, E. In Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches; Carb6, R., Ed.; Kluwer: Dordrecht, 1995, pp. 3-30. 22. Carb6-Dorca, R. Tagged Sets, Convex Sets and Quantum Similarity Measures; Technical Report IT-IQC-9-97, see also: J. Math. Chem. 1998, 23, 353-364. 23. Berberian, S. K. Introduccion al Espacio de Hilbert; Editorial Teide: Barcelona, 1970. 24. Von Neumann, J. Mathematical Foundations of Quantum Mechanics; Princeton University Press: Princeton, NJ, 1955. 25. Encyclopaedia of Mathematics; Reidel-Kluwer: Dordrecht, 1987, Vol. 8, p. 249.
72
RAMON CARBO-DORCA
26. See for example: (a) Encyclopaedia of Mathematics', Reidel-Kluwer: Dordrecht, 1987, Vol. 5, p. 126. (b) Zemanian, A. H. Generalized Integral Transformations; Dover Publications: New York, 1987. 27. Carb6, R; Besalu, E. J. Math. Chem. 1995,18, 37-72. 28. Carbd-Dorca, R. On the Statistical Interpretation of Density Functions: ASA, Convex Sets, Discrete Quantum Chemical Molecular Representations, Diagonal Vector Spaces and Related Problems; Technical Report IT-IQC-10-97, see also: J. Math. Chem. 1998, 23, 365-375.
PATTERN RECOGNITION TECHNIQUES IN MOLECULAR SIMILARITY
W. Graham Richards and Daniel D. Robinson
I. II. III. IV.
Abstract Introduction Two-Dimensional Representations Alignment Conclusion Acknowledgment References
73 74 74 74 76 76 76
ABSTRACT The speedup in molecular similarity calculations needed to cope with libraries of tens of thousands of compounds is achievable if we adopt techniques from pattern recognition and start with two-dimensional representations derived by nonlinear mapping of the three-dimensional distance matrices. Here we describe the use of invariant moments in this respect.
Advances in Molecular Similarity, Volume 2, pages 73-77. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 73
74
W. GRAHAM RICHARDS and DANIEL D. ROBINSON
I. INTRODUCTION Molecular similarity has proved to be a useful tool since its introduction by Carbo et al.^ and later extensions to the use of molecular electrostatic potential^ and shape-^ as descriptors. In particular, the use of similarity matrices where every member of a series of compounds is compared with a lead compound or even with every other member of the series has been a powerful technique both for quantitative structureactivity studies'^ and as a measure of diversity. This utility was readily apparent when dealing with series of some tens of compounds. Now that combinatorial chemistry is providing actual libraries of tens of thousands of molecules and virtual libraries containing millions of compounds, new problems present themselves. These are difficulties of speed of calculation. If we are to align molecules optimally and then compute similarity, we need to gain an increase in speed of several orders of magnitude. In two-dimensional problems, such as optical character recognition, speeds that are appropriate can be achieved. Here we show how such pattern recognition approaches can be applied to molecular similarity if we can represent three-dimensional structures in two dimensions.
II. TWO-DIMENSIONAL REPRESENTATIONS The three-dimensional structure of a molecule may be represented by a distance matrix. We have shown^'^ that nonlinear mapping permits one to produce two-dimensional representations that retain the majority of the distance information from three dimensions. These figures are suitable material for pattern recognition.
III. ALIGNMENT There are two aspects to alignment of two-dimensional figures: putting the center of the figures at the same point (translational invariance) and rotating to achieve maximal similarity (rotational invariance). Borrowing the technique from pattern recognition, Hu's method^ of invariant moments may achieve both translational and rotational alignment. The method is based on a statistical analysis of the distribution and values of the property p to be compared. This distribution takes the form of the following equation: m
== J J yyp(;c, y)dxdy
for continuous data
or
% . =. -T^ T^ yVp(-^' y)
for discrete data
Pattern Recognition Techniques
75
Now from a uniqueness theorem due to Papoulis,^ provided that p(x, >^) is continuous, and has nonzero values only in a finite part of the x, y plane, moments of all orders exist and the sequence of moments m are uniquely determined by p(x, y). Conversely, it can be shown that the infinite sequence of m uniquely determines Inherent in the equations for generating the moments is the assumption that the property p(jc, y) is centered on the calculation coordinates. This may not be the case. However, let us define two parameters as follows: x=-
^0,0
y-
^0,1 ^0,0
These two parameters clearly give the center of the property p(x, y) in the calculation coordinate system. These have been shown to be consistently placed for all reasonably comparable systems. Then let us define the central moment |Lt„^ by |i
= J J (jc - Icfiy - 3^)^p(JC, y)dxdy
for continuous data
l^>,9 z> a =^^X La X (^ ~ ^y^y ~ >^)^P(-^' >^)
for discrete data
or
These central moments have the desired invariance to translafion. We have seen how aligning a structure along its principal axes enables us to remove any unwanted rotation. In the case of the moments Hu utilizes this property of the principal axes to rotate p(x, y) so that the property's distribufion is aligned along the calculation axes. He does this by forming a variety of combinations of the three second-order central moments and the four third-order central moments. In his paper Hu explains these combinafions, which are able to disfinguish between structures that exhibit mirror and rotational symmetry: ^ 0 = 1^0 + ^^02
^ 2 = ^Kl^so - ^1^12)'" + (3^^2i - |L^03)'' Tl3 = Vl(|ll3o + |Lli2)^ + (fi21 + ^^03)^'
./V
(M30 - 3fi,2)(H30 + ^^l2)[(^l3o + ^\T)^ - 3(Mai + ^03)^! + (3^21 - ^lo3)(^l2l + ^lo3)[3(^l3o+^\^^ - ^^i\ + ^03)^]
76
W. GRAHAM RICHARDS and DANIEL D. ROBINSON
^6
(3^12 - M(l^21 + |L^03)[3(|Ll3o + [i^^)^ - (^21 + ^03)^] These seven invariant central moments are all that is required to gain a high degree of sensitivity in pattern recognition. Indeed, in Hu's original paper he used only the first two invariant central moments to implement a crude character recognition system which by all accounts worked remarkably well. Utilizing the invariant central moments is fairly straightforward. All we have to do is to project our molecular property onto a grid and run through the calculations detailed above. This gives us a seven-dimensional vector X]^ 20 which represents the distribution of p^ 2Z)(^' y)- ^ ^ molecule to be compared can also be subjected to the same treatment, remembering that we no longer have to bother aligning A and B to get the correct answer. This yields a second seven-element vector r|^ 20The similarity, or more properly in this case the distance between the two molecules, is then given by the Euclidean distance of these two vectors in seven-dimensional space:
Clearly the above equation can be calculated in a fraction of the time required for the Carb6 or Hodgkin indices.
IV. CONCLUSION Using pattern recognition techniques of this type fuels the hope that we might be able to scan a whole library of compounds for examples that are similar to a chosen set of leads or other sources of ideas. At the same time we retain the essentials of the three-dimensional structure which are so important in the binding between a small molecule and its target receptor.
ACKNOWLEDGMENT This work was carried out in part pursuant to a contract with the National Foundation of Cancer Research.
REFERENCES 1. Carb6, R.; Leyda, L.; Amau, M. Int. J, Quantum Chem. 1980, 77,1185. 2. Hodgkin, E. E.; Richards, W. G. Int. J. Quantum Chem. Quantum Biol. Symp. 1987,14,105. 3. Meyer, A. M.; Richards, W. G. J. Comput. AidedMol Des. 1991,5,426.
Pattern Recognition Techniques 4. 5. 6. 7. 8.
77
Good A .C; So, S.-S.; Richards, W. G. J. Med. Chem. 1993, 36, 433. Barlow T. W.; Richards, W. G. I Mol Graph. 1995,13, 373. Robinson, D. D.; Barlow, T. W.; Richards, W.G. J. Chem. Inf. Comput. Set. 1997, 37, 943. Hu, M. K. IRE Trans. Inf. Theory 1962, 179. Papoulis, A. Probability, Random Variables and Stochastic Processes; McGraw-Hill: New York, 1965.
This Page Intentionally Left Blank
TOPOLOGY AND THE QUANTUM CHEMICAL SHAPE CONCEPT
Paul G. Mezey
I. Introduction II. Topological Resolution and Molecular Shape III. Molecular Similarity Measures Based on Topological Resolution of the Shape of Electron Density IV. Summary Acknowledgment References
79 81 86 90 91 91
I. INTRODUCTION The concept of molecular shape is not based on direct observation. The size of most molecules is too small for visual examination; moreover, the wavelength range of visible light is not suitable to provide a detailed enough resolution for molecules. This fact has influenced in a fundamental way the special evolution of the molecular shape concept, since, in the absence of direct observation, the shape of molecules is usually perceived to be that of the molecular models used for their representation. Naturally, the early, somewhat simplistic molecular models, such as the "ball and
Advances in Molecular Similarity, Volume 2, pages 79-92. Copyright © 1998 by JAI Press Inc. Allrightsof reproduction in any form reserved. ISBN: 0-7623-0258-5
79
80
PAUL G. MEZEY
stick" or fused sphere "space-filling" models, could reflect only some, highly simplified aspects of molecular shape, yet the shapes of these models have been treated by many chemists as if they were the actual shapes of the molecules. It is remarkable that, even today, one of the primary tools for conveying molecular shape information is the "ball and stick"-type stereodiagram used by many chemists. The more realistic, fuzzy, three-dimensional electron density cloud models, already easily calculable by quantum chemistry methods, only recently have started to become an appreciated tool for molecular shape representation. ^"^^ Molecular electron densities can be represented by three-dimensional density functions p(^, r), where ^ is a specified nuclear configuration and r is the three-dimensional position variable. With the introduction of the additive fuzzy density fragmentation (AFDF) methods,^^'^^ including the numerical MEDLA (molecular electron density loge assembler) technique^^""*^ and the more advanced analytical ADMA (adjustable density matrix assembler) method,"^^""^^ ab initio quality electron densities p(^, r) can be calculated, virtually for any molecule of chemical and biochemical importance, even for macromolecules, such as proteins."^^"^^ The shapes and similarities of these p{K, r) electron density clouds can be analyzed in great detail, and additional properties can be studied, such as the forces acting on the various nuclei in macromolecules,"*^ leading to conformational changes and changes of folding patterns of long, polymeric chain molecules. Electron density clouds are rather fuzzy objects, where any, detailed enough graphical representation of the peripheral, low electron density range is likely to hide from view the higher density range closer to the nuclei. This fact, and the fact that it is difficult to construct any macroscopic model that properly reflects the fuzzy nature of molecules have apparently contributed to the popularity of the simpler, and easily visuahzable "ball and stick" and fused sphere "space-filling" models, where molecules appear analogous to macroscopic, classical-mechanistic constructions. Whereas for classical-mechanical objects geometry is the appropriate tool of shape description, for fuzzy, quantum-mechanical molecules, geometrical methods are no longer efficient; by contrast, topology appears as an ideal tool of shape description. Topology allows for flexibility in a very natural way, and it also has the capacity to describe quantum-mechanical uncertainty and the associated fuzziness in a natural and intuitively transparent way. During the past decade, topology—specifically, algebraic topology—has been advocated as a powerful framework for molecular shape description that is compatible with the fundamental, quantum-mechanical nature of molecules. Among the relevant results, the introduction of the molecular shape group methods (SGM)^"^^ has led both to novel theoretical interpretation of molecular shape properties, as well as to various applications in molecular similarity and complementarity analysis in systematic pharmaceutical drug discovery approaches and in toxicological risk assessment.^^'"*^"^^
Topology and the Quantum Chemical Shape Concept
81
In this chapter, some of the more recent advances in the systematization of topological methods of molecular similarity analysis, involving the concept of topological resolution of molecular electron densities p(^, r) and their contour surfaces, will be reviewed.
II. TOPOLOGICAL RESOLUTION AND MOLECULAR SHAPE Before discussing the precise formulation of topological resolution and its applications to molecular similarity analysis, I shall briefly review the motivation for the approach and some of the relevant topological concepts. In the study of molecular similarity, structural features of molecules are the most commonly used properties that are analyzed and compared. Depending on the level of detail required, static shape features of molecules can be studied at various levels of resolution; these levels of resolution can be used as a tool for the introduction of various similarity measures.^^ Evidently, if three objects A, B, and C are indistinguishable at some low level of resolution, but at some higher level of resolution A is distinguishable from B and C, but B and C are still indistinguishable, then B and C are more similar to each other than to object A. Evidently, the level of resolution required to distinguish objects can be used as a measure of similarity. Since resolutions in the geometrical sense are easily characterized by numbers, this approach, the resolution-based similarity measures (RBSM) approach, provides numerical similarity measures.^^ Originally, the RBSM approach was formulated in terms of geometrical resolution,^^ for example, by placing objects on a regular, rectangular grid and using the occupied cells of the grid for comparisons. The finer the grid, the finer the resolution, and the grid size required served as a numerical measure of similarity. In this contribution I shall discuss some topological generalization of the idea of using resolution to measure the degree of similarity. The switch to topological resolution is the simplest if one focuses on the static shape features of molecules. However, static shape features provide a biased representation of the molecule, and in the evaluation of molecular similarity, it is also of importance to use information on chemical reactions and conformational changes, involving molecular interactions. Such interactions are typically determined by local molecular shape properties. In the long-range interactions during the initial stages of a chemical reaction, typically the large-scale features of local molecular moieties are dominant. However, as the reaction progresses, which usually involves a close approach of one molecule by another, details of local shape features become increasingly more important. Consequently, the level of resolution required for the analysis of shape features relevant in different stages of molecular interactions does change in the course of the reaction or conformational change induced by one molecule in another. Since the local shape features of both the static and the dynamic repre-
82
PAUL G. MEZEY
sentations of molecules are characterized by their topological properties, it is natural to invoke the concept of topological resolution in molecular similarity analysis. In the following paragraphs some of the fundamental concepts of the relevant branches of point set topology are reviewed with special focus on topological resolution, followed by a description of a topological realization of RBSM. Consider a set X, and a family T of subsets T^ of X, X D 7^, where the members T^ of family T fulfill the following three conditions: (i)
X, 0 G T
(1)
that is, the original set X and the empty set 0 are included in the family T, furthermore,
(ii)
ur^eT
(2)
a
for any number of sets T^ in the family T, and
(iii)
T^r^TaeT
(3)
for any two sets T^, T^ e T. If properties (i)-(iii) are fulfilled, then the family T is called a topology on set X, and the members T^ of family T are called the T-open sets of set X. The pair (X, T) is called a topological space. The structure of set X provided with a topology T, and various functions defined on X can be studied in terms of the T-open sets of X. Note that nowhere in the above discussion was it assumed that set X has a geometrical structure, and there is no need even for a distance function for introducing a topology on a set X. Nevertheless, properties (i)-(iii) are precisely the most fundamental properties of open sets in a metric space, for example, in a Euclidean space provided with the ordinary, Pythagorean distance function. In a metric space, open sets are defined in terms of distance: A set Y is open within a metric space if for any point y of y one can find a ball of some nonzero radius, centered on the point y, such that the entire ball still falls within Y. Invoking balls with some nonzero radius involves the distance function of the metric space, since the ball is the collection of all points with distance from y less than the specified distance chosen as radius. What provides topology with a remarkable versatility is the fact that many of the properties of open sets, and the implied properties for continuous functions defined in terms of these open sets, are fully operational without any reference to distance. This provides a well-controlled flexibility, which is the special hallmark of topology. Furthermore, as evident from requirements (i)-(iii), on any given set X one can introduce many, different topologies. This provides a whole range of possible degrees of "flexibility," an important concern in the chemistry of actual, quantum-
Topology and the Quantum Chemical Shape Concept
83
mechanical, nonrigid molecules. Since there are many different ways a topology T can be chosen, in the study of the topological properties of any object X (for example, a molecule X), one must specify the actual topology T used. The following relations are crucial for the introduction of the concept of topological resolution. Consider a set X, and assume that for two topologies Tj and T2 on X the following relation holds: Every Tj-open subset of X is also a T2-open set. This implies that Tj is a subfamily of T2, T2DT1
(4)
If relation 4 holds, then topology Tj is said to be coarser (or weaker) than topology T2, or one can say that topology T2 is finer (or stronger) than topology Tj. Two topologies on a set X do not always relate to one another in such a clear-cut fashion; in fact, a relation such as 4 above is rather special. Two topologies are regarded incomparable if neither is finer than the other. The finer-coarser relation between topologies on a given set X gives only a partial ordering of the set of all topologies on the set X. Exploiting this partial order, the interrelations among topologies on a given set X can be studied using lattice theory. The detailed analysis of a topology, as well as the construction of new topologies on a given set X, can be carried out using the concepts of base and subbase of a topology T. Consider a subfamily B of family T: T DB
(5)
This subfamily Bis a, base for topology T on X if and only if every T-open set G e T is a union of some sets in B. Consider a subfamily S of family T: T D5
(6)
This subfamily 5 is a subbase for topology T on X if and only if finite intersections of elements of S form a base for T. The special role of subbases is illustrated by the fact that they can be used to define topologies. In such cases we refer to subfamily 5 as a defining subbase. By choosing a family of subsets of X as a subbase 5, and generating a base B by the above recipe (of generating allfiniteintersections), one can indeed generate a family T that fulfills all three conditions (i)-(iii). Note that a special finite intersection, the empty intersection of subsets of space X, is the full space X, consequently, X is automatically included in the base B generated by this recipe, and hence X is also included in the family T. The empty union of sets from the base B is the empty set 0 , hence the empty set 0 is automatically a member of the family T generated by this recipe. The subbase-base approach provides a very versatile method for generating topologies. Also note that, if for two generating subbases 5j and 5*2 the relation
84
PAUL G. MEZEY
52=)5i
(7)
holds, then the corresponding topologies are necessarily comparable, and topology T2 isfinerthan topology Tj, T2DT1
(8)
Consider a set X and a family T of topologies T- on X, where these topologies T- are fully ordered by thefiner-cruderrelation: T={Ti,T2,...T,,...}
(9)
T,,,DT,
(10)
where
for every index / for which T-^j is included in the family T of topologies. We shall also use the notation (X,T.,i)D(X,T.)
(11)
to express the same fact in terms of topological spaces, if the specification of the underlying space X is important. If relation 11 holds, we say that the topological space {X, T-^j) is of higher topological resolution than topological space (X,T.). Topological resolution is defined in terms of thefiner-cruderrelations. In comparison 11, the former topological space, (X, T-^^), provides a more detailed topological description of the underlying space X than the topological space (X, T.). In particular, afiinction/thatmaps X onto itself may be T.-continuous but not T-^j-continuous (note that a function is continuous if and only if the inverse image of every open set is open, where openness is interpreted within the actual topologies used). A topological description with a higher level of topological resolution provides more information than one at a lower level of topological resolution. We consider the three-dimensional, fuzzy electron densities of molecules embedded in the ordinary 3D Euclidean space E^. The shape analysis of these electron densities, leading to the determination of the algebraic-topological shape groups, has been reviewed extensively,^^'^"^ and only a brief sununary will be given here. For each nuclear arrangement K, the molecular electron density function p(^, r) can be represented by an infinite family of molecular isodensity contour (MIDCO) surfaces G(K, a), where the density threshold a can take values from the [0, ©o) interval. Each MIDCO is defined as a set G{K,a) = {r:p{K,r) = a]
(12)
that is a surface with a specific shape for each nuclear configuration K and each value of the electron density threshold a along the MIDCO.
Topology and the Quantum Chemical Shape Concept
85
The local shape of each MIDCO G(Ky a) is tested against a range of reference curvatures b. For each reference curvature value b, the points r along each MIDCO G(K, a) are classified according to the local curvatures, as being a point where the contour surface is either 1. Locally convex relative to b (r belonging to a domain of type D2(b)), 2. Locally of the saddle type relative to reference curvature b (r belonging to a domain of type D^(b)), or 3. Locally concave relative to b (r belonging to a domain of type DQ(b)). For each MIDCO G(K, a), the domains D (b) for various curvature types jii with reference to a curvature value b generate a pattern P(K, a, b) of domains on the MIDCO G{K, a). These patterns P{K, a, b) can be analyzed by topological methods, leading to a description of the interrelations among the domains within each topologically distinct pattern P{K, a, b). Whereas no actual construction of new objects is needed, it is useful to picture the process of focusing on a given curvature type within the actual pattern P(K, a, b) by assuming that the corresponding domain is excised from the MIDCO surface G{K, a). For each MIDCO G(K, a), the domains DJb) of a specified curvature type \i with reference to a curvature value b are removed from the MIDCO, leading to a truncated object G {K, a, b) with a certain set of holes. The algebraic-topological homology groups of the truncated MIDCO G^{K, a, b) are denoted by H^JJC, a, b), and are, by definition, the shape groups of the molecule. There are three families of these groups, the zero-, one-, and two-dimensional shape groups, where in the notation H^(K, a, b) the dimension k, the truncation type \x, the nuclear configuration K, the density threshold a, and the reference curvature b are all specified. Note that some of these specifications are often omitted if they are evident from the context. The shape groups HHK, a, b) are invariant within small intervals of the threshold values a for electron densities, also within small intervals of reference curvature b, and also for some small molecular deformations changing the nuclear arrangement K. Note that, for a molecule of A^ nuclei {N > 3), the family of accessible nuclear configurations K form a subset of the nuclear configuration space M, where M is a metric space of 3N-6 dimensions. The local invariances of the shape groups H^(K, a, b) within the parameter plane {a, b) spanned by the density and curvature thresholds a and fc, and within domains of the nuclear configuration space M imply that there are only a finite number of shape groups for each molecule.^^'^'* Usually a separate shape group analysis is performed for each specified nuclear configuration K of interest. The finite number of shape groups HHK, a, b) within the parameter plane (a, b) can be characterized by their ranks, called Betti numbers, providing a numerical shape code for each conformation K of each molecule. In some sense, the topological shape group approach represents a reduction of the information content of a three-dimensional continuum of a fuzzy electron density
86
PAUL G. MEZEY
cloud p(^, r) to a set of discrete Betti numbers, in a process that retains the essential shape information about molecules. The shape codes provide a concise representation of shape information. These shape codes can be compared numerically, in a process that is much simpler than the direct comparisons of molecular electron densities. Note that in direct density comparisons the mutual orientation of the molecules must be optimized in the initial step of shape comparison. No such optimum superposition is needed when evaluating the similarity of molecules based on their shape codes. These direct, numerical comparisons of the topological shape codes are used to compute numerical shape similarity measures and complementarity measures for the fuzzy electron density clouds of molecules. The shape group approach provides a detailed shape description and shape comparison. In most instances, such a detailed shape description is required to detect and interpret the shape features of the electron density p(^, r) relevant to a given chemical problem. However, in some cases, one does not need a complete shape analysis of the electron density p(^, r), and the focus can be shifted from the details to some of the more prominent shape features. Furthermore, for large molecules, the large amount of detail obtained in a complete shape group analysis of the entire electron density p(^, r) may render the computational task and the interpretation of the results cumbersome. In such cases, it is warranted to use alternative shape characterization methods where the level of detail studied can be appropriately modified. A natural condition that can be used to control the amount of detail is the level of resolution. In the next section, a new set of molecular similarity measures will be discussed, based on the concept of topological resolution of fuzzy electron density clouds p(^, r).
III. MOLECULAR SIMILARITY MEASURES BASED ON TOPOLOGICAL RESOLUTION OF THE SHAPE OF ELECTRON DENSITY Different ranges of the electron density threshold parameter a and of the reference curvature b provide a natural approach to shifting the emphasis between the local details and the large-scale features of the shapes of molecular electron densities p(^, r). For example, the MIDCO surfaces G{K, a) corresponding to low-electrondensity thresholds a usually exhibit less detail than the high-density contours G{K, a) running closer to the atomic nuclei in the molecule. However, for our present analysis, the role of range selection of the reference curvature parameter h is more important than the choice of density threshold a. When considering a reference curvature b of high negative value, one finds that at most points r along a MIDCO G{K, a), even for MIDCOs with high values of the electron density threshold a, both of the local canonical curvatures of the surface (that is, both eigenvalues of the local Hessian matrix expressing the local curvatures at point r) are greater than the reference curvature b. Consequently, most if not all
Topology and the Quantum Chemical Shape Concept
87
points r of the MIDCO G(K, a) belong to a curvature domain of type DQ(^) of relative concavity with reference to the curvature b. This, in turn implies that a truncation of type fx = 2 eliminates only a few domains or no domain at all from the MIDCO G(K, a). For example, if the local canonical curvatures at all points r of the MIDCO G{K, a) are greater than the reference curvature b, then no truncation occurs, resulting in the coincidence G2(K,a,b) = G(K,a)
(13)
and in the associated trivial group as shape group H\(K, a, b), Hl(K,a,b) = {0}
(14)
If the value of the reference curvature is gradually increased, eventually, more and more local canonical curvatures fall below the value b, and more domains become subject to elimination from the MIDCO G(K, a), resulting in topologically distinct objects, G2(K,a,b)i^G(K,a)
(15)
as well as in nontrivial groups as shape groups H\{K, a, b). In fact, more detail of the shape properties of the MIDCO G{K, a) of the molecular electron density function p(^, r) becomes accessible. It is possible to consider all of the topological changes of the truncated objects G2{K, a, b) as the value of the reference curvature b is increased, and this approach has merits if the goal is to detect the actual threshold values for b where the topological change occurs. However, it is computationally simpler if one focuses on a finite series of selected b values: b,,b2,...b,,.,.,b^
(16)
where fe, <£>,,,
(17)
and determines the truncated MEDCO G2(K, a, b) for each b- value. This leads to the series of truncated MIDCOs G^iK, a, b,\ G^iK, a, b^),...,
G^iK, a, b),...,
G^iK. a, bj
(18)
and, if the focus is not restricted to the curvature domains type D (b) of type ^l = 2, to the associated pattern series P(K, a, ^i), P(K, a, b^),...,
P(K, a, Z..),..., P(K, a, bj
(19)
Note that all of these patterns P{K, a,fc.)are generated on the same MIDCO G(K,a). Consider now the combined pattern
88
PAUL G. MEZEY
P{K,a,b,...b;)
(20)
obtained by superimposing all patterns on G{K, a) involving the subsequence up to and including index /: P(K, a, b,l P(K, a, b^\ . . . , P(K a, b)
(21)
The set of DJ^b^ domains within each pattern P{K, a, b.) can be regarded as a defining subbase S{K, a, Z?.) for a particular topology T(K, a, b) on the MIDCO G(K,al S(K,a,b) = {D^(b;)}
(22)
each generating a formal topological space iG{K,alT{K,a,b))
(23)
The superposition of patterns on the same MIDCO G(K, a) generates finite intersections of domains belonging to various reference curvatures b-. Consequently, the subbase S\K, a,by.. b-) taken as the set of all such intersections of domains D^(bf) for / = 1, 2,. . . , / generates a topology T' that is the same as the topology T(K, a,by .. b^ obtained with a subbase S{K, a,b^... b^ defined as the union of subbases S{K, a, b^, S(K, a, b^,..., S{K, a, b-): S(K, a, by .. b) = u^i^,. S{K, a, bj)
(24)
S\K, a, by .. b) = S{K, a, by .. b)
(25)
where
hence T = T{K,a,by,.b)
(26)
Note that for the defining subbases S{K, a,by .. b^ of these topologies T(K, a,by ., bj) the relation S{K,a,by.,b.^,)^S{K,a,by..b)
(27)
must hold for any choice of index /, 1 < / < m - 1, as a consequence of the definition of subbase S(K, a,by.. fo.) as the union (24) of all of the individual subbases S(K, a, bj) of indices k up to and including the index /. Consequently, the corresponding topologies T(Ky a,by .. b) are also fully ordered, T(K, a, by .. Z7j D T{K a, by .. ^ i ) ^ • • • D T(A:, a, by .. Z?.) D . . . D T(A:, a, b^)
(28)
Topology and the Quantum Chemical Shape Concept
89
This complete ordering implies that these topologies T(K, a,by.. b.) provide a monotonia series suitable for a systematic adaptation of the techniques of topological resolution for the shape characterization of the MIDCO G(K, a), using gradually increasing topological resolution as index / sweeps over the interval [1, m]. Based on these topologies T(K, a,by .. b-), an RBSM, or more precisely, a topological RBSM (in short, TRBSM), can be constructed for molecules, as follows. Take two molecules, Mj and M2, of nuclear configurations K^ and K2, respectively, and consider two respective density thresholds a^ and a2> and the associated two MIDCOs G(K^, a^) and G(K2, a^. Generate the series of topologies T(A'p flj, Z7j... ^.) and T{K2, ^2, b^ .. ./?•), respectively, for the range 1 < / < m of indices /. We say that the two MIDCOs G{K^, a^ and G(^2' ^2) ^^^^ equivalent shapes at the level / of topological resolution if and only if there is a one to one and onto correspondence between the two defining subbases S{K^,a^,by..b^ and 5(A^2' ^2' ^r • • ^/) ^^^^ ^^^^ preserves the curvature index \k assignment of each element of the subbase for each sublevel k,\
eq[^i...Z7.] Gi^K^.a^)
(29)
where the relevant reference curvatures [by .. b-] are specified, or in short, by G(Kya^) eq GiK^.a^)
(30)
if the choice of reference curvatures and the resulting topologies used are evident from the context. Note that, if the first reference curvature b^ is chosen as a large enough negative value, then G(K,,a,)
cq[b^] GiK^.a^)
(31)
for practically any two molecules M^ and M2, of any two nuclear configurations ATj and K2, as long as both density thresholds are small enough so that both MIDCOs G{Ky a^ and G(A^2' ^2) ^^^ topological spheres, since then both patterns P{Ky a^ b^) and P(^2' ^2' ^1) contain only a single domain D^ib^). That is, a common starting point exists for any two molecules Mj and M2 at a rather crude level of topological resolution where their shapes appear equivalent, and this is not affected by how different the two molecules and their two nuclear configurations K^ and K2 actually are. However, by gradually increasing the resolution of the topologies, the shape differences gradually manifest themselves, and at some, possibly high positive value of the reference curvature b-, even slight conformational deviations of the same molecule are necessarily detected by the method, resulting in a nonequivalence of the shapes of the two
90
PAUL G. MEZEY
MIDCOs at the corresponding level / of topological resolution, where this nonequivalence is denoted by G(K^,a^) noneq[foi... fc.] GiK^a^)
(32)
In general, a higher index / for equivalence implies a finer (stronger) topology, that is, a higher level of topological resolution. Equivalence at a higher level of topological resolution implies a higher level of similarity. For example, if for three molecules Mj, M2, and M3 the maximum indices of equivalence for the various pairings are /12 for G(A'j, flj) and G(K2, a^ /i3 for G{K^, aj) and G{Ky a^) 1*23 for G(i^2' ^2) ^^^ G{K^, ^3) where *23
(33)
then the similarity of the shapes of MIDCOs is the highest for the pair G(K2, a^ and G{K^, a^, and the lowest for the pair G{K^, a^ and G(^2» ^2)The similarity measure based on the above implementation of the concept of topological resolution is a tool that allows one to focus on the essential shape features of molecular electron densities in a systematic manner. By gradually increasing the topological resolution, finer and finer details of shape are detected by the method, and for each type of chemical problem, it is possible to select the level of resolution that reflects the relevant size, shape, and detail requirements. The quantity m - i^^ ^^Y serve as a dissimilarity measure, although m - i^2 is not a metric on the abstract space of all shapes, since for afixedselection of the sequence of reference curvatures, two actual shapes of slight geometrical differences may exhibit topological equivalence, unless the highest selected reference curvature value is sufficiently increased.
IV. SUMMARY After a brief review of the fundamental topological concepts of molecular shape analysis and topological similarity methods, a novel alternative method of similarity evaluation is described, based on the concept of topological resolution. The motivation and justification of the approach are presented in the context of the quantumchemical shape properties of fuzzy molecular electron density distributions. An implementation of the topological resolution method is described, utilizing the local convexity shape domains already available from the earlier approach of SGM.
Topology and the Quantum Chemical Shape Concept
ACKNOWLEDGMENT The original research reported in this account was supported by a research grant from the Natural Sciences and Engineering Research Council of Canada.
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.
Mezey, P. G. Int. J. Quantum Chem. Quant. Biol. Symp. 1986,12, 113. Mezey, P. G. /. Comput. Chem. 1987, 8, 462. Mezey, P G. Int. J. Quantum Chem. Quant. Biol. Symp. 1987,14, 127. Mezey, P G. J. Math. Chem. 1988, 2, 325. Mezey, P G. In Concepts and Applications of Molecular Similarity; Johnson, M. A.; Maggiora, G. M., Eds.; Wiley: New York, 1990. Arteca, G.A.; Mezey, PG. Int. J. Quantum Chem. Symp. 1990, 24, 1. Mezey, P. G. In Reviews in Computational Chemistry; Lipkowitz, K. B.; Boyd, D. B., Eds.; VCH Publishers: New York, 1990. Mezey, P G. J. Math. Chem. 1991, 7, 39. Mezey P G. J. Math. Chem. 1992,11, 27. Mezey, P G. J. Chem. Inf. Comput. Sci. 1992, 32, 650. Luo, X.; Arteca, G. A.; Mezey, PG. Int. J. Quantum Chem. 1992,42, 459. Mezey, P G. J. Math. Chem. 1993,12, 365. Mezey, P. G. Shape in Chemistry: An Introduction to Molecular Shape and Topology, VCH Publishers: New York, 1993. Mezey, P G. J. Chem. Inf Comput. Sci. 1994, 34, 244. Mezey, P G. Int. J. Quantum Chem. 1994, 51, 255. Mezey, P G. Can. J. Chem. 1994, 72, 928. (Special issue dedicated to Professor J. C. Polanyi) Mezey, P. G. In Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches, Carbo, R., Ed.; Kluwer: Dordrecht, 1995. Mezey, P. G. In Molecular Similarity in Drug Design; Dean, P. M., Ed.; Chapman & HallBlackie: Glasgow, 1995. Walker, P D.; Mezey, PG. J. Comput. Chem. 1995,16, 1238. Walker, P D.; Maggiora, G. M.; Johnson, M. A.; Petke, J.D.; Mezey, PG. J. Chem. Inf Comput. Sci. 1995, 35, 568. Walker, P D.; Mezey, P G.; Maggiora, G. M.; Johnson, M. A.; Petke, J. D. J. Comput. Chem. 1995,16, 1474. Mezey, P. G. In Topics in Current Chemistry, Vol. 173, Molecular Similarity; Sen, K., Ed.; Springer-Verlag: Beriin, 1995. Mezey P G. In Advances in Molecular Similarity, Vol. 1, 1996, pp. 89-120. Zimpel, Z.; Mezey, P. G. Int. J. Quantum Chem. 1996, 59, 379. Mezey P G. J. Chem. Inf Comput. Sci. 1996, 36, 1076. Heal, G. A.; Walker, P D.; Ramek, M.; Mezey, P G. Can. J. Chem. 1996, 74,1660. Klein, D. J .; Mezey, P G. Theor Chim. Acta 1996, 94,177. Mezey, P G.; Zimpel, Z.; Warburton, P; Walker, P D.; Irvine, D. G.; Dixon, D. G.; Greenberg, B. J. Chem. Inf Comput. Sci. 1996, 36, 602. Mezey, P G. Int. J. Quantum Chem. 1997, 63, 105. Zimpel, Z.; Mezey, P .G. Int. J. Quantum Chem. 1997, 64, 669. Mezey, P G. Int. Rev. Phys. Chem. 1997,16, 361. Mezey, P. G. In Fuzzy Logic in Chemistry; Rouvray, D.H., Ed.; Academic Press: San Diego, 1997, pp. 139-223.
91
92
PAUL G. MEZEY
33. Mezey, P.G. In From Chemical Topology to Three-Dimensional Geometry; Balaban, A. T., Ed.; Plenum Press: New York, 1997, pp. 25-42. 34. Mezey, P. G. In Conceptual Trends in Quantum Chemistry, Vol. 3; Calais, J.-L.; Kryachko, E. S., Eds.; Kluwer: Dordrecht, 1997, pp. 519-550. 35. Harary, E; Mezey, P G. Int. J. Quantum Chem. 1997, (52, 353. 36. Mezey, P. G. In Advances in Quantum Chemistry; Lowdin, P.-O.; Sabin, J. R.; Zemer, M. C , Eds.; Academic Press: San Diego, 1996. 37. Mezey, P G. Structural Chem. 1995, 6, 261. 38. Walker, P D.; Mezey, P G. Program MEDIA 93 (Mathematical Chemistry Research Unit, University of Saskatchewan, Saskatoon, Canada, 1993). 39. Walker, P D.; Mezey, P G. J. Am. Chem. Soc. 1993,115,12423. 40. Walker, P D.; Mezey, P G. J. Am. Chem. Soc. 1994,116,12022. 41. Walker, P D.; Mezey, P G. Can. J. Chem. 1994, 72, 2531. 42. Walker, P D.; Mezey, P G. J. Math. Chem. 1995,17, 203. 43. Mezey, P. G. Program ADMA 95 (Mathematical Chemistry Research Unit, University of Saskatchewan, Saskatoon, Canada, 1995). 44. Mezey, P G. J. Math. Chem. 1995,18, 141. 45. Mezey, P. G. In Computational Chemistry: Reviews and Current Trends; Leszczynski, J., Ed.; World Scientific Publishers: Singapore, 1996. 46. Mezey, P G. Int. J. Quantum Chem. 1997, 63, 39. 47. Mezey, P G. Pharm. News 1997,4, 29. 48. Mezey, P G.; Walker, PD. Drug Discovery Today (Elsevier Trend Journal) 1997, 2, 6. 49. Mezey, P G. In Soil Chemistry and Ecosystem Health; Huang, P M., Ed.; SSSA: Pittsburgh, PA, 1998, pp. 21-43. 50. Mezey, P. G. In Pauling's Legacy: Modem Theory of Chemical Bonding; Maksic, Z.; OrvilleThomas, W. J., Eds.; Elsevier: Amsterdam, 1998, in press. 51. Mezey, P. G. In Pauling's Legacy: Modem Theory of Chemical Bonding; Maksic, Z.; OrvilleThomas, W. J., Eds.; Elsevier: Amsterdam, 1998, in press. 52. Mezey, P G.; Zimpel, Z.; Warburton, P; Walker, P D.; Irvine, D. G.; Huang, X.-D.; Dixon, D. G.; Greenberg, B. M. Environ. Toxicol. Chem. 1998,17,1207. 53. Mezey, P. G. In Bolyai Society Mathematical Studies, 1998, in press. 54. Mezey, P G.; Fukui, K., Arimoto, S.; Taylor, K. Int. J. Quantum Chem. 1998, 66, 99. 55. Mezey, P. G. In Chemical Topology: Introduction and Fundamentals', Rouvray, D.; Bonchev, D., Eds.; Gordon & Breach: New York, 1998, in press. 56. Mezey, P. G. In Encyclopedia of Computational Chemistry; Schleyer, P. v. R.; AUinger, N. L.; Clark, T.; Gasteiger, J.; KoUman, P A.; Schaefer III, H. P.; Schreiner, P R., Eds.; Wiley: New York, 1998, Vol. 4, p. 2582. 57. Mezey, P. G. In Advances in Molecular Structure Research, Vol. 4,1998,4,115.
STRUCTURAL SIMILARITY ANALYSIS BASED ON TOPOLOGICAL FRAGMENT SPECTRA
YoshimasaTakahashi, Hiroaki Ohoka, and Yuichi Ishiyama
Abstract I. Introduction II. Methods A. Topological Fragment Spectrum B. Quantitative Evaluation of Structural Similarity Based on TFS III. Results and Discussion A. Structural Similarity Analysis of 42 Psychotropic Agents B. Subspectrum Use C. Application to Similar Structure Search in Chemical Database IV. Concluding Remarks Acknowledgment References
Advances in Molecular Similarity, Volume 2, pages 93-104. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 93
94 94 95 95 97 98 98 98 100 103 104 104
94
YOSHIMASA TAKAHASHI, HIROAKI OHOKA, and YUICHI ISHIYAMA
ABSTRACT This paper discusses an approach to structural similarity analysis of chemical compounds using the topological fragment spectrum. The latter is based on the enumeration and numerical characterization of all possible substructures from a chemical structure. Here, the subspectra based on structural fragments having up to a specified size (number of edges) were also used to describe the topological profile of a chemical structure. The performance of the approach was examined in a structural similarity analysis of 42 psychotropic drugs. The appHcation of this method to similar structure searching in a chemical database is also discussed.
I. INTRODUCTION The use of molecular similarity methods, especially structural similarity, is under active development in the area of drug design, for the selection of analogues for chemicals and for the estimation of molecular properties. ^'^ The rationale is that structurally similar compounds are likely to possess similar molecular properties and similar biological activities. There are two different approaches for structural similarity analysis. One approach focuses on the identification of common and similar parts between the molecules that are to be compared. There have been several important attempts to determine structural fragments that are common among a group of chemical compounds.^"^ This is also closely related with the maximal common substructure (MCS) problem.^'^^ In these studies, for a group of compounds that show the same or similar biological activities although they are different in chemical structure, it is assumed that the activity in question should be attributed to a particular molecular feature that they all have in common. ^^ The other approach focuses on the degree of similarity between structures. This is a problem for a quantitative evaluation of the similarity of their chemical structures. Substructural analysis described by Cramer et al.*"^ and Adamson and Bush^^ gives us a basis for approaching this problem. It is based on the correlation of activities and properties of chemical compounds with the occurrence of particular partial structures that are embedded within their structures. Adamson and Bush^^ also reported the quantification of the similarity between chemical structures on the basis of their substructural analysis. In their work, 39 local anesthetics were classified numerically by calculating similarity or dissimilarity coefficients between pairs of the structures and applying cluster analysis to the obtained results. In using such an approach, Willett and Winterman^^'^^ examined the effectiveness of various measures for the evaluation of molecular similarity for the clustering of chemical compounds. These approaches are based on the generation of substructural descriptors from chemical structures and their use in quantitative evaluation by any similarity measure. Graph theoretical analysis has also been applied to the ordering of chemical structures and followed by correlative analysis of molecular properties.*^'^^ A
TFS-Based Structural Similarity Analysis
95
topological index is a numerical value representing the topological nature of molecular structure. The structural similarity or dissimilarity has been examined using some kinds of topological indices to provide the ordering of molecules."^^'^^ However, in many cases, substructural descriptors are more often used to define intermolecular similarity.^^"^'^ In general, topological/2D substructure descriptors are extremely useful for large-scale data because most chemical databases contain only connection tables without three-dimensional information. Recently, Brown and Martin^^ have evaluated a variety of substructure-based clustering methods for use in compound selection. The results suggest that 2D descriptors and hierarchical clustering methods yield better results than 3D descriptors and other methods. However, quantification of such structural similarity sfill depends on the chosen set of substructures defined as the descriptors in advance. Nevertheless, the concept of structural similarity is quite important for the further intelligent use of computers in the chemical field. In that case, an approach is required to process structural informafion in a more flexible way so as to allow somehow the automatic evaluation of the more ambiguous global structural similarity; in other words, a method to examine the similarity of structures when they are regarded as whole enfifies. Here we describe an alternafive approach using the exhaustive fragmentation profile of a chemical structure for the quantification of structural similarity or diversity of chemical compounds.
II. METHODS A. Topological Fragment Spectrum
Topological fragment spectrum (TFS) has been proposed by the authors^^ as a tool for the description of the topological structure profile of a molecule. The approach is based on enumeration of all possible substructures from a chemical structure and numerical characterization of them. The procedure involves two major steps: (1) enumeration of all possible substructures from each chemical structure and (2) numerical characterization of the substructures. Both are described in the following section. Enumeration and Characterization of the Substructures
For a given structure represented as a chemical graph (hydrogen-suppressed graph), all of the possible substructures embedded in it are enumerated. Here, all of the substructures of the parent structure (the original chemical graph) are taken into account in the following processes even if they are isomorphic. Subsequently, all of the individual substructures are characterized with a numerical quantity. To perform this characterization we have used two methods in the present study as follows: (1) the overall sum of degrees of the nodes composing each subgraph and (2) the overall sum of the mass numbers of the atoms (atomic groups) corresponding
YOSHIMASATAKAHASHI, HIROAKI OHOKA, and YUICHI ISHIYAMA
96
to the nodes of the subgraph. With the first method the chemical structure is represented by a simple graph where the hydrogen atoms are omitted, the characterization of the structure thus depending only on the topology of the structural skeleton. An illustrative scheme of the procedure is shown in Figure 1. For the second method, attached hydrogen atoms are taken into account as augmented
S(e)=0
O S(e)=4
Occurrances
(b)
o 1 2
3
4
5
6
7
8
Sum of the degrees of nodes
(c)
X = (3,1,2,2,2,3,3,1)
Figure 1. An illustrative scheme forthe generation of a topological fragment spectrum (TFS). (a) Enumeration of all possible subgraphs from a chemical graph of 2-methylbutane. (b) TFS characterized by the sum of degrees of nodes in the subgraphs, (c) Vector representation of the TFS.
TFS-Based Structural Similarity Analysis
97
atoms and are represented by weighting correspondingly their respective nodes in the graph. Description of the Structural Profile by a Spectral Representation
We define as "topological fragment spectrum" (TFS or fragment spectrum in what follows) the histogram resulting from displaying the frequency distribution of a set of individually characterized substructures (structural fragments) according to the value of their characterization index. The fragment spectrum generated according to this manner is a representation of the topological structural profile of a molecule. The fragment spectra of promazine as characterized by two methods mentioned above are shown in Figure 2. B. Quantitative Evaluation of Structural Similarity Based on TFS
Obviously, the fragment spectrum obtained by these methods can be described as a kind of multidimensional pattern vector. Consequently, using this pattern representation of a spectrum it is possible to apply various quantitative measures
2000
IBOO
1000
"[ IT
BOO
c
M.MMlllll 10
20
1- "\"
1 30
40
(a) Fragment spectrum characterized by the sum of degrees of nodes in fragments.
Q^''\
S
N-/
]
,
1 0
1J [
ijUiJiXLIiMiil,ili, 11 .1 100
200
(b) Fragment spectrum characterized by the sum of atomic mass numbers in fragments. Figure 2. Promazine fragment spectra generated by different characterization methods.
98
YOSHIMASA TAKAHASHI, HIROAKI OHOKA, and YUICHI ISHIYAMA
for the evaluation of similarity. In the present work, the following measure was used for evaluating the similarity: D(X,,X^
S(X,Xp =
(1)
l — ^ max
where X- and X- are pattern vectors representing the fragment spectra of the iih and jth molecule, respectively. Z)(X., X) is the Euclidean distance between patterns Xand Xj. D^^ is the maximum distance among the set of molecules under analysis. The different dimensionalities of the spectra to be compared are adjusted as follows: If X. = (x.j, x.^,...,
x.^) and Xj = (Xj^, Xj^,..., Xj^, Xj^^^^y . . . , Xj^) (q
then X. is redefined as X. = (x-j, .x.2,..., x.^, x^^^^^y . . . , x.^) h e r e , A:,(^+I) = ■^/(^+2) =" • • • ~ ^/p = ^
This approach was fully computerized and used in the following structural similarity analysis and similar structure searching in a chemical database.
III. RESULTS AND DISCUSSION A. Structural Similarity Analysis of 42 Psychotropic Agents
The approach described above was applied to the structural similarity analysis for 42 psychotropic agents. The sum of the atomic mass numbers was employed in the characterization of substructures. The evaluation of their structural similarity was carried out with Eq. 1. The result obtained was summarized in a spanning tree of 42 psychotropic agents in the nearest-neighbor manner. The spanning tree is shown in Figure 3. Each link in the spanning tree shows the most similar pair of structures. It is difficult to conclude whether the resulting intermolecular similarity relationship is reasonable or not. On the basis of our intuitive analysis, it seems that the present approach yields a successful result. It is valuable to note that promazine (structure 7 in Figure 3) and its analogues were not clustered in a clear or rational manner. The first five nearest neighbors of promazine were compounds 27,31,19, 29, and 9 in this order. It is considered that there are too many "zero" elements to adjust for the different dimensionality between the fragment spectra that were being compared producing an unfavorable result. B. Subspectrum Use
In the previous section, structural similarity analysis was carried out for 42 psychotropic agents using their full fragment spectra. However, the computational time required for the exhaustive enumeration of all possible substructures from a chemical structure is often very large especially for molecules that contain highly fused rings. In addition, a large difference in the dimensionality between the
TFS-Based Structural Similarity
Analysis
99
_a Q _ 2 . _ \ o "9 6" °to °^o •o
s
0
°-o«-~
X
X)
P
(S-K -
T\
o 11
^ ^"x/ \
'X}
[^ J
N^
X)
L y
tV
V"0
X)
A
r 16
I OX)
^
39 1
&X)
^ \
Figure 3. A spanning tree of 42 psychotropic agents based on their structural similarities. The analysis was carried out with the full fragment spectra in which every fragment Is characterized by the fragmental weight.
fragment spectra to be compared may lead to an unexpected result. To avoid these problems, an alternative approach based on the use of subspectrum may be employed for such a similarity analysis, in which each spectrum can be described with structural fragments up to a specified size in the number of edges (bonds). Figure 4 shows an example TFS subspectrum for promazine. In this way, the similarity analysis was carried out for the same data set used above. The result of the structural similarity analysis was also summarized as the spanning tree shown in Figure 5, which seems to show more reasonable interrelationships for their structural similarity. In fact, for this case the first five nearest neighbors of pro-
100
YOSHIMASATAKAHASHI, HIROAKI OHOKA, and YUICHI ISHIYAMA
Full fragment spectrum of promazine
■li,.i....llli
id.^
Subspectrum for the set of fragments with S(e)^5. Figure 4. TFS full spectrum of promazine and its subspectrum derived from the small fragments. The spectrum is exemplified for fragments having 5(e) ^ 5.
mazine (7) were compounds 31, 32, 30, 38, and 33 in this order. Figure 6 shows their subspectra and that of promazine. C. Application to Similar Structure Search in Chemical Database
Obviously the TFS can be applicable to similar structure searches in chemical databases. To test the applicability of the TFS approach to this problem, we prepared a structure database consisting of 1600 perfume and flavor chemicals taken from Arctander's compilation.^^ The database contains a variety of molecules (e.g.,
TFS'Based Structural Similarity Analysis
101
■o
•o o
6 31
\ 7
^'r
»
ir-LJT
^
--v-A^o
"O
s-Q
s-Q
j-Q
,db
Figure 5. Result of similarity analysis for 42 psychotropic agents using TFS subspectra.
alcohols, terpenes, steroids) that are structurally diverse. The TFS subspectral approach was employed for the similar structure search. All of the spectra were characterized for fragments having five or less bonds within them. These spectra of the molecules in the database were generated and stored, in advance, as an index. A result of the similar structure search is shown in Figure 7, in which decamethylene oxalate was used as a query structure. Figure 7 shows the first ten most similar compounds to the query structure, the compounds being ranked according to their similarities. It is quite difficult to evaluate the result for the similar structure search, because the similarity evaluated depends on an individual's interests, experiences, and so on. Furthermore, it often depends on one's psychological situation. Nonetheless, we still believe that this result shows the structural similarities of these compounds very well. In conclusion, these results suggest that our approach succeeds in the evaluation of structural similarity or structural diversity of organic compounds in place of the
102
YOSHIMASA TAKAHASHI, HIROAKI OHOKA, and YUICHI ISHIYAMA
lOf
I
Ill.ll..
Il.lll.l.lllllllll.l
ii
mill
llllil,,!..
lllUJ
,A,jm
llUUJ
k
lliJI
jiw^
I
iiiiii,,
iiiiiiii....iiii,iiii„,i
Figure 6. The first five nearest neighbors of promazine (in the order shown) and their TFS subspectra from fragments having S(e) ^ 5.
TFS-Based Structural Similarity
103
Analysis Query structure
M -0
0-
Decamethylene oxalate
-0
0-
10
Figure 7. Result of similar structure search in a structure database of flavor and perfume chemicals.
human eye and brain. The current approach is also applicable to similar structure search on a chemical compound database.
IV. CONCLUDING REMARKS TFS can be viewed as a useful tool to describe topological structural profiles of a molecule. The work presented here apparently suggests that it can be applicable to structural similarity analysis in a quantitative manner. TFS is also useful for similar structure search in chemical databases. The advantages are that a molecule's TFS can be measured by a PC from an ordinary connection table, and it does not require any kind of a priori substructural definition like a fragment dictionary to generate it. In the present work, we used fragment weight for the characterization of structural fragments that are enumerated. It allows us to numerically describe the structural profile of a molecule. This approach can also be easily extended to use other characterization schemes for the fragments.
104
YOSHIMASA TAKAHASHI, HIROAKI OHOKA, and YUICHI ISHIYAMA
ACKNOWLEDGMENT This work was supported by a Grant-in-Aid for Scientific Research from the Ministry of Education, Science, Sports and Culture of Japan.
REFERENCES 1. Johnson, M. A.; Maggiora, G. M., Eds. Concepts and Applications of Molecular Similarity; Wiley: New York, 1990. 2. Sen, K., Ed. Molecular Similarity I, Top. Cum Chem. 1995, 173. Sen, K., Ed. Molecular Similarity II, Top. Cum Chem. 1995,174. 3. Carhart, R. E.; Smith, D. H. J. Chem. Inf. Comput.Sci. 1985, 25, 64. 4. Bawden, D. In Concepts and Applications of Molecular Similarity; Johnson, M. A.; Maggiora, G. M., Eds.; Wiley: New York, 1990, p. 65. 5. Judson, P. N. J. Chem. Inf Comput. Sci. 1992, 32, 657. 6. Barnard, J. M; Down, G. M. J. Chem. Inf Comput.Sci. 1992, 52, 644. 7. Down, G. M; Willett, P. Rev. Comput. Chem. 1995, 7, 1. 8. Lipkus, A. H. J. Chem. Inf Comput. Sci. 1997, 37, 92. 9. Varkony, T. H.; Shiloach, Y; Smith, D. H. / Chem. Inf Comput. Sci. 1979, 79,104. 10. Klopman, G. J. Am. Chem. Soc. 1984,106, 7315. 11. Takahashi, Y; Sukekawa, M.; Sasaki, S. J. Chem. Inf Comput. Sci. 1992,32, 639. 12. Bayada, D. M.; Johnson, P. In Molecular Similarity and Reactivity; Carb5, R., Ed.; Kluwer: Dordrecht, 1995, p. 243. 13. Takahashi, Y Top. Cum Chem. 1995,174, 105. 14. Cramer, R. D., Ill; Redl, G; Berkoff, C. E. J. Med Chem.. 1974,17, 533. 15. Adamson, G. W; Bush, J. A. Nature 1974, 248, 408; J. Chem. Inf Comput. Sci. 1975, 75, 215. 16. Adamson, G W; Bush, J. A. J. Chem. Inf Comput. Sci. 1975, 75, 55. 17. Willett, R; Winterman, V. Quant. Struct.-Act. Relat. 1986, 5, 18. 18. Willett, P. Similarity and Clustering Technique in Chemical Information System; Research Studies Press: Letchworth, England, 1987. 19. Kier, L. B.; Hall, L. H. Molecular Connectivity in Chemistry and Drug Research; Academic Press: New York, 1976. 20. Randic, M. / Math. Chem. 1991,17,155. 21. Randic, M. J. Chem. Inf Comput. Sci. 1992, 32, 686. 22. Downs, G. M.; Gill, G. S.; Willett, P; Walsh, P SAR QSAR Environ. Res. 1995, 3, 253. 23. Matrin, E. J.; Blaney, J. M.; Aiani, M. A.; Spellmeyer, D. C ; Wong, A. K.; Moos, W. H. J. Med. Chem.l99S,38,U3\. 24. Karcher, W; Karabunarliev, S. J. Chem. Inf Comput. Sci. 1996.36, 672. 25. Kearsley, S. K.; Sallamack, S.; Huder, E. M.; Andose, J. D.; Mosley, R. T; Sheridan, R. P J. Chem. Inf Comput. Sci. 1996, 36, 118. 26. Shememskis, N. E.; Weininger, D.; Blankley, C. J.; Yang, J. J.; Humblet, C. J. Chem. Inf Comput. Sci. 1996,36, 862. 27. Sheridan, R. P; Miller, M. D.; Underwood, D. J.; Kearsley, S. K. J. Chem. Inf Comput. Sci. 1996,56,128. 28. Brown, R. D.; Martin, Y C. / Chem. Inf Comput. Sci. 1996, 36, 572. 29. Takahasi, Y; Ishiyama, Y In Molecular Similarity and Reactivity; Carbo, R., Ed.; Kluwer: Dordrecht, 1995, p. 311. 30. Arctander, S. Perfume and Flavor Chemicals I, II1969.
ANALYSIS OF THE TRANSFERABILITY OF SIMILARITY CALCULATIONS FROM SUBSTRUCTURES TO COMPLEX COMPOUNDS
Guido Sello and Manuela Termini
Abstract I. Introduction 11. Methodology III. Results and Discussion A. ED Variations: Changing Structure B. ED Variations: Changing Conformation C. ED Variations: Changing System D. Similarity Index: 2D and 3D E. Similarity Index: SP Influence F. Similarity Index: Juxtapositioned Pairs G. Similarity Index: Triplets IV Calculations, Transferability, Similarity Measures and Indices V Conclusion Acknowledgments References and Notes Supplementary Material Advances in Molecular Similarity, Volume 2, pages 105-136. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 105
106 106 108 112 112 118 120 125 126 128 129 130 131 132 132 133
106
GUIDO SELLO and MANUELA TERMINI
ABSTRACT The possibility of transferring calculation results from simple, easy-to-handle structures to complex compounds containing many atoms that can be dissected into pieces is attractive but can produce unwanted side effects. It is therefore necessary to check thoroughly the validity of the transfer. The problem becomes more intriguing in the field of similarity evaluation where an elusive concept must be transformed into a quantitative measure. The results obtained from a series of calculations are reported and discussed in this perspective and the validity of the introduced approximation is emphasized. The level of the simplification that can be used with confidence is consequently derived. The calculations are made using an empirical methodology for evaluating molecular electronic distribution and, as a consequence, molecular electronic energy.
I. INTRODUCTION The possibility of transferring the calculations done on simple substructures to complex compounds has been studied and used in many fields of molecular modeling. It is sufficient to cite the parametrization of molecular mechanics or semiempirical methodologies, most of the time relying on calculations made on simple compounds at the ab initio level. More recently, the transfer of fragment values has been employed also in drug design^ using structural characteristics as bond lengths or atomic residual charges. The acceptability of those transfers has always been ensured by comparison of "true" and approximated calculations. Thus, it is common practice to "accept" a transfer if the results obtained at different levels of approximation are in agreement or if the property calculated by the model is correctly represented. Similarity between organic compounds has been used in the past as a purely qualitative concept^ whose importance was seldom clearly identified. Thus, it was very common to speak about "similar reactivity" or "similar geometry" without attaching any particular meaning to it. In the late 1980s a different and more powerful view on similarity appeared and is now rapidly developing. The application fields are various and spread over data management as well as quantitative biological activity prediction.^ This last area of similarity application can concern the specific case of QSAR measures as well as the wider and more general subject of drug design. It is not difficult to understand why similarity has been introduced in this field. In general we can identify two different methodological approaches to the problem of drug design, in accordance with the information available on it: 1. Those approaches where the structure of the receptor is known; in this case the goal is the identification of compounds that can fit in the best way with the receptor (limited utility of similarity).
Transferability of Similarity Calculations
107
2. Those where, on the contrary, the structure of the receptor is unknown; in this case the only possibility is to look for compounds having functionalities that can provide a specific activity. This second case can suggest the joining of similarity with computational techniques introducing a new methodological approach."* The quantitative measure of similarity brings to light the problem of its transferability. From a qualitative point of view it is instinctive to combine the similarity of small pieces so as to assess the similarity of the whole. This is obvious in reactivity as well as in biological activity. The combination is possible unless the perturbations introduced by the added piece are so important as to change the characteristics of the entire construction. Then all of the general difficulties related to the possibility of transferring values must be considered and a way of validation found. In the case of similarity measures the validation is very difficult by experimental proofs because similarity cannot be directly measured and the validity of its connection to a property is always questionable. Therefore, the most reliable approach must use calculation comparisons. In recent years we became interested in the usage of the similarity concept,^ mainly following our ongoing interest in chemical reactivity.^ Our approach considers the possibility of defining the similarity between substructures using their electronic description, more exactly, the perturbation induced by each atom or atomic group on the electronic energy of the substructure. Thus, we developed an original method^ that can quickly calculate electronic energy differences. The method is based on a previously described approach to atomic charge calculation^ and on the well-known relation existing between chemical potential and electronic energy.^ The energy differences can then be used to evaluate the similarity between substructures. We applied this approach to different substructure classes: from simple functional groups to complete molecules^. In our first analysis we examined topological (2D) molecular representations only; but the extension of our method for electronic distribution calculation to 3D representations^^ has stimulated subsequent work on the subject. First we repeated the calculation onto 3D substructure representations and compared the results; then we calculated the similarity to get a feeling as to the response of the approach to the introduced change. ^^ After having verified the reliability of the calculation, we went back to the analysis of one particular class of compounds that we have studied previously: the nucleic bases. The genetic information contained in nucleic acids is formally based on the juxtaposition of the sequence of base triplets furnishing the correct composition of the genes and of the corresponding proteins. This means that the very long chains of bases present in genes can be subdivided into sequences of triplets. As there are only four fundamental bases their combination can give only 64 different triplets. The consequence is a very high level of repetition of similar substructures. Therefore, we examined the possibility of transferring the calculations done on
108
GUIDO SELLO and MANUELA TERMINI
simple substructures to complex compounds using those substructures. To verify this possibility we analyzed the influence of different factors onto the results. As a final result we expected to get sufficient data supporting our hypothesis and therefore an easy way to simulate complex compounds without resorting to cumbersome calculations.
II. METHODOLOGY To be better understood we will briefly describe our calculation methodology. We developed an empirical method for calculating atomic charges. It is based on Gordy's approach^^ to the calculation of atomic electronegativities, i.e., on the formula X = 0 . 3 1 x ( n + l ) / r + 0.50 where n is the number of valence electrons; as modified by Pritchard^^:
where a and b are constants depending on the element row and Z^^ is calculated from Slater. We extended this formula to nonconstant Zs and rs, obtaining X = axZ^^^i)/r(i) + b where Z and r refer to a hypothetical perturbed state, Z^^ is naturally related to the effective (also fractional) electron number, and r is connected to the same factor by Pauling's equation^"*: r(0 = rxZ,ff/Z,ff(0 where Z' is equal to Z for complete electron shielding. Another Pauling equation correlates bond ionic percent to % difference: % = 100 X (1 - exp(-l/4 X Ax^)) and, as a consequence, (2 = ^x(l-exp(-l/4xA%^)) The equation can be extended to multiple interactions giving G T = 2:,G(0
At X equalization we reach a stable state, yielding the atomic charges. Given the correlation between electronic energy (e.e.), % (or chemical potential |Li), and hardness (r|) Ti = [diJJdn]z= [d\&.&.)/dn\
Transferability of Similarity Calculations
109
and considering the equation for % calculation as a continuous function (at least in the valence shell) we can derive the corresponding formulas for hardness and e.e. by the following mathematical manipulations: Z^ff = A^ - (aN^ + bN^ + c(A^3 - 1))
\x = -X = -k^x Z^^/r + ^2 = -^1 X [N-(aN^ + bN2 + c(N^ - l))]/r + ^2 = -k^ x[N- (aN^ + bN2 + c(N^ - 1))] x [TV - (aN^ + bN^ + cN^WiHj = k^x[N-
x r^) + k^
(aN^ + bN2 + c(N^ - 1))] x [A^ - (aN^ + bN2 + cN^)] - k^
and Ti = [dvi/dn\ = [k^ X {N^ x (A^3 - \))/N^ - k^]^ = ^3 x {2N^ - 1) e.e. = J |Lu//i = J \xdN^ = \k^ X [A^- (aN^+bN2-^c(N^ - 1))] x [A^- (aA^^ + Z7A^2 + ^^3)] "^^2 = ^3 X (A + 5 + O - /:2 X A^3 + A: where A = {N'^ + aN-2NN^-2bNN2'^N\
+2bN^N2'-aN^ +b^NJ
-abN2)xN^
B = 0.5x {-2aN + 2aN^ + 2abN2 - a^) x A^^
C = a^^xNl The formulas obtained permit the calculation of the atomic hardness and of the e.e.; concerning the latter quantity the corresponding molecular energy is simply the sum of the atomic contributions. Each molecule, each perturbation of a molecule, each transformation of a molecule, has a different energy because the different electron distribution influences the occupancy of Slater's shells and, consequently, the energy calculation. Nevertheless, the direct comparison of the molecular electronic energies does not give information on molecular similarity; in fact, while, for example, the energy of two molecules of linear alkanes (C4 <-> C^) constantly increases thus allowing their comparison, there is no guarantee that the energy of different molecules is different. On the contrary, a much more powerful source of information is furnished by the analysis of the variations of the electronic energy, mainly if it is compared on many points. It is the use of these variations that allowed us to introduce the already described new definition of functional groups. The immediate next step is
110
GUIDO SELLO and MANUELA TERMINI
the application of the just introduced concept to the analysis of the similarity between substructures. In fact, an interesting aspect of our approach is represented by the calculation mechanism. This is based on a very simple idea: If we think that the electronic energy of a molecule is strictly related to the energy of its atoms as results from their reciprocal interactions, we can affirm that the presence/absence of an atom at a specific position determines an energetic variation which can be thought of as the quantification of the difference between that molecule and a hypothetical molecule where the atom is isolated (annihilation principle). Therefore, if we consider two identical molecules and we work on atoms in complete correspondence, the energy variation will be identical; on the contrary, if we consider two different molecules and we work on similar atoms, or if we consider two identical molecules and we work on different atoms, the calculated energy differences will be different and they will represent a measure of the "importance" of the atom in the molecule. Finally, concerning the nucleic bases, we introduced a similarity index calculated as follows:
/=i:,w, where A^. is the ED calculated for atom /, obtaining values that are representative of the stabilization level given by the base pairing. To get the most information we sequentially analyzed structures of increasing complexity, beginning from a neutral isolated base in 2D and ending with a charged triplet in 3D. In our approach the efficiency of the base pairing depends on the similarity gain that results after pairing in comparison to the isolated compounds; we also postulated that a model for the pairing mechanism can be represented by the starting compounds with a charge on the atoms involved in the pairing mechanism (Figure 1). Passing from 2D to 3D representations we needed to be completely confident about the transferability of the calculated values, and therefore thoroughly examined the perturbation carried by each possible change in the analysis methodology. The first change concerns the differences related to the transformation of the 2D into the 3D approach. Those effects were initially examined on neutral and charged isolated bases considering the changes both of the energy differences and of the similarity indices. The 3D structures are built using a standard model builder^^ without geometry optimization because of the complete rigidity of the compounds. A methyl group takes the place of the substituent on the nitrogen atom usually carrying the sugar-phosphate residue. As a second step we considered the same bases but with the correct substitution (having interest in the examination of RNA bases, the sugar is a ribose residue) to verify its influence on the calculation. The structures are considered both without
Transferability of Similarity Calculations
111
U -tfi H-HN
H-HN
Figure 1, Base-base pairing and its model simulation.
and with geometry optimization.^^ The phosphate residue is always in the correct physiological ionic form. In our model we simulated the interaction between bases by charging the atoms involved in the coordination. To ensure the reliability of such approximation we also built some base pairs, and then calculated their similarity and compared the results with those from the charged isolated bases. The calculations were made in 2D and in 3D. In this regard a further problem can occur: the geometry of the nitrogen atom in the NH2 group. In fact, we found that the model builder uses, as a standard, a planar NH2 group and that such geometry is also usually reported in
H /
\
H
Figure 2. Standard nucleic base chain conformation. The triplet shown is GUG.
112
GUIDO SELLO and MANUELA TERMINI
biological papers. As we would have liked to eliminate all doubts concerning the correctness of this assumption, we also calculated our energy differences using tetrahedral nitrogens. The last but very important problem in the transfer of results is represented by the effects that vicinal bases can produce in a long base chain. More precisely, we would like to look at the interactions between bases in triplet chains particularly in the case of 3D calculations. Therefore, all 64 possible triplets were built and calculations made on both neutral and partially (i.e., with charged central bases) charged bases. We adopted the standard conformations of the builder^^ (Figure 2).
III. RESULTS AND DISCUSSION In Figure 3 the bases used for the analysis are shown together with the numbering of the examined atoms. For the sake of clarity we will compare ED variations and similarity index variations separately. A. ED Variations: Changing Structure The resulting presentation and discussion will be articulated as shown in Scheme 1; each branch of the 2D tree is compared with the corresponding branch of the 3D
*01
5
*0
1 R
NH2*
I R NH2*
I R *0
R
R OH*
*HNA^N
*0
-N^N--^^
*HN-^^
I X > 11 R OCK,
I I>
k
R
*0
I X> R
ID R
Figure 3. Bases used in the calculations with atom numbering. Starred atoms are charged when simulating pairings. The bases shown are (from left top to right bottom): thymine (T), uracil (U), cytosine (C), adenine (A), guanine (G), adenine tautomer (A ), thymine tautomer (T ), inosine (I), methyl-guanine (M), xanthine (X).
Transferability of Similarity Calculations
113 Nitrogen geometry
± SP residue NBn+SPNitrogen planar
± Charges NBn-f-Sugar&Phosphate
NBn+SPNitrogen tetrahedral NBneutral NBn-SPNitrogen planar NBn-Sugar&Phosphate
/
NBn-SPNitrogen tetrahedral
NB(2D or 3D) NBc+SPNitrogen planar NBc+Sugar&Phosphate
\
NBc+SPNitrogen tetrahedral NBcharged NBc-SPNitrogen planar NBc-Sugar&Phosphate NBc-SPNitrogen tetrahedral
Scheme T.
tree. The trees branch first on the charge status. Then the sugar-phosphate (SP) residue is introduced as a second element of complexity. Finally the geometry of the nitrogen atoms of the amine functions is considered. The structure consequently presents eight final branches for each tree. The discussion will follow each branch upward. Only three of the standard bases (A, G, C) possess amine functions that can be configured tetrahedrically (Table 1) or planarly (Tables 2, 3). In all cases we can observe a complete agreement between planar and tetrahedral nitrogens when the atoms are charged. On the other hand, neutral nitrogen atoms show a constant decrease (-0.05) of the calculated ED values passing from planar to tetrahedral geometry. ^^ The presence of the SP residue on the separate bases does not influence the ED values too much (Tables 2, 3). In fact, the only visible change concerns a small difference (±0.1) of the ED for amine nitrogen atoms when charged. This effect is common to 2D and 3D cases. The difference is, however, negligible where the ED calculation is concerned. Going upward we can compare neutral and charged bases. The ED variations are very small or zero for all multiple bonded atoms without differences due to the different situations (presence/absence of the SP residue, 2D or 3D tree). On the contrary, all single bonded atoms (both N and O) are heavily influenced by the presence of the charge. But the observed variations are scarcely sensitive to the
GUIDO SELLO and MANUELA TERMINI
114
Table 1, Values of Energy Differences of Compounds with Tetrahedral Nitrogen Compound^ Calc. type^ Adenine Guanine Cytosine Calc. type*^ Adenine Guanine Cytosine Calc. type*^ Adenine Guanine Cytosine Calc. type^ Adenine Guanine Cytosine Calc. type^ Adenine Guanine Cytosine Calc. type^ Adenine Guanine Cytosine Calc. type^ Adenine Guanine Cytosine Calc. type*^ Adenine Guanine Cytosine
Atom 1^
Atom^
0.451 2.872 0.452
3.692 2.886 2.820
1.851 2.845 2.843 1.852 1.851
3.701 2.878 2.876 2.847 2.850
0.471 2.846 0.468
3.692 2.894 2.823
1.840 2.810 2.811 1.871 1.871
3.704 2.890 2.891 2.849 2.851
0.453 2.873 0.455
3.690 2.887 2.822
1.854 2.850 2.850 1.854 1.854
3.712 2.880 2.879 2.846 2.850
0.443 2.848 0.468
3.691 2.896 2.823
1.791 2.804 2.804 1.810 1.811
3.713 2.894 2.894 2.852 2.854
Atom J(b
Atom 4^
2D, -SP, Nt, neutral 3.790 3.743 2.824 0.280 2.841 2.885 2D, -SP, Nt, (zharged1 3.752 3.809 1.492 2.825 2.824 1.493 2.874 2.866 2.864 2.882 3D, -SP, Nt, neutral 3.780 3.743 2.821 0.339 2.838 2.906 3D, -SP, Nt,
Atom^
0.455 2.898
1.854 0.455 2.891 2.911
0.480 2.808
1.835 0.477 2.774 2.817
0.454 2.899
1.851 0.451 2.891 2.910
0.482 2.792
1.838 0.480 2.749 2.802
Notes: ^Compound names refer to Figure 3. ^Atom numbering refers to Figure 3. *The specific structure and the calculation type are indicated in the column and refer to each whole block. 2D and 3D indicates the dimensionality; ± SP indicates the presence/absence of the sugarphosphate residue; std indicates the standard conformation of the SP residue; Nt indicates the tetrahedral structure of the amine nitrogen; neutral/charged indicates that calculation are made on neutral/charged compound.
Transferability of Similarity Calculations
115
Table 2. Values of Energy Differences of Neutral Compounds Compound ^
Atom 1^
Atom^
Atom 3^
Atom 4^
Calc. type*^ Adenine Guanine
0.503 2.872
3.691 2.885
Thymine
2.882
Uracil Cytosine Adenine**^
0.279 0.280
2.881 2.882
0.505
2.882 2.880 2.821
2.843
2.953
2.819
0.250
2.885 2.807
Thymine*^ Me-guanine
0.775 0.204
2.780
2.875
2.882
3.829
3.768
0.280
2.803
0.278
2.920
2.880
2D, -SP, Np, neutral 3.740 3.789 2.824 0.280
Atom^
0.506 2.902 2.903 2.898 2.899 0.504
Inosine
2.872
3.646 2.887
Xanthine
2.882
2.881
3.691
3.782
Guanine
0.527 2.847
2.895
0.340
2.823
0.532
Thymine
2.762
2.913
0.370
2.909
Uracil
2.825
0.373
2.910
2.786 2.787
Cytosine Adenine*^
0.522
2.900 2.824
2.839 0.260
2.906 2.804
Thymine*^
0.823 0.202
Calc. type^ Adenine
2.940
3D, -SP, Np, neutral
2.965
2.816 2.788
3.740
2.808
2.863
2.903
2.809
3.643
3.800
0.529
0.895
Xanthine
0.846 2.857
0.326 0.366
3.766 2.802 2.932
2.883
Calc. type^ Adenine Guanine Thymine Uracil Cytosine
0.502 2.872 2.882 2.880 0.505
3.692 2.886 2.882 2.880 2.823
3.790 0.279 0.280 0.280 2.844
2.823 2.880 2.880 2.884
2.953 0.774
2.820
0.249
2.808
2.778
2.875
2.882
3.646 2.887
3.830
3.769
Inosine
0.203 2.872
0.280
2.802
Xanthine
2.885
2.884
0.279
2.898
Me-guanine Inosine
Adenine*^ Thymine**^ Me-guanine
2.891
2D, +SPstd, Np, neutral 3.741 0.508 2.902 2.902 2.899 2.900 0.502 2.917
{continued)
GUIDO SELLO and MANUELA TERMINI
116
Table 2, Continued Atom 1 ^
Atom 2^
Calc. type^ Adenine Guanine
0.495
3.692
Thymine Uracil Cytosine
2.830
2.898
0.499
Adenine*^
2.963
2.823 2.814
Thymine*^
0.810
Me-guanine Inosine
0.213 2.847
Xanthine
2.859
Compound^
Atom 3^
Atom 4 ^
Atom 5^
2.848
3D, +SPstd, Np, neutral 3.741 3.780 2.822 0.342 2.896
0.538
2.761
2.911
0.369 0.372
2.917
2.772
2.917
2.773
2.840
2.791
0.264
2.910 2.804
2.794
2.872
2.909
2.794
3.646
3.797
3.766
0.533
2.896
0.326
2.802
2.893
0.365
2.919
2.844
Notes: ^Compound names refer to Figure 3. ^Atom numbering refers to Figure 3. ^ h e specific structure and the calculation type are indicated in the column and refer to each whole block. 2 D and 3D indicate the dimensionality; ± SP indicates the presence/absence of the sugar-phosphate residue; std indicates the standard conformation of the SP residue; Np indicates the planar structure of the amine nitrogen; neutral indicates that calculations are made on neutral compounds. ^Starred bases refer to the tautomeric forms of the corresponding standard bases.
Table 3. Values of Energy Differences of Charged Compounds Compound^ Calc. type^ Adenine Guanine Thymine Uracil Cytosine Adenine* Thymine* Me-guanine Inosine Xanthine
At. 1^
At. 2^
1.861 2.843 2.845 2.861 2.881 2.889 2.880 1.861 1.862 3.003 1.630 1.631 0.196 2.844 2.858 2.878
3.699 2.876 2.877 2.870 2.882 2.867 2.882 2.848 2.850 2.853 2.811 2.816 3.661 2.878 2.869 2.881
At. 3^
At. 4^
2D, -SP, Np, charged 3.806 3.749 1.492 2.825 1.492 2.826 1.608 2.881 1.609 2.869 1.608 2.881 1.607 2.869 2.869 2.875 2.864 2.881 1.525 2.808 2.867 2.903 2.902 2.877 3.857 3.778 1.494 2.804 1.488 2.920 1.488 2.916
At. 5^
1.863 0.505 2.901 2.885 2.900 2.885 2.892 2.910 2.889 2.909 1.860 2.936 2.929
Notes ^
3 charges 2 charges shift shift 3 charges 2 charges 3 charges 2 charges
shift {continued)
Transferability of Similarity Calculations
117
Table 3, Continued Compound ^ Calc. type^ Adenine Guanine Thymine Uracil Cytosine Adenine* Thymine* Me-guanine Inosine Xanthine
Calc. type^ Adenine Guanine Thymine Uracil Cytosine Adenine* Thymine* Me-guanine Inosine Xanthine Calc. type"^ Adenine Guanine Thymine Uracil Cytosine
At. / ^
At:P
1.787 2.808 2.811 2.700 2.755 2.780 2.819 1.884 1.882 3.001 1.675 1.674 0.194 2.811 2.826 2.855
3.699 2.889 2.891 2.918 2.912 2.896 2.900 2.849 2.854 2.851 2.823 2.822 3.662 2.892 2.883 2.891
3D,-SP, Np, charged 3.771 3.748 1.387 2.821 2.822 1.366 1.441 2.912 1.441 2.914 1.435 2.911 1.437 2.914 2.904 2.858 2.857 2.902 2.807 1.484 2.901 2.885 2.897 2.881 3.776 3.815 2.803 1.385 2.932 1.311 2.936 1.311
1.861 2.850 2.850 2.860 2.880 2.859 2.879 1.861 1.853 3.005 1.631 0.120 2.849 2.863 2.882
3.712 2.880 2.879 2.868 2.882 2.866 2.881 2.848 2.850 2.852 2.814 3.671 2.879 2.872 2.881
2D, +SPstd, Np, charged 3.821 3.751 1.614 2.825 1.612 2.823 2.880 1.609 2.866 1.608 2.880 1.608 2.868 1.608 2.874 2.870 2.864 2.880 2.806 1.637 2.877 2.903 3.779 3.869 2.802 1.614 1.607 2.898 1.609 2.886
2.918 2.903
1.790 2.804 2.804 2.697 2.755 2.787 2.824 1.785 1.785
3.713 2.894 2.894 2.920 2.913 2.896 2.900 2.852 2.854
3D, +SPstd, Np, charged 3.752 3.802 1.497 2.820 2.819 1.491 2.919 1.440 1.442 2.923 2.916 1.436 2.927 1.439 2.918 2.855 2.908 2.855
1.840 0.482 2.767 2.707 2.767 2.711 2.748 2.801
At 3^
At. 4^
At. 5^
Notes ^
1.812 0.530 2.780 2.729 2.780 2.733 2.774 2.818
3 charges 2 charges
2.775 2.817 1.844 2.881 2.857
1.861 0.456 2.902 2.886 2.902 2.886 2.891 2.909 2.911 1.861
shift shift 3 charges 2 charges 3 charges 2 charges
shift
3 charges 2 charges shift shift 3 charges 2 charges 3 charges
shift
3 charges 2 charges shift shift 3 charges 2 charges
GUIDO SELLO and MANUELA TERMINI
118
Tables, Continued Compound ^ Adenine* Thymine* Me-guanine Inosine Xanthine
Notes:
At. 1^ 3.000 1.710 0.210 2.802 2.820 2.851
At 2^ 2.848 2.830 3.671 2.899 2.888 2.892
At 3^ 1.599 2.898 3.810 1.511 1.432 1.435
At 4^ 2.805 2.904 3.778 2.853 2.919 2.922
At 5^ 2.807 1.829 2.834 2.795
Notes'^ 3 charges
shift
^Compound names refer to Figure 3. '^Atom numbering refers to Figure 3. Special notes are reported in this column indicating either the number of charges on the compound, or the need of shifting the base so as to have coordination therefore requiring a different position for the charges. ^The specific structure and the calculation type are indicated in the column and refer to each whole block. 2D and 3D indicate the dimensionality; ±SP indicates the presence/absence of the sugar-phosphate residue; std indicates the standard conformation of the SP residue; Np indicates the planar structure of the amine nitrogen; charged indicates that calculations are made on charged compounds.
particular case, always varying by 0.1 or 0.2 at most. The results are easily understandable considering the limited influence of the through space interactions. The last comparison concerns the differences existing between the 2D and the 3D calculations. Here it is again possible to observe an overall similarity between the ED values (Tables 2, 3). It is clear that the 3D calculations show, as a constant characteristic, a greater variability (e.g., the values for the €26^ carbonyl in thymine and uracil become differentiated). This effect is expected because more interactions are considered in 3D calculations. In conclusion, we can affirm that the ED values are consistently invariable and insensitive to the diverse changes introduced. However, it is fundamental to confine the comparisons to a specific calculation area (2D or 3D) without mixing the results. This is an obvious precaution to limit the chance of errors. B. ED Variations: Changing Conformation
In this section we discuss the variations of the ED values caused by changes in conformations. Two aspects will be considered: the relative position of the SP residue, and the effects of small changes of the overall conformation. If we minimize the rotational position of the SP residue with respect to the base, the standard nucleotidic conformation is lost (we use the a-helix conformation as built by the model builder). As a consequence, the through space influence of the SP residue onto the base atoms could change. In fact, some small differences are found (e.g., neutral amine A^s), but the ED values remain more or less constant (Table 4). We also made some calculations on two bases (guanine and thymine) in different conformations chosen between the minima of the potential energy surfaces
Transferability of Similarity Calculations
119
Table 4, Valuesof Energy Differences of Compounds with Minimized SP Residue Compound^ Calc. type*^ Adenine Guanine Thynnine Uracil Cytosine Adenine*^ Thymine*^ Me-Guanine Inosine Xanthine Calc. type^ Adenine Guanine Thymine Uracil Cytosine Calc. type^ Adenine Guanine Thymine Uracil Cytosine Adenine*^ Thymine*^ Me-Guanine Inosine Xanthine Calc. type^ Adenine Guanine Thymine Uracil Cytosine
Atom 1^
Atom 2^
0.503 2.872 2.882 2.880 0.505 2.952 0.774 0.204 2.872 2.883
3.691 2.886 2.881 2.880 2.823 2.819 2.780 3.646 2.887 2.882
1.861 2.847 2.849 2.860 2.858 1.862 1.861
3.713 2.877 2.878 2.868 2.867 2.848 2.852
0.495 2.835 2.768 2.830 0.481 2.970 0.816 0.200 2.846 2.857
3.692 2.898 2.914 2.901 2.824 2.818 2.789 3.643 2.897 2.892
1.794 2.873 2.786 2.706 2.785 1.889 1.890
3.713 2.896 2.897 2.920 2.898 2.854 2.855
Atom 3^
Atom 4^
2D , +SPmin, Np, neutral 3.788 3.740 0.279 2.823 0.280 2.881 0.280 2.880 2.844 2.887 0.250 2.806 2.881 2.875 3.767 3.828 2.802 0.280 0.277 2.921 ,2D, +SPmin, Np, charged 3.821 3.750 2.824 1.615 1.612 2.825 1.609 2.880 1.609 2.880 2.874 2.868 2.865 2.883 3D,, +SPmin, Np, neutral 3.780 3.740 0.339 2.821 0.370 2.918 2.921 0.368 2.841 2.916 0.262 2.805 2.869 2.912 3.800 3.766 0.327 2.799 0.365 2.937 ;3D, +SPmin, Np, charged 3.800 3.749 1.497 2.819 1.491 2.820 1.444 2.919 2.921 1.439 2.921 2.863 2.859 2.911
Atom^
0.506 2.902 2.902 2.900 2.899 0.504 2.941
1.864 0.506 2.900 2.902 2.891 2.911
0.496 2.754 2.760 2.760 2.761 0.526 2.864
1.812 0.496 2.747 2.739 2.709 2.769
Notes: ^Compound names refer to Figure 3. ^Atom numbering refers to Figure 3. ^ h e specific structure and the calculation type are indicated in the column and refer to each whole block. 2 D and 3 D indicates the dimensionality; + SP indicates the presence of the sugar-phosphate residue; min indicates the minimized conformation of the SP residue; Np indicates the planar structure of the amine nitrogen; neutral/charged indicates that calculation are made on neutral/charged compound. *^Starred bases refer to the tautomeric forms of the corresponding standard bases.
GUIDO SELLO and MANUELA TERMINI
120
Table 5, Values of Energy Differences of Guanine and Thymine Calculated In Different Conformational Minima Compound ^ Calc. type*^ Guanine
Thymine
Atom 1 ^
Atom 2^
2.848 2.861 2.842 2.848 2.843 2.837 2.844 2.774 2.766 2.780 2.770 2.766 2.773
2.896 2.892 2.899 2.896 2.897 2.898 2.895 2.912 2.910 2.913 2.909 2.911 2.909
Atom 3 ^ 3D, +SP, Np, neutral 0.340 0.325 0.336 0.340 0.338 0.337 0.317 0.369 0.367 0.369 0.371 0.358 0.371
Atom 4 ^
imi 2.811 2.822 2.822 2.818 2.819 2.809 2.917 2.918 2.917 2.909 2.913 2.908
Atom 5^ 0.535 0.564 0.537 0.535 0.541 0.538 0.513 2.780 2.765 2.782 2.811 2.777 2.797
Notes: ^Compound names refer to Figure 3. ^Atom numbering refers to Figure 3. T h e specific structure and the calculation type are indicated in the column and refer to each whole block. 3D indicates the dimensionality; SP indicates the presence of the sugar-phosphate residue; Np indicates the planar structure of the amine nitrogen; neutral indicates that calculations are made on neutral compounds.
obtained by rotation of two dihedral angles (between the base and the SP residue and between the sugar and the phosphate residue). The results are shown in Table 5 where we can observe an overall stability of the values therefore making consistent the assumption that the EDs would be constant for the conformational minima. It is obvious that the conformational changes can affect the ED values, but very small variations cannot change too much the weights of the atoms. C. ED Variations: Changing System
Last but not least of the analyzed perturbations come two massive changes of the reference system. The first concerns the pairing simulation by juxtaposition of the base pair without charging the atoms. The second analyzes the behavior of the base triplets. Positioning the bases at the standard pairing distance (as built by the model builder) it is possible to calculate EDs where the presence of the second base influences the result (Table 6). Of course, the comparison of those values with the EDs of the separate neutral bases shows small differences because the through space interaction is weak, but not zero. Therefore, extending the comparison to three
Transferability of Similarity Calculations
121
Table 6. Values of Energy Differences of Base Pairs Calculated in 3D without SP Residue, with Planar Nitrogen, and with Neutral Bases Pair^
At. 1^
At. 2^
At. 3^
At. 4^
Inosine
2.791
2.909
0.334
2.800
Adenine
0.610
3.682
3.772
3.740
Inosine
2.801
2.903
0.403
2.797
At. 5^
Notes^
Index^
0.121 2.740
shift
0.017
2.740
shift
0.030
2.913
2.789
shift
0.020
2.914
2.803
Uracil
2.835
2.896
0.468
2.915
Inosine
2.795
2.904
0.403
2.796
Thymine
2.771
2.907
0.471
2.915
Inosine
2.811
2.904
0.377
2.795
Xanthine
2.868
2.893
0.456
Xanthine
2.795
2.910
0.387
Adenine
0.592
3.683
3.782
3.735
Xanthine
2.786
2.907
0.435
2.903
2.836
Cytosine
0.595
2.810
2.835
2.906
2.794
Xanthine
2.819
2.900
0.459
2.910
2.827
0.106
Uracil
2.831
2.903
0.464
2.911
2.757
Xanthine
2.818
2.900
0.457
2.910
2.827
Thymine
2.765
2.914
0.461
2.911
Xanthine
2.866
2.891
0.457
2.913
0.071 shift
0.006
2.757
shift
0.009
2.784
shift
Guanine
2.809
2.902
0.384
2.813
0.504
Xanthine
2.819
2.900
0.427
2.913
2.856
Xanthine
2.867
2.890
0.462
2.908
2.803
Xanthine
2.833
2.916
0.413
2.925
2.800
Me-guanine
0.224
3.649
3.814
3.758
0.624
0.146
Adenine
0.578
3.683
3.775
3.741
Guanine
2.791
2.909
0.341
2.820
0.538
0.099
Adenine
0.572
3.683
3.781
3.737
Uracil
2.765
2.914
0.405
2.919
2.761
0.073
Adenine
0.582
3.685
3.781
3.736
Thymine
2.705
2.923
0.401
2.918
2.759
0.080
Adenine
0.598
3.689
3.773
3.740
Me-guanine
0.250
3.658
3.804
3.764
Adenine
0.556
3.689
3.765
3.739
Adenine*
2.934
2.814
0.297
2.803
Guanine
2.778
2.910
0.392
2.814
Cytosine
0.588
2.812
2.833
2.921
2.736
Guanine
2.808
2.901
0.421
2.823
0.499
0.565
0.021 shift
shift
0.008
0.023 0.006
0.576
Uracil
2.829
2.902
0.461
2.914
2.742
Guanine
2.807
2.901
0.419
2.822
0.498
Thymine
2.764
2.913
0.458
2.914
2.741
Guanine
2.805
2.906
0.413
2.813
0.572
Thymine*
0.863
2.775
2.864
2.917
2.744
Cytosine
0.608
2.812
2.831
2.917
2.762
Inosine
2.796
2.910
0.388
2.974
2 charges
0.077
shift
0.100
shift
0.004 0.064 0.066 {continued)
122
GUIDO SELLO and MANUELA TERMINI Tables. Continued
Pair^
At. 1^
At. 2^
At. 3^
At. 4^
At. 5^
Cytosine Uracil Cytosine Me-guanine Cytosine Adenine* Thymine Cytosine Thymine Uracil Thymine Me-guanine Thymine Thymine* Thymine Adenine* Uracil Thymine Uracil Me-guanine Uracil Adenine* Uracil Thymine*
0.605 2.716 0.565 0.245 0.509 2.954 2.704 0.603 2.766 2.795 2.736 0.229 2.739 0.885 2.753 2.977 2.819 2.756 2.796 0.226 2.813 2.978 2.780 0.887
2.814 2.921 2.820 3.658 2.820 2.813 2.922 2.815 2.914 2.903 2.916 3.650 2.916 2.782 2.917 2.807 2.907 2.909 2.910 3.648 2.908 2.807 2.907 2.780
2.834 0.421 2.829 3.806 2.823 0.329 0.420 2.835 0.449 0.422 0.429 3.815 0.443 2.860 0.398 0.378 0.445 0.421 0.432 3.815 0.402 0.380 0.443 2.861
2.904 2.908 2.916 3.765 2.920 2.816 2.909 2.905 2.911 2.914 2.919 3.758 2.906 2.903 2.924 2.819 2.912 2.916 2.920 3.759 2.924 2.819 2.907 2.902
2.801 2.788 2.775 0.559 2.729 0.588 2.788 2.801 2.760 2.784 2.730 0.596 2.798 2.797 2.718 0.500 2.759 2.784 2.729 0.600 2.719
Nofes ^
Index "^ 0.140
shift
0.003 0.086 0.086 0.034 0.079 0.009
shift 0.169 shift
0.001 0.085
shift 0.172
2.796 2.798
0.017
Notes: ^Compound names refer to Figure 3. ''Atom numbering refers to Figure 3. ^Special notes are reported in this column indicating either the number of charges on the compound, or the need of shifting the base so as to have coordination therefore requiring a different position for the charges. '^The similarity indices are obtained using ED values deriving from these pairs in the place of the charged bases used in the previous tables.
elements (neutral separate base, neutral base in the pair, charged separate base) we can observe a consistent trend, where the charged base shows the largest variations. The uniform behavior of the bases submitted to diverse perturbations (in the first case by the presence of the partner base; in the second case opportunely charged) can guarantee the validity of our model of pairing. The last part of the present section concerns the calculations made on base triplets. These calculations are quite time demanding (~2 h of cpu time, corresponding to -6 h of real time, on an IBM-RISC/6000, 520, medium loaded) and therefore we limited our interest to standard bases (A, C, G, U), building all 64 possible
Transferability of Similarity
Calculations
123
combinations, in 3D only.^^ It is interesting to control two different aspects: the sensitivity of the method to the molecular neighborhood, and the possibility to save time using isolated bases as models of base chains. The triplets have been realized using the model builder in our hands and, as a consequence, they present the standard substitution pattern with a 3' OH terminal and a 5' P terminal (Figure 4). This apparently small difference is nevertheless significant; in fact, both adenine and cytosine in position 3' OH show evident changes (Table 7) of the ED of the nitrogen atoms, particularly when in the central position of the triplet either guanine or uracil (a carbonyl group is positioned exacdy below the nitrogen atom) is present. When positioned in the central or in the 5' P positions, all of the bases show only small variations in the ED values. More exactly we can observe a general decrease in the ED values with corresponding differences of < -0.08 for 5' P bases, < -0.13 for central bases, and < -0.23 for 3' OH bases (the values are obtained comparing the EDs of the atoms of the corresponding isolated bases with those of the triplets and represent the greatest calculated differences). These differences are not meaningless and, in the case of analysis of short base chains, they should be considered; but in the case where all of the bases are well inside a long chain they can be compared with 5' P bases in the triplets. To verify the base behavior in the pairing we calculated some examples with the central base charged (Table 7) and one triplet example with all of the bases charged (Table 8). The comparison with the corresponding isolated charged bases shows a variation always positive and of less than 0.04 and 0.06, respectively. This result is clear if we consider that the perturbation introduced by the charge is great enough
A C T U G G
C U I A I X X X X X A U T A U G M A*
CI
C A*
T r
U M
U T*
Figure 4, Comparison of similarity indices calculated using standard base pairing and simulated base pairing. The dashed line represents standard values.
124
GUIDO SELLO and MANUELA TERMINI
Table 7. Values of Energy Differences of Base Triplets Calculated in 3D with SP Residue and Planar Nitrogen Atom^
ACU^
AUA^
CAG^
CGC^
CUC^
1 2 3 4
0.383
0.381
0.288
0.421
0.428
3.690
3.694
2.818
2.815
2.815
3.793
3.797
2.842
2.847
2.854
3.742
3.743
2.917
2.920
2.912
2.769
2.766
2.795 2.716
5 1 2 3 4
2.763
2.721
0.460
2.737
2.909
2.917
3.683
2.916
2.919
0.330
0.329
3.792
0.339
0.339
3.732
2.803
2.924
0.517
2.721
2.785
0.448
0.461
2.914
2.815
2.814
3.795
0.330
2.843
2.839
3.733
2.813
2.918
2.918
0.491
2.746
2.734
2.808
2.924
5
0.492
2.718
1 2 3 4
2.813
0.445
2.909
3.687
0.347 2.923
5
2.734
Atom^
GAC^
CGG^
GUG^
UCA^
UUU^ 2.770
1 2 3 4
2.762
2.785
2.795
2.746
2.916
2.912
2.911
2.917
2.907
0.320
0.327
0.330
0.359
0.372
2.820
2.818
2.820
2.920
2.912
5
0.501
0.507
0.506
2.773
2.779
1 2 3 4
0.489
2.786
2.768
0.445
2.769
3.680
2.903
2.906
2.805
2.901
3.799
0.344
0.352
2.838
0.364
3.731
2.806
2.920
2.919
2.915
0.493
2.723
2.728
2.732
2.790
2.818
2.825
0.437
2.810
2.913
2.907
2.906
3.692
2.908
0.328
0.342
0.337
3.788
0.361
3.734
5 1 2 3 4
2.813
2.810
2.809
5
0.486
0.491
0.502
2.918
Atonrf
AGW
AUA""
CAC
CGC
cue
0.393 3.688 3.793 3.740
0.389 3.692 3.796 3.742
0.289
0.432
0.428
2.817
2.812
2.814
2.844
2.847
2.854
2.917
2.919
2.912
2.772
2.765
2.796
2.720
{continued)
Transferability of Similarity Calculations
125
Table 7, Continued Atonff
ACt/
AUA""
CAC
CGC
cue
1 2 3 4
2.683 2.922 1.486 2.808 0.492
2.629 2.729 1.465 2.920 2.711
1.837 3.707 3.811 3.745
2.656 2.930 1.487 2.808 0.516
2.714 2.921 1.467 2.938 2.631
0.454 3.684 3.795 3.731
5
2.814 2.907 0.351 2.921 2.735
2.782 2.914 0.330 2.814 0.492
0.455 2.813 2.844 2.917 2.747
0.461 2.814 2.840 2.921 2.735
Atom^
GAC
GGC^
GUC
UCA"^
UULF
1 2 3 4
2.762 2.915 0.320 2.819 0.502
2.786 2.911 0.329 2.817 0.507
2.796 2.910 0.323 2.818 0.491
2.746 2.916 0.359 2.920 2.774
2.775 2.912 0.373 2.915 2.779
2
1.839 3.703
2.710 2.917
2.688 2.917
1.844 2.836
2.771 2.906
Atom^
GAC'
GGG^
GUa
UCA""
UUU^
3 4
3.820 2.743
1.468 2.806 0.492
1.442 2.918 2.717
2.845 2.916 2.735
1.438 2.936 2.648
2.787 2.912 0.329 2.812 0.488
2.820 2.905 0.345 2.809 0.492
2.826 2.904 0.340 2.807 0.504
0.438 3.691 3.786 3.733
2.810 2.910 0.359 2.924 2.726
5 1 2 3 4
5 1
5 1 2 3 4 5
Notes: ^Atom numbering refers to Figure 3. ^Triplets are indicated by the first letter of the names of the corresponding bases. Neutral triplets. H'riplets are indicated by the first letter of the names of the corresponding bases. Triplets have only the central base charged.
to make the charged centers less sensitive to their neighborhood. Even charging all of the bases in the triplet is not sufficient to introduce large changes in the calculated EDs.i^ D. Similarity Index: 2D and 3D The calculation of our similarity indices is a combination of the ED values, thus it changes as much as they change. As a consequence, we expect an overall
GUIDO SELLO and MANUELA TERMINI
126
Table 8. Values of Energy Differences of Base Triplets Calculated in 3D with SP Residue and Planar Nitrogen Atom^
Cb
C^
A^
Atonr^
/\^
C"
A""
1 2 3 4
1.835 2.849 2.866 2.910 2.804
1.842 2.839 2.832 2.918 2.727
1.853 3.711 3.805 3.756
1 2 3 4 5
0.385 3.701 3.793 3.741
0.413 2.820 2.836 2.919 2.744
0.254 3.703 3.790 3.747
5 Notes:
^Atom numbering refers to Figure 3. ''Bases are indicated by the first letter of their names. All three bases are charged. '^Bases are indicated by the first letter of their names. All of the bases are neutral. The 5' P adenine is rotated by 90° with respect to the usual conformation.
agreement with ED variations. It is worth repeating that the importance of the indices' reliability is entirely dependent on their use, i.e., on the property they are representing. In the present case the goal would be the possibility of ordering base-base pairing by the use of the corresponding similarity indices. The property (similarity of pairing) is difficult to measure; therefore, it is difficult to have an experimental validation of the model. We could be satisfied by a qualitative agreement among the ordering given by different index approximation, or, better, by a consistent stability of the ordering inside each type of calculation (e.g., 2D or 3D). In Table 9 the indices calculated using structures with different characteristics in 2D and 3D approaches are shown. By comparing the two halves of the table it is possible to note an overall agreement between the two approaches with increasing index values from left to right. Besides that, we can locate six different groups in both halves: AT, AU, TC, UC; TG, UG; TU, UT; AC; AG; CG. The possibility of locating groups in all calculations (particularly in the 3D approach) is interesting because it represents a signal of a transferable influence of the main effects. E. Similarity Index: SP Influence
Going beyond standard bases we analyzed the indices where mutant or tautomeric bases are also involved. This analysis has been done only in 3D with the aim of studying the influence of the presence of the SP residue. From Table 10, we see that the SP residue changes the index values, as expected, but it is also possible to locate a more articulated grouping of the bases: A, M, C; U, T, X; I, G; A*; T*. In fact, with few exceptions, where one member of the group pairs with a base we can also find the other members. This behavior is consistently presented by both of the calculation classes (with or without SP). In addition, pairing among members of the first group always gives high indices, while in the second group omopairing
Transferability of Similarity
127
Calculations
Table 9. Values of Similarity Indices Calculated Using Neutral and Charged Isolated Bases -SP
+SPmin
+SPstd
Np^
Np^
Np^
Adenine-thymine
0.067
0.013
0.084
0.126
Adenine-uracil
0.068
0.013
0.085
0.128
Cytosine-guanine
0.194
0.068
0.058
0.113
Adenine-guanine
0.190
0.081
0.080
0.124
Thymine-guanine
0.143
0.018
0.018
0.019
shift
Uracil-guanine
0.140
0.019
0.017
0.018
shift
Thymine-cytosine
0.071
0.070
0.061
0.114
0.062
0.116
Paif^
-hSPstd N^
Notef
2D
Uracil-cytosine
0.073
0.071
Thymine-uracil
0.018
0.021
Uracil-thymine
0.022
0.020
Adenine-cytosine
0.044
0.013
2 charges
shift shift inversion
0.033 3D
Adenine-thymine
0.241
0.306
0.313
0.362
Adenine-uracil
0.233
0.293
0.298
0.346
Cytosine-guanine
0.388
0.324
0.194
0.248
Adenine-guanine
0.260
0.215
0.213
0.259
Thymine-guanine
0.075
0.036
0.049
0.043
shift
Uracil-guanine
0.065
0.039
0.049
0.050
shift
Thymine-cytosine
0.370
0.414
0.294
0.351 0.335
Uracil-cytosine
0.361
0.401
0.279
Thymine-uracil
0.046
0.052
0.002
Uracil-thymine
0.084
0.042
0.014
Adenine-cytosine
0.128
0.109
0.019
2 charges
shift 0.011
shift inversion
Notes: ^Pair names refer to Figure 3. '^he specific structure and the calculation type are indicated in the column header. ± SP indicates the presence/absence of the sugar-phosphate residue; std/min indicates the standard/minimized conformation of the SP residue; Np/Nt indicates the planar/tetrahedral structure of the amine nitrogen. '^ Special notes are reported in this column indicating either the number of charges on the pair, or the need of shifting/inverting the base so as to have coordination therefore requiring a different position for the charges. If the bases need to be shifted, the first base is submitted to the shift.
is always represented by low indices. Finally, there are the special cases of A* and T*. A* is the only base that uses an NH imine group as proton acceptor; T*, on the other hand, presents an OH enolic group as proton donor. As a consequence, the A* pairings are always in the upper part of the table and the T* pairings in the lower part.^^ In conclusion, the indices can be considered transferable at least for what concerns the grouping of similar bases.
128
GUIDO SELLO and MANUELA TERMINI
Table 10, Ordered Values of Similarity Indices Calculated Using Neutral and Charged Isolated Bases^ in 3D with Planar Amine Nitrogen Without SP residue
xc 0.464 XA 0.336 IT* 0.154 XU 0.096 AA* 0.010
T*M 0.460 UM 0.320 UT* 0.147 UT 0.084 IT 0.009
XM T*A* CT CG CU 0.411 0.391 0.388 0.361 0.370 TM UA* AT AG TA* 0.250 0.241 0.316 0.260 0.247 TT* GT* CA* IX AC 0.121 0.104 0.139 0.128 0.118 CM XG TU TC UG 0.071 0.074 0.065 0.048 0.046 XX TT UU 0.007 0.005 0.008 With SP residue in standard conformation
CI 0.355 AU 0.233 AM 0.100 XT* 0.044
XA* 0.341 lA 0.227 XT 0.100 lU 0.013
T*A* 0.446 CT 0.294 lA 0.177 CA* 0.069 AM 0.007
T*M 0.383 XA 0.289 CI 0.160 AA* 0.056 XX 0.007
UA* 0.368 XM 0.289 lU 0.135 TG 0.049 CM 0.006
TM 0.302 CG^ 0.196 XG 0.085 UU 0.017
AU 0.298 GT* 0.179 TT* 0.080 UT 0.014
TA* 0.365 CU 0.279 IT 0.132 UG 0.049 TU 0.002
XA* 0.352 XC 0.272 IX 0.119 XU 0.023 TT 0.001
AT 0.313 IT* 0.213 XT* 0.101 XT 0.020
UM 0.305 AG 0.213 UT* 0.095 AC 0.019
Notes: ^A = adenine; C = cytosine; G = guanine; T = thymine; U = uracil; A* = adenine tautomer; T* = thymine tautomer; M = methylguanine; I = inosine; X = xanthine. ^ h e CG index calculated on three points is 1.155 and a better comparison could be with a weighted percentage of this value (e.g., 6 7 % = 0.77).
F. Similarity Index: Juxtapositioned Pairs
The index calculation of the pairings simulated by juxtapositioning the bases gives values that are smaller than those calculated by our model (Table 6). This is due to the small changes of the EDs that are induced only by the through space interactions between hydrogen donors and acceptors. However, the indices group the pairings in dependence on the base similarity in a fashion comparable to the above-reported effects. The groups are not so clearly formed for two principal reasons: (1) the small values obtained from differences of large numbers inherently contain greater errors; (2) the juxtapositioning of the base for nonstandard pairings has been forced by hands, particularly where purine-purine or pyrimidinepyrimidine interactions are involved.
Transferability of Similarity Calculations
129
G. Similarity Index: Triplets The last case of index calculation represents a further modification of the pairing model. Here the pairing concerns triplet pairs where only the central base has been charged for the simulation. Following are some aspects of the results (Table 11). 1. In all of the examples the index values increase with respect to the isolated bases (last line in the table). This effect was expected because the greater the number of interactions, the greater the weight of each atom, being the weight representative of the atom's importance. 2. On the contrary, the smaller the index value of the isolated base, the greater the variance of the triplet indices. This was expected for the same reasons that make the least important atoms more sensitive to the perturbations. 3. Even if the indices show differences in dependence on the triplet composition, they are each similar enough to the others to be equally usable. For the sake of clarity, we will briefly summarize the work done and the results obtained. To verify the possibility of transferring the similarity calculations from small substructures to complex compounds, we analyzed the changes introduced by adding new features to the starting simple model. Therefore, we added complexity elements to the model step by step, analyzing the perturbations caused by each new element. The first and most important modification was the passage from topological to 3D calculations. Here the changes were important and, as a consequence, we should work either in 2D or in 3D without mixing results. Further, only the 3D values can distinguish different situations caused by through space interac-
Table 11, Values of Similarity Indices of Base Triplets Calculated in 3D with SP Residue and Planar Nitrogen A-U^ Paii^ CAG-GUG GAG-GUG CAG-AUA GAG-AUA CAG-UUU GAG-UUU CAG-GUG GAG-CUC
Notes:
A-G^
C-C^
C~U^
Index
Paii^
Index
Paii^
Index
Paii^
Index
0.386 0.361 0.351 0.326 0.320 0.295 0.270
CAG-GGG CAG-CGC CAG-AGU GAG-GGG GAG-CGC GAG-AGU
0.387 0.368 0.359 0.323 0.304 0.295
GGG-UCA CGC-UCA AGU-UCA
0.398 0.379 0.370
UCA-GUG UCA-AUA UCA-UUU UCA-CUC
0.396 0.361 0.330 0.280
0.245
^Bases are identified by the first letter of their names, "^he central bases of each triplet are used in pairing.
GUIDO SELLO and MANUELA TERMINI
130
NHz
^ HO-P-O-CHi O ' H
O
NH2
OH
HO-P-O-CHfe |i
O
'
H
H\| O
H^
/H OH
<J^X
HO-P-O-CH,
o
H HN OH
H" /H OH
Figure 5. An example triplet (ACG) used in the calculations.
tions. In this view, all of the other changes (presence/absence of SP substitution, planar or tetrahedral nitrogen atoms, isolated bases or triplets) introduced visible differences in the calculations. As a matter of fact, we should consider each case differently so as to obtain the correct results; but, in the aim of studying similarity, we can choose a standard approximation [SP substitution in standard RNA conformation (see Tables 2 and 3), planar nitrogen atoms (see Tables 2 and 3), isolated bases] and use it with confidence. In fact, the increase of the calculation accuracy does not justify the simultaneous increase in computer time, being that the quantitative calculation of similarity is still difficult to validate.^^
IV. CALCULATIONS, TRANSFERABILITY, SIMILARITY MEASURES AND INDICES In the Introduction we briefly mentioned the problems connected with the use of an elusive concept like similarity and it was clear that even the idea of quantifying this concept can lead to erroneous conclusions and/or methodological approaches. In this respect, if we add to the mere calculations the aim of transferring results between structures and substructures so as to get similarity measures or indices we could find ourselves in quicksand. In fact, we are adding the approximations ingrained into calculations to those from the transfer of values from simple to complex structures with the objective of getting a quantitative measure of something, like similarity, that is even difficult to define. Therefore, we would like to reconsider the work we have presented. The initial idea was to develop a methodology for measuring the similarity between structures; this measure would have been based on the calculation of a well-defined physical molecular characteristic (electronic energy difference). Supposing the calculation of the characteristic is exact (this is never true but the
Transferability of Similarity Calculations
131
4U0 2
,♦
'A
350 -
300 i
i
\\ *\
''/
v/
50 -
AU
-trl«
/
\\
100 -
std
h
\\
150 -
\\ *
^ '
200 -
\\
f'
\\
250 -
0 -
p ,''
\\
\ H
1
1
1
CG
UG
AG
— 1 — CU
1 UU
Figure 6, Comparison of similarity Indices calculated using triplets and Isolated (dashed line) bases.
approximation is clear), it could be possible to compare the values obtained for different structures. The passage from these values to a similarity measure is not immediate but it is easy. The difficulties appear as soon as we would like to attach a meaning to the similarity measure, for example affirming that structures that are similar in our system will have similar chemical reactivity. Most of the time this step is realized using brute force. It is, therefore, clear that the transfer of calculations from simple to complex structures will be problematic if, and only if, it will change the connection between the numerical value and the similarity meaning attached to it. At this point the question is: did we demonstrate that the value transfer does not invalidate the connection between calculations and similarity measures? The answer is positive. It must be clear that the qualitative sense of similarity is maintained throughout the transfer; it is also clear that the ordering of the structures is more or less constant.^^ We think that, at the present state of the development of the project, it would be possible to transfer ED values with confidence and index values with sufficient reliability at least for what concerns the grouping of the base pairings.
V. CONCLUSION The use of a modified version of the algorithm for the calculation of the similarity between substructures introduces the problem of the transferability of results from
132
GUIDO SELLO and MANUELA TERMINI
simple substructures to complex compounds. The obtained results clearly show that, at an optimal level of approximation, the operation can be done with enough confidence. The level of approximation that is necessary to get reliable results is represented by the calculations made on isolated bases with SP substitution using 3D representation and planar aminic nitrogen atoms.
ACKNOWLEDGMENTS Partial financial support by the Consiglio Nazionale delle Ricerche, and by the Ministero deirUniversita e della Ricerca Scientifica e Tecnologica, is gratefully acknowledged.
REFERENCES AND NOTES 1. (a) Chan P. L.; Dean P. M.; / Comput.-AidedMol Des. 1992,6,385-396. (b) Chan, P L.; Dean P M.; J. Comput-AidedMol. Des. 1992, 6,407-426. 2. Rouvray D. H. In Concepts and Applications ofMolecular Similarity; Johnson, M. A.; Maggiora, G. M., Eds.; Wiley-Interscience: New York, 1990, pp. 15-42. 3. As recent general references see: Concepts and Applications of Molecular Similarity] Johnson, M. A.; Maggiora, G. M., Eds.; Wiley-Interscience: New York, 1990. Molecular Similarity and Reactivity: From Quantum Chemical toPhenomenologicalApproaches; Carbd, R., Ed.; Kluwer: Dordrecht, 1995. 4. (a) Good, A. C ; So, S.; Richards, W. G. J. Med Chem. 1993,5(5,433-438. (b) Richards, W. G. PureAppl. Chem. 1994, 66,1589-1596. 5. Leoni, B.; Sello, G. In Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches; Carb6, R., Ed.; Kluwer: Dordrecht, 1995, pp. 267-289. 6. Sello, G, J. Chem. Inf. Comput. Sci. 1992, 52,1\?>-1\1, and references cited therein. 7. Baumer, L.; Sello, G. /. Chem. Inf. Comput. Sci. 1992, 32, 125-130. 8. (a) Baumer, L.; Sala, G.; Sello, G. Tetrahedron Comput. Method 1989, 2, 37-46. (b) Baumer, L.; Sala, G.; Sello, G. ibid 1989, 2, 93-103. (c) Baumer, L.; Sala, G.; Sello, G. ibid 1989, 2, 105-118. 9. \i = (dE/dN)z. 10. Sello, G. Theochem 1995, 340, 15-28. 11. Sello, G.; Termini, M. Unpublished results. 12. Gordy, W.; Thomas, W. J .0. J. Chem. Phys. 1956,24, 439-444. 13. Pritchard, H .O.; Skinner, H. A. Chem. Rev. 1955,55,745-786. 14. Pauling, L. The Nature of the Chemical Bond, 3rd ed.; Cornell University Press: Ithaca, 1960. 15. Molecular Advanced Design', Aquitaine Systemes: Paris, Version 2,1990. The standard conformation used by the builder is a-helix, which is the most representative for mRNA and tRNA. 16. A fast search for rotational minima has been performed using a Montecarlo-Metropolis algorithm. 17. Also 2D calculations feel the effect of the geometry because the bond lengths are different. 18. The full set of the calculated EDs is available as supplementary material. 19. To demonstrate the sensitivity of the method to the atom position in the space we considered a last case concerning a triplet (ACA) where one base (the 5' P) has been rotated by 90° (Table 8). Both the triplet and the base to rotate were chosen arbitrarily. The variations of the ED values are quite large as expected. 20. GT* is a three-point pairing whose index is quite lower than the comparable GC index.
Transferability of Similarity Calculations
133
21. In similarity evaluation the possibility of having very accurate calculations could have a dangerous side effect. In fact, the more precise the model is, the less extensible is its applicability. This situation is not surprising because similarity can be seen as a fuzzy property. 22. The sensitivity of the measuring method does not exclude the possibility that, in some cases, the result will depend on the approximation level used.
SUPPLEMENTARY MATERIAL (Values calculated for the complete set of 64 base triplets.)
Values of Energy Differences of Base Triplets Calculated in 3D with SP Residue and Planar Nitrogen Atom^
AAA^
AAC^
AAG^
AAU^
ACA^
ACC^
ACG^
ACU^
1 2 3 4 5
0.261 3.694 3.789 3.744
0.260 3.693 3.789 3.744
0.262 3.694 3.789 3.740
0.260 3.693 3.788 3.743
0.270 3.696 3.793 3.744
0.268 3.693 3.792 3.742
0.270 3.695 3.795 3.743
0.269 3.693 3.792 3.740
1 2 3 4 5
0.416 3.687 3.788 3.735
0.419 3.688 3.791 3.734
0.457 3.683 3.792 3.733
0.448 3.684 3.794 3.731
0.413 2.810 2.836 2.919 2.732
0.420 2.810 2.841 2.917 2.751
0.457 2.804 2.840 2.919 2.730
0.460 2.803 2.844 2.915 2.752
1 2 3 4 5
0.437 3.690 3.785 3.735
0.421 2.817 2.836 2.900 2.741
2.787 2.912 0.329 2.814 0.487
2.785 2.914 0.336 2.924 2.727
0.418 3.690 3.785 3.734
0.413 2.818 2.828 2.922 2.728
2.776 2.916 0.321 2.812 0.494
2.765 2.918 0.338 2.926 2.713
Notes: ^Atom numbering refers to Figure 3. H-ipiets are indicated by the first letter of the names of the corresponding bases. Neutral triplets.
Atom^
AGA^
AGC^
AGG^
AGU^
AUA^
AUC^
AUG^
AUU^
1 2 3 4 5
0.374 3.692 3.793 3.742
0.377 3.691 3.794 3.742
0.379 3.691 3.794 3.740
0.383 3.690 3.793 3.742
0.381 3.694 3.797 3.743
0.381 3.693 3.797 3.743
0.387 3.692 3.798 3.741
0.386 3.690 3.796 3.739
1 2 3 4 5
2.734 2.915 0.322 2.812 0.481
2.745 2.915 0.328 2.811 0.491
2.755 2.909 0.330 2.810 0.485
2.763 2.909 0.330 2.808 0.492
2.721 2.917 0.329 2.924 2.718
2.727 2.917 0.329 2.924 2.718
2.744 2.910 0.340 2.922 2.716
2.749 2.910 0.339 2.919 2.738
1 2 3 4 5
0.468 3.689 3.793 3.734
0.444 2.814 2.843 2.919 2.749
2.818 2.906 0.341 2.811 0.494
2.813 2.909 0.347 2.923 2.734
0.445 3.687 3.795 3.733
0.441 2.812 2.838 2.919 2.739
2.817 2.907 0.335 2.808 0.501
2.805 2.908 0.353 2.922 2.725
134
GUIDO SELLO and MANUELA TERMINI
Atom^
CAA^
CAC^
CAG^
CAiP
CCA^
CCC^
CCG^
CCU^
1 2 3 4 5
0.288 2.818 2.842 2.917 2.770
0.287 2.818 2.842 2.918 2.770
0.288 2.818 2.842 2.917 2.769
0.288 2.816 2.844 2.918 2.770
0.295 2.819 2.848 2.915 2.791
0.297 2.818 2.850 2.914 2.794
0.395 2.816 2.851 2.914 2.792
0.297 2.818 2.850 2.913 2.794
1 2 3 4 5
0.420 3.687 3.787 3.734
0.423 3.687 3.790 3.733
0.460 3.683 3.792 3.732
0.452 3.683 3.794 3.730
0.415 2.811 2.829 2.922 2.718
0.421 2.811 2.834 2.920 2.736
0.461 2.804 2.833 2.922 2.715
0.461 2.805 2.839 2.919 2.739
1 2 3 4 5
0.442 3.690 3.785 3.735
0.428 2.818 2.837 2.920 2.739
2.785 2.914 0.330 2.813 0.491
2.782 2.915 0.336 2.923 2.725
0.436 3.690 3.786 3.734
0.434 2.819 2.829 2.923 2.724
2.782 2.915 0.331 2.812 0.511
2.769 2.919 0.344 2.927 2.710
Atom^
CGA^
CGC^
CGG^
CGLP
CUA^
CUC^
CUG^
CUU^
1 2 3 4 5
0.419 2.815 2.846 2.919 2.765
0.421 2.815 2.847 2.920 2.766
0.423 2.814 2.847 2.918 2.765
0.427 2.810 2.848 2.920 2.766
0.430 2.815 2.852 2.914 2.794
0.428 2.815 2.854 2.912 2.795
0.431 2.813 2.854 2.913 2.795
0.430 2.814 2.854 2.910 2.796
1 2 3 4 5
2.729 2.915 0.333 2.810 0.506
2.737 2.916 0.339 2.803 0.517
2.750 2.910 0.341 2.807 0.511
2.754 2.910 0.341 2.806 0.518
2.706 2.918 0.337 2.926 2.704
2.716 2.919 0.339 2.924 2.721
2.726 2.912 0.347 2.925 2.702
2.738 2.912 0.345 2.922 2.724
/\fO/T7^
CGA^
CGC^
CGG^
CGU^
CUA^
CUC^
CUG^
CUU^
1 2 3 4 5
0.469 3.689 3.793 3.733
0.448 2.815 2.843 2.918 2.746
2.816 2.907 0.342 2.810 0.497
2.809 2.910 0.347 2.922 2.731
0.463 3.688 3.795 3.732
0.461 2.814 2.839 2.918 2.734
2.822 2.907 0.342 2.808 0.520
2.809 2.909 0.357 2.923 2.721
/Atom^
GAA^
GAC^
GAG^
GAU^
GCA^
GCC^
GCG^
GCU^
1 2 3 4 5
2.760 2.917 0.319 2.822 0.499
2.758 2.917 0.319 2.822 0.502
2.762 2.916 0.320 2.820 0.501
2.760 2.916 0.321 2.821 0.503
2.774 2.918 0.322 2.823 0.487
2.772 2.914 0.332 2.822 0.505
2.779 2.916 0.323 2.821 0.487
2.775 2.914 0.332 2.819 0.507
1 2 3 4 5
0.448 3.684 3.795 3.733
0.451 3.685 3.798 3.732
0.489 3.680 3.799 3.731
0.481 3.681 3.802 3.730
0.436 2.806 2.841 2.918 2.739
0.445 2.805 2.846 2.916 2.758
0.480 2.800 2.846 2.918 2.737
0.484 2.800 2.850 2.914 2.760
Transferability of Similarity Calculations
135
1 2 3 4 5
0.437 3.691 3.786 3.734
0.424 2.816 2.836 2.917 2.744
2.790 2.913 0.328 2.813 0.486
2.788 2.913 0.338 2.921 2.729
0.416 3.693 3.789 3.767
0.415 2.819 2.830 2.924 2.731
2.781 2.918 0.321 2.813 0.495
2.770 2.918 0.337 2.928 2.717
Atom^
CGA^
GGC^
GGG^
GGU^
GUA^
GUC^
GUG^
GUU^
1 2 3 4 5
2.782 2.912 0.325 2.820 0.504
2.776 2.912 0.322 2.820 0.504
2.785 2.912 0.327 2.818 0.507
2.780 2.910 0.323 2.820 0.506
2.791 2.913 0.320 2.821 0.490
2.789 2.901 0.330 2.820 0.506
2.795 2.911 0.330 2.820 0.506
2.793 2.908 0.332 2.816 0.510
1 2 3 4 5
2.762 2.909 0.336 2.809 0.488
2.711 2.908 0.342 2.808 0.498
2.786 2.903 0.344 2.806 0.493
2.791 2.902 0.344 2.805 0.500
2.744 2.912 0.340 2.923 2.726
2.747 2.912 0.344 2.920 2.743
2.768 2.906 0.352 2.920 2.723
2.770 2.904 0.351 2.917 2.745
1 2 3 4 5
. 0.467 3.689 3.794 3.733
0.443 2.810 2.842 2.916 2.748
2.818 2.907 0.342 2.810 0.491
2.818 2.905 0.349 2.920 2.736
0.449 3.687 3.800 3.734
0.443 2.813 2.840 2.920 2.742
2.825 2.906 0.337 2.809 0.502
2.807 2.909 0.354 2.924 2.727
Atom^
UAA^
UAC^
UAG^
UAU^
UCA^
UCC^
UCG^
UCU^
1 2 3 4 5
2.731 2.917 0.359 2.922 2.752
2.731 2.917 0.356 2.923 2.752
2.734 2.916 0.355 2.922 2.753
2.735 2.916 0.356 2.923 2.754
2.746 2.917 0.359 2.920 2.773
2.750 2.917 0.367 2.919 2.775
2.748 2.916 0.360 2.920 2.775
2.752 2.917 0.367 2.919 2.776
Atom^
UAA^
UAC^
UAG^
UAU^
UCA^
UCC^
UCG^
UCU^
1 2 3 4 5
0.451 3.682 3.796 3.731
0.454 3.683 3.799 3.730
0.491 3.679 3.801 3.730
0.483 3.679 3.803 3.728
0.445 2.805 2.838 2.919 2.728
0.452 2.805 2.843 2.916 2.748
0.491 2.799 2.842 2.919 2.726
0.491 2.799 2.847 2.915 2.750
1 2 3 4 5
0.443 3.688 3.787 3.734
0.429 2.814 2.838 2.918 2.741
2.792 2.911 0.330 2.814 0.493
2.788 2.913 0.337 2.922 2.726
0.437 3.692 3.788 3.734
0.434 2.818 2.830 2.922 2.727
2.786 2.916 0.332 2.812 0.511
2.771 2.918 0.344 2.926 2.713
Afom^
UGA^
UGC^
UGG^
UGU^
UUA^
UUC^
UUG^
UUU^
1 2 3 4 5
2.755 2.914 0.368 2.925 2.749
2.751 2.913 0.357 2.945 2.776
2.758 2.912 0.365 2.923 2.749
2.756 2.910 0.363 2.925 2.751
2.767 2.913 0.364 2.920 2.775
2.769 2.913 0.374 2.914 2.774
2.770 2.910 0.365 2.918 2.777
2.770 2.907 0.372 2.912 2.779
136
GUIDO SELLO and MANUELA TERMINI
1 2 3 4 5
2.766 2.908 0.344 2.806 0.512
2.774 2.907 0.350 2.805 0.523
2.787 2.902 0.352 2.803 0.516
2.793 2.902 0.353 2.802 0.524
2.739 2.911 0.352 2.922 2.715
2.747 2.908 0.358 2.917 2.730
2.760 2.905 0.363 2.921 2.713
2.769 2.901 0.364 2.915 2.732
1 2 3 4 5
0.472 3.686 3.795 3.733
0.452 2.812 2.844 2.917 2.748
2.822 2.904 0.343 2.811 0.500
2.816 2.906 0.349 2.921 2.733
0.467 3.688 3.796 3.731
0.465 2.813 2.840 2.915 2.734
2.826 2.906 0.345 2.807 0.521
2.810 2.908 0.361 2.918 2.720
SIMILARITY IN ORGANIC SYNTHESIS DESIGN: COMPARING THE SYNTHESES OF DIFFERENT COMPOUNDS
Guido Sello
I. II. III. IV. V.
Abstract Introduction Similarity Measures Comparison Methodology Results and Discussion Conclusion Acknowledgments References
137 138 139 140 142 150 150 150
ABSTRACT The possibility of using similarity concepts to compare syntheses of different compounds is examined and discussed. New similarity measures and indexes are described; they analyze both the strategic and tactical aspects suggesting a systematic
Advances in Molecular Similarity, Volume 2, pages 137-151. Copyright © 1998 by JAI Press Inc. Allrightsof reproduction in any form reserved. ISBN: 0-7623-0258-5 137
138
GUIDOSELLO
approach to the problem. The examples reported are used to help the reader understand the principles and methods introduced. Discussion of the results illustrates the advantages that similarity can bring into synthesis planning and emphasizes the real applicability of the realized procedures.
I. INTRODUCTION Organic synthesis planning is one of the most creative and difficult tasks that can be faced by chemists. It needs the assistance of many human intellectual abilities that all contribute to the attainment of the final result: an efficient and often elegant synthesis. In this respect, we can predict a very high productive application of similarity, as is demonstrated by several literature references.^"^ However, the explicit use of similarity in thisfieldis scarce and incomplete. Very few examples^"^^ exist that attempt to introduce similarity in the design of synthesis and most of them just mention its possible use without making an accurate analysis of its importance and contribution. But, when examining many of the best known syntheses, the impression of its presence is immediate. It is often possible to note the intelligent use of well-known synthetic steps in the planning of the synthesis of new and diverse compounds. Recently we became involved in a project fully dedicated to the introduction of similarity concepts into synthesis design.^"* We could thus elaborate on some preliminary ideas that helped the development of a rough initial system based on similarity measures. After some further modifications the system was applied to different evaluation phases and its contribution was certain. However, the greatest part of the studies^ was devoted to the application of similarity to the synthetic design of a single target with the aim of selecting the best path among many possibilities, of locating alternative steps, and of predicting the most different solutions. Now we are interested in the application of the analysis to the comparison of the syntheses of different compounds. It is evident that this problem is more difficult to solve and that it could even be difficult to assess the quality of the solution. Our previous experience has shown that even structurally similar compounds (e.g., the same compound) can be synthesized by very different routes where comparisons can be problematic. In addition, while the comparison of structures is possible by diverse methods, the comparison of transformations is often done by using substructure changes, neglecting reagents. As a consequence, it becomes hard to compare different molecules where substructures might not be similar at all. For example, the reduction of a carbon-oxygen double bond is keyed by a different substructure with respect to the reduction of a carbon-carbon double bond, despite the evident similarity from the viewpoint of synthesis planning. Herein I will address the complex problem of the comparison of the syntheses of different compounds, developing new ideas and calculation methods at the same time. The synthetic routes, used as examples, are taken directiy from the literature^^
Similarity in Organic Synthesis Design
139
and are not the best possible routes, but they can serve to assess the utility of similarity as a tool in synthesis design.
II. SIMILARITY MEASURES To compare synthetic routes we need both structure and reaction descriptors. In fact, I will develop a system that can consider the strategic aspect of the synthesis and its tactical realization. It is generally accepted that strategy is mainly a structure problem because the strategic approach to the synthesis of a target cannot be conditioned by the state-of-the-art development of transformation methodologies. On the contrary, tactics are concerned with the application of the strategic principles to the current target and depend on the transform management. During the development of a synthetic plan it is possible to conceive a system that, using similarity, can enhance strategic and tactical efficiency. However, when comparing existing synthetic routes we are forced to use similarity measures only to weigh the alternative options. We selected two structure and one reaction descriptors. In synthesis design it is important to consider two aspects of the similarity between structures: The first is a classical substructures comparison between educts and products (substructure similarity measure, SSM); the second aims at measuring the effectiveness of the synthetic step (globularity similarity measure, GSM). Every synthetic chemist instinctively feels that a good synthetic step must correlate two compounds that partially share structural features, but at the same time are as different as possible. In principle the best synthetic step transforms an educt into a product that is the most diverse where the change was predicted, but that maintains every other part of the structure unchanged. In other words, we can say that in a good educt-to-product passage all of the building blocks are conserved and all of the reacting blocks are affected. Our two structure descriptors aim indeed at measuring the realization of this goal. On the contrary, the comparison between transformations is best represented by a single descriptor that can measure the efficiency of a transformation in association with its group. The use of the reaction classification scheme developed by us should sustain the system by defining a global transform similarity measure (GTSM). SSM is directly derived from our previous work in the field of substructure similarity. ^^ We defined a substructure similarity index as SS\ = Nx{A + B)l{AxB)
(1)
where A^ is the number of similar atoms, and A and B are the numbers of atoms in molecules A and B, respectively. This index is well suited for comparing structures. Nevertheless, we slightly modified its definition to take into account two problems: the first concerning the different weight that the similarity of a connected and an unconnected substructure
140
GUIDOSELLO
must have; the second changing the limits of the index that now ranges between zero and one. The calculation is thus obtained as SSM = VSSFf
SF. = 2xN/(A-^B)
(2)
where A^. is the number of similar atoms in fragment /, and A and B are the numbers of significant atoms of molecules A and B, respectively. It follows that 0 < SF- < 1, equal to 0 if A^- is equal to 0, and equal to 1 if A^. is equal to A and A equal to B, and consequently 0 < SSM < 1. Globularity is a measure of the structure complexity and of its distribution on the molecule. ^'^ It is calculated by G = MAXD/COMPTOT
(3)
where MAXD is the greatest of the smallest distances of atom pairs measured as atom complexity, and COMPyQj is the molecular complexity measured as the sum of all atom complexities; from this descriptor we derive the similarity measure as G S M = AGAB
W
where AG is the difference between globularity of molecules A and B, A being the educt and B the product. From our reaction classification scheme we chose one descriptor that is used to measure the similarity between transformations. It is based on the calculated chemical potential ^^ of changing atoms and is obtained as ii = dE/dn = -IC,Z,^,^/(ZZXJ
+ k^
(5)
where Z^^ = Z-G = Z-[N^ + 0.85A^2 + 0.35(A^3 - 1)] Z^j^, = Z - (A^i + 0.S5N2 + 0.35A^3) Z is the atomic nuclear charge, a is Slater's core screening factor, N^ is the number of inner-shell electrons, N2 is the number of medium-shell electrons, and A^3 is the number of outer-shell electrons. Using this descriptor we can obtain the corresponding similarity measure as GTSM = ^ A ^ .
(6)
where A|i. is the difference of chemical potential of atom i in the product and the educt.
III. COMPARISON METHODOLOGY The definition of the similarity measure is necessary for the realization of a system that can compare synthetic routes; nevertheless, it is also necessary to conceive a
Similarity in Organic Synthesis Design
141
methodology that can guarantee the correct use of similarity measures. The methodology must be clear and stable; but, in addition, it is worth remembering that we are comparing quite different objects using an approach that must conserve its flexibility and large scope. Consequently, I am not going to calculate similarity indexes by just combining the corresponding similarity measures, but instead will develop a procedure where the similarity measures are used following a precise scheme. The final result will still be a number but its value will be highly dependent on the current comparison. In other words, the same synthetic step either can contribute to the calculation of the similarity index with reference to one synthesis, or can be neglected when considering a different synthesis. We can define three similarity indexes, one for SSM, one for GSM, and one for GTSM, that are calculated as reported below. SSI (substructure similarity index) is obtained by taking the geometric mean of the similarity percentages of the SSMs of all of the synthetic steps that have an SSM similar to the corresponding partner to an extent greater than or equal to 80%. For example, consider step 1 of routes A and A', and define SSM of A equal to 0.75 and SSM of A' equal to 0.82; because their ratio is equal to 0.91, step 1 contributes to SSI. On the contrary, if step 2 shows an SSM of A equal to 0.70 and an SSM of A' equal to 0.50, because their ratio is equal to 0.71, step 2 does not contribute to SSI. The rationale is that a synthetic step of two different syntheses can be considered sufficiently similar if the level of similarity of the educts and the products is at least 80%, neglecting the absolute value of the SSM. It is clear that this rationale is valid only when comparing educts and products, i.e., when we can be confident that the compounds we are considering cannot be too dissimilar. Calculation of the index is as follows: SSI = (n„ (SSM/SSM^)^/'^
V SSM/SSMj > 0.80
(7)
GSI (globularity similarity index) is based on the use of the descriptor globularity to evaluate the strategic efficiency of a synthetic step. As shown above, G is a measure of the molecular complexity and has a lower value for more complex structures; therefore, it should increase going down from target to precursors. Nevertheless, in real syntheses, G can either increase or decrease and GSM can thus be positive or negative. The corresponding index must consider this situation and we again chose to add together only GSM that show similar trends, i.e., GSI is the geometric mean of the ratio only of GSMs that have the same sign (positive or negative): GSI = (n„ (GSM. / GSM^.))^'''' V GSM/GSM^. > 0
(8)
The rationale, in this case, is that we can compare, and then add their contribution to the similarity, only those changes in globularity that have the same strategic effect, i.e., either increase or decrease the molecular complexity. Those synthetic steps that have opposite strategic meaning cannot be compared. The importance of
142
GUIDOSELLO
the contribution is measured by the similarity between the complexity changes; thus, even very small complexity variations that are strategically meaningless, can contribute to the overall strategic similarity between syntheses. GTSI (global transform similarity index) is naturally connected to the similarity between transformations, i.e., between reactions transforming educts into products. Its calculation uses the measure of the global changes in chemical potential of all of the atoms. Also in this case we must reckon when two reactions in two different syntheses can be compared. We have already reported^^ a system for reaction classification based on two descriptors, electronic energy and chemical potential, that allows the hierarchical subdivision of reactions into ordered sets. Thus, we are in the position of using that classification scheme in our procedure. However, the classification scheme is too articulated and, for the sake of synthesis comparison, we decided to use only two levels of the hierarchy. GTSI is the geometric mean of the ratio of the GTSMs of the reactions that belong to the same energy class; i.e., we consider comparable only two reactions that are both additions, or eliminations, or substitutions. GTSI is calculated as follows: GTSI = (n^ (GTSM./GTSM^)^''"
iff class(/?.) = class(/?p
(9)
In this case it is clear that we can use for similarity evaluation only those reactions that have similar reactivity bases. The GTSI values vary much more than their structure counterparts and their mere comparison can be misleading. As a consequence we decided to maintain a trace of the number of participating reactions (PRN) to the calculation and to use both GTSI and PRN as indexes of synthesis similarity.
IV. RESULTS AND DISCUSSION The results presented in the following concern two molecular sets: the first composed of four molecules of the same class (four prostaglandin derivatives), the second composed of the first set augmented by two different compounds, Sirenin and Methoxatin, chosen because the size and the length of their synthetic routes are of the same extent (Figure 1). The syntheses have been selected from the same literature source, thus their description is sufficiently uniform. However, not all of the synthetic steps have been explicitly considered and in the course of the discussion I will point out the differences that can appear just because of different descriptions. Before beginning the discussion I will repeat some general guidelines. The analysis is always carried out in the synthetic sense; thus, when I speak about, for example, step 3,1 mean the third step from the starting material. The values of the similarity measures (SSM, GSM, GTSM) are always obtained by comparing two intermediates of the same synthetic route (e.g., PGl-5 with PGl-6); on the contrary, the values of the similarity indexes (SSI, GSI, GTSI) are calculated by comparing similarity measures of corresponding steps in different syntheses.
143
Similarity in Organic Synthesis Design NHCHO PGE1
0" > r O
"N" "COOH METHOXATIN
Figure 1, Set of examined compounds.
Finally, it must be clear that, while the similarity measures represent the values of the descriptors and consequently have constant values, the similarity indexes are strictly dependent on the calculation procedure and can be changed very easily. In the first set we have four prostaglandin derivatives: PGl and PG2 are the same intermediate in two syntheses of PGEl; PG3 and PG4 are also intermediates in the syntheses of PGEl, but they are different from PGl and PG2. The four synthetic routes are sketched in Figures 2, 3, 4, and 5. The syntheses of PG3 and PG4 are very similar differing only in the last step. Nevertheless, I have freely chosen to describe the two routes in two different ways at the second and third steps because we would like to verify the response of the system. Values of SSM, GSM, and GTSM are reported in Table 1. From Table 2 we can observe that the SSI factors for PG3 and PG4 are over 80% similar in the last three steps only, in agreement with the different weight that the ethereal chain has in the two structures; as a consequence the SSI is the smallest in the PG series. Looking at the GSI, on the contrary, it is immediately obvious that the two syntheses are strategically very similar; all six steps show homogeneous variations and the final result is very clear. In conclusion, the two syntheses are similar for the strategy concerned, but it must be clear that the use of big protective groups in small molecules can influence the overall yield. The transformations are obviously very similar, being in agreement for all but one of the steps. The GTSM values are influenced by the diverse reagents used only in step 5 and in step 3, where we deliberately used water for the hydrolysis of PG3, and NaOH for that of PG4. Comparing PGl and PG2, which are exactly the same target, we can note that different synthetic routes of the same molecule are not certainly similar. All of the
GUIDO SELLO
144 NO,
NO, As^(CH2)eCN
(CH2)5CN
/
•
<
.^-TN^^..
Base ^
OCH^OCH^CHO
CHO
y^Iy^'*^
PG1-7
\CW^fM
o
o
(CH20H)2
(CH2),CN _PIS_^
^nAm PG1-2
O
Figure 2. First synthesis of PGE2. (ChyXN
^^^^^
(CH2)6CN
A J U ^^
OH
NHCHO'oH
PG2-8
PG2-7 NHCHO
NHCHO
^A^(CH2}6CN
^A^(CH,),CN
° V/ PG2-5 NHCHO
NHCHO
Js^{CH,),CN
.(CH2)5CN H2SO4.
OAc
OH°W°
PG2^
NHCHO
^A/(CH2),CN
-{CH2}sCN
OAc PG2-2
^^
O
PG2-1
^
F/g«/re J. Second synthesis of PGE2.
Similarity in Organic Synthesis Design
145
/tic. Mo^ ^ ° 3 ^
£l
2)MCPBA
^^
\
PG3-5
PG3-7 CO,H
J
1)NaOH
HjO
PG3-4
PG3-3
ri
2) (Bu)3SnH
2)Cr03
OAc
\^0
OAc
PG3-1
PG3-2
Figure 4, Synthesis of one precursor of PGE2.
indexes have PRNs equal to 2 or 3, demonstrating the low similarity between the two syntheses. On the other hand, PGl shares a good SSI with PG3 and PG2 shares a medium SSI with PG4. GSI and GTSI (Tables 3 and 4) are always very scarce; this permits us to affirm that the syntheses of PGl, PG2, and PG3 (or PG4) are different enough to represent synthetic alternatives. More interesting and more pertinent to our discussion is the analysis of the second molecule set because it can reveal the power of the similarity analysis applied to the syntheses of different
OBn
^
6 'X- "^ -^ "^ PG4-5
PG4-6
CO.H
J
1) NaOH
Wa
..ri
HjO
2)C02
OH
OBn
OH PG4-3 O
O
1)pB2CI ►
2) (Bu)3SnH
OpBz
V-OBn PG4-2
OpBz
V-OH PG4.1
Figure 5. Synthesis of another precursor of PGE2.
146
GUIDO SELLO Table 1, Values of Substructure Similarity Measure, Globularity Similarity Measure, and General Transform Similarity Measure Step
Compound
/ +2
Methoxatin Sirenin
0.55 0.74
PCI
2+3
3+4
4+5
5 +6
6 +7
0.68
0.88 0.82 0.72
0.68 0.81
0.71
0.67
0.61
0.93
0.63
0.75
0.58 0.50
0.49
0.56
0.38 0.52
PG2
0.88 0.71
0.71
0.72
0.89
0.40 0.57
PG3
0.77
0.60
0.74
0.64
0.40
0.38
PG4
0.85
0.71
0.82
0.86
0.91
0.65
Methoxatin
26
Sirenin
-46
PGl PG2
-32
PG3
60 14
PG4 Methoxatin Sirenin PGl PG2 PG3 PG4
12
7 +8
-85 -14
39
-52
115
-54
-8
10
-35
-10
56
73 25
-24
0
-26
-37
-86
-7
5 -67
-15
22
-27 -27
2
-46 -23
12
-19
El-10.07 A l 8.93 A l 13.02 A 1 8.64 IC 1 0.08 A 1 5.09 E 1 -4.38 E 1 -5.41 El-6.11 IC 1 0.89 IC i 1.49 IC 1 1.71
ICI-0.10 ICI-1.73 IC 1 0.56 A 15.33 A 1 2.96 A l 3.39
10 21
-66
IC 1 -0.30 IC 1 2.74 A l 6.19 IC 1 -3.91 A 1 3.99 A l 5.23 ICI 1.66 A 1 6.55 A 1 4.86 ICI 1.32 A 1 4.07 IC 1 -0.54 A 1 5.75 E 1-10.84 A l 4.27 IC 1 0.48 ICI-0.99 A 18.16 IC 1 -3.34 IC 1 -2.45 A 1 8.30
compounds. Both Methoxatin and Sirenin have structures that are clearly dissimilar from prostaglandins. We have only taken care of the general complexity and of the length of the synthetic routes. These two characteristics are unnecessary to make a comparison, but it is obvious that they make it easier to understand the result. In the case of comparisons between syntheses of different lengths, we need to
Table 2. Values of Substructure Similarity Index^ Sirenin Methoxatin Sirenin PGl PG2 PG3
PCI
PG2
PG3
2/0.80/0.89 3/0.67/0.88 2/0.78/0.88 3/0.69/0.88 4/0.51/0.85 5/0.69/0.93 2/0.86/0.93 3/0.72/0.89 5/0.73/0.94 3/0.76/0.91
PG4 3/0.87/0.93 3/0.82/0.94 3/0.74/0.90 4/0.72/0.92 3/0.70/0.89
Note: ^First value is Participating Reaction Number; second value is the product of SSMs; third value is the geometric mean.
Similarity in Organic Synthesis Design
147
Table 3, Values of Globularlty Similarity Index^ Sirenin Methoxatin
2 / 0.03 / 0.1 7
Sirenin
PCI 1 / 0.69 / 0.69
PG2
PG3
PG4
3 / 0.005 / 0.1 7 6 / 0.005 / 0.41 6 / 0.002 / 0.35
4/0.046/0.46 1/0.88/0.88
2/0.05/0.22
2/0.09/0.30
PG1
3/0.017/0.26 1/0.55/0.55
1/0.56/0.56
PG2
3/0.04/0.34
PG3
3/0.06/0.39 6/0.02/0.52
Note: ^First value is Participating Reaction Number; second value is the product of GSMs; third value is the geometric mean.
introduce a preliminary step that consists of the selection of the synthetic subroute of the longer synthesis. This function can be effected in two ways: either selecting the subroute containing the precursors of a size similar to the smaller compound, or selecting, after comparison, the subroute most similar to the shorter synthesis. In any case the analytical procedure is always the same, even if the result changes. Proceeding in the study of our molecule set we get the following results. It is obvious that the comparison could not suggest much similarity between compounds and syntheses that are so different, but the curiosity of determining what can be obtained by this approach is nevertheless sparkling. Sirenin. Its synthesis is sketched in Figure 6 and shows five steps where the structure remains acyclic; in step 6 two cycles are formed giving rise to the final architecture. A similar trend is present in the synthesis of PGl (cyclization at step 6) and is reflected by both SSI and GSI (PRNs equal to 4). Looking at the result of each step as measured by SSM and GSM, we can observe that (1) step 2 is a condensation of PGl and not of Sirenin, thus the two measures are in disagreement; (2) step 4 shows a bigger change for Sirenin with respect to PGl, and this is again reflected by the similarity measures; (3) steps 5 and 7, where the differences are less clear, show disagreement in only one measure. All of the remaining steps are
Table 4. Values of General Transform Similarity Index^ Sirenin Methoxatin Sirenin PGl PG2 PG3
PCI
4/0.004/0.25 2/0.09/0.30 3/0.12/0.49
PG2
PG3
2/0.24/0.49 4/0.06/0.49 2/0.10/0.32 2/0.07/0.26 2 / 0.81 / 0.90 0 / 0.00 / 0.00 3/0.19/0.57
PGA 3/0.03/0.31 2/0.54/0.73 1 / 0.07 / 0.07 2/0.10/0.32 5/0.06/0.57 (4/0.43/0.81)
Note: ^First value is Participating Reaction Number; second value is the product of GTSMs; third value is the geometric mean.
148
GUIDO SELLO 1)BuLi, TMS 2) BuLi
Br
3) AgNOs 4) NaCN
CHjO
S8
Ni(C0)4 ^
1
^
^
1)CH2N^
S r;^co,H
^
^^""^^
OH S6
S5
C03CH3
1)NH2NH2 2) Mn02
( >s
COXH, S3
C03CH3
1)Se02 ^ 2) LiAIH4
Figure 6. Synthesis of Sirenin.
similar for the strategy concerned. Also, PG2 shows an interesting similarity with Sirenin in SSM, despite the fact that PG2 cyclization is at the third step as evidenced by GSI; this result demonstrates that the similarity in the changed or maintained substructures can exist even for compounds that are quite dissimilar. The similarity between reactions is a different matter. PGl has three steps that are classified similarly to those of Sirenin (step 3 is an addition; step 5 is a substitution; step 6 is an addition) and only step 3 has a similarity value that is of the same magnitude. The most similar synthesis is, however, that of Methoxatin with four of six steps falling in the same class (step 1 of Methoxatin and step 2 of Sirenin are additions; step 3 of Methoxatin and step 4 of Sirenin are eliminations; step 4 of Methoxatin and step 5 of Sirenin are substitutions; step 5 of Methoxatin and step 6 of Sirenin are additions) with the two addition steps showing similar values. Methoxatin. Methoxatin is a poly cyclic aromatic compound; the synthesis of one of its precursors is sketched in Figure 7 and is compared with all other members of the set. Despite the clear diversity of this molecule with respect to the others, it is still possible to have some hints of the similarity of its synthetic route. SSI is always
Similarity in Organic Synthesis Design
149
1) HCOOAc. HCO2H
1) NaNOo. 0.3N HCI =
2)R02. H j . EtOH. 65
►
2) KOH, CH3OH/H2O, 0 'C NHCHO OCH3
^
o
M6
HN
Y
6,"
Acetone/H20. A
NHCHO
NHCHO
OCH3 M5
CP2CH3 ,/ - N H H O
CO2CH3
CH2CI2
NHj
s<:^^CO,CH, cOjCH,
CO,CH, C02CH3
Figure 7, Synthesis of Methoxatln.
relatively important (PRNs equal to 2 or 3) because the Methoxatin synthesis has much more structural change at each step than any other compound in the set. Somewhat surprisingly, GSI behaves differently. Both PG3 and PG4 show exactly the same trend as Methoxatin and the GSI values are very near those between PG3 and PG4 themselves, regularly alternating increases and decreases of globularity. In the case of Methoxatin the alternation is evidently justified by the corresponding alternation of condensation-cyclization steps. GTSI also shows some interesting similarity in the PG3 and Sirenin syntheses (PRNs equal to 4 of six compared steps for both). The addition and elimination steps contribute the most to the similarity, as expected. In concluding the analysis of this second molecule set, we can affirm that even the syntheses of quite different compounds can be compared providing interesting suggestions on similar reactivity or strategy weight of each route.
150
GUIDOSELLO
Some final remarks are necessary. Similarity has shown its potential use also in the analysis of synthetic routes; it is clearly possible to compare the syntheses of very different compounds; the results are often easy to understand. Concerning the values of SSM and GSM, we think their meaning can be immediately grasped, i.e., a great value of SSM means a largely conserved structure, a positive change of GSM means a molecule with a more distributed complexity. In the same way we can interpret SSI and GSI values, i.e., two synthetic steps can be considered similar if their SSM share the same level of substructure change or their GSM share the trend in complexity distribution. GTSM and GTSI are less easily bound to the reactivity behavior. First, we had to define a classification scheme to compare transformations; second, the calculated GTSMs are highly dependent on the reaction class. In particular, additions and eliminations generally have more similar values and, when it is the case, their comparison gives greater contributions. On the contrary, substitutions have, by definition, small values that change size and even sign and their comparison has a lesser influence. This fact must be kept in mind when comparing the values of GTSI, because the same number of similar transformations can give different index values if we are considering additions-eliminations or substitutions. Finally, it is important to remember that the PRNs must always be considered because two synthetic routes that share many steps are more similar than two routes that share few steps with higher values.
V. CONCLUSION The importance of similarity measures has been demonstrated in the field of synthesis planning. Using a limited amount of descriptors it is possible to effectively compare the syntheses of different compounds and to obtain, as a final result, the level of their relative efficiency. In addition, by defining a reference synthesis it could be possible to order synthetic routes, thus suggesting a preferential order of application. It is clear that the "best" synthesis does not exist, but it is likely that any reference synthesis can be used for the goal of determining the most appealing between a limited set of synthetic routes.
ACKNOWLEDGMENTS Partial financial support by the Consiglio Nazionale delle Ricerche, and by the Ministero deirUniversita e della Ricerca Scientifica e Tecnologica, is gratefully acknowledged. The author thanks the Organizing Committee of the 7th International Conference on Mathematical Chemistry for the invitation to present part of this work.
REFERENCES 1. Johnson, A. P.; Marshall, C; Judson, P. N. Reel. Trav. Chim. Pays-Bas 1992, 111, 310.
Similarity in Organic Syntliesis Design
151
2. (a) Hendrickson, J. B. Anal. Chim. Acta 1990,235,103. (b) Hendrickson, J. B. Reel Trav. Chim. Pays-Bos 1992,111,323. 3. Gordeeva, E. V.; Lushnikov, D. E.; Zefirov, N. S. Tetrahedron 1992, 48, 3789. 4. (a) Barone, R.; Arbelot, M.; Chanon, M. Tetrahedron Comput. Method 1988, 7, 3. (b) Azario, P.; Arbelot, M.; Baldy, A.; Meyer, R.; Barone, R.; Chanon, M. New J. Chem. 1990,14, 951. 5. (a) Long, A. K.; Kappos, J. C ; Rubinstein, S. D.; Walker, G. E. / Chem. Inf. Comput. Sci. 1994, 34, 922. (b) Long, A. K.; Kappos, J. C. J. Chem. Inf. Comput. Sci. 1994, 34, 915. (c) Corey, E. J.; Long, A. K.; Lotto, G. I.; Rubinstein, S. D. Reel. Trav. Chim. Pays-Bas 1992, HI, 304. 6. Johnson, A. R; Marshall, C ; Judson, R N. Reel Trav. Chim. Pays-Bas 1992, HI, 310. 7. Gasteiger, J.; Hondelmann, U.; Rose, R; Witzenbichler, W. J. Chem. Soc. Perkin Trans. 2 1995, 193. 8. Nakayama, T. J. J. Chem. Inf. Comput. Sci 1995, 35, 885. 9. Gelemter, H.; Rose, J. R.; Chen, C. / Chem. Inf Comput Sci 1990, 30, 492. 10. Hamm, R; Jauffret, R; Kaufmann, G. Reel Trav. Chim. Pays-Bas 1992, 111, 317. 11. Fontain, E.; / Chem. Inf Comput. Sel 1992, 32, 748. 12. (a) Gasteiger, J.; Ihlenfeldt, W. D.; Rose, R R. Reel Trav. Chim. Pays-Bas 1992, 111, 270. (b) Ihlenfeldt. W.; Gasteiger, J. Angew. Chem. Int. Ed Engl 1995, 34, 2613. (c) Gasteiger, J.; Ihlenfeldt, W. D.; Rose, J. D. J. Chem. Inf Comput. Sel 1992, 32, 700. 13. Wochner, M.; Brandt, J.; v.Scholly-Pfab, A.; Ugi, I. Chimia 1988, 42, 111. 14. Sello, G.; Termini, M. Tetrahedron 1997, 53, 3729. 15. Corey, E. J.; Cheng, X. The Logic of Chemical Synthesis; Wiley: New York, 1989, pp. 141,165, 251,253,255,258. 16. Sello, G.; Termini, M. In Advances in Molecular Similarity; Carbo, R.; Mezey, P., Eds.; JAI Press: Greenwich, CT, 1996, Vol. 1, p. 213. 17. Baumer, L.; Campagnari, I.; Sala, G.; Sello, G. Reel Trav Chim. Pays-Bas 1992, 111, 297. 18. Sello, G. Theoehem 1995, 340, 15. 19. Sello, G.; Termini, M. Tetrahedron 1997, 53, 14085.
This Page Intentionally Left Blank
BROWSABLE STRUCTURE-ACTIVITY DATASETS
Mark Johnson
I. II. III. IV. V. VI. VII.
Abstract Introduction The Problem of the Merchandiser A First Look at a Structure-Activity Dataset Molecular Equivalence Numbers as Primary Structural Browsing Variables Level Sets as Primary Structural Browsing Variables Similarity-Based Projections as Primary Browsing Variables Summary and Conclusions References
153 154 155 156 . 160 164 167 169 169
ABSTRACT Ever-larger structure-activity datasets are motivating the need for ways of rapidly finding and validating interesting structure-activity patterns. Patterns among chemical and biological-activity variables are easily viewed and browsed with currently available graphical methods. Substructure and similarity searching provide an analogous capability offindinginteresting compounds represented in a structural database, but these techniques are slow and unsystematic. Structural browsing variables provide
Advances in Molecular Similarity, Volume 2, pages 153-170. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 153
154
MARK JOHNSON
a complementary and systematic browsing alternative. The features that make these variables useful for structure browsing are discussed and illustrated using the perillartine sweetener dataset of Acton and Stone.
I. INTRODUCTION In the pharmaceutical industry, the race to discover new chemical entities for treating disease is dictating larger project teams that encompass more extensive and diverse synthetic efforts directed at increasingly complicated activity spectra. These changes are now being fueled by rapid technological advances in high-throughput screening/ combinatorial chemistry,^ and genomics."^ Chemists will increasingly be confronted by ever-larger structure-activity datasets in terms of both numbers of compounds and numbers of activities. This raises the browsing question: What is required of an effective method for efficiently gaining an intuitive and reasonably comprehensive view of the basic factors determining the structure-activity relationships buried in these datasets? The time-honored way of accessing structural information in chemical structure database is substructure searching. It serves well if you have a clear hypothesis or idea of what types of structures you should be examining. The notion of browsing arises when structure-activity data must be organized and examined in a manner that facilitates the formulation of hypotheses and ideas and the isolation of structural effects that those hypotheses must explain. Similarity searching has been proposed as a browsing tool."* Each similarity search forms a group of those structures having a specified similarity with a query structure. By performing a sequence of similarity searches in which the next query structure is a "hit" from a preceding search, one has the sense of "wandering about in structure space." However, similarity searching restricts one's attention to those compounds that can be "reached" from the original query structure by always taking steps no larger than the similarity search radius. These reachability sets, also called level sets, can be confined to quite small regions of structure space when the compounds are highly clustered. As the reachability sets grow in size, a second difficulty is encountered. Many structural regions in the reachability set will never be visited because of the unsystematic paths one is likely to take in a "similarity walk" through structure space. In this study, I address the browsing problem by stepping back to the very first steps of a structure-activity analysis. Here one is formulating ideas concerning the important substructures and critical structural comparisons that shape our intuitive understanding of the structure-activity relationship. Section 2 begins by analogically looking at the general nature of the problem raised by the browsing question. The concepts of primary and secondary browsing variables are motivated. Section 3 presents some data from an analysis of some oxime sweeteners and illustrates the use of activity variables as browsing variables.
Browsable Structure-Activity Datasets
155
Section 4 defines the concept of a molecular equivalence number (MEQN) and gives an example of an MEQN that works very effectively as a primary browsing variable. It also argues by example that many of the common quantitative chemical descriptors are not likely to be primary browsing variables, but may work well as secondary browsing variables. Section 5 shows that indexed level sets can be very effective primary browsing variables. Finally, Section 6 examines a particular combination of two one-dimensional multidimensional scaling projections that work well for browsing.
II. THE PROBLEM OF THE MERCHANDISER The manager of a large department store encounters a problem analogous to that raised by the browsing question: How can the merchandise be structured so that a first-time shopper can effectively find what he or she wants in the store? A purely random assortment of the merchandise may suffice if one has only a dozen articles or so, but as the number increases, a one-dimensional arrangement is almost invariably employed, usually decomposed into linear or curvilinear segments defined by the aisles of the store. The shopper can systematically browse the items in the store because of this basic linear structure. Systematic browsing also requires that the items be meaningfully organized along the aisles. A purely random order is never used. Neither are size or color used even though they are critical attributes of merchandise. Rather the items are functionally grouped into men's wear, lady's wear, hardware, appliances, and so forth. These groups must then be indexed to positions along the aisles. A variety of indexings would work; it is only important that one be implemented. The formation of these groups and the selection of an indexing defines what I will term SL primary browsing variable. Within these primary groupings, the articles are arranged by other attributes to influence or facilitate the shopper. For example, if the group is men's suits, they might be further ordered by size. Because color and size operate at this second ordering level, I will refer to them as secondary browsing variables. Mathematically, a variable is a mapping of a set of objects to the real line. If these objects are molecular structures, that variable is often called a chemical descriptor. As a variable, any chemical descriptor effectively orders groups of structures along the real line, the entity that corresponds to the aisles in our analogy. If one thinks of a group as those molecular structures mapped to a particular number or interval of numbers, it becomes apparent that any chemical descriptor is a basis for defining groups of structures. But are these groups interesting, and how do we tell? Our merchandising analogy provides a clue. If the merchandiser is clever, he or she will form groups that facilitate the comparisons a shopper wishes to make in purchasing an item. In buying a suit, you will be wanting to make comparisons between suits of your size, and you can be reasonably assured of finding the suits
156
MARK JOHNSON
that interest you in close proximity to each other. You will not find those suits dispersed throughout the store. With that clue in mind, we seek a basis for forming groups of structures so that one can make interesting structural comparisons within a group.
III. A FIRST LOOK AT A STRUCTURE-ACTIVITY DATASET Acton and Stone^ present a number of analogues of perillartine, an oxime sweetener. The taste potency relative to sucrose along with percent sweetness and water solubility were reported for each analogue. These values are given in Table 1. Acton and Stone wanted a compound that would elicit a strong, sweet taste. As such, they required high values for percent sweetness, potency, and solubility. However, they found that the more potent compounds tended to be the more insoluble compounds. This trade-off between potency and solubility is given in Figure 1. The log of the potency plus the log of the solubility is an approximate measure of deliverable taste strength which will be called delivered potency. It is apparent from Figure 1 that aldoxime, compound 44, represented the best combination of percent sweetness and delivered potency among the compounds in the study. Consequently, our analysis will focus on these two activity variables. A simple listing of the structures found in Ref. 5 is not given here because the function of such a listing becomes largely archival as the number of structures gets large. This study emphasizes the role that systematic structure browsing plays in searching out the critical structures that shape one's perception of a structureactivity relationship. Consequently, only a few selected subsets of structures are presented as might be constructed by various browsing techniques. This results in some structures appearing in more than one figure. Such reporting is inefficient from an archival viewpoint, but helps to "explain" why such structures come to play such critical roles in our intuitive understanding of a structure-activity relationship. The development of this intuitive understanding is the principal function of structure-activity browsing. Activity as a function of structure has often been likened to a mountainous surface where the height corresponds to activity and the latitudinal and longitudinal coordinates correspond to locations in structural space. When first venturing into a large structure-activity dataset, it is natural to want to look at those structures with the most desired activity spectrum, i.e., the "highest" points of the activity landscape that have been visited in a lead optimization effort. The first row of Figure 2 lists the first five structures with the highest delivered potency. The diversity of structures gives one great freedom in formulating possible hypotheses as to the types of structural features associated with good delivered potency. We might seek hypotheses related to the minimal number of those structural features. This leads to compound 3. Alternatively, we might seek hypotheses related to the largest number of features consistent with good delivered potency.
Browsable Structure-Activity Datasets
157
Table 1, Structure-Activity Table. Biological and physico-chemical data from [5]. Num
LogSol
LogPot
1 -3.3 3 -0.22
2.57 0.9
4 -0.82 6 -2.7
1.6 1.98
7 -2.3 8 -3.7
1 2.24
9 -3.7 10 -2.52
3.06 1.7
11 -1.96 12 -2.7
1.08
13 -1.57 14 -1.92
%Swt
DelPot
60 -0.73 15 0.68 55 0.78 17 -0.72 0 -1.3 14 -1.46 50 -0.64 65 -0.82
ShrbSiz Rnglso
BdSetO
BdDel2
6 5
1 2
1 14
1 2
6 12
2 1
15 17
12
1
5 5 4
1 1
18 2
BdDelRPI
BdDelMPI
-1.09 4.81
1.44 -4.56
2
4.81
-3.8
12
-1.09
7.79
13 14
-1.09
7.79 -1.5 0.17
-1.09
2
1
3
19
3
-1.09 -4.91
1 1
20
15
-1.09
21 22
16 17
-1.09
1.43 2.11
-1.09
4
23 24
18
-1.09
5.69
19
-1.09
25
20
-1.09
6.93 5.61
4.05
6
0.9
23 -0.88 55 -0.8 4 -0.67
15 -1.8
0.6 1.04
6 -1.32 4 -0.76
10 11
16
0.3
1.48
5
1.78
10
17 -1.1
1.74
48
0.64
3
4
26
21
2.73
-5.42
18 -2.22
1.38
6
4
3
22
2.73
-5.73
19 -1.3 20 -1.96
0.9 1.2
39 -0.84 42 -0.4
5
1
-1.09
-0.09
6
1
3 4
4 4
-1.09
-0.12
21 -1.1 22 -0.15
0.48 0.18
45 -0.76 22 -0.62
3 3
5 6
5 5
23 24
1 0.88
-3.83 -4.01
23 -1.1
0.3 1
3
6
5
25
0.88
5
6 7
3
26
0.88
27 28 6 4
27 1 1
0.77 -1.09 -1.09 -1.09 -1.09 -1.09 -1.09 -0.42
24 -1.64 25 -1.26 26 -1.77 27 -3.22 28 -1.52 29 -2.52 32 -1.66 33 -2.05 34 -1.74 35 -3.22 37 -2.4 38 -0.85
1.9
0
0.03
1.95 2.18
9 10 3 4
2.13
65 -0.27
7
0.6
2 -0.25
1.3 1.74 2.4 1.72 1.9 0.9 1.7
39 -1.6
1.23
4 -0.37
2.15 2.61
0 -0.85 0 -0.79
42 -1.82
2.3 2.7
45 -2.4
8
2 -0.8 0 -0.64 0 0.04 40 -0.03 16 -0.82 53 0.2 50 -0.62 0 -0.76 0 -0.35 50 0.21 22 -1.04
40 -3 41 -3.4 43 -3.22 44 -1.7
8
70
0.48
2.35
78 -0.52 90 0.65
2.48
92
0.08
3 3 4
1 1 1 1
6 7
-4.06 -3.04 -3.91 -1.01 0.14
8
7 8 29 9
3 3 5 5 6
8
10
6
-0.42
8
12
7
-0.42
9
9
5
30
4.81
-3.93
11 7 4
2
3 30
31
4.81
-0.96
2 1
4.81
-4.42 -1.26
3 4
10
-1.09 -0.7
2 1
6
2.18 3.3 4.87 6.16 -2.16 -0.7 3
8 8 9
-0.7
6
10 10
9 10 11
-0.7
-1.01 1.14
7
10
12
9
-0.7
242
-2.38
{continued)
MARK JOHNSON
158 Table 1. Continued Num
LogSol
LogPot
46 -3.4 47 -2.22
2.51
48 -2.52 49 -1.7 50 -2.4
1.45 2.15
-1.92
51 Note:
%Swt
DelPot
34 -0.89 12 -0.7
1.52
ShrbSiz Rnglso
BdDel2
BdDelRPI
10
1
32
4
33
2.73
1 10
3 7 11
34 10
-1.09 -0.7
10
12
10
-0.7
-1.24 -4.22 -3.24
11
1
35
-3.92
-1.83
2 -1.07
1.96
0 0.45 0 -0.44
6 7
2.51
0
3
-0.7
BdDelMPI
6 6 7
0.59
BdSetO
1.25 -5.47
Num corresponds to the Acton and Stone ordering [5]; LogSol and LogPot are the logs of the solubility and the potency; %Swt is percent sweetness; DelPot = LogSol+LogPot; ShrbSiz is the number of atoms outside the ring system; Rnglso is an arbitrary indexing of the ring systems; BdSetO (BdDel2) are the 0-level (2-level) sets based on the bond-set (edge-deletion) distance; BdDelRPI (BdDelMPI) are 1 -dimensional multidimensional-scaling projections of the edge-deletion pairwise distances between the ring systems (chemical graphs).
This leads to compound 16. As yet, we have no basis with which to reject the hypothesis that any structure "in between" these two extremes would have good delivered potency. All experimental bases for excluding hypotheses have been filtered out by our singular focus on compounds with good delivered potency.
Sweetness
«ooooOO
o (A D)
eg
— I —
0.5
1.0
1.5
2.0
— I —
I
2.5
3.0
Log potency
Figure 1. Joint plot of potency, solubility, and sweetness showing compound 44, the selected sweetener aldoxime.
Browsable Structure-Activity
Datasets
159
The second row of Figure 2 lists the first five structures with the highest percent sweetness. Four of the five structures are close analogues of each other and possess a common ring system. On the one hand, this greatly reduces the diversity of structures from which we might spawn hypotheses. On the other hand, these close analogues reveal a number of small structural changes at the para position that do not change percent sweetness. Thus, we have located part of a "sweetness plane" in this particular response region of structure space. Because compounds 42, 43, and 45 are not found in the first row of Figure 2, it might be suspected that we have identified a "delivered-potency cliff." We shall see that other browsing variables are more specifically designed for systematically locating those cliffs. Again, by construction, this set of structures excludes from our consideration those critical changes that do reduce percent sweetness, i.e., the "sweetness cliffs." To get a better feel for the activity planes and cliffs of our activity landscape, we would like to know more about the structural changes that have been tried and the effects of those changes. By suitably narrowing our focus, we can easily collect together all of the compounds in a small region of structure space. Substructure searching and similarity searching are common methods of "drilling down" and focusing on smaller regions of structure space. For example, one could list all of the 1,4-cyclohexadiene derivatives or perform a similarity search with compound 44 as the query structure. Both represent hypotheses that restrict one's attention, and by suitably narrowing one's focus, one eventually reaches the point where all
First five compounds ordered by delivered potency
U First five compounds ordered by percent sweetness Figure 2. Structures with the best five delivered potency values and the best five sweetness values.
160
MARK JOHNSON
of the structures within a substructure search can be examined. This process of drilling down takes us out of the browsing mode which concerns us here. It is natural to return to the browsing mode when we seek the next interesting region of structure space.
IV. MOLECULAR EQUIVALENCE NUMBERS AS PRIMARY STRUCTURAL BROWSING VARIABLES A subregion of structure space becomes interesting if it contains some of the peaks, planes, and cliffs of the structure-activity surface that stand out from the vast regions of inactive space. The preceding section used delivered potency and percent sweetness as browsing variables. By concentrating on the most desired values for these variables, we located the delivered potency and sweetness peaks and, in doing so, found a sweetness plane. These two browsing variables made no use of structure. The remaining part of this study will focus on browsing variables that take structure into account. We began with a simple count, called ShrbSiz in Table 1, of the number of atoms outside of the ring system. Figure 3 maps ShrbSiz, the number of atoms that lie outside the ring system, against our two activity variables. The values of ShrbSiz have been jittered by the addition of a small random number so as to distinguish compounds with common tied values. No global trends are apparent in Figure 3
I O
6
Number of atoms outside of theringsystem Figure 3. Joint plot of jittered ShrbSiz, delivered potency, and sweetness.
Browsable Structure-Activity Datasets
161
except that the better values for these two variables tended to be restricted to the lower values of ShrbSiz. The absence of interesting global trends does not rule out the possibility that by examining compounds with particular values of ShrbSiz we might find some interesting structural comparisons as we did in the bottom half of Figure 2. This possibility is ruled out by considering Figure 4, which contains those structures with a ShrbSiz of 6. One can find a few interesting structural comparisons, such as compounds 44 and 49. But even with this small set of compounds, such comparisons are difficult to locate because of a lack of a dominaang and obvious commonality among the structures. ShrbSiz is an example of SL globally quantitative chemical descriptor. A difference in ShrbSiz for two compounds always means that the two compounds differ in the number of atoms that lie outside of the ring system; it is only a matter of scale. Babaev and Hefferlin^ provide interesting arguments for using the number, A^, of atoms and the number, Z^, of valence electrons in a molecule as globally quantitative chemical descriptors for organizing molecular structures when one is looking for molecular properties that reflect the fundamental periodicities found in atomic structure. However, in the regions of (iV,Z^)-space where most drugs are found, it has yet to be demonstrated that these two descriptors will prove effective as primary
6^° 47
o 28
4
Figure 4. Compounds with a ShrbSiz of 6.
49
MARK JOHNSON
162
browsing variables. This does not mean that quantitative properties are not useful in a browsing context. In fact, ShrbSiz will be seen to be very effective as a secondary browsing variable for viewing and comparing structures within a group defined by another primary browsing variable. Opposite the quantitative chemical descriptors are the nominal chemical descriptors. Here we only have to give a structural reason why two compounds have identical values for the descriptor. There is no requirement that compounds given different descriptor values have anything in common or are similar to each other in some regard. In other words, a difference in the value of a nominal descriptor has no meaning unless this difference is 0. A nominal chemical descriptor can be defined by simply specifying a manner in which two compounds are to be viewed as equivalent. For example, two molecules may have the same molecular formula, the same chemical graph, or the same ring system. Each of these equivalence relationships partitions every collection of compounds into disjoint classes so that each compound falls into one and only one class. An MEQN is an arbitrary indexing of the classes of an equivalence relation that is defined over all compounds. Rnglso in Table 1 groups the oximes according to a common ring system. (Acyclic structures are equivalent to methane in this equivalence relation, and the
B
Q
d
in
d
Ring equivalence number
Figure 5. Joint plot of jittered Rnglso, delivered potency, and sweetness.
Browsable Structure-Activity Datasets
163
number of atoms in the ring system is defined to be 1.) By replacing ShrbSiz in Figure 3 with Rnglso, we obtain Figure 5. Because of its construction, we cannot expect to see global structure-activity trends between Rnglso and activity. However, by systematically examining the structural classes, we see that the most desired compounds have a Rnglso value of 10. These structures are listed in Figure 6. The structures in Figure 6 share an obvious commonality, namely, their ring system. Consequently, one can expect that almost any pairwise comparison would prove interesting, especially if one takes into account how medicinal chemists use functional groups to explore a structure-activity relationship. When one notes the disparities in the sweetness of compounds in this set, one knows that a region of structure-activity cliffs has been isolated. However, these cliffs are still to be specified structurally. This is easily done visually using structure-activity maps, one of which for the dataset can be found in Ref. 7. However, structure-activity maps are slow to construct relative to the rate at which large structure-activity datasets must be browsed. Consequently, there is a need for a simpler method of visually isolating the cliffs and planes. The structures in Figure 6 have been ordered by ShrbSiz. Using this ordering together with Figure 5, we see that adjacent compounds 42 and 43 are part of a sweetness plane and a delivered-potency cliff. Further on, adjacent compounds 49 and 44 are part of a sweetness cliff and a delivered-potency plane, and adjacent compounds 50 and 45 are again part of a sweetness cliff and possibly a deliveredpotency cliff. If these observations are not to remain anecdotal curiosities, possibly explicable in terms of experimental variation or inexplicable because of the complexities of the underlying mechanisms, we must group them in terms of commonalities and
'
r
'
°\
42
49
Figure 6. Compounds with a Rnglso of 10.
MARK JOHNSON
164
trends. For example, the sweetness cliffs defined by compounds 49 and 44 and compounds 50 and 45 are both associated with meta versus para substitution. As we begin to piece together these cliffs and planes, an intuitive understanding of a structure-activity relationship emerges that must be taken into account by any quantitative description of that relationship.
V. LEVEL SETS AS PRIMARY STRUCTURAL BROWSING VARIABLES Molecular similarity measures are playing an increasingly important role in drug discovery. ^"^ ^ A variety of algorithms exist for clustering a collection of compounds once the pairwise similarities or distances between the compounds have been calculated.^^ In hierarchical clustering, one can speak of the clusters associated with a given level, v, of similarity. ^^ These clusters partition the collection of structures into disjoint sets of compounds, the level sets at level v. Unlike the equivalence classes used in the construction of an MEQN, the membership of these sets changes as the collection of compounds changes. By adding another compound to the collection it is possible for two level sets to merge into one even though the level value, V, does not change. Indexing the level sets of a hierarchical clustering of structures gives rise to another class of interesting browsing variables. Table 1 includes two level-set
Sweetness
••••••# •
4 4 «
•
(D Q. •
•
28
O
42
49*
34 45
• 22
d
•
• •20
36
37 4 3 »
21 .23
• ^
6
•
50*
29 • 32
41 27
35»
• 48
4
•
8
10
Level sets Figure 7. Joint plot of jittered BdSetO, delivered potency, and sweetness.
12
Browsable Structure-Activity Datasets
165
browsing variables: BdSetO and BdDel2. BdSetO is an arbitrary indexing of the single-linkage level sets for the bond set distance at level 0. (The bond set distance between molecules A and B is the total number of bonds in both molecules minus the number of bonds they have in common when ignoring connectivity.) At a level of 0, two compounds are in the same level set if and only if they are indistinguishable by the bond set distance. BdDel2 is an arbitrary indexing of the level sets of the bond deletion distance at level 2. (The bond deletion distance between molecules A and B is the minimum number of bonds that must be deleted from either A or B that results in a common substructure. This conmion substructure need not be connected.) By letting BdSetO be the primary browsing variable we obtain Figure 7. Looking at compounds 44 and 49, we know we have located a sweetness cliff and a delivered potency plane. Turning to Figure 8, which contains the structures of a number of level sets in Figure 7, we see that the sweetness cliff is defined by differences associated with the meta Sind para positions. This cliff was noted in the preceding section when we confined our attention to a single ring system. Level sets often allow comparisons across ring systems. For example, a comparison of compounds 42 and 34 suggests that a 1,4-cyclohexadiene analogue will have better sweetness and potency than the 1,3-cyclohexadiene analogue. This "substructure" hypothesis is not only confirmed by the comparison of compounds 45 and 37, but suggests that compound 45 may be approaching an optimal structure because the same structural
/ ^ 23 o
'-><^<»
Figure 8. Structures of annotated points in Figure 7.
MARK JOHNSON
166
f?
s & ?
>
in
d o o
■<5 Q
Level sets Figure 9, Joint plot of jittered BdDel2, delivered potency, and sweetness.
change had a much more dramatic effect, an argument very much in spirit with the Pfeiffer rule.^^'^"* Unfortunately, the 1,3-cyclohexadiene analogue of compound 44 was not isolated.^
°v/
Figure 10, Structures of annotated points In Figure 9.
Browsable Structure-Activity Datasets
167
By letting BdDel2 be the primary browsing variable, we obtain Figure 9. Using a different similarity measure gives rise to slightly different groupings, but the nature of the arguments remains unchanged. Compounds 44 and 45 define a sweetness plane and a modest delivered-potency cliff. A similar statement can be made for compounds 49 and 50, except now the delivered-potency cliff is more pronounced and neither compound is sweet. Looking at the corresponding structures in Figure 10, we see the negative effect on sweetness is associated with substitution at the meta position and the negative effect on delivered potency is associated with a methyl addition alpha to the ring.
Vl. SIMILARITY-BASED PROJECTIONS AS PRIMARY BROWSING VARIABLES Given the pairwise similarities or distances between the compounds, multidimensional scaling methods can array the compounds in R^ so that their pairwise distances in /?" optimally correlate with their pairwise similarities or distances in structure space. By letting n = 1, we obtain another potential browsing variable. The variables BdDelRPl and BdDelMPl are two such projections. BdDelMPl represents a projection of the distance matrix computed over all molecular structures using the bond deletion distance. To form BdDelMPl, the multidimensional scaling function cmdscale of Splus^^ was run on the pairwise bond deletion distances calculated on the hydrogen-reduced chemical graphs of the molecules using the R^ projection option. BdDelMPl is the x-axis component of that projection. BdDelRPl was computed in a similar manner except the pairwise distances were based on the chemical graphs of the ring systems. BdDelMPl may be thought of as a one-dimensional molecular projection and BdDelRPl as a one-dimensional ring-system projection. Recall that Rnglso was an arbitrary indexing of the ring systems. The x-axis of Figure 5 represents only one of the many indexings that would order the compounds along a line while preserving the integrity of the ring system groups. However, one might be interested in arranging these ring systems along the line in a manner that puts similar ring systems close to one another. In our department store analogy, this would be analogous to arranging the item groups so that men's shirts are close to men's suits. The ring projection, BdDelRPl, can be viewed as replacing the arbitrary indexing in Rnglso with a more relevant ordering without destroying the makeup of the ring system groupings. Figure 11 plots the molecular and ring projections, BdDelRPl and BdDelMPl, against each other. Percent sweetness is expressed as the size of the plotted point and delivered potency as the type character used for the plotted point. The best sweetener would be a large, solid disk. Compound 14 is represented by a solid but small disk. Compound 44, the selected sweetener, is represented by a large circle indicating a better than 90% sweet taste and better than average delivered potency.
MARK JOHNSON
168 8 >♦ 7 « 15
Delivered potency
■ 33 * 14
Sweetness
+X5KAO®« D • • • • • • •
•
X *
37
+ 0
K 2*
• 1
1
Q
2 ^ 3 '♦^ 22
1
1
1
Ring projection
F/gure / / . Joint plot of jittered BdDelRPI, BdDelMPI, delivered potency, and sweetness.
The sweet compounds, indicated by the large plotted characters, cluster in the middle of the figure. The five labeled compounds in this region form the upper row of Figure 12. This is a highly organized set of structures with an easily recognized commonality consisting of a six-membered ring substituted with analogously sized substituents in thtpara position.
"v><„
Figure 12, Structures of annotated points in Figure 11.
Browsable Structure-Activity Datasets
169
Other regions of this plot also organize structures for meaningful visual comparison. The two other sets of annotated points at the top and bottom of the plot illustrate two of these regions. These corresponding compounds are not sweet as judged by the small characters associated with the plotted points. The corresponding structures are given in the second and third rows of Figure 12. The second row of structures share the same ring system as compound 28 in the center of the first row, but have a different group of substituents in ihopara position. Interestingly, the acid and salt, compounds 6 and 7, are separated in Figure 11 from the remaining three compounds in the second row.
VII. SUMMARY AND CONCLUSIONS This study has compared a number of chemical descriptors with regard to their potential utility in a single context involving roughly 50 compounds. In this limited context, the browsing variables have accomplished the purpose for which they were designed: to group structures in meaningful ways that facilitate the construction of an intuitive understanding of the activity cliffs and planes associated with the structure-activity surface. Clearly no single browsing variable is going to solve all needs. This raises the question as to what combinations of browsing variables are needed to most quickly bring one to an intuitive understanding of the structure-activity relationship. In this regard, the department store analogy begins to break down. As was pointed out to me by Doug Klein at the Girona Conference on Mathematical Chemistry and Molecular Similarity, the store manager changes his or her arrangement of the store items at great cost. With computers, we can rapidly reorganize the structures to suit a variety of browsing needs. The ability to define collections of browsing variables creates the need for flexible and powerful software tools for dynamically exploring visual presentations of tabled data. As the number of points grows and the number of variables increases, it becomes increasingly important to have access to a dynamic data visualization package for putting together various visual combinations of browsing variables and activity variables. For this study, Spotfire^^ was used for dynamic data visualization, and Splus^^ was used to construct the printed figures.
REFERENCES 1. Hams, A. L. P/i^rm. A^^w5 1995, 2, 26. 2. Czamik, A. W.; EUman, J. A., Eds.; Combinatorial Chemistry. Special issue of Ace. Chem. Res. 1996, 29. 3. Beeley, L. J.; Duckworth, D. M. Drug Discovery Today 1996, 7, 474. 4. Bawden, D. In: Concepts and Applications of Molecular Similarity, Johnson, M. A.; Maggiora, G. M., Eds.; Wiley-Interscience: New York, 1989, pp. 65-76. 5. Acton, E. M.; Stone, H. Science 1976, 2, 584.
170
MARK JOHNSON
6. Babaev, E. V.; Hefferlin, R. In Concepts in Chemistry: A Contemporary Challenge; Rouvray, D. H., Ed.; Wiley-Interscience: New York, 1997, pp. 41-100. 7. Johnson, M. A. J. Biopharm. Stat. 1993, 3, 203. 8. Willett, R Similarity and Clustering in Chemical Information Systems', Research Studies Press: Letchworth, 1987. 9. Johnson, M. A.; Maggiora, G. M., Eds.; Concepts and Applications of Molecular Similarity; Wiley-Interscience: New York, 1989. 10. Dean, R M., Ed., Molecular Similarity in Drug Design; Blackie Academic & Professional: London, 1995. 11. Carbd, R., Ed.; Molecular Similarity and Reactivity: From Quantum Chemistry to Phenomenological Approaches; Kluwer: Dordrecht, 1995. 12. Jain, A. K.; Dubes, R. C. Algorithms for Clustering Data; Prentice-Hall: Englewood Cliffs, NJ, 1988, p. 66. 13. Pfeiffer, C. C. Science 1956, 7, 29. 14. Lehmann, P A. F Quant. Struct.-Act. Relat. 1987, (5, 57. 15. Splus is a commercial statistics package marketed by StatSci, a division of MathSoft, Seattle, Washington, http://www.mathsoft.com. 16. Spotfire is a commercial data visualization package marketed by Spotfire AB, Goteberg, Sweden, http://www.spotfire.com.
CHARACTERIZATION OF THE MOLECULAR SIMILARITY OF CHEMICALS USING TOPOLOGICAL INVARIANTS
Subhash C. Basak, Brian D. Cute, and Gregory D. Grunwald
Abstract I. Introduction IL Methods A. Database B. Calculation of Indices C. Classification of the Indices D. Statistical Methods and Computation of Similarity III. Results A. Principal Component Analysis B. Analogue Selection C. /C-Nearest-Neighbor Property Estimation IV. Discussion Acknowledgments References Advances in Molecular Similarity, Volume 2, pages 171-185. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 171
172 172 173 173 173 178 178 179 179 183 183 183 184 184
172
SUBHASH C. BASAK, BRIAN D. CUTE, and GREGORY D. GRUNWALD
ABSTRACT Three similarity spaces were used in the selection of analogues and i^T-nearestneighbor (KNN)-based estimation of normal boiling points for a diverse set of 2926 chemicals. The similarity spaces consisted of principal components derived from (1) 40 topostructural indices, (2) 61 topochemical parameters, and (3) the full set of 101 topostructural and topochemical indices. The three methods selected sets of analogues with a substantial number of structurally analogous molecules. For the KNN method of property estimation, the similarity space that used the full set of indices was superior to either of the subsets (topostructural or topochemical). For all three methods, K = 6-10 gave the best estimated values for boiling point.
I. INTRODUCTION Interest in quantifying the similarity of molecules using computational methods has increased. ^"^ In particular, a recent trend in the characterization of similarity/ dissimilarity of chemicals makes use of graph invariants. Molecular structures can be represented by planar graphs, G = [V,E], where the nonempty set V represents the set of atoms and the set E generally represents covalent bonds.^ These graphs can be used to adequately represent the pattern of connectedness of atoms within a molecule. Graph invariants, values derived from planar graphs, are graph theoretic properties which are identical for isomorphic graphs. A numerical graph invariant or topological index maps a chemical structure into the set of real numbers. Various graph invariants have been used in ordering and partial ordering of sets of molecules. ^'"^"^ Various topological indices (TIs) and principal components (PCs) derived from TIs have been used in quantifying the similarity/dissimilarity of molecules and in the similarity-based estimation of physical and toxicological properties."^'^'^^^^ Such TIs include those derived from simple planar graphs which contain adjacency and distance information for vertices. These TIs could be considered topostructural indices. Other TIs, which are derived from weighted chemical graphs, could be regarded as topochemical indices because they contain explicit information regarding the chemical nature of the atoms (vertices) and bonds (edges) in the molecular structure, in addition to quantifying the adjacency and distance relationships within the graph. Our earlier studies made use of a combination of topostructural and topochemical indices to select analogues of chemicals and estimate properties of molecules in large and diverse databases using the A'-nearest-neighbor (KNN) method. In this paper we have carried out a comparative analysis of similarity-based analogue selection and KNN-based estimation of normal boiling point using: (1) a set of 40 topostructural indices, (2) a group of 61 topochemical indices, and (3) the combined set of 101 indices.
Molecular Similarity Using Topological Invariants
1 73
II. METHODS A. Database
The normal boiling point database consisted of 2926 compounds taken from the U.S. EPA ASTER^^ system. The data comprised a set for which chemical structures and normal boiling values were available, and for which it was possible to compute all 101 TIs. B. Calculation of Indices
The TIs calculated for this study are listed in Table 1 and include Wiener number/^ molecular connectivity indices as calculated by Randic^^ and Kier and Hall,^^ frequency of path lengths of varying size, information theoretic-indices defined on distance matrices of graphs using the methods of Bonchev and Trinajstic^^ as well as those of Raychaudhury et al.,^^ parameters defined on the neighborhood complexity of vertices in hydrogen-filled molecular graphs,^"^"^^ and Balaban's J indices."^^"^^ The majority of the TIs were calculated using POLLY 2.3.-^^ The J indices were calculated using software developed by the authors. The Wiener index (W), the first topological index reported in the chemical literature,^^ may be calculated from the distance matrix D(G) of a hydrogensuppressed chemical graph G as the sum of the entries in the upper triangular distance submatrix. The distance matrix D(G) of a nondirected graph G with n vertices is a symmetric nxn matrix (J.), where d.. is equal to the distance between vertices v. and v- in G. Each diagonal element d-- of D(G) is zero. We give below the distance matrix D(G^) of the unlabeled hydrogen-suppressed graph Gj of n-propanol (Figure 1): (1) (2) (3) (4) 1 D(G,) = l 4
0 1 2 3 1 0 1 2 2 1 0 1 3 2 1 0
W is calculated as
W=l/2'^d,j = 'Zh.g,
(1)
where g^ is the number of unordered pairs of vertices whose distance is h. Thus, for D(Gy), Wheis a value often. RandiC's connectivity index,^^ and higher order connectivity path, cluster, pathcluster, and chain types of simple, bond and valence connectivity parameters were
174
SUBHASH C. BASAK, BRIAN D. GUTE, and GREGORY D. GRUNWALD
Table 1. Symbols, Definitions, and Classifications of Topological Parameters Topostructural 1^
Information index for the magnitudes of distances between all possible pairs of vertices of a graph
1^ W
Mean information index for the magnitude of distance Wiener index = half-sum of the off-diagonal elements of the distance matrix of a graph
P
Degree complexity
H^
Graph vertex complexity
H^
Graph distance complexity
7C
Information content of the distance matrix partitioned by frequency of occurrences of distance h Order of neighborhood when / Q reaches its maximum value for the hydrogen-
O
filled graph M^i
A Zagreb group parameter = sum of square of degree over all vertices
M2
A Zagreb group parameter = sum of cross-product of degrees over all neighboring (connected) vertices
^X
Path connectivity index of order h = 0-6
^X
Cluster connectivity index of order h = 3-6
^XpQ
Path-cluster connectivity index of order h = 4 - 6
^Xch
Chain connectivity index of order h = 3-6
P^
Number of paths of length /i = 0-10
J
Balaban's J index based on distance
'ORB
Topochemical Information content or complexity of the hydrogen-suppressed graph at its maximum neighborhood of vertices
IC^
Mean information content or complexity of a graph based on the rth ( r = 0-6) order neighborhood of vertices in a hydrogen-filled graph
SIC^
Structural information content for Ath ( r = 0-6) order neighborhood of vertices in a hydrogen-filled graph
CICr
Complementary information content for rth (r = 0-6) order neighborhood of vertices in a hydrogen-filled graph
^A^
Bond path connectivity index of order h = 0-6
^Ac
Bond cluster connectivity index of order h = 3-6
"A^h
Bond chain connectivity index of order h = 3-6
^A^c
^0"<^ path-cluster connectivity index of order h = 4 - 6
^A^
Valence path connectivity index of order h = 0-6
^XQ
Valence cluster connectivity index of order h = 3-6
'^Ach
Valence chain connectivity index of order h = 3-6
^XpQ
Valence path-cluster connectivity index of order h = 4 - 6
f
Balaban's J index based on bond types
/
Balaban's J index based on relative electronegativities
J
Balaban's J index based on relative covalent radii
Molecular Similarity Using Topological Invariants (1)
(2)
(3)
175 (4)
G, Figure 1, The unlabeled hydrogen-suppressed graph (Gi) of n-propanol.
calculated using the method of Kier and Hall.^^ The generalized form of the simple path connectivity index is as follows:
paths
where v., v.,.. . ,v^^j are the degrees of the vertices in the path of length h. The path length parameters (P^), number of paths of length /z (/i = 0, 1, . . . , 10) in the hydrogen-suppressed graph, are calculated using standard algorithms. Information-theoretic TIs are calculated by the application of information theory on chemical graphs. An appropriate set A of n elements is derived from a molecular graph G depending on certain structural characteristics. On the basis of an equivalence relation defined on A, the set A is partitioned into disjoint subsets A • of order n. (/ = 1, 2 , . . . , /z; S-n- = n). A probability distribution is then assigned to the set of equivalence classes: Aj, A2, . . . , A^ PvP2^'--^Ph
where/?. = n/n is the probability that a randomly selected element of A will occur in the ith subset. The mean information content of an element of A is defined by Shannon's relation:^ ^ h
The logarithm is taken at base 2 for measuring the information content in bits. The total information content of the set A is then n x IC. To account for the chemical nature of vertices as well as their bonding pattern, Sarkar et al.^^ calculated the information content of chemical graphs on the basis of an equivalence relation where two atoms of the same element are considered equivalent if they possess an identical first-order topological neighborhood. Since properties of atoms or reaction centers are often modulated by stereoelectronic characteristics of distant neighbors, i.e., neighbors of neighbors, it was deemed
176
SUBHASH C. BASAK, BRIAN D. CUTE, and GREGORY D. GRUNWALD
essential to extend this approach to account for higher order neighbors of vertices. This can be accomplished by defining open spheres for all vertices of a chemical graph. If r is any nonnegative real number and v is a vertex of the graph G, then the open sphere ^(v, r) is defined as the set consisting of all vertices v- in G such that ^(v,v•) < r. Therefore, 5(v,0) = (]), 5(v, r) = v for 0 < r < 1, and 5(v, r) is the set consisting of V and all vertices v- of G situated at unit distance from v, if 1 < r < 2. One can construct such open spheres for higher integral values of r. For a particular value of r, the collection of all such open spheres 5(v,r), where v runs over the whole vertex set V, forms a neighborhood system of the vertices of G. A suitably defined equivalence relation can then partition V into disjoint subsets consisting of vertices that are topologically equivalent for rth-order neighborhood. Such an approach has been developed and the information-theoretic indices calculated based on this idea are called indices of neighborhood symmetry.^^ In this method, chemicals are symbolized by weighted linear graphs. Two vertices UQ and VQ of a molecular graph are said to be equivalent with respect to r-th-order neighborhood if and only if corresponding to each path WQ, Wj,.. ., M^ of length r, there is a distinct path VQ, Vj, . . . , v^ of the same length such that the paths have similar edge weights, and both UQ and VQ are connected to the same number and type of atoms up to the rth-order bonded neighbors. The detailed equivalence relation has been described in earlier studies.^^'^^ Once partitioning of the vertex set for a particular order of neighborhood is completed, IC^ is calculated by Eq. 2. Basak et al. defined another informationtheoretic measure, structural information content (SIC^), which is calculated as SlC^^IC/log^n
(4)
where IC^ is calculated from Eq. 2 and n is the total number of vertices of the graph.^"^ Another information-theoretic invariant, complementary information content (CIC^), is defined as C/C, = log2n-/C,
(5)
CICj, represents the difference between maximum possible complexity of a graph (where each vertex belongs to a separate equivalence class) and the realized topological information of a chemical species as defined by IC^}^ In Figure 2, the calculation of IC2, SIC2, and C/C2 is demonstrated for the labeled hydrogen-filled graph (G2) of n-propanol. The information-theoretic index on graph distance, I^, is calculated from the distance matrix D(G) of a chemical graph G as foUows:^^ (6)
l'^=W\og2W'-'Zg,'h\og2h
Molecular Similarity Using Topological
Invariants
H2
\77
Hg
I y^ I
G2: n-propanol
H3
I
H7
H5
Second order neighbors: I
m
n
^1
^^2
I . I 0
•
C
'^3
H4
1 .
1
C
1
H/I\ H/I\
C
" 0 ^
IV
1 1
c
c
H Q C
H Q C
•
r
H5
1
ih ih ih c
" 0 ^
V
VI
^0.
^c.
Vffl
vn H I S'
J\\
H
HV|
H i C " H
/W
JK
H H
Subsets:
I
V
VI
vn
vni
(OO
(Ci)
(C2)
{C3)
IV
V
VI
vn
vm
3/12
1/12
1/12
1/12
1/12
n
m
(H2-H3)
(H4-H5)
(He-He)
I
n
ffl
1/12
2/12
2/12
(Hi)
IV
Probability:
IC2 = 5*1/12*1092 12 +2*2/12*1092 12/2 + 3/12*log2 12/3 = 2.855 bits SIC2 = ICi/log2 12 = 0.796 bits CIC2 = 1092 12 - IC2 = 0.730 bits
Figure 2, Calculation of the indices IC2, SIC2, and CIC2 for the hydrogen-filled, labeled graph (C2) of n-propanol.
178
SUBHASH C. BASAK, BRIAN D. GUTE, and GREGORY D. GRUNWALD
The mean information index, ^ , is found by dividing the information index I^ by W. The information-theoretic parameters defined on the distance matrix, H^ and //^, were calculated by the method of Raychaudhury et al.^^ Balaban defined a series of indices based on distance sums within the distance matrix for a chemical graph which he designated as J indices.^^"^^ These indices are highly discriminating with low degeneracy. Unlike W, the J indices have a range of values that is independent of molecular size. The general form of the J index calculation is as follows: / = ^(ji + l)-' ^ ( v / ' ^ 2
(7)
ij, edges
where the cyclomatic number |LI (or number of rings in the graph) is\x = q-n-\-l with q edges and n vertices, and 5- is the sum of the distances of atom / to all other atoms and s- is the sum of the distances of atom; to all other atoms.^'^ Variants were proposed by Balaban for incorporating information on bond type, relative electronegativities, and relative covalent radii.^^'-^^ C. Classification of the Indices
The set of 101 TIs was partitioned into two distinct subsets: topostructural indices and topochemical indices. Topostructural indices encode information about the adjacency and distances of atoms (vertices) in molecular structures (graphs) irrespective of atom type or factors such as hybridization states and number of core/ valence electrons in individual atoms. Topochemical indices quantify information regarding specific chemical properties of the atoms comprising a molecule as well as the topology (connectivity of atoms). Topochemical indices are derived from weighted molecular graphs where each vertex (atom) is properly weighted with selected chemical/physical properties. These subsets are shown in Table 1. D. Statistical Methods and Computation of Similarity Data Reduction
Initially, all TIs were transformed by the natural logarithm of the index plus one. This was done since the scale of some TIs may be several orders of magnitude greater than other TIs. A principal component analysis (PGA) was used on the transformed indices to minimize intercorrelation of indices. The PCA analysis was accomplished using the SAS procedure PRINCOMP.^'* The PCA produces linear combinations of the TIs, called principal components (PCs) which are derived from the correlation matrix. The first PC has the largest variance, or eigenvalue, of the linear combination of TIs. Each subsequent PC explains the maximal index variance orthogonal to the previous PCs, eliminating any redundancies that could occur within the set
Molecular Similarity Using Topological Invariants
1 79
of TIs. The maximum number of PCs generated is equal to the number of TIs available. For the purposes of this study, only PCs with eigenvalues greater than one were retained. A more detailed explanation of this approach has been provided in a previous study by Basak et al."^ These PCs were subsequently used in determining similarity scores as described below. Similarity Measures Intermolecular similarity was measured by the Euclidean distance (ED) within an n-dimensional space. This /x-dimensional space consisted of orthogonal variables (PCs) derived from the TIs as described above. ED between molecules / and j is defined as 1/2
EDy-
1(D,,-Djf
(8)
lc=l
where n is the number of dimensions or PCs retained from the PCA. Z).^ and DM are the data values of the kth dimension for chemicals / and 7, respectively. K'Nearest'Neighbor Selection and Property Estimation Following the quantification of intermolecular similarity of the 2926 chemicals, the A'-nearest neighbors {K= 1-10,15,20,25) were determined on the basis of ED. This procedure can be used to select structural analogues (neighbors) of a probe compound or the neighbors can be used in property estimation. In estimating the normal boiling point of the probe compound, the mean observed normal boiling point of the A'-nearest neighbors was used as the estimate and the standard error {s) of the estimate was used to assess the efficacy of the set of indices.
III. RESULTS A. Principal Component Analysis From the PCA of the 40 topostructural indices, seven PCs with eigenvalues greater than one were retained. These seven PCs explained, cumulatively, 90.8% of the total variance within the TI data. Table 2 lists the eigenvalues of the seven PCs, the proportion of variance explained by each PC, the cumulative variance explained, and the three TIs most correlated with each individual PC. The PCA of the 61 topochemical indices resulted in the selection often PCs, all having eigenvalues greater than one. The ten PCs explain a total of 92.1% of the variance within the TI data. Table 3 presents a summary of the information regarding these ten PCs.
Table 2. Summary of Principal Component Analysis of 40 Topostructural Indices for 2926 Chemicals
PC 1 2 3 4 5 6 7
Eigenvalue
Proportion of Explained Variance
Cumulative Explained Variance
28.2
46.2
46.2
11.0
18.0
64.3
Top Three Correlated Indices
5.9
9.6
73.9
4.1
6.7
80.6
Pi,Po. >^ 4y 5V 6y '^PC/ '^PC/ ^PC 3 y 5 y 4y ^O ^O ^PC L '>^Ch. 'Xc
2.8
4.6
85.2
XcU' '^Ch/ '^Ch
1.9
3.1
88.3
1.5
2.4
90.8
XcUf XQ\^f '^Ch Xn PlO/ P9
Table 3. Summary of Principal Component Analysis of 61 Topochemical Indices for 2926 Chemicals
PC 1 2 3 4 5 6 7 8 9 10
Table 4.
PC
Eigenvalue
Proportion of Explained Variance
Cumulative Explained Variance
Top Three Correlated Indices
20.4
33.5
33.5
^A^, 2;^, ^A*'
10.8
17.8
51.2
5/C4, S/C3, S/C5
8.1
13.3
64.6
3A^, ^A^, ^A^C
6.1
9.9
74.5
3.0
5.0
79.5
2.4
3.9
83.4
1.7
2.8
86.2
1.4
2.2
88.4
-^h/ ' ^ h / -^h 3yb 3yv 4yb '^Ch/ ^Chf -^Ch ICQ, SICQ, / Q 6 vb 5 vb Sy/ ^0 ^C' '^C ^A^, 2;^, ^A^
1.2
2.0
90.4
1.1
1.8
92.1
^A^, ^A^, ^A^ 4yb 4yv 6yv C' C' '^PC
Summary of Principal Component Analysis of 101 Topological Indices for 2926 Chemicals
Eigenvalue
Proportion of Explained Variance
Cumulative Explained Variance
Top Three Correlated Indices
1
42.6
41.6
41.6
2 3 4 5
13.3 11.4 8.9 5.1
13.0 11.1 8.7 5.0
54.7 65.8 74.5 79.6
Pi, PQ/^'^
6
3.7
3.6
83.2
ICQ, SICQ, SIC^
7
2.6
2.6
85.8
^A^, ^A^ ^A^
8
2.0
1.9
87.7
9
1.7
1.7
89.4
"^X^, ICQ, SICQ
10 11
1.4 1.1
1.4 1.1
90.8 91.9
^A^/A^c/^'^ ICyf^JCo
12
1.0
1.0
92.8
PS.PW^PQ
"^A^o % o ^-^c SICs, SICe, CICe ^^ck^'Xcu^'X^u y/^ch/'^'^h
^'^/^'^/'^h
Molecular Similarity Using Topological Invariants
181
Twelve PCs were retained from the PC A of the full set of 101 TIs. Each of these PCs had an eigenvalue greater than one and, cumulatively, they explained 92.8% of the variance within the full set of TIs. These PCs are summarized in Table 4.
Probe: 3-methyl-4-chlorophefX)l
CH3
Structural:
OL O a OL O ^CHa
CI
OH
(1) 0.00
Chemical:
CI
CI
(2) 0.00
(3) 0.01
dLtt'O ^Cl
^ ^
^CHg
(1) 0.01
All:
NHj
(1) 0.01
"Y
CI
NHj
(2) 0.02
NH2
(3) 0.02
NH2
(4) 0.01
on
T^
^ ^
6H
(4) 0.02
(5) 0.01
^T^
^CHg
CI
(5) 0.03
a,a.oco CI
OH
(2) 0.02
(3) 0.02
CI
(4) 0.03
(5) 0.03
Figure 3. The five analogues selected for the probe 3-methyl-4-chlorophenol using three molecular similarity spaces: topostructural, topochemical, and all indices. The numbers under the structures indicate the ranking of the analogues and the Euclidean distance to the probe.
SUBHASH C. BASAK, BRIAN D. CUTE, and GREGORY D. GRUNWALD
182
Table 5. Comparison of the Three Sets of TIs and Their Derivative PCs for Prediction of Normal Boiling Point (°C) Using K-Nearest-Neighbors {n = 2926) Indices
K
r
s
Topostructural Topochemical Topostructural + topochemical
10 6 8
0.881 0.883 0.896
39.0 38.6 36.6
0.92 0.90 -
0.88
«
0.86
a> o ^ 0.84 Topostructural indices Topochemical indices All Indices
0.82 H 0.80
T
5
10
15
20
25
30
Number of neighbors (K)
50 Topostructural indices Topochemical indices All indices
48 46 44 42 40 38 36 34 10
15
20
25
30
Number of neighbors (K) Figure 4. Pattern of (top) correlation (r) and (bottom) standard error (s) of the estimates according to the /C-nearest-neighbor selection for 2926 normal boiling points using three molecular similarity spaces.
Molecular Similarity Using Topological Invariants
183
B. Analogue Selection
Figure 3 shows an example of analogue selection using PCs to derive a Euclidean distance space. The first five analogues (neighbors) for the probe compound, 3-methyl-4-chlorophenol, are presented for each of the three similarity spaces. The analogues selected by the topostructural model show a repetition of the same skeletal structure, ignoring substituents, throughout the first five analogues. In the topochemical model and the full set model some variability in the skeletal structure arises (chemical analogues 2 and 5, full set analogue 4). Also of interest is the repetition of chemicals between the sets of analogues. While the ordering varies between the methods, the topostructural and topochemical models select two identical structures, the topostructural and the full set have three analogues in common, and the topochemical and full set select four of the same analogues. 2-Chloro-5-methylphenol appears in all three sets, while there are only three unique compounds (topostructural analogues 4 and 5, topochemical analogue 5). C. fC-Nearest-Neighbor Property Estimation
Figure 4 presents the correlation (r) and the standard error (s) of the prediction of the normal boiling points for the 2926 chemicals for the three groups of indices over the full range of i^ values examined (K= 1-10,15, 20, 25). Table 5 shows the best normal boiling point model for each set of indices. The best boiling point estimates for all three sets were for K in the range of 6 to 10. The full set of indices gave the best result, although there was only a small difference between models.
IV. DISCUSSION The purpose of this paper was to study the relative effectiveness of three similarity spaces derived from graph invariants in the selection of structural analogues and in the KNN-based estimation of properties. The similarity spaces were created using a PCA of calculated graph invariants. Tables 2-4 summarize the results of the PCA of the three sets of indices. The first PC is always correlated with indices that quantify molecular size. In the case of the topostructural indices, the second PC is most correlated with branching indices. In the case of PCs derived from either topochemical or the full set of topostructural and topochemical parameters, the first PC was strongly correlated with molecular size, while the second PC was highly associated with the molecular complexity indices. These results are in line with our earlier studies on different sets of chemicals."^'^'^^'^^'-^^ All three spaces were used in the selection of five analogues of a particular structure (Figure 3). Perusal of the three sets of structures shows that there is a substantial degree of similarity among the three groups of five chemicals selected. It is interesting to note that all five nearest neighbors of the probe selected by the topostructural method had isomorphic skeletal graphs when hydrogen atoms are
184
SUBHASH C. BASAK, BRIAN D. GUTE, and GREGORY D. GRUNWALD
suppressed. For the two similarity spaces created by topochemical indices alone and the combined set of topostructural and topochemical indices, four of the five selected neighbors are common (Figure 3) although the ordering of the molecules is different. This shows that these two similarity methods are not intrinsically very different. Our earlier results showed that analogues selected by similarity methods derived from experimental physical properties, atom pairs, and TIs select very similar sets of analogues.^^ In the case of KNN-based estimation of boiling points of chemicals from their analogues, K was varied from 1 to 25. The best estimated value was obtained in the range of ^ = 6-10. This is in line with our earlier studies with different properties."-'2 In conclusion, the three similarity spaces derived in this paper have reasonable power for selecting analogous molecules from a very diverse database of chemicals. The KNN-based estimation shows that selected analogues can be used for the estimation of boiling points of diverse chemicals if more accurate methods are not available.
ACKNOWLEDGMENTS This is contribution number 161 from the Center for Water and the Environment of the Natural Resources Research Institute. Research reported herein was supported in part by grants F49620-94-1-0401 and F49620-96-1-0330 from the United States Air Force, a grant from Exxon Corporation, and the Structure-Activity Relationship Consortium (SARCON) of the Natural Resources Research Institute of the University of Minnesota.
REFERENCES 1. Johnson, M. A.; Maggiora, G. M. Eds. Concepts and Applications of Molecular Similarity; Wiley: New York, 1990. 2. Carbd, R.; Leyda, L.; Amau, M. Int. J. Quantum Chem. 1980, 77,1185. 3. Bowen-Jenkins, P. E.; Cooper, D. L.; Richards, G. J. Phys. Chem. 1985, 59, 2195. 4. Basak, S. C ; Magnuson, V. R.; Niemi, G. J.; Regal, R. R. Discrete Appl. Math. 1988, 79,17. 5. Basak, S. C ; Bertelsen, S.; Grunwald, G. J. Chem. Inf. Comput. Sci. 1994, 34, 270. 6. Rum, G.; Hemdon, W C. J. Am. Chem. Soc. 1991,113,9055. 7. Willett, P.; Winterman, V. Quant. Struct.-Act. Relat. 1986, 5, 18. 8. Wilkins, C. L.; RandiC, M. Theor. Chim. Acta 1980,58, 45. 9. Trinajstie, N. Chemical Graph Theory Vols. I c& 77; CRC Press: Boca Raton, PL, 1983. 10. Basak, S. C ; Grunwald, G. D. Math. Model. Sci. Comput., in press. 11. Basak, S. C ; Grunwald, G. D. SAR QSAR Environ. Res. 1994, 2, 289. 12. Basak, S. C ; Grunwald, G. D. New J. Chem. 1995, 79, 231. 13. Basak, S. C ; Grunwald, G. D. / Chem. Inf Comput. Sci. 1995,35, 366. 14. Basak, S. C ; Grunwald, G. D. SAR QSAR Environ. Res. 1995,3, 265. 15. Basak, S. C ; Grunwald, G. D. Chemosphere 1995,31, 2529. 16. Basak, S. C ; Gute, B. D.; Grunwald, G. D. Croat. Chim. Acta 1996, 69,1159. 17. Lajiness, M. S. In: Computational Chemical Graph Theory, Rouvray, D. H., Ed. Nova Science Publishers: New York, 1990, p. 300.
Molecular Similarity Using Topological Invariants
185
18. Russom, C. L. Assessment Tools for the Evaluation of Risk (Aster) v. 7.0; U.S. Environmental Protection Agency, 1992. 19. Wiener, H. J. Am. Chem. Soc. 1947, 69, 17. 20. Randie, M. J. Am. Chem. Soc. 1975, 97, 6609. 21. Kier, L. B.; Hall, L. H. Molecular Connectivity in Structure-Activity Analysis; Research Studies Press: Hertfordshire, U.K., 1986. 22. Bonchev, D.; TrinajstiC, N. J. Chem. Phys. 1977, 67, 4517. 23. Raychaudhury, C ; Ray, S. K.; Ghosh, J. J.; Roy, A. B.; Basak, S. C. J. Comput. Chem. 1984, 5, 581. 24. Basak, S. C; Roy, A. B.; Ghosh, J. J. In Proceedings of the Second International Conference on Mathematical Modelling; Avula, X. J. R; Bellman, R.; Luke, Y. L.; Rigler, A. K., Eds.; University of Missouri-Rolla, 1980, p. 851. 25. Basak, S. C ; Magnuson, V. R. Arzneim.-Forsch. Drug Res. 1983, 33, 501. 26. Roy, A. B.; Basak, S. C ; Harriss, D. K.; Magnuson, V. R. In Mathematical Modelling in Science and Technology; Avula, X. J. R.; Kalman, R. E.; Liapis, A. I.; Rodin, E. Y, Eds.; Pergamon Press: New York, 1984, p. 745. 27. Balaban, A. T. Chem. Phys. Lett. 1982, 89, 399. 28. Balaban, A. T. Pure andAppl. Chem. 1983, 55, 199. 29. Balaban, A. T. Math. Chem. (MATCH) 1985, 21, 115. 30. Basak, S. C ; Harriss, D. K.; Magnuson, V. R. POLLY v. 2.3 (copyright University of Minnesota), 1988. 31. Shannon, C. E. Bell Syst. Tech. J. 1948, 27, 379. 32. Sarkar, R.; Roy, A. B.; Sarkar, R. K. Math. Biosci. 1978,39, 299. 33. Magnuson, V. R.; Harriss, D. K.; Basak, S. C. In Studies in Physical and Theoretical Chemistry; King, R. B., Ed.; Elsevier: Amsterdam, 1983, p. 178. 34. SAS Institute Inc. In SAS/STAT User's Guide, Release 6.03 Edition; SAS Institute Inc.: Gary, NC, 1988, p. 751. 35. Basak, S. C ; Niemi, G. J.; Veith, G. D. J. Math. Chem. 1991, 7, 243. 36. Basak, S. C.; Magnuson, V. R.; Niemi, G. J.; Regal, R. R.; Veith, G. D. Math. Model 1987, 8, 300.
This Page Intentionally Left Blank
OPTIMIZING HYBRID DENSITY FUNCTIONALS BY MEANS OF QUANTUM MOLECULAR SIMILARITY TECHNIQUES
Miquel Sola, Marta Fores, and Miquel Duran
Abstract I. Introduction II. Methodology III. Results and Discussion A. The CO System B. The N2 System C. The LiF System IV. Conclusions Acknowledgment References and Notes
188 188 190 192 192 199 200 201 201 201
Advances in Molecular Similarity, Volume 2, pages 187-203. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 187
188
MIQUEL SOLA, MARTA FORES, and MIQUEL DURAN
ABSTRACT The ao, a^, and ac semiempirical parameters of the original three-parameter method of Becke have been optimized by minimizing the difference between the density deUvered by this method and the singles and doubles quadratic configuration interaction (QCISD) generalized density using quantum molecular similarity measures. The optimization is performed employing the relaxed geometry at each set of ao, ax, and ac parameters. This method has been applied to a series of small molecules (N2, CO, and LiF) that have experimentally known properties and molecular bonds of diverse degrees of ionicity and covalency. Results show that, at least in these diatomic molecules, it is possible to obtain a set of parameters that reproduces almost exactly the electron density obtained from the QCISD methodology. Especially interesting are the values obtained for the ao parameter, which reflect how much exact exchange should be included in the description of a particular system.
I. INTRODUCTION The density-functional theory (DFT)^ of electronic structure has seen significant advances since its original formulation by Hohenberg, Kohn, and Sham.^ The use of density gradients in exchange and correlation corrections to the so-called local spin-density approximation (LSDA) has largely improved the computed molecular properties like geometries, vibrational frequencies, dipole moments, and particularly the molecular bond energies.^'^"^ Despite the unquestionable success of DFT in many fields, there is still a need to improve the current DFT schemes. For instance, the difficulties that DFT has in describing weak interactions like hydrogen bonds, charge transfer, and van der Waals complexes are well established,^ and they will be surmounted only with the advent of more precise exchange-correlation potentials. In this sense, the analysis of the performance and the improvement of different exchange-correlation functional is a subject of great current interest. So far, two approaches have been followed. On one hand, some authors have investigated nonlocal schemes of the generalized gradient approximation (GGA) which include the Laplacian of the electron density and the kinetic energy density.^ On the other hand, different hybrid schemes^"^^ have been proposed in which the Hartree-Fock (HF) exact treatment of exchange is incorporated to some extent into the available functionals. These hybrid DFT methods, which are based on the adiabatic connection formula, ^^ attempt to improve the exchange part of the exchange-correlation functional. In fact, it has been shown that errors in molecular descriptions arise mainly from the treatment of exchange, which is the dominant part of the exchange-correlation energy.^^'^'^"^ So, it is commonly argued that a partial inclusion of the exact exchange must improve the overall accuracy of the exchange-correlation functional. Indeed, results show that these hybrid methods
Optimizing Hybrid Density Functionals
189
yield energetic and structural results of an accuracy comparable to those obtained by methods that are much more demanding computationally.^^ Probably the most popular hybrid scheme for the exchange-correlation functional is Becke's three-parameter method, which was originally formulated as^
Here the E^^^^ term corresponds to the HF exchange energy based on Kohn-Sham orbitals, while E]^^ is the uniform electron gas exchange-correlation energy, AE^^ is Becke's 1988 gradient correction for exchange,^^ and AE^^^ is Perdew and Wang's gradient correction for correlation.^^ Commonly, this procedure is referred to as the B3PW91 method. The coefficients a^, a^, and a^ were determined by Becke fitting 56 atomization energies, 42 ionization potentials, and 8 proton affinities. The values obtained, which are in some sense semiempirical, were a^ = 0.20, a^ = 0.72, and a^ = 0.81. It is worth noting that, in this fitting, single point energy calculations were performed at experimental geometries. Furthermore, the exact exchange and the gradient corrections were added in the evaluation of energies in a non-self-consistent fashion using converged LSDA densities.^^ Probably, the so-called B3LYP functional is even more popular than the B3PW91 method. In the Gaussian-94^^ implementation the expression used is similar to Eq. 1 with slight differences: ^xc = Ex'""^ + %(^r'
- ^ ^ ^ ^ ) + ^.^T
+ E^"" + a.iAE"^ - E^^^) (2)
As the Lee-Yang-Parr (LYP) functional^^ already contains a local part and a gradient correction, one has to remove the local part to obtain a coherent implementation. This can be done in an approximate way by subtracting £^^^ to AE^^. The method has been normally used with the same three parameters derived originally for the B3PW91 functional. Other common hybrid schemes are those based on Becke's half-and-half (BHandH)^ linear interpolation of the adiabatic connection integral, which takes the values of a^ = 0.5 and a^ = a^ = Oin Eq. 1. If one has a high-quality electron density for a particular system, the DPT functionals, and in particular the three parameters of the B3LYP method, can be optimized so that one can obtain an electron density in better agreement with the reference density for the system considered. Ideally, the reference density should be taken from experimental results. In practice, however, it is very difficult to obtain a reliable electron density from a direct experimental measurement. In this case, an electron density obtained from a high-level ab initio method, such as the singles and doubles quadratic configuration interaction (QCISD),^^ can be used as the reference density. The aim of this work is to optimize the a^, a^, and a^ semiempirical parameters of the B3LYP method by minimizing the differences between the density furnished by the B3LYP method and the QCISD generalized density^^ which is taken as the reference density. This minimization is performed by maximizing the quantum
190
MIQUEL SOLA, MARTA FORES, and MIQUEL DURAN
molecular similarity measure (QMSM) between the B3LYP and the QCISD densities. The expression used to compute the QMSM between two first-order electron density functions {p^, Pj] is given by^^ ZjjiS) = J J p/r^) e(ri, r^) p/r^) dr, dv^
(3)
0(rp r^) being a positive definite operator depending on two-electron coordinates. Overlap-like QMSM are obtained when the ©(r^, r^) operator is chosen as the Dirac delta function 5(rj - r^. Use of the operator l/rj2 or l/r\2 gives rise to Coulomblike QMSM and gravitational-like QMSM, respectively.^^ In the particular case that P/ = Py» 01^^ g^ts Zji, which is the so-called self-QMSM.^^ In previous works,^^ it has been shown that comparison between densities through use of QMSM can provide detailed information on the similarities and differences between various methods. Such comparative studies should increase our understanding of the behavior of the different DFT schemes, and they should also provide hints toward improved treatments. In this work, we show how QMSM can be used to improve standard hybrid methodologies.
II. METHODOLOGY Let us define the function difference: ^(r) = pB3LYp(r) - pQCISD(r)
(4)
If Becke's a^, a^, and a^ parameters are allowed to change, the function difference P(r) will depend on these three parameters. Then, one can obtain the set of parameters that yield the B3LYP density closest to the QCISD density by minimizing the quadratic error integral:
D = lp\r)dr
(5)
which is obtained as a function of a^, a^, and a^. The Gaussian-94^^ program has been used to perform singles and doubles quadratic configuration interaction (QCISD)^^ and DF (B3PW91, B3LYP,^ BHandHLYP, and BHandrf) calculations. In all of these DF calculations, the nonlocal energy functional is incorporated into the optimization of both electronic and nuclear degrees of freedom self-consistently. B3LYP and B3PW91 calculations with (2Q, a^, and a^ coefficients different from the standard coefficients provided by Becke,^ have been performed with Gaussian-94 using internal options.^^ To minimize basis set effects, which may produce relevant QMSM differences,^^ the 6-311++G** basis set^^ has been used throughout. All calculations have been done within the restricted Hartree-Fock formalism except ionization potentials, which have been calculated with the unrestricted Hartree-Fock method.
Optimizing
Hybrid Density
191
Functionals
QMSM have been obtained from the Gaussian-94 electron densities using the Messem program developed in our group.^^ For QCISD calculations, generalized densities^^ have been used. Likewise, the DF electron densities have been calculated from self-consistently converged Kohn-Sham orbitals. All QMSM are Coulomblike, i.e., ©(FJ, r2) = l/rj2 in Eq. 3, except self-QMSM, which are overlap-like. Gradients of the D function in Eq. 5 with respect to the a^, a^, and a^ parameters have been computed numerically, and the optimization of this function has been performed using the quasi-Newton Davidon-Fletcher-Powell (DFP) algorithm.^^
INITIALIZE a„. a„ and ac
OBTAIN D(ao,a,.a.)
COMPUTE 5D/5ak ak = ak + Ei ak = ak - ei i = i +1
T
OBTAIN D(ao.a..ac)
_ilfl_
a. a., and a^ L
Y«
XConvergedV_Jifl
Scheme 1.
DFP OPTIMIZATION J NEW SET OF ao. a,, a.
192
MIQUEL SOLA, MARTA FORES, and MIQUEL DURAN
Scheme 1 depicts the program structure chart of the present algorithm. In this chart, shadow rectangles indicate calculations that use the Gaussian-94 program for geometry optimization and obtention of the electron density. Thus, starting from the initial set of a^, a^, and a^ parameters given by Becke, the QCISD and B3LYP electron densities are computed allowing complete electronic and nuclear relaxation of the system. With these two densities and using the Messem program one can obtain the value for the D function through Eq. 5. After that, the gradients of the D function with respect to the a^, a^, and a^ coefficients are computed numerically using the central difference approximation, i.e.,: dP
Z)(a^ + S ) - D ( a ^ - 5 )
daj^
25
(6)
with 5 = 10"^. Finally, and after checking the convergence of the process, a DFP optimization step is performed. We have considered that the optimization has been converged when the norm of the gradient vector was below 2 x 10"^ au. Throughout this contribution, the final converged a^, a^, and a^ parameters will be referred to as the QCISD-density-optimized parameters. Bader topological analyses^^ and maps of electron density differences were carried out with the Electra program developed in our group.^^
III. RESULTS AND DISCUSSION To test the methodology reported above, a study of the CO, N2, and LiF systems has been carried out. We have selected these three molecules because, from the point of view of Bader's atoms-in-molecules theory,^^ they exhibit different bonding nature: LiF is a typical case of closed-shell ionic interaction; N2 is an appropriate example of a molecule with a shared interaction; and CO is a well-known case of an intermediate interaction. Analyses of B3LYP electron densities in these molecules can put forward the behavior of the a^, a^, and a^ parameters in these different types of bonding. A. The CO System
Before starting the process of optimization, it is interesting to represent the quadratic error integral as a function of the a^, a^, and a^ parameters to obtain information about possible multiple minima on this surface. Figures 1-3 plot the quadratic error integral as a function of two variable Becke's parameters, while the remaining parameter is kept frozen. For all three surfaces we have computed 100 points changing each variable parameter from 0.1 to 1 by 0.1 unit each time. During the calculation of these surfaces, the geometry of the CO molecule has been kept frozen at the QCISD-optimized geometry.
Optimizing Hybrid Density Functionals
193
a,=0.81
0.014 <
--
0.012-1
3
0.010 i
3^ c 0.008 -1 o '^ o c 3
0.006 j 0.004'
H-
0.002'
o
0.000 1.2
0.0
0.0
Figure 1, Plot of the quadratic error integral (in au) as a function of 1 - ao and ax computed at the QCISD/6-311++G** optimized geometry. In this graph the ac has been kept frozen at 0.81.
Figure 1 depicts the quadratic error integral as a function of the a^ and a^ parameters. The a^ coefficient has been kept frozen at 0.81, as suggested by Becke in the B3PW91 method.^ As can be seen, the quadratic error integral is minimized along the ca. I -a^ = a^ line. This is not surprising, and indicates that the presence of the E^^^^ term reduces the need for the nonlocal correction AE^^ to approximately the same extent. For the same reason, Becke in his original work^ pointed out that a^ should be lower than 1. The value of a^ is especially interesting because it reflects how much exact exchange should be included in the description of a particular system. What is more striking is the fact that the value of the a^ coefficient does not have much influence as long as one takes a^=l -a^. In fact, along the \ - a^ = a^ line, the quadratic error integral is almost constant, reaching a minimum for ca. a^ = 0.5. Interestingly, a recent simplification sets a^=l -a^, and a^ = 1, with a^ = 0.l6oTa^ = 0.28,^^'^"^ and thus a unique parameter is needed to define this new hybrid DFT method. Further, in a recent paper, Perdew et al.^^ have provided a qualitative physical explanation to this relationship and have indicated that the a^ parameter must have a value of ca. 0.25. Our result reinforces the existence of this simple correlation between a^ and a^ parameters.
MIQUEL SOLA, MARTA FORES, and MIQUEL DURAN
194
a,=0,72
0.008]
6
0.0071
^ c o o c
0.0061 0.0051 0.0041
3
0.0031
o
0.0021 0,00V 0.000
1-a„ Figure 2. Plot of the quadratic error integral (in au) as a function of 1 - ao and BQ computed at the QCISD/6-311++G** optimized geometry. In this graph the ax has been kept frozen at 0.72.
Figure 2 illustrates the quadratic error integral as a function of the a^ and a^ coefficients; the a^ has been kept fixed at 0.72. Interestingly, this surface has a small dependence on the a^ coefficient. The minimum is found approximately along the line \-a^ = a^ (0.72) value. Finally, in Figure 3 the quadratic error integral is represented as a function of a^ and a^, for the a^ value of 0.2. Again, the surface has a small dependence on the a^ parameter, and presents a minimum near the a^=l-a^ (0.8) value. Analysis of the shape of these plots shows that in all cases there is only a minimum on the surface, and so multiple minima problems are not expected. However, the presence of flat regions on the surface around the minima may complicate the process of optimization. The QCISD-density-optimized Becke parameters obtained for the CO molecule are 0.099, 0.857, and 0.935 for the a^, a^, and a^ parameters, respectively. As expected, the a^ value is smaller than 1 and follows approximately the I -CIQ relationship. The results obtained for some experimental quantities and for the different hybrid schemes analyzed are presented in Table 1.
Optimizing Hybrid Density Functionals
195
a, = 0.20
0.008] ^ 0.0071
6
o.ooej
'^'^
0.0051
CO
c o 'ip o c 3 M— o
0.0041 0.0031 0.0021 0.0011 0.000
Figures, Plot ofthe quadratic error integral (in au) as a function of ax and ac computed at the QCISD/6-311++G** optimized geometry. In this graph the ao has been kept frozen at 0.20.
Comparison between results obtained from the different hybrid methodologies studied and those yielded by QCISD and experiment, reveals that the largest differences appear in the two functionals that make use of Becke's half-and-half exchange functional.^ This is especially true in the case of the CO dipole moment, which has the wrong direction when computed with the BHandH procedure. However, it is well known that most DFT functionals have success in computing the sign of the dipole moments with values close to zero, like CO and NQ,^'^^^'^^ and in fact the rest ofthe functionals analyzed give the correct sign for the CO dipole moment. The two B3LYP schemes analyzed here (the so-called B3LYPBE which uses the standard three parameters of Becke and the so-called B3LYPNP which employs the QCISD-density-optimized parameters) give values quite close to the QCISD and experimental results. The proportion of exact exchange in the QCISD-optimized parameters B3LYP method is lower (9.9%) than in the B3LYP standard method (20%). The B3LYP QCISD-density optimized parameters method reproduces the QCISD geometry. Also, there is a significant improvement from B3LYPBE to
MIQUEL SOLA, MARTA FORES, and MIQUEL DURAN
196
Table 1. CO Experimental and B3LYP,' BHandHLYP, BHandH, and QCISD 6-311++G** Optimized Bond Lengths,^ Harmonic Frequencies,^ Dipole Moments,^ Ionization Potentials,^ Overlap-like Self-QMSM,^ and Density^ and Laplacian of the Electron Density^ at the Bond Critical Point BHandH liC-O) v(C-O)
\^ IP Zii
P(rc) V2p(r,) Notes:
1.111 2354.2 -0.019 14.00 112.34402 0.5136 0.7889
BHandHLYP BSLYPBE^ B3LYPNP^ 1.114 2325.7 0.027 14.18 113.39620 0.5085 0.7755
1.127 2210.3 0.072 14.20 113.23945
1.133 2162.1 0.104 14.04 113.27957
0.4930 0.5499
0.4860 0.4659
QCISD
Exp.^
1.133 2177.2 0.090 13.71 113.41322
1.128 2170 0.112 14.05
0.4776 0.7854
— — —
^B3LYPBE refers to the standard B3LYP procedure, while B3LYPNP indicates the B3LYP method with the a^, a^, and a^ QCISD-density-optimized parameters. b|nA. ^All frequencies are harmonic and are reported in c m " \ ^In debyes. ^In eV. ^In au. sprom Ref. 3a.
B3LYPNP as far as harmonic frequency, dipole moment, and ionization potential are concerned. In particular, the improvement in the ionization potential was unexpected because Becke's original three parameters were optimized so as to reproduce experimental ionization potentials among other energy-related properties.^ As a whole one may conclude that the density afforded by the QCISD-optimized parameters is closer to the QCISD density than the B3LYPBE density. The good pattern of the B3LYPNP density is also demonstrated by the value of the density at the bond critical point,^^ which is the closest to the QCISD one. Indeed, Figure 4 reveals the small differences between the B3LYPNP and QCISD densities. The maximum difference is 5.781 au (2.18% of error) and is located at the oxygen nucleus. There is a slight reduction of density at nuclei when going from QCISD to B3LYPNP, which is transferred to the bonding region and around nuclei. As already reported,^^^'^^ the LSDA electron density is too diffuse in the region near the nuclei. Nonlocal corrections pull the density to the regions close to the nuclei, although, as can be seen in Figure 4, only partially correct the LSDA deficiency. This B3LYPNP density reduction at nuclei when compared with the QCISD density is also reflected by the value of the self-QMSM measure of Table 1. As pointed out earlier,^^ quantum molecular overlap self-similarity measures are good indexes to estimate concentration of electronic charge in molecules: the larger the self-QMSM, the more concentrated the charge density. Values of self-QMSM in Table 1 already indicate that the QCISD density is slightly more concentrated than the B3LYPNP
Optimizing
Hybrid Density
197
Functionals
- a O O - 8 . 8 0 - 5 . 6 0 - 4 . 4 0 - 3 . 2 0 - 2 . 0 0 - 0 . 8 0 0.40 1.60 2.80 4.00 5.20 piiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiiiiiiiiiiiiiiiiimiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiH
6.40
7.60
7.60
I 7.60
6.40
6.40
5.20
5.20
4.00
4.00
2.80
2.80
1.60
1.60
0.40
0.40
-0.80
-0.80
-2.00
-2.00
-3.20
-3.20
-4.40
-4.40
-5.60
-5.60
-6.80
-6.80
-8.00 - a O O - 6 . 8 0 - 5 . 6 0 - 4 . 4 0 - 3 . 2 0 - 2 . 0 0 - 0 . 8 0 0.40
-8.00 1.60
2.80
4.00
5.20
6.40
7.60
Figure 4, Plot of the 6-311++G** electron density difference comparing the density obtained from the QCISD methodology with that computed at the B3LYPNP level, for the CO molecule at their optimized geometry. In this map the carbon nucleus is on top. The minimum contour is 1 x 10""* au and increases to 2, 4, 8, 20, 40, 80, . . . X 10~ au. Dashed lines correspond to negative values, that is, points where QCISD density is larger.
density. Further, the B3LYPNP Laplacian of the density at the bond critical point is the worst one when compared with the QCISD result. Remarkably, most DFT schemes provide strikingly too low values for the Laplacian of the electron density at the bond critical point of the CO molecule as compared with the QCISD values.^^^ However, because the CO electron density distribution is quite complex, one cannot infer from this result that B3LYP Laplacian values are defective. Finally, as Becke's three parameters were originally optimized together with the PW91 nonlocal correlation correction (B3PW91),^ we have investigated how the B3LYP QCISD-density-optimized parameters are modified when the nonlocal correlation correction is changed from A£^^^ - E^"^^ to A£^^^^ The minimization of the ^Q, a^, and a^ parameters with the B3PW91 functional has yielded the values of 0.708, 0.853, and 0.964, respectively. The differences with the parameters
Table 2, H^ Experimental and B3LYP/ BHandHLYP, BHandH, and QCISD 6-311++G** Optimized Bond Lengths,^ Harmonic Frequencies,*^ Ionization Potentials/ Overlap-like Self-QMSM,^ and Density^ and Laplacian of the Electron Density^ at the Bond Critical Point BHandH KN-N) y(N-N) IP Z// P(rc) V2p(r,) Notes:
BHandHLYP B3LYPBE^ BSLYPNP^
QCISD
1.080 1.082 1.095 1.104 1.104 2582.4 2441.4 2607.6 2353.7 2387.4 16.03 16.15 15.88 15.68 15.37 104.39067 105.41418 105.25520 104.96788 105.40628 0.7119 0.7090 0.6827 0.6657 0.6649 -2.8776 -2.8544 -2.5843 -2.4135 -2.4425
Exp! 1.098 2360 15.58
— — —
^B3LYPBE refers to the standard B3LYP procedure, while B3LYPNP indicates the B3LYP method with the a^, a^, and a^ QCISD-density-optimized parameters. b|nA. *^All frequencies are harmonic and are reported in cm"^ '^In eV. ^In au. ^From Ref. 3a.
- 8 L 0 0 - 6 . 8 0 - 5 . 6 0 - 4 . 4 0 - 3 . 2 0 - 2 . 0 0 - 0 . 8 0 0.40
1.60
2.80
4.00
5.20
6.40
7.80
7.60
6.40
6.40
5.20
5.20
4.00
4.00
2.80
2.80
1.80
1.60
0.40
0.40
-0.80
-0.80
-2.00
-2.00
-3JZ0
-3.20
-4.40
-4.40
-5.60
-5.60
-6.80
-6.80
II""'"" "'" iiMiiMriimimimimn mii iimiii iiinmiimminiininiiii -8.00 -8.00 - & 0 0 - 6 . 8 0 - 5 . 6 0 - 4 . 4 0 - 3 . 2 0 - 2 . 0 0 - 0 . 8 0 0.40 1.60 2.80 4.00 5.20 6.40 7.60
Figure 5. Plot of the 6-311++G** electron density difference comparing the density obtained from the QCISD methodology with that computed at the B3LYPNP level, for the N2 molecule at their optimized geometry. The minimum contour is 1 x 10'"* au and increases to 2,4,8,20,40,80,... x 10"^ au. Dashed lines correspond to negative values, that is, points where QCISD density Is larger. 198
Optimizing Hybrid Density Functionals
199
obtained using the B3LYP functional are quite small. As expected, the largest difference corresponds to the a^ parameter, which has changed by 0.029 unit. The small change in the parameters observed when going from B3LYP to B3PW91 justifies the use of Becke's B3PW91 optimized parameters in B3LYP calculations. B. The N2 System
For the N2 molecule we have obtained a^ = 0.021, a^ = 0.761, and a^ = 0.704. The results calculated for some experimental quantities and for the different hybrid schemes analyzed are shown in Table 2. Here too, the largest errors are provided by methods that use Becke's half-andhalf exchange. The B3LYP method with standard parameters yields excellent numbers. However, the B3LYP method with the QCISD-optimized parameters provides a significantly better harmonic frequency and ionization potential. Again, the values of the density and the Laplacian of the density in the bond critical point show that the B3LYPNP method yields a density closer to the QCISD density than does B3LYPBE. The good pattern of the B3LYPNP density can also be seen in Figure 5. The largest difference is now 0.337 au (0.20% of error) and is located at the nitrogen nuclei. All B3LYPNP computed values are closer than B3LYPBE values to QCISD results, except the self-QMSM, although in this case this is probably related to the shorter N-N bond length when computed with the
Table 3. LiF Experimental and B3LYP,' BHandHLYP, BHandH, and QCISD 6-311++G** Optimized Bond Lengths,^ Harmonic Frequencies,'^ Dipole Moments/ Ionization Potentials/ Overlap-like Self-QMSM/ and Density^ and Laplacian of the Electron Density^ at the Bond Critical Point BHandH KLi-F)
1.553
y(Li-F)
952.8 6.339
L |A
IP
10.68
Zu
BHandHLYP 1.568 925.4 6.431 11.74
B3LYPBE^
B3LYPNP^
1.582
1.595
903.9
880.6 6.317
6.355 12.23
1.595 886.8 6.544 11.55
122.01616
122.99495
Plz-c)
0.0765
0.0721
0.0697
0.0670
0.0663
V2(rc)
0.7654
0.7259
0.6785
0.6415
0.6550
Notes:
122.87479
12.48
QCISD
123.11443
123.02987
fxp.8 1.564
914 6.33
— — — —
^BSLYPBE refers to the standard B3LYP procedure, while B3LYPNP indicates the B3LYP method with the a^, a^, and a^ QCISD-density-optimized parameters. b|nA. '^All frequencies are harmonic and are reported in cm~\ ^\n debyes. ^In eV. ^In au. ^From Ref. 3a.
MIQUEL SOLA, MARTA FORES, and MIQUEL DURAN
200
- 8 . 0 0 - 6 . 8 0 - 5 . 8 0 - 4 . 4 0 - 3 . 2 0 - 2 . 0 0 - 0 . 8 0 0.40
1.60
2.80
4.00
5.20
6.40
7.60
7.60
6.40
I 6.40
5.20
5.20
4.00
4.00
2.80
2.80
1.60
1.60
0.40
0.40
-0.80
-0.80
-2.00
-2.00
-3.20 I
-3.20
-4.40 I
-4.40
-6.80 I"'" miiiimm iiiiiiimiiiniiiiii itiiiiiimiimiiiiiiiiiiiiiiiiiniiiiiiiiiimimiiiig _. illllllllllllllllllill - & 0 0 - 6 . 8 0 - 5 . 6 0 - 4 . 4 0 - 3 . 2 0 - 2 . 0 0 - 0 . 8 0 0.40 1.60 2.80 4.00 5JZ0 6.40 7.60
Figure 6. Plot of the 6-311++G** electron density difference comparing the density obtained from the QCISD methodology with that computed at the B3LYPNP level, for the LiF molecule at their optimized geometry. In this map the fluorine nucleus is on top. The minimum contour is 1 x 10""* au and increases to 2, 4, 8, 20, 40, 80, . . .
B3LYPBE methodology. A shorter N-N bond length produces an increase in the concentration of electron density. As for the CO case, the B3LYPNP ionization potential is closer to QCISD and experimental ionization potential than the B3LYPBE value. C. The LiF System
The QCISD-density-optimized Becke parameters obtained for the LiF system are a^ = 0.029, a^ = 1.003, and a^ = 0.730. The small value ofa^ allows a^ to take a value slightly larger than 1, but which also follows approximately the I - CL^ relationship. Table 3 presents the results obtained for the different hybrid schemes analyzed. Experimental results are also included. For this molecule, all methods perform similarly when compared with the available experimental results. The most interesting fact again is the important
Optimizing Hybrid Density Functional
201
similarity between the B3LYPNP and QCISD densities, which can also be appreciated in Figure 6. Both the electron density and the Laplacian of the density, as well as the self-QMSM, provided by the B3LYPNP method are the closest ones to the QCISD values.
IV. CONCLUSIONS In this work, we have analyzed a methodology that allows one to obtain a set of a^, a^, and a^ parameters that reproduces properly any density of reference. In particular, the QCISD density has been chosen as the reference density because it yields reliable geometries and energetic parameters, although in principle any density could be used. In the three systems studied, the proportion of exact exchange is found to be smaller than 20%, which is the percentage used in the standard B3PW91 method, although we expect that in more correlated systems or during bond breaking processes this percentage may be larger. For both CO and LiF, the relationship a^ = 1 - AQ is essentially preserved, while for N2 the a^ value is somewhat smaller than 1 - <3Q. The QCISD-optimized value of the a^ parameter ranges from 0.935 in CO to 0.704 in N2. Quadratic integral error surfaces for the CO molecule have provided evidence as to the fact that this parameter (a^) does not play a leading role in defining hybrid schemes. Despite the conclusions summarized above, the examples studied in this contribution do not provide consistent suggestions and clear conclusions about which are the best QCISD-density-optimized Becke parameters. Thus, to draw general conclusions that can be applied to most molecular systems, we will have to analyze a larger set of molecules. Furthermore, we think that the present algorithm would provide more interesting results when applied to the description of chemical species with strong correlation or which are not well described by the current functionals, like weak interacting complexes and excited states. In this case, the application of this methodology can offer the most convenient parameters. Research in this direction is currently under way in our laboratory.
ACKNOWLEDGMENT This work has been funded through Spanish DGICYT Project No. PB95-0762.
REFERENCES AND NOTES 1. (a) Parr, R. G.; Yang, W. Density-Functional Theory ofAtoms and Molecules; Oxford University Press: London, 1989. (b) Ziegler, T. Chem. Rev. 1991, 91, 651. (c) Ziegler, T. Can. J. Chem. 1995, 75, 743. 2. (a) Hohenberg, P; Kohn, W. Phys. Rev. B 1964,136, 864. (b) Kohn, W.; Sham, L. J. Phys. Rev. A 1965,140,1133.
202
MIQUEL SOLA, MARTA FORES, and MIQUEL DURAN
3. (a) Johnson, B. G.; Gill, P. M. W.; Pople, J. A. J. Chem. Phys. 1993, 98, 5612. (b) Murray, C. W.; Laming, G. J.; Handy, N. C ; Amos, R. D. Chem. Phys. Lett. 1992, 799, 551. 4. Andzlem, J.; Wimmer, E. Physica B 1991,172, 307. 5. (a) Stanton, R. V.; Merz, K. M., Jr., J. Chem. Phys. 1994, 100, 434. (b) van Leeuwen, R.; Baerends, E. J. Int. J. Quantum Chem. 1994,52,711. (c) Becke, A. D. J. Chem. Phys. 1986,84, ASIA, (d) Andzlem, J.; Wimmer, E. / Chem. Phys. 1992, 96, 1280. (e) Schmiedekamp, A. M.; Topol, I.A.; Burt, S. K.; Razafmjanahary, H.; Chermitte, H.; Pfaltzgraff, T.; Michejda, C. J. J. Comp. Chem. 1994,15, 875. 6. (a) Ruiz, E.; Salahub, D. R.; Vera, A. J. Am. Chem. Soc. 1995,117, 1141. (b) Kim, K.; Jordan, K. D. J. Chem. Phys. 1994, 98, 10089. (c) Del Bene, J. E.; Person, W. B.; Szczepaniak, K. J. Phys. Chem. 1995, 99, 10705. (d) Pudzianowski, A. /. Phys. Chem. 1996,100, 4781. 7. Proynov, E. I.; Vela, A.; Salahub, D. R. Chem. Phys. Lett. 1994, 230, 419. 8. Becke, A. D. J. Chem. Phys. 1993, 98,1372. 9. Becke, A. D. J. Chem. Phys. 1993, 98, 5648. 10. Becke, A. D. J. Chem. Phys. 1996,104, 1040. 11. Gorling, A.; Levy, M. J. Chem. Phys. 1997,106, 2675. 12. Gritsenko, O. V; van Leeuwen, R.; Baerends, E. J. Int. J. Quantum Chem. 1996, 30, 163. 13. (a) Harris, J.; Jones, R. O. J. Phys. F1974, 4, 1170. (b) Gunnarsson, O.; Lunqvist, B. I. Phys. Rev. B1976,46,6671. (c) Langreth, D. C ; Perdew, J. P Phys. Rev. B. 1977, 75,2884. (d) Harris, J. Phys. Rev. A 1984, 29, 1648. 14. Gunnarsson, O.; Jones, R. O. Phys. Rev. B 1985, 31, 7588. 15. (a) Bauschlicher, C. W., Jr.; Partridge, H. Chem. Phys. Lett. 1995, 240, 533. (b) Wang, J.; Eriksson, L. A.; Johnson, B. G.; Boyd, R. J. / Phys. Chem. 1996, 100, 5274. (c) Torrent, M.; Duran, M.; Sola, M. J. Mol. Struct. (Theochem) 1996,362,163. (d) Barone, V; Adamo, C. Chem. Phys. Lett. 1994, 224, A2>1. (e) Barone, V. Chem. Phys. Lett. 1994, 226, 392. (f) Bauschlicher, C. W, Jr., Chem. Phys. Lett. 1995, 246, 40. (g) Hack, M. D.; Maclagan, R. G. A. R.; Scuseria, G. E. J. Chem. Phys. 1996,104,6628. (h) Jursic, B. S. J. Chem. Soc. Perkin Trans. II1996,697. (i) Jursic, B. S. J. Chem. Soc. Perkin Trans. II1997, 637. (j) De Proft, P.; Geerlings, P J. Chem. Phys. 1997,106, 3270. 16. Becke, A. D. Phys. Rev. A 1988, 38, 3098. 17. Perdew,J. P;Chevary,J. A.;Vosko,S. H.;Jackson,K. A.;Pederson,M. R.;Singh,D. J.;Fiolhais, C. Phys. Rev. B 1992,46, 6671. 18. Adamo, C.; Lejl, F. J. Chem. Phys. 1995,103, 10605. 19. Frisch, M. J.; Trucks, G. W; Schlegel, H. B.; Gill, P M. W; Johnson, B. G.; Robb, M. A.; Cheesman, J. R.; Keith, T.; Petersson, G. A.; Montgomery, J. A.; Raghavachari, K.; Al-Laham, M. A.; Zakrzewski, V. G.; Ortiz, J. V; Foresman, J. B.; Peng, C. Y.; Ayala, P Y; Chen, W; Wong, M. W; Andres, J. L.; Replogle, E. S.; Gomerts, R.; Martin, R. L.; Fox, D. J.; Binkley, J. S.; Defrees, D. J.; Baker, J.; Stewart, J. J. P.; Head-Gordon, M.; Gonzalez, C.; Pople, J. A. Gaussian 94 Revision A.l, Gaussian, Inc., Pittsburgh, PA, 1995. 20. (a) Lee, C ; Yang, W; Parr, R. G. Phys. Rev. B 1988,37, 785. (b) Miehlich, B.; Savin, A.; StoU, H.; Preuss, H. Chem. Phys. Lett. 1989, 757, 200. 21. Pople, J. A.; Head-Gordon, M.; Raghavachari, K. J. Chem. Phys. 1987, 87, 5968. 22. (a) Handy, N. C ; Schaefer, H. F, III. J. Chem. Phys. 1984, 81, 5031. (b) Wiberg, K. B.; Hadad, C. M.; LePage, T. J.; Breneman, C. M.; Frisch, M. J. J. Phys. Chem. 1992, 96, 671. 23. (a) Carbd, R.; Amau, M.; Leyda, L. Int. J. Quantum Chem. 1980, 17, 1185. (b) Carbo, R.; Calabuig, B. Int. J. Quantum Chem. 1992, 42, 1681. (c) Carbo, R.; Calabuig, B. Comp. Phys. Commun. 1989,55, 117. (d) Carbo, R.; Calabuig, B. J. Mol. Struct. (Theochem) 1992,254, 517. (e) Carb6, R.; Calabuig, B. Int. J. Quantum Chem. 1992,42, 1695. (f) Carbo, R.; Calabuig, B.; Vera, L.; Besalu, E. Adv. Quantum Chem. 1994, 25, 253. 24. Besalu, E.; Carbo, R.; Mestres, J.; Sola, M. Molecular Similarity. Topics in Current Chemistry Series, Springer-Verlag: BerUn, 1995, Vol. 173, pp. 31-62.
Optimizing Hybrid Density Functionals
203
25. Sola, M.; Mestres, J.; Oliva, J. M.; Duran, M.; Carb6, R. Int. J. Quantum Chem. 1996, 55, 361. 26. (a) Solk, M.; Mestres, J.; Carbo, R.; Duran, M. J. Chem. Phys. 1996,104, 606. (b) Torrent, M.; Duran, M.; Sola, M. Adv. Mol. Sim. 1996,1,167. 27. In this case, the route must contain the following internal options: #B3LYP IOP(5/45 = Pi P2) IOP(5/46 = P3 P4) IOP(5/47 = P5 P^). The values Pj = 1000, Pj = 0200, P3 = 0720, P4 = 0800, P5 = 1000, and Pg = 0810 reproduce the B3LYP results. The formula for the ^BSLYP energy is E^,^yp = P2E^^+ pMBk'''' + P3^'')+P6Er''+ Psi^h^^- E^^ where Pi - Pj\^M^, for / = 1 to 6. Thus, Becke's parameters AQ, A^* and a^ are p^oiX- pi -p^, pi -p^y and P5, respectively. The p^ parameter has been fixed at 1, while p^ has been 1 in all B3LYP calculations and zero in all other cases. For instance, BHandH results are obtained for/?! = 1, P2 = 0.5, P2 = 0, /?4 = 0.5, and p^=p^ = 0. 28. Carb6, R.; Domingo, L. Int. J. Quantum Chem. 1987, 32, 517. 29. (a) Krishnan, R.; Binkley, J. S.; Seeger, R.; Pople, J. A. J. Chem. Phys. 1980, 72,650. (b) Clark, T; Chandrasekhar, J.; Spitznagel, G. W.; Schleyer, P v. R. J. Comp. Chem. 1983, 4, 294. (c) Frisch, M. J.; Pople, J. A.; Binkley, J. S. J. Chem. Phys. 1984, 80, 3265. 30. (a) Mestres, J.; Sola, M.; Besalu, E.; Duran, M.; Carbo, R. Messem, Girona, CAT, 1993. (b) Besalu, E.; Carbo, R.; Duran, M.; Mestres, J.; Sola, M. Modem Techniques in Computational Chemistry: METTEC-95; Clementi, E.; Corongiu, G., Eds., STEP, Cagliari, 1995, pp. 491-508. 31. (a) Davidon, W. Variable Metric Methodfor Minimization; Argonne National Laboratory Report ANL-5990, Argonne, IL, 1959. (b) Hetcher, R.; Powell, M. J. D. Comput. J. 1963, 6, 163. 32. (a) Bader, R. F W. Ace. Chem. Res. 1985, 18, 9. (b) Bader, R. F W. Atoms in Molecules: A Quantum Theory, Clarendon Press, Oxford, 1990. 33. Mestres, J. Electra, Girona, CAT, 1994. 34. Ernzerhof, M.; Perdew, J. P.; Burke, K. Density Functional Theory; Nalewajski, R., Ed.; Springer: Berlin, 1996. 35. Perdew, J. P; Ernzerhof, M.; Burke, K. J. Chem. Phys. 1996,105, 9982. 36. (a) Aarnst, M.; Stufkens, D. J.; Sola, M.; Baerends, E. J. Organometallics 1997,16, 13. 37. (a) Wang, J.; Shi, Z.; Boyd, R. J.; Gonzalez, C. A. J. Phys. Chem. 1994, 98, 6988. (b) Wang, J.; Eriksson, L. A.; Johnson, B. G.; Boyd, R.J. J. Phys. Chem. 1996,100, 5274.
This Page Intentionally Left Blank
ATOMIC SIMILARITY THROUGH A NEURAL NETWORK: SELF-ASSOCIATIVE PERIODIC TABLE OF ELEMENTS
Jose Fayos
I. 11. III. IV. V.
Abstract 205 Introduction 206 Architecture and Function of a Neural Network for the Periodic Table . . . . 207 Self-Association of Elements and Properties 209 Prediction of Properties for Elements 211 Conclusions 212 Acknowledgment 213 References 213
ABSTRACT Atomic similarity analysis by a neural network produces self-association of chemical elements into three main groups preserving the conventional periodic table structure. Seventy-two elements and seven solid-state properties of each are taken as neurons to build an lAC (ftferative activation and competition between neurons)-type neural
Advances in Molecular Similarity, Volume 2, pages 205-213. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 205
206
JOSE FAYOS
network. To create associations, the range of values for each property is divided into about seven intervals or categories, and the net is genetically determined by 446 excitatory or activating contacts between the elements and their known property categories. The neural competition consists of inhibitory contacts within elements and within categories of the same property. Besides the simpler database retrieval functions, like that of finding elements sharing one property, this artificial net reproduces more intelligent (Mendeleyev-like) properties of human memory. Thus, starting from an all-inactive neuron pattern, when a neuron is activated by an external input, multiple parallel activation and inhibition processes expand through the net, converging to a pattern of activated neurons, which are revealed to be self-associated. If such self-associations are produced by activating successively, one by one, the above elements and categories, the following results ensue: (1) the 72 elements are approximately grouped into only three main famiUes, each sharing close properties; (2) although no assumption about the atomic structure is made, these families pack byfillingthree different areas in a conventional periodic table, showing that the electronic configuration of the atoms is implicit in the solid-state properties; (3) one of the atomic families contains halogen and rare-gas atoms together with alkaline and alkaline-earth atoms, which gives rise to a closed cylindrical periodic table; and (4) the net is able to predict unknown properties for elements, with an average error less than one category.
I. INTRODUCTION The parallel processes between neurons of an artificial neural network allow neuron self-associations, without including any explicit rule. We use this advantage to discover hidden associations between chemical elements when they share close properties. The IAC net proposed by McClelland and Rumelhart^ is applied to create associations between 72 elements (those of thefirstsix periods of the periodic table excluding lanthanides) and seven solid-state properties; namely, density, crystal structure, atomic diameter, cohesion energy, ionization energy, melting point, and compressibility. The observed values of these properties were taken from tables in Chapters 1 and 3 of Kittel's Introduction to Solid State Physics? To generate such associations, there must be elements sharing some properties, thus the range of the observed values for each property was divided into 7 or 8 intervals or categories, trying to homogenize the intervals by their length and by the number of values included. Table 1 shows the average of the values taken in each interval, which are included in the name of the resulting 49 categories. Exception is made for crystal structures (STR) as they are already categorical, where only the 5 more common are considered as categories for the net. Besides, some other ambiguous or uncommon properties were also excluded. Thus, of the 12x1 = 504 possible element-category connections, we considered 446.
Atomic Similarity througti a Neural Network
207
Table 1. The Property Layers: Density (RHO), Crystal Structure (STR), Atomic Diameter (DIA), Cohesion Energy (COH), Ionization Energy (ION), Melting Point (MEL), and Compressibility (COM)^ RHO: R1.1. R2.7, R4.6. R6.1. R7.4, R9.3, R12.4, R20.4 (0.1 -22.6 g/cm') 12 10 7 7 5 9 8 7 STR: HCP. BCC. FCC, DIAMOND. RHOMBOHEDRAL 17 13 16 4 5 DIA: D1.5. D2.2. D2.6,D2.9. D3.2. D3.7. D4.6 (1.4-5.2 A) 2 4 18 14 9 9 6 COH: C0.1, CI .0, CI .8. C3.0, C4.2. C5.7, C7.6 (0.1 - 8.9 eV/at) 5 9 12 14 11 8 10 ION: 14.1.15.7,17.5. 19.6.111.7.113.7.116.6.123.1 (3.9-24.6 eV/at) 3 10 30 16 3 5 2 2 MEL: M90. M390. M740. M1170. M1720, M2260. M3160 (20-3700 K) 8 15 8 9 9 10 7 COM: CM0.4. CM1.2. CM2.4. CM3.8, CM6.1. CM10.5, CM42.2 (0.2-56x10" m^/N) 21 11 11 5 2 4 4 COL: lAIIA, IIIBIVB. VBVIBVIIB. VIII, IBIIBIIIA, IVAVA, VIAVIIA, VIIIA 11 6 9 9 11 10 10 6 Note: ^For each property, the names of the categories (including the average of the numerical values) follow and, below, the number of elements associated. In parentheses are shown the total range of values for each property. The last row corresponds to categories of the chemical group (COL) included to predict properties of elements.
II. ARCHITECTURE AND FUNCTION OF A NEURAL NETWORK FOR THE PERIODIC TABLE Figure 1 represents a fragment of our neural network. A total of 193 neurons are distributed among nine layers: one central hidden layer containing 72 hidden elements (X*) and eight external layers. The hidden layer is included to increase competitiveness between elements; of the external layers, one is for the 72 elements and the other seven are property layers, the latter including the 49 categories. Both elements and hidden elements and categories are interconnected neurons which process the incoming signals in the same way. The neural contacts or synapses work in two senses. Inside each layer there are inhibitory contacts (not represented in Figure 1) between every two neurons, a total of 10,524, allowing full neural competition intralayer. Between layers the contacts are excitatory, and each hidden element connects with its element and with its known categories, giving 72 + 446 = 518 excitatory contacts, which constitute the genetic base of knowledge of the net.
208
JOSE FAYOS
R1.1 R2.7 ... R20.4
HCP BCC ... RHOM
CM0.4 CM1.2 ... CM42.2
H* He* Li* Be* B* (3*( D* F* Ne*
Cs* Ba* .... .... Po* At* Rn*
H He
Cs Ba
Li
Be
B C O F
Ne
.... Po At
Rn
Figure 1, Schematic architecture of the lAC neural network for the elements of the periodic table and their properties. Only some layers, some elements or categories, and some connections have been represented for clarity.
The external neurons can also be activated by a foreign input. Each neuron processes the input stimuli arriving either from foreign or from connected neurons, changes its own activation a by A<3, and send its new activation (only if it is positive) back to the connected neurons. Thus, in the lAC neural process,^ a target neuron can be excited by the outputs of connected neurons by ^ = oLexc-outputs or it can be inhibited by neurons of the same layer by / = iLinh-outputs, where a and y are overall filter parameters. The net neuron input is net = e-i + strength xforinput, where the last term is the strength of the foreign input if it exists. Finally, Afl = {a^^^ - a)net -{a- rest)decay if net > 0, or A(3 = (a - a^Jnet (a - rest)decay if net < 0, which allows that a^^ 0, propagating the activation through the net. Nevertheless, the activation does not reach the whole net, as neural competition only allows one or a few neurons of each layer to remain active. In fact, along the steps, the activation's a^ will change and oscillate until they usually reach a pattern of stable values.
Atomic Similarity through a Neural Network
209
Thus, once the net is generated, it will not learn (no changes in the synapse weights are produced) but it allows useful neuron associations. In fact, the lAC neural network "illustrates some properties of the human memory."^ The more obvious application is as a database retriever, such as finding the property categories of one element or the elements sharing one category; or finding common categories of given elements or elements sharing given categories. A more interesting task is to find an element from its categories, even if one of the categories is wrong. But the more amazing application is to discover hidden associations or atomic similarity; that is, the ability of the net to self-associate elements and categories into a few families, which also implies the ability to predict unknown property categories for elements.
III. SELF-ASSOCIATION OF ELEMENTS AND PROPERTIES The self-association occurs by applying a permanent/ormpw^ = 1 to one element and allowing the parallel net evolution from the equilibrium rest (i.e., all a-^ = rest < 0) to another stable (a-^ > 0) distribution pattern. To realize a good convergence, we selected the following optimal parameters: a = 0.10, y = 0.04, strength = 0.4, ^max - 1 •^' ^min ~ ~^-^' ^^^^ = -0.1, and decay = 0.04. Then, after about 300 cycles the net stabilized positive activation of some elements, hidden elements, and shared categories. We use the term a well self-associated group of elements if we retrieve it by the initial shooting of any other element of that group. Repeating the process for each of the 72 elements, three such separated groups arise: (a) Li, Na, K Ca, Rb, Sr, Cs, and Ba, which produce 59 element-element associations from the 8 x 8 expected ones, giving besides 4 extra associations to elements not included in the group; (b) Mo, Tc, Ru, Ta, W, Re, Os, and Ir, confirming all 64 expected ones, but giving 4 extra associations; and (c) As, Cd, Sb, Te, Hg, Pb, and Bi, which is the worst self-association confirming 33 out of 49 possible associations and giving 17 extra ones. Besides, most of the remaining 49 elements are significantly more associated with one of the three main groups. Table 2 shows how the three groups distribute in three areas preserving the structure of the periodic table. Thus, this neural network reproduces Mendeleyev's work when he associated elements in a periodic table based on their similarities. Note that no direct information about the electronic structure of the atoms was included in the net, and indeed the net was able to reveal that hidden information implicit in the solid-state properties. In fact, almost the same three areas were obtained by repeating the whole self-association process after including in the net one layer more with eight categories for the chemical groups, namely, lAIIA, IIIBIVB, VB VIB VIIB, VIII, IBIIBIIIA, IVAVA, VIAVIIA, and VIIIA (included as the last row in Table 1). Besides this spontaneous aggregation of 58 elements into only three families. Table 2 shows another interesting point: the elements on the right side of the
210
JOSE FAYOS
Table 2. The 72 Selected Elements Divided into Three Self-Associated Groups, Closing a Cylindrical Periodic Table^
He
TT
Be
Na K
FB
^ 1
C
I A/ O I. - -
^/ ' Sj , P S ■ Ca Sc Ti V Cr Mn Fe Co Ni
p" Ne CI AT
1 ^ Cu IZrnGa, Gel AsSe Br Kr
y l Z r NbMo Tc Ru Rh Pd 'Ag Cd«/n Sn iSbTe 1 Xe^ I L- . u . . . Cs Ba ILalHfTa W. Re. Qs, l r _ P t _ A u j H Q D .PJt) .Bli e o J ^ R n Rb Sr
Note:
^Area a is surrouned by a continuous line, area b by long dashes, and area c by short dashes. Undefined elements are in italic.
periodic table are associated (into area a) with those on the left side, which closes a cylindrical periodic table. The three areas around the periodic table mainly correspond to (a) alkaline, alkaline-earth, halogen, and rare-gas elements, (b) transition elements, and (c) less-metals and some no-metals elements. There are, however, 14 undefined elements between areas and a part of area b (around Si) isolated within area c. It is interesting to recall that in 1862 A.-E. Beguyer de Chancourtois proposed his telluric screw, wherein he arranged a helicoidal strip of elements (ordered by their weights) around a cylinder, in such a way that similar elements were on the same vertical. The activated property categories by each family of elements are more poorly resolved in between, as some properties are common to two families of elements. Nevertheless, we could consider three less self-associated groups of categories corresponding to the three families of elements. Hence, as an overall result, the net allows self-association of three general groups, each one including some elements and their own properties, where one complete group is almost retrieved by activating only one element of this group. The above self-associated groups of categories also arise by applying a foreign input to each of the 49 property categories and letting net evolution from rest (now taking y = 0.14 as there are few categories per layer). Figure 2 shows schematically those groups of categories corresponding to the a, b, and c groups of elements.
Atomic Similarity through a Neural Network
>HJ
13.7
ION:
4.1
5.7.. [7.5
DIA:
4.6
'zj
3^\2.9
RHO:
1.1
2.7
4.6,
6?1^s 7.4. •
9.3
12.4
COH:
0.1
1.0
.1.8
3.0: [4.2
5.7
7^
MEL:
90
390
COM: 42.2
9.6
211
i:6
.74tK1170i p 2 0
10.5x ^ " • • , 3 . 8
2.4'
16.6
23.1
1.5 20.4
2260 ^ 1 6 0
,J14\.
b
Figure 2, Schematic drawing of the self-associated groups of property categories, corresponding to the three main families of elements shown in Table 2. The units of the category values are shown in Table 1.
IV. PREDICTION OF PROPERTIES FOR ELEMENTS This neural network is able to predict good enough property categories for elements. Let us first examine how the net recovers the 446 known element-category associations when they are ignored. To begin, we ignore one of these associations by cutting the connection from one category to its hidden element and reverse. Then we activate the corresponding element by a permanent/(9rm/7wr = 1, and we allow the parallel net evolution from the rest. Finally, we examine which category or categories (if any) of the cut property is now indirectly activated (predicted) and compare it with the known category. If the same run is repeated for the 446 associations (using higher inhibition y = 0.14), the net predicts 400 categories (90%) with an overall average deviation <5> = (^\pred-known\)IA^^ - 0.84 category, including 46% of coincidences. Hence, as a set of 446 random predictions gives <5> = 2.28 categories, the net predicts 2.28/0.84 = 2.7 times better than random. By including in the net the abovementioned layer (COL) with 8 categories for the chemical groups, the prediction is slightly better; now (using the same parameters) from the 519 associations the net predicts 487 (94%), with an overall average deviation of 0.69, which is 3.3 times better than random. Figure 3 shows the <5> distribution in this case, with average deviations running from 0.35 for compressibility to 1.31 for density. Consequently, we tried to predict, by using 519 associations, unknown categories of elements, that is, those categories never included in the net. We tested this by predicting the crystal structures of ten conflictive elements: those having several possible structures, or those having uncommon structure not included in the five categories selected for the net in Table 1. While the net was unable to predict any structure for S, Se, and Re and made a bad prediction for Mn (hep instead of bcc or fee), it predicted well for the six remaining elements; giving: rhom for P from
212
JOSE FAYOS COL
RHO
STR
DIA
COH
ION
MEL
COM
57 d
.Lif''' i
§
,iiH''
I
d
o
§
1
!.Jf'
d
l.lli' 1
y 1
8
M^'
" j..fl
Uli I'M lilt
III!' ; 1 M 1 1 n 1'
16
c 0 L
1 1 M 1 11 r1 1 1 11 1 1 i' 1 M 1 1 r
21
28
35
43
50
57
Known Categories
Figure 3. Predicted versus known categories, separated by properties. The names of the eight used properties are those of Table 1 including COL for the chemical groups. Each of the 57 known categories (along the x-direction) is common to several elements, which have their predicted categories (along the /-direction). Coincidences are on diagonal. Some intermediate points in the /-direction are the average of two neighbor categories, simultaneously predicted. Coincident points due to different elements are visualized by a slight displacement in the x-direction. The deviations are not significant for the structure type where only coincidences are relevant. The large deviations in COL for the first and last categories, lAIIA and VIIlA, agree with a cylindrical periodic table.
the existing cube and rhom phases; fee for Ga which is orthorhombic pseudo-fcc; fee for In which is fe-tetrahedral pseudo-fee; rhom for Te which can be hexagonal or rhom; hep and fee for La which can be hep, fee, or bee; and rhom for Po which is cubic or rhom.
V. CONCLUSIONS Some hidden associations, which are implicit in the elements-properties database, are revealed by an iterative activation and competition (IAC) neural network,^ showing that the chemical elements are self-associated into a few groups. The net is built with 72 elements of the periodic table and seven of their solid-state properties, each divided into about seven categories. Besides the more obvious database retrieval functions, this net is able to self-associate elements and properties into only three main groups, which have two interesting properties shown in Table
Atomic Similarity through a Neural Network
213
2. First, these groups preserve the structure of the conventional periodic table, as they fill three different areas in it. Second, elements on the right and on the left belong to the same group, which closes a cylindrical periodic table. The first striking result given above is enhanced by the fact that the element associations are not biased by any assumption about their atomic structure, which if included in a new layer of the net, scarcely modifies the above self-associations. On the other hand, as a consequence of those spontaneous element associations, the net is able to predict properties of elements with an average deviation of ±0.7 category. Hence, by using atomic similarity over solid-state properties, this net reproduces the work of Mendeleyev giving the same periodic pattern, but now organizing three groups of elements around a cylindrical periodic table. The periodic laws which are a paradigm of generalization of science, are thus revealed as hidden self-associations of elements by this neural network. Although there are still many variables for tuning, like the weights of synapses, the properties selected or even the number of categories per property, we hope this is a good approach for applying a neural network to the periodic table database.
ACKNOWLEDGMENT I thank my colleague Dr. Manfred Stoud for his encouragement and advice.
REFERENCES 1. McClelland, J.L.; Rumelhart, D.E. Explorations in Parallel Distributed Processing, 6th ed.; MIT Press: Cambridge, MA, 1994. 2. Kittel, C. Introduction to Solid State Physics, 6th ed.; Wiley: New York, 1986.
This Page Intentionally Left Blank
COMPARISON OF QUANTUM SIMILARITY MEASURES DERIVED FROM ONE-ELECTRON, INTRACULE, AND EXTRACULE DENSITIES
Xavier Fradera, Miquel Duran, and Jordi Mestres
Abstract I. Introduction II. Computational Details A. Calculation of Intracule and Extracule Densities B. Calculation of Second-Order Quantum Similarity Measures III. Application Examples A. Two-Electron Atomic Systems: H", He, Li"^, Be^"*" B. Diatomic Molecules: N2, CO, Lip IV. Conclusions Acknowledgments References
Advances in Molecular Similarity, Volume 2, pages 215-243. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 215
216 216 219 220 221 222 222 225 242 242 242
216
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
ABSTRACT Quantum similarity measures (QSM) are being widely used to quantify the degree of similarity between molecules. However, only first-order QSM between one-electron density functions have been computed so far. Second-order QSM have not been available due to the inherent complexity of electron-pair density functions and to their high computational cost. To overcome these problems, second-order QSM can be computed using intracule and extracule densities. Thus, this contribution presents a comparative study of QSM derived from one-electron, intracule, and extracule densities. As illustrative examples, results obtained for a series of two-electron atomic systems (H~, He, Li"^, He^"*") and a series of diatomic molecules (N2, CO, LiF) are presented and discussed. For the molecular series, the maximization of the QSM computed from one-electron and extracule densities as a function of the relative superposition of two molecules is performed and the corresponding differences on their topologies analyzed.
I. INTRODUCTION A molecule, as a quantum object, is completely described by its corresponding A'-electron wave function, Y(Xi,.. .,x^), from which the nth-order reduced density matrix can be obtained by integrating over {N-n) space-spin coordinates:' p „ ( X i , . . . , x„; x ' l , . . . , x;,) = 1^1 J Y * ( x , , . . . , x ^ ) ^ ( x ; , . . . , x;)dx„^, ■■■dxj^ (1) The nth-order density function is the diagonal part of the corresponding density matrix, that is. p „ ( x , , . . . , x„) = r j J T ' ( x , , . . . , x ^ ) T ( x , , . . . , Xjv)Jx„^i • • • rfx^
(2)
Spinless density functions can be obtained by further integrating over all spin coordinates. Thus, the first-order spinless density function of a molecule is p(r) = N | ^ * ( X I , X 2 , . . . , x ^ ) ^ ( X p X 2 , . . . , Xj^)ds^dx2 • "
d\^
(3)
while the spinless second-order density function is r(ri,r2) =
^
^J^*(Xi,X2,...,x^)W(x^,X2,...,x^)ds^ds2dx^
" ' dx^
From a probabilistic point of view, Eq. 3 describes the probability of finding any one of the N electrons at each point r of the three-dimensional space, while Eq. 4 describes the probability of finding one electron at position r j while another electron is simultaneously positioned at r2. Within the LCAO framework, first-order density functions are expressed as double sums over pairs of basis functions,
Comparison of Quantum Similarity Measures
217
m
p(r) = ;^D^.(p;(r)(p.(r); |p(r) = N
(5)
(/■
while second-order density functions are expressed as four-indexed sums over quartets of basis functions: r(ri,r2) = S^//)t/^/(ri)^/^i)*i^(''2)^/(r2); Inr^r^) = 2 I
(6)
y^/
{D..} in Eq. 5 and {D-jj^^} in Eq. 6 are the first- and second-order density matrix coefficients, respectively ; {(p,(r)} are the atomic orbitals or basis functions, and m is the number of these atomic orbitals. An overlap-like QSM employing first-order density functions was first described by Carbo et al."^: ZAB = /PA('-)pB('-)dr
(^)
Z^g quantifies the degree of overlap of the density functions p^(r) and PgCr). The value obtained for Z^g depends on the mutual alignment between molecules A and B. Thus, Z^g is in fact a function of six variables, corresponding to three translational and three rotational degrees of freedom. A cosinus-like similarity index (C^g) was also originally proposed by Carbo et al.^ as r AB = / y
ZAB y U/2
(8)
V'^AA'^B>'
where Z^^ and Zgg are the self-similarities of molecules A and B, respectively.^ For positive-defined molecular fields (as the electron density), C^g values range from 0 to 1: The value of 1 is only achieved when the molecules under comparison are identical; any dissimilarity between the two molecules is reflected by C^g values within the (0,1) range; finally, the value of 0 corresponds to the mathematical limit situation of zero overlap. So far, QSM applications in chemistry have been mainly based on first-order density functions."^"^^ In fact, not only QSM but also most of the tools applied to study electron distributions in molecules, such as the well-known theory of atoms in molecules,^^ density maps, and others, have relied on this first-order description, although in many aspects of chemistry it would be desirable to go beyond it and analyze directly second-order density functions. This would be specially important in studies in which the role of electron correlation is important.^^'^^ A definition of an overlap-like second-order QSM using two-electron density functions can be obtained as an extension of measure 7 as described by Carbo et al.:^
218
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
4 ' ^ = K('-i.'-2)rB(r„r,)dr,dr,
(9)
but, so far, only semiempirical approximations to second-order QSM have been described.^^''^^ There are several explanations for this avoidance of the general use of second-order QSM as a convenient tool for the analysis of molecular electronpair distributions. First of all, second-order density functions are more difficult to visualize than first-order ones, because of their higher dimensionality. Moreover, an overlap-like QSM between ab initio second-order density functions, as in Eq. 9, is computationally too expensive to be applicable to molecular systems, even to small ones. This situation can be alleviated by reducing second-order density functions to intracule and extracule densities.^^ That is, from the coordinates that describe the simultaneous position of two electrons, r^ and r2, an intracule coordinate, r, and an extracule coordinate, R, can be defined: (10)
R=lilii
(11)
2 Then, the intracule density function, /(r), is defined as /(r) = |r(r„r,)8((r, - r2) - r)dr^dt, ; J/(r) = (f\
^^^^
and the extracule density function, E(R), as E(R) = J r ( r i , r 2 ) 5 ^ ^ ^ - R dr.dr^ ; /^(R) = ('
(13)
/(r) and E(R) are the probability density functions for the electron-electron distance and for the electron-pair center of mass, respectively. As the second-order density itself, both /(r) and E(R) must integrate to the number of electron pairs, /(r) and E(R) have the advantage of reducing the six-dimensionality of the original second-order density function while keeping an electron-pair character. So, /(r) and E(R) are three-dimensional functions like the first-order density function, and are easily visualizable. Since the intracule coordinate, r, depends only on the relative positioning of the electrons in the molecule, /(r) has the property of being invariant to any molecular translation. Another remarkable property of/(r) is that it always shows an inversion center around the point 7(0), regardless of the symmetry of the molecule. Additional symmetry elements present in the molecule are also reflected in /(r). On the other hand, the extracule coordinate, R, is directly related to the molecular three-dimensional space, and E(R) shows the same symmetry elements as p(r). Calculations of /(r) and E{R) require evaluating many costly four-indexed two-electron integrals, whose number depends on the fourth power of the number
Comparison of Quantum Similarity Measures
219
of primitive basis functions. In the past, the lack of proper algorithms to deal efficiently with the computation of those integrals in large grids of points has restricted /(r) and E(R) calculations to atoms and small molecules, and, in many cases, only along longitudinal or transversal atomic or molecular axes rather than on rectangular grids.^^"^^ Recently, Cioslowski and Liu have developed a computational scheme that allows for faster calculations of /(r) and E(R) on large grids of points,^^ which has permitted a deeper understanding of the topological characteristics of molecular /(r) and E(R) distributions^'* and their Laplacians.^^ The possibility of obtaining /(r) and E(R) distributions of atoms and molecules in a very feasible way opens the path for calculating second-order QSM,^^ as a natural extension of the originally proposed first-order QSM."^ Thus, overlap-like intracule QSM, F^g, ^AB^kW/B^dr
(14)
and overlap-like extracule QSM, X^g, X^3 = j£^(R)£3(R)dR
(15)
can be computed, which are quantitative measures of the similarity of molecules A and B as represented by their contracted second-order electron-pair densities, /(r) or E(R). Likewise, one-electron densities, in the particular case of A = B, 7^^ and X^^ are the self-similarity measures quantifying how locally concentrated are /(r) or E(R) distributions for molecule A. Consideration of self-similarity measures allows for normalizing second-order QSM through the definition of a Carbo second-order similarity index. W2) ^B
7(2) ^AB /'7(2)'7(2)\l/2
(16)
following the original form of the Carbo similarity index.^ In Eq. 16, Z^^ generally represents F^g or X^g depending on the use of/(r) or £(R), respectively. The objective of this contribution is to compare the values and trends of quantum similarity measures and indices computed from one-electron, intracule, and extracule densities. The following sections contain, first, a description of the computational details used for evaluating /(r) and E(R) and, second, two illustrative numerical applications on atoms and linear diatomic molecules.
II. COMPUTATIONAL DETAILS In this section, the actual approaches for calculating /(r) and E(R), first, and F^g and X^g, afterwards, are briefly described, with specific mention of the numerical integration schemes used for particular systems having spherical or cylindrical
220
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
symmetry. Throughout this work, ab initio second-order density matrices have been computed at the Hartree-Fock (HF) level of theory by means of the programs Gaussian 94^^ and Gamess.-'^ A. Calculation of Intracule and Extracule Densities
Within the HF approximation, the second-order D^^ matrix elements appearing in Eq. 6 can be obtained from first-order D- matrix elements. For closed-shell systems they are evaluated as
whereas for open-shell systems they are computed in a UHF framework as D>j^ = ^D,p„-^{D-D--Dfpf,)
(18)
where first-order elements are split into a and P spin contributions D.. = Z)° + DP. IJ
I]
(19)
IJ
Although /(r) distributions calculated at the HF level do not possess the characteristic electron-electron cusp condition at the origin, it has been shown that the main topological features of /(r) and E(R) are already manifested at this level of theory.^"^'"^^ Future work will be directed toward analyzing the effect of electron correlation on the topology of/(r) and E(R) distributions. At present, there are no analytical expressions for /(r) and ^(R) such as Eqs. 5 and 6 for first- and second-order density functions, so the only feasible approach for assessing /(r) and E(R) is through the numerical integration of Eqs. 12 and 13 on large grids of points. To perform these numerical integrations in a fast and feasible way, the algorithmic scheme proposed recently by Cioslowski and Liu^^ has been followed, which divides the computational load into a grid-dependent and a grid-independent part, and reduces the number of integrals that need to be computed by discarding those integrals below an arbitrary significance threshold. A detailed description of this algorithm can be found in Ref. 33. Integration of /(r) or E(R) over all space should return the total number of electron pairs. However, only an approximate value to the exact number of electron pairs will be obtained because of the use of a numerical integration from three-dimensional grids as ^''^^'^I(r)Ar
-X^(R)AR
(20)
(21)
Comparison of Quantum Similarity Measures
221
where A^ is the number of electrons of the system studied, and Ar and AR are the grid spacings for the three Cartesian components of the intracule and extracule coordinates, respectively. Throughout this contribution the same grid spacing will be taken for the three Cartesian components. The approximate number of electron pairs obtained from Eqs. 20 and 21 will be used to assess quantitatively the validity and quality of the numerical integration performed. The dependency of this value on the grid extension and spacing will be examined. B. Calculation of Second-Order Quantum Similarity Measures Following the numerical integration scheme presented above, the evaluation of overlap-like second-order QSM between /(r) or E(R) distributions is straightforward: i'AB^E^AWWAr
(22)
^AB = I : ^ A ( R ) £ B ( R ) A R
(23)
As stated above for the number of electron pairs, the quality of the approximate values obtained for K^g and X^g will depend on the extension and spacing of the grid employed in the numerical integration, as well as on the integral screening threshold used in the calculation of/(r) and E(R), respectively. All of these aspects will be investigated below. Equations 20-23 assume generally the definition of three-dimensional density grids, which means consideration of a very large number of points. However, in particular cases, atomic or molecular symmetry can be used to reduce the dimensionality of the grids needed to compute Y^^ and X^g. For instance, the spherical symmetry of atomic systems can be exploited to compute /(r) and E(R) solely along an axis starting at the nuclear position. In this case, second-order QSM between two atoms can be computed as Y^^ = 4n^I^(r)I^iryAr
X^^^4n2E^ir)E^iryAr
(24)
(25)
For linear systems, cylindrical /(r) and E{R) distributions can be generated by rotating a planar grid around the internuclear axis. Thus, second-order QSM will be evaluated as Yf,B^2n^I^{x,z)Isix,z)xAxAz
(26)
222
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
^AB = 271^ E^{x. z)E^(x, z)xAxAz
(^'^)
where z and x are the rotation axis and an axis perpendicular to it. From a practical point of view, it is important that any two grids being compared have the same grid spacing. When the extension of the grids is not the same, the integration can be carried out only over the region common to both grids. This does not imply a significant loss of accuracy, as long as both grids are sufficiently large to include all regions with any significant contribution to /(r) or E(R) distributions. Then, assuming a zero contribution from regions where data from only one of the two grids are available should be a reasonable approach. Another aspect worth taking into account is the fact that the similarity between two molecules depends on their superposition. As a consequence, the two molecules being compared have to be mutually aligned so as to maximize the corresponding QSM. The optimization of the similarity function depending on the particular density definition employed to evaluate the QSM will also be one of the points of discussion later in this work.
III. APPLICATION EXAMPLES Two application examples are presented to illustrate the use of second-order QSM for analyzing quantitatively atomic and molecular electron-pair density distributions. A series of two-electron atomic systems (H~, He, Li"^, Be^"*") is considered first as the simplest case where the values and trends followed by first- and second-order QSM can be analyzed and compared. One-electron, intracule, and extracule similarity matrices for a series of diatomic molecules (N2, CO, LiF) are presented next, and the topologies of the similarity functions arising from the maximization of the different QSM are discussed. A. Two-Electron Atomic Systems: H", He, Li% Be^"^
As the simplest case, the H", He, Li"^, and Be^"^ two-electron isoelectronic series of atomic systems is studied first. First- and second-order density matrices were computed at the HF/6-31G level of theory by means of the Gaussian 94 package.^^ Since all of these systems are spherically symmetric, p(r), /(r), and £(R) values were computed only along an axis starting at the nucleus. Evaluation of the number of electrons and Z^^ from p(r), and the number of electron pairs, and Y^^ and Xj^^ from /(r) and £'(R), respectively, was performed numerically by spherical integration. The number of electrons, the number of electron pairs, and analytical evaluation of Z^^ can then be used to validate the quality of the numerical integration. Table 1 shows the results obtained using four different combinations of length and spacing for the axial calculations. For all systems considered, the correct number of electrons for p(r) and electron pairs for /(r) and E(R) is reproduced with
Comparison of Quantum Similarity Measures
223
Table 1. Number of Electrons or Electron Pairs and Self-Similarities Computed from Several Grids for H", He, Li"^, and Be^"^, Analytical Z^A
m
Atom
H"
grid^
1 2 3 4
n.e. 1.999995 2.000000 2.000000 2.000000
exact
He
1 2 3 4
1.998323 2.000000 2.000000 2.000000
exact
Li^
1 2 3 4
1.987350 1.999321 1.999962 2.000000
exact Be2^
1 2 3 4 exact
1.964233 1.997353 1.999741 2.000000
E(R)
i
p(r) ZAA
n.e.p.
^AA
0.08844 0.999806 0.00646 0.08845 1.000000 0.00646 1.000000 0.00646 0.08845 0.08845 1.000000 0.00646 0.08845 0.74720 1.000000 0.04503 0.76010 1.000000 0.04503 0.76012 1.000000 0.04503 0.76012 1.000000 0.04503 0.76012 2.83197 - 0.999984 0.18766 3.05588 1.000000 0.18769 3.07256 1.000000 0.18769 3.07376 1.000000 0.18769 3.07376 6.42902 0.999882 0.49051 7.76495 0.999999 0.49133 7.91138 1.000000 0.49133 7.92676 1.000000 0.49133 7.92676
n.e.p.
'^AA
1.000000 1.000000 1.000000 1.000000
0.05165 0.05165 0.05165 0.05165
0.999918 1.000000 1.000000 1.000000
0.35983 0.36022 0.36022 0.36022
0.998087 0.999986 1.000000 1.000000
1.46224 1.50129 1.50153 1.50153
0.987046 0.999882 0.999999 1.000000
3.37330 3.92408 3.93061 3.93063
^Grid definitions: 1: length 7.5 au, spacing 0.2 au (38 points) 2: length 10.0 au, spacing 0.1 au (100 points) 3: length 10.0 au, spacing 0.05 au (200 points) 4: length 15.0 au, spacing 0.01 au (1500 points)
at least two decimal figures when using the coarsest calculation (considering only 38 points along the axial grid) and five decimal figures when using the finest grid (which considers a total of 1500 points along the axis). Note, however, that the number of electron pairs shows a faster convergence to the exact value than the number of electrons when systematically refining the grid used for the numerical integrations. Values of Z^^ obtained from the finest numerical integration scheme reproduce within five decimal figures the corresponding analytical values for the four atomic systems. However, it is observed that finer grids are needed for obtaining quality ^AA ^^lu^s when going from H" to Be^^. This is due to the fact that p(r) attractors become sharper as the number of protons in the nucleus grows and hence more
224
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
precise integrations are required. As regards 7^^ and X^^, although no analytical data are available for comparison, in all cases a fast convergence is achieved when the numerical integration is systematically refined, which can be considered as a guarantee of correctness for these values. Z^^ is very sensible both to the number of electrons in the system considered and to the shape of the corresponding p(r) distribution, and it is also strongly dependent on local charge density concentrations. Consequently, Z^^ has been used as a measure for analyzing quantitatively the concentration of the electron density distribution in atoms and molecules.^ For instance, within an isoelectronic series, larger Z^^ values correspond to systems having electron densities more locally concentrated, whereas smaller Z^^ values are found for systems possessing electron densities more uniformly distributed. This is indeed the case in the H~, He, Li"^, and Be^"*^ two-electron series, where the trend observed in the respective Z^^ values (0.088, 0.760, 3.074, and 7.927) reflects the local concentration of the electron density toward the nucleus as the atomic number increases. In this case, Y^^ (0.006, 0.045, 0.188, and 0.491) and X^^ (0.052, 0.360, 1.501, and 3.931) values follow the same trend along this series. As this is the simplest case where one-electron densities (two-electron systems) and two-electron densities (one-electron-pair systems) can be compared, not surprisingly Z^^, 7^^, and Xj^^ are tightly related between them. To compare the electron density distributions of the systems along this isoelectronic series, similarity matrices were constructed by computing the corresponding similarity elements using the finest grid. Since for these systems all atomic density distributions show a single maximum centered at the origin, maximization of the similarity was not necessary in this case. The three different similarity matrices containing the Z^g, F^g, and X^g similarity measures, respectively, and their corresponding first- and second-order Carbo similarity indices are collected in Table 2. Again, the overall trend is quite the same for the three similarity matrices, but there are some points worthy of comment. For any given pair, the magnitude of QSM values follows the order Z^g > X^g > y^g. Comparison between one-electron and electron-pair QSM is not straightforward, since they are related to different numbers of particles or particle interactions. For instance, there are two electrons and only one electron pair in this particular case. This certainly contributes to making Z^g values larger. As the number of electron pairs depends approximately on half the square of the number of electrons, this effect would be reversed as the atomic number increases, because the number of electron pairs will be much larger than the number of electrons. More consistent comparisons can be made between y^g and X^g values. Because of the inherent definition of intracule and extracule coordinates, /(r) distributions are always more disperse than E(R). Consequently, X^g values are found to be larger than y^g. As regards similarity indices for a given element, first-order indices are found to be larger than second-order indices, thus indicating that these systems are more similar from the point of view of the one-electron density than from the intracule or extracule densities. Interestingly,
Comparison of Quantum Similarity Measures
225
Table 2. Similarity Matrices for Two-Electron Systems^ One-electron similarity matrix H-
He
Li-^
H-
0.0884
0.7880
0.5733
0.4335
He
0.2043
0.7601
0.9330
0.8220
Li-^
0.2989
1.4261
3.0738
0.9672
Be2+
0.3629
2.0177
4.7740
7.9268
Be^-^
Intracule similarity matrix H-
He
Li+
H-
0.0065
0.7596
0.4911
0.3380
He
0.0129
0.0450
0.8959
0.7383
Li-^
0.0170
0.0824
0.1877
0.9496
Be2^
0.0190
0.1098
0.2884
0.4913
Be2+
Extracule similarity matrix H-
He
Li-^
H-
0.0516
0.7596
0.4911
0.3380
He
0.1036
0.3602
0.8959
0.7383
Be2+
LJ-^
0.1367
0.6589
1.5015
0.9496
Be2+
0.1523
0.8785
2.3069
3.9306
Note:
^Values in roman type refer to Q S M ; italic, Carbo indices; boldface, self-similarities.
due to the spherical symmetry of atomic systems, F^g andZ^g second-order Carbo similarity indices are exactly the same for any atomic pair in this series. B. Diatomic Molecules: N2, CO, LiF
This section presents the results for the series of N2, CO, and LiF diatomic molecules and is organized as follows: First, the profiles of p(r), /(r), and E(R) along the internuclear axis of these molecules are presented to show the different topological characteristics of each particular density distribution; a systematic study into the quality of the grid in numerical /(r) and E(R) calculations necessary for obtaining a sufficient accuracy when evaluating F^g and X^g values is done afterwards; the section continues with a detailed analysis on the topology of the corresponding pairwise similarity functions in terms of the molecular alignments associated with each local similarity maximum; and finally, similarity matrices are constructed and Z^g, y^g, and X^g values for the three possible molecular pairs in this series are compared and discussed. Topology of One-Electron, Intracule, and Extracule Densities All molecular geometries were optimized at the HF/6-31G* level by means of the Gamess package.^^ First- and second-order density functions obtained at this
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
226
level of theory were then used to calculate p(r), /(r), and E(R). The profiles of p(r), /(r), and E(R) when evaluated along the internuclear axis of N2, CO, and LiF are depicted in Figures 1-3, respectively. Interpretation of the topology of p(r) distributions for the three molecules considered is very simple, all of them having two attractors located at nuclear positions. The height of attractors is directly related to the amount of electron density associated with each atom. This can be seen clearly for the LiF molecule, where the Li peak has a value of ca. 10 au while the F peak is about 400 au high (see Figure 3). /(r) and E(R) profiles present two attractors located at the positions defined by the positive and negative values of the internuclear distance and at nuclear positions, respectively, and an additional attractor located at the origin. Because of the electron-pair nature of/(r) and E(R) distributions, interpretation of attractors in /(r) and E(R) is significantly different than in p(r).^'^'^^ For instance, within the Hartree-Fock approximation, intra-atomic electron pairs furnish the attractor at the origin in /(r), while contributing to the attractors at nuclear positions in E(R). On the other hand, in this particular series of diatomic molecules, interatomic electronpair interactions are responsible for the attractors at internuclear distance positions in /(r), while furnishing the attractor at the origin in E(R) (provided that molecules were previously centered). The topological characteristics of the electron density distributions of the systems under comparison will ultimately determine the topology of the similarity function evaluated from them (vide infra).
Na on«-«l«ctron dtntlty
500.00 - 1
400.00
-
300.00
-
200.00
-
100.00
-
1 -4. 00
1 -2.00
11
ji ^
1
i\ 1
0.00
\
2.00
1
'
1 4.00
internuclear axis (a.u.)
Figure 1, p(r), /(r), and E(R) profiles along the internuclear axis for N2.
Comparison of Quantum Similarity Measures
227
N, intracule density
3.
20.00 -\
-2.00
0.00 internuciear axis (a.u.)
T
1
2.00
4.00
Nj extracule density
1 -4.00
-2.00
^
0.00 internudear axis (a.u.)
Figure 1. (Continued)
T~ 2.00
"~1 4.00
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
228
CO on«-«l«ctron density
•2.00
0.00 internuclear axis (a.u.)
200
4.00
CO Intracule density
5.
20.00 ■
J^ 4.00
VJ
0.00 internuclear axis (a.u.)
4.00
Figure 2. p(r), /(r), and E(R) profiles along the internuclear axis for CO.
Comparison of Quantum Similarity Measures
229
CO extracule density
I
'
•2.00
1
'
0.00 internuclear axis (a.u.)
n2.00
~1 4.00
Figure 2, (Continued)
FLI one-electron density
1 •4.00
-2.00
^^
\
'
0.00 internuclear axis (a.u.)
I 2.00
Figure 3, p(r), /(r), and E(R) profiles along the internuclear axis for LiF.
230
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
FLi Intracule density
5,
20.00 -
0.00 -H •8.00
r-
-f" -4.00
0.00 internuclear axis (a.u.)
4.00
8.00
FLI extracule density
"1 -4.00
-2.00
^
\
::^^ ""
0.00 internuclear axis (a.u.)
Figure 1. (Continued)
I 2.00
'
I 4.00
Comparison of Quantum Similarity Measures
231
Computation of Second-Order Quantum Similarity Measures
To investigate the grid extension and spacing required to obtain a sufficiently accurate number of electron pairs, several definitions of bidimensional grids were examined for computing /(r) and E(R). Then, values for F ^ and Xj^ can be evaluated following Eqs. 26 and 27, respectively, defined above. For ^(R) distributions, grid limits were defined by adding an extension value to the coordinates of the two atoms. For /(r) grids, grid limits were determined by adding the extension value to the interatomic distance, in the same plane. In all cases, density values were evaluated in the plane formed by the internuclear axis and one of the axes perpendicular to it. Due to the particular definition of intracule and extracule coordinates, the extension value and grid step used for evaluating E(R) were both doubled when evaluating /(r). In this way, for the same number of points, F ^ , and Xp^ values computed numerically from these definitions of the grids are expected to achieve comparable accuracy. In addition to the dependence of the number of electron pairs, T^g and X ^ on the extension value and the step of the grids, the effect of using different values for the integral neglect threshold when computing /(r) and ^(R) was also studied. Results obtained from /(r) and E(R) for the series of N2, CO, and LiF molecules are listed in Tables 3 and 4, respectively. The series of grids employed for evaluating /(r) and E(R) are given in order of increasing number of points. As a general trend, the accuracy of the number of electron pairs obtained by numerical integration of /(r) and J^(R) distributions (Eqs. 20 and 21, respectively) increases systematically as the grid is further extended and refined. Also, it is worth commenting that results obtained by setting the integral neglect threshold to 10"^ do not differ significantly from those obtained with a threshold of 10"^. It is for that reason that evaluation of /(r) and £'(R) using the two finest, and thus computationally most expensive, grids was done by setting the integral neglect threshold to 10~^. So far, the number of electron pairs obtained numerically for the different molecules considered, have been used to check the accuracy of /(r) and E(R) calculations on grids of points. In this sense, it is reasonable to expect a similar trend in accuracy for the second-order quantum self-similarity measures, F^^ and X^^. However, since evaluation of similarity measures involves products of density values, y^^ and X^^ are expected to be strongly dependent on the grid step. This effect will be particularly large in those regions around the attractors contributing significantly to the total similarity. In contrast, the dependence on the extension of the grid is smaller, because the superposition in external regions involves products of low-density values, having a very low contribution to the total value of the similarity measure. As an example of this critical effect on the N2 molecule, comparison of the accuracy in the number of electron pairs and F^^ andX^^ values in Tables 3 and 4, respectively, reflects that while the number of electron pairs in grids 3 and 8 [where the only difference is that the grid step has been refined from 0.20 to 0.02 au in /(r) and from 0.10 to 0.01 au in E(R)] changes from 90.011618
Table 3. Number of Electron Pairs and Self-Similarities Computed from /(r) for the N2, CO, and LiF Molecules
co
N2
Grid
Step
Ext.
1 2
0.20 0.20
3 4 5 6 7 8
0.20 0.20 0.10 0.10 0.04 0.02
5.0 5.0 10.0 10.0 10.0 10.0 10.0 10.0
LiF
Thres.
n.e.p.
YAA
n.e.p.
YAA
n.e.p.
10"
89.954418 89.954417
128.650437 128.650442
89.958387 89.958293
132.475271 132.475262
65.246080 65.248329
90.011618 90.011617 90.767124 90.767122 90.963268 90.990813
128.650455 128.650449 140.695092 140.695098 143.524521 143.906903
90.011981 90.011886 90.767405 90.767311 90.963391 90.990918
132.475555 132.475271 144.324330 144.324043 147.097547 147.473385
65.341073 65.3451 14 65.838921 65.842961 65.970634 65.9891 60
10" 10" 10"
YAA
97.756019 97.755845 97.756030 97.755856 103.210131 103.219955 104.591502 104.776202
N
W N
Table 4. Number of Electron Pairs and Self-Similarities Computed from €(R) for the N2, CO, and LiF Molecules
co
N2
Grid
LiF
Step
Ext.
Thres.
n.e.p.
XAA
n.e.p.
XAA
n.e.p.
XAA
0.10 0.10 0.10 0.10 0.05 0.05 0.02 0.01
2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0
10-~ 10" 10-5 lo4
89.9421 72 89.9421 72 89.988485 89.988485 90.761 349 90.761 349 90.962346 90.990581
1249.788832 1249.788832 1249.788903 1249.788902 1392.135885 1392.135884 1425.586745 1430.118215
89.940503 89.940495 89.98461 8 89.984609 90.760501 90.760493 90.962221 90.990561
1294.987260 1294.987060 1294.987321 1294.987121 1432.041380 1432.0411 52 1464.330352 1468.707001
65.210030 65.210924 65.303102 65.304345 65.833487 65.833729 65.972046 65.991678
958.056875 958.059042 958.056971 958.0591 39 1014.828059 1014,830244 1028.901076 1030.790246
10" 10-~ 10-5
Comparison of Quantum Similarity Measures
233
to 90.990813 (in Table 3) and from 89.988485 to 90.990581 (in Table 4), Y^^ (in Table 3) changes from 128.650455 to 143.906903, and X^^ (in Table 4) changes from 1249.788903 to 1430.118215. Comparable effects are found for the CO and LiF molecules. The ensemble of results gathered in Tables 3 and 4 allows for establishing that small grid steps [0.02 au for /(r) and 0.01 au for £(R)] are required to obtain values for the number of electron pairs, 7^^, and X^^ with acceptable accuracy. Consequently, all second-order similarities discussed below were evaluated using the definition of grid 8 in Tables 3 and 4. Maximization of the Similarity Functions
In this section, the dependence of Z^g, F^g, and X^g on the molecular alignment is discussed. A point that deserves a special comment here is the fact that /(r) distributions are invariant to molecular translations. Consequently, an important aspect of evaluating second-order similarities from /(r) is that, provided that linear molecules are previously defined along the same axis, no similarity maximization is needed for F^g. Actually, a y^g similarity measure cannot be assigned to a unique molecular alignment in the real molecular space, but to a set of molecular alignments that share the same orientation. Thus, in general, for rigid matchings between two molecules A and B with coordinates r^ and Tg, while ^AB^'^A'^^B) ^^^ X^g(r^, Tg) similarity functions must be optimized in a six-dimensional space (three translational and three rotational degrees of freedom), the Y^^(r^, Fg) similarity function needs to be optimized only in a three-dimensional space (three rotational degrees of freedom). The question of Z^g similarity maximization has recently been studied, and it has been shown that, since p(r) distributions are strongly localized on atomic nuclei, the optimal superpositions are achieved when atomic nuclei overlap strongly, the heavier the atoms, the more important their contribution to the total similarity and the more dominant their overlap in the superposition^^'^^ The same situation is expected to occur for E(R) distributions. As only linear molecules are considered in this work, maxima in p(r) and E(R) distributions will arise from alignments in which the internuclear axis of both molecules is coincident. In consequence, Z^g and X^B maxima can be located by overlapping both molecular axes and allowing one of the molecules under comparison to move along the overlapped axis. This was done for the three possible molecular pairs from the current set: {N2,C0}, {N2,LiF} and {CO,LiF}. In addition, for the {CO,LiF} pair, two orientations have been taken into account, corresponding to the two possible relative orientations between these two molecules. Figure 4 depicts the variation of thefirst-order(C^AB) ^^^ second-order extracule (C^^g) Carbo similarity indices for the {N2,C0} pair as N2 is translated over the CO molecule maintained at the origin. The use of similarity indices instead of similarity measures allows for a better comparison between the two types of similarity obtained when molecules are represented by their p(r) and E{R) distri-
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
234
-2.0
i
0.0
-1.0
displacement (a.u.)
r
1.0
2.0
Figure 4. CAB (clotted line) and CAB (solid line) similarity functions for the {N2, CO} pair.
butions. Each maximum of CZAB and CXAB in Figure 4 has been associated with a given molecular alignment and labeled accordingly. The five molecular matchings recognized when comparing {N2,CO}, together with the relative position of the molecules and the similarity index values, are collected in Table 5. Three sharp maxima appear in the CZAB function: the global maximum (3, CZAB = 0.9410) arises from aligning CO and N2 molecules by optimally matching one N atom with the C atom and the other N atom with the O atom (N-C, N-O). Since N-N and C - 0
Table 5. Displacement between the Centers of the Molecules (D, in au), CJ& and CAB Values for the Different Maxima (Shown in Boldface) Located in the Corresponding Similarity Functions for the {N2,CO} Pair (See Figure 4) Matchings in {N2,CO} 1. 2. 3. 4. 5.
(N-C) (N-»,»-C) (N-Q^N-O) (N-^-O) (N-O)
D
QB
-2.06 -0.98 0.01 1.01 2.06
0.3766 0.0530 0.9410 0.0602 0.5984
C
AB
0.1393 0.5299 0.9782 0.7368 0.2476
Comparison of Quantum Similarity Measures
235
distances are quite similar (2.0378 au and 2.1047 au, respectively), atoms from the two molecules can be closely superposed. The other two maxima appear when aligning one N with C (1, C^g = 0.3766) and one N with O (5, C \ g = 0.5984). This result is fully consistent with previous studies on Z ^ measures: Since p(r) distributions are strongly localized around atomic nuclei, the major contributions to the Z ^ measure come from very close atom-atom overlaps.^^'^^ As atoms begin to separate, its contribution to the similarity function diminishes very quickly. The situation is quite different for the C^^g profile. In this case, there are also three maxima, but while the global maximum (3, C^^g = 0.9782) is assigned again to the matching of (N-C, N - 0 ) atoms and the attractors at the center of the two molecules (labeled • - • ) , the other two maxima arise, respectively, from matching the center of the N2 molecule with the C atom and the center of the C - 0 molecule with one of the N atoms (2, C^^g = 0.5299), and matching the N2 center with the O atom and the center of the CO molecule with one N atom (4, C \ g = 0.7368). For the (N-C) and (N-0) alignments located as maxima in C^y^g, only slight shoulders appear in the C^^g function (matchings 1 and 5). To understand the differences between Z ^ and X ^ similarity spaces, one must go back to the topologies of the corresponding p(r) and £(R) distributions (Figures 1 and 2). As stated above, p(r) distributions present strong sharp attractors around nuclei, and p(r) values decay quickly out of these attractors. On the other hand, £(R) distributions present attractors at the nuclei, but also at the centers of the molecules (due to electron-electron interatomic interactions). For N2 and CO, the strongest peaks in E(K) distributions are those located at the origin (the centers of the molecules). E(R) values for the attractors are consistent with the number of electron pairs contributing to it, calculated from the number of electrons that would be formally assigned to each atom. Following this qualitative approach we should find 28 electron pairs for the O attractor, 21 for N, 15 for C, 49 for the N2 center, and 48 for the CO center. The relation between these figures is in good qualitative agreement with the actual ^(R) values on the corresponding attractors. It may now be clearer that the global maximum in C^^g collects contributions from the matchings of (N-C, N-O, • - • in 3), whereas the other two maxima get contributions from the (•-N, •-C in 2) alignments and from the (•-N, •-0 in 4) alignments (see Table 5). The other two maxima identified in the C^^g function (matchings 1 and 5) appear only as shoulders in C^^g because only a single attractor from each £(R) distribution is overlapping (N-C in 1 and N - 0 in 5). Finally, another point worth mentioning is that the shape of the C^^g function is significantly smoother than the shape of the C^^g function, mainly due to the close proximity of maxima
inCV Similarity-index functions for the {N2,LiF} pair are depicted in Figure 5. As in the previous case, attractor alignments are labeled in the figure and listed in Table 6. The first interesting result from Figure 5 is that the global maxima for C^^g and C^^g are associated with different molecular alignments. It is also observed that the topology of C^^g and C^^g functions for the {N2,LiF} pair is now more
236
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
complicated than it was for the {N2,C0} pair. Focusing first our attention on the variation of C^^g when translating N2 on LiF, two important maxima are clearly visible, matchings 1 and 4, which are associated with external and internal (N-F) alignments with C^^g values of 0.6865 and 0.6922, respectively. Furthermore, two additional maxima appear, matchings 6 and 9, which can be assigned to internal and external (N-Li) alignments with C^^g values of 0.1238 and 0.0911, respectively. The presence of four maxima when comparing {N2,LiF} from p(r) distributions instead of the three maxima found in the comparison of {N2,C0} is due to the fact that the internuclear distance in N2 (2.0378 au) is significantly smaller than in LiF (2.9384 au). In contrast, the N2 internuclear distance was comparable to that in CO (2.1047 au). Consequently, the four atoms cannot be matched simultaneously, as was the case in {N2,CO}, and an additional maximum appears. Three maxima appear in the C^^g function for {N2,LiF}. The global maximum (2, C^^g = 0.8244) is assigned to the overlap of (•-F) attractors, and the other two are assigned to the internal (N-F) alignment (4, C^^g = 0.6348) and the overlap of the two central (•-•) attractors in E(R) (5, C^^g = 0.6497). Alignments of other attractors in the topology of E(R) do not give rise to a maximum in this case, but to a shoulder in the shape of the C^^g function. (N-F) is the only alignment giving rise to a maximum (matching 4) in both C^^g and C^^g functions (see Table 7). The differences between C^^g and C^^g similarity functions for {N2,C0} and {N2,LiF} pairs are due to the differences in the electron density distributions of CO and LiF. From the one-electron density for the CO molecule, it is observed that the density on the position of the O atom is approximately twice as high as that on the C atom (see Figure 2). For the LiF molecule, the density on the position of the Li atom is more than ten times smaller than that on the F atom (see Figure 3). This is reflected by the C^^g values on matchings 1 (N-C) and 5 (N-O) for the {N2,C0} pair (see Table 5) and matchings 1 (N-F) and 9 (N-Li) for the {N2,LiF} pair (see Table 6) from which the following order for atom-atom matchings is observed: (N-F) > (N-O) > (N-C) > (N-Li). Moreover, it has been pointed out that, while for the {N2,C0} pair the global maximum arises from a double (N-C, N-O) alignment (matching 3 in Figure 4), an (N-F,N-Li) alignment for the {N2,LiF} pair is not possible because of the large LiF interatomic distance. As regards the C^^^g function, results show that the global maxima for {N2,C0} and {N2,LiF} arise from matching the two higher attractors in the corresponding E(R) distributions which, in N2 and CO, correspond to the attractor at the center of mass (furnished by electron-pair interatomic interactions) but, in LiF, correspond to the attractor on the position of the F atom (furnished by electron-pair intra-atomic interactions). This is the reason why the global maximum for {N2,LiF} in Figure 5 aligns the center of the N2 molecule with the F atom, and not with the center of the LiF molecule. The results of the similarity study on the {CO,LiF} pair can be anticipated from the discussion made above for the {N2,C0} and {N2,LiF} pairs. However, the similarity study of the {CO,LiF} pair possesses the additional interest of having to explore the similarity functions for two possible relative orientations of one
Comparison of Quantum Similarity Measures
237
molecule with respect to the other. The results can be visually analyzed in Figures 6 and 7. Values of C^^g and C^^g at each similarity-index maximum are gathered in Tables 7 and 8. Following the arguments stated above, regardless of the relative orientation of the two molecules C^^g has a maximum when (O-F) are aligned (matching 1 for the [CO,FLi] orientation in Figure 6 and matching 4 for the [OC,FLi] orientation in Figure 7). For the [OC,FLi] orientation C^^g at the global maximum (0.8403) is slightly larger than that for the [CO,FLi] orientation (0.8338), because of the small additional overlap of the C atom with the Li atom. The second maximum in importance occurs when (C-F) are aligned. In this case, the C^^g value for the [CO,FLi] orientation (0.5120) is now slightly larger than that for the [OC,FLi] orientation (0.5013), because of the extra overlap of the O atom with the Li atom. On the other hand, the global maxima for the C^^g functions are achieved when matching the center of the CO molecule with the F atom. To the C^^g value of the global maxima contribute also the overlap of the O atom (matching 2 in the [CO,FLi] orientation) or the C atom (matching 2 in the [OC,FLi] orientation) with the center of the LiF molecule, the former giving rise to a larger C^^g value (0.8499) than the latter (0.7645). Construction of Similarity Matrices Similarity matrices containing the values of first- and second-order similarity measures and indices at the global maxima located in the previous section for each molecular pair are presented in Table 9. Molecular Self-Similarities. Self-similarity values will be discussed first. Self-similarities are reported in the diagonal of the similarity matrices in Table 9. According to these values, the following ordering can be derived: Z ^ LiF > CO > N2 Y^ CO > N2 > LiF X^C0>N2>LiF The usefulness of Z^j^ as a quantitative measure of electronic concentration (or dispersion) has already been discussed.^ In our series of molecules, LiF (121.8149) is the molecule showing a higher concentration of the one-electron density, despite having less electrons (12) than CO and N2 (14). This is due to the fact that most of the electron density in LiF is locally concentrated around the F atom. From the same argument, the one-electron density in CO (112.5459) is more locally concentrated (around the O atom) than in N2 (104.6178), which has its one-electron density more uniformly distributed: while CO has an attractor of ca. 300 au high (on O) and one of ca. 125 au (on C), N2 has two attractors of ca. 200 au high (see one-electron density distributions in Figures 1 and 2).
238
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
-2.0
-1.0
0.0
1.0
displacement (a.u.) Figure 5. CAB (dotted line) and CAB (solid line) similarity functions for the {N2, LiF} pair.
The ordering becomes quite different when comparing the values obtained for two-electron self-similarities. According to Y^^ and Xj^ the molecules are ordered as CO > N2 > LiF, although CO and N2 are very much closer than N2 and LiF. This trend can be easily rationalized if one realizes that only 66 electron-pair interactions Table 6. Displacement between the Centers of the Molecules (D, in au), C^e and CAB Values for the Different Maxima (Shown in Boldface) Located in the Corresponding Similarity Functions for the {N2,LiF} Pair (See Figure 5) Match ings in {NiXiFj 1. 2. 3. 4. 5. 6. 7. 8. 9. Note:
(N-F) (.-F) (N-*) (N-F) (•-•) (N-Li) (N-.) (•-Li) (N-Li)
D
CAB
-2.48 -1.44
0.6865 0.0532
0.3602 0.8244
0.6922 0.1086 0.1238
0.6348 0.6497 0.3702
0.0911
0.0194
a
-0.45 -0.06 0.42
— — 2.48
^Dash indicates absence of maximum in both C^g ^'^^ ^AB-
239
Comparison of Quantum Similarity Measures
o
T -2.0
0.0
-1.0
1.0
2.0
displacement (a.u.) Figure 6. CAB (solid line) and CAB (dotted line) similarity functions for the {CO, LiF} pair (CO-FLI orientation).
Table 7. Displacement between the Centers of the Molecules (D, in au), C^B and CAB Values for the Different Maxima (Shown in Boldface) Located in the Corresponding Similarity Functions for the {CO,LIF} Pair for the [CO,FLi] Orientation (See Figure 6) Match ings in {Co.LiF} [CO.FLi] Orientation 1. 2. 3. 4. 5. 6. 7. 8. 9.
(O-F) (•-F) (0-.) (C-F)
(•-•) (O-Li) (€-•) (•-Li) (C-Li)
D
CAB
^AB
-2.51 -1.44
0.8338 0.0478
0.4804 0.8499
0.5120 0.1102 0.1306
0.5467 0.5776 0.3379
0.0729
0.0145
a
-0.41 -0.05 0.39
— — 2.51
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
240
\
0-»-C F~—Li
8
N 6
o-»-c
\
O-'-C F—.—Li
1
\
1.0
0.0
p._._.Li
\
o-»-c \ F
Li
r 1.0
2.0
displacement (a.u.)
Figure 7. CAB (solid line) and CAB (dotted line) similarity functions for the {CO, LiF} pair (OC,FLI} orientation).
Table 8. Displacement between the Centers of the Molecules (D, in au), C^B and CAB Values for the Different Maxima (Shown in Boldface) Located in the Corresponding Similarity Functions for the {CO,LiF} Pair for the [OC,FLi] Orientation (See Figure 7) Matchings in {CO,LiF) [OQFLi] Orientation 1. 2. 3. 4.
(C-F) (--F) (C-*) (O-F)
5. 6. 7. 8. 9.
(•-•) (C-Li) (.-Li) (0-») (0-Li)
D
CAB
^AB
-2.51 -1.44
0.5013 0.0486
0.2521 0.7645
0.8403 0.7960 0.1519 0.1171
0.7276 0.7298 0.7328 0.4472
0.0355 0.1036
0.3191 0.0254
a
-0.41 -0.37 -0.10 0.36
— 0.92 2.51
Note: ^Dash indicates absence of maximum in both Cjg and C^g.
Comparison of Quantum Similarity Measures
Table 9.
241
Similarity Matrices for the Set of N2, CO, and LiF Diatomic Molecules^ One-electron similarity matrix
N2 CO LiF
CO
LiF
N2 104.6178
0.9410
102.1060
112.5459
0.8403
78.1381
97.6297
121.8149
0.6922
Intracule similarity matrix
N2 CO LiF
CO
LiF
N2 143.9069
0.9919
144.4941
147.4734
0.8391
99.8228
104.3031
104.7762
0.8129
Extracule similarity matrix
N2 CO LiF Note:
N2 1430.1182
CO
LiF
0.9782
0.8244
1417.7297
1468.7070
0.8499
1000.8985
1045.8006
1030.7902
^Values in roman type refer to QSM; itaiiC; Carbo indices; boldface, self-similarities.
are possible in LiF, in comparison with the 91 electron-pair interactions in CO and N2. However, within the two isoelectronic molecules, /(r) and E(R) distributions for CO are slightly more concentrated than those for N2. Furthermore, comparison of Zpj^ with Ypj^ and Xpj^ reveals that density redistribution between N2 and CO is not so important on /(r) and E{R) distributions as it was on p(r) distributions. For instance, it can be observed from E(R) distributions presented in Figures 1 and 2 that the attractors at the center of mass of both molecules are ca. 400 au high and, in fact, the attractor in N2 is slightly higher than that in CO. This result is consistent with the formal assignment of 49 and 48 electron-electron interatomic interactions in N2 and CO, respectively. Pairwise Molecular Similarities. Pairwise comparisons between molecules can be performed by analyzing the nondiagonal terms in similarity matrices. The following discussion will be done from Carbo similarity indices, which provide a more convenient means for comparing molecules from different types of similarity measures. Using values in Table 9, it is extracted that for all similarity matrices the ordering of the three nondiagonal elements is {N2,C0} > {CO,LiF} > {N2,LiF}, as could be qualitatively expected from the electronic nature of the molecules under study. A more detailed analysis reveals that for the {N2,C0} and {CO,LiF} pairs the differences between first-order (C^^g) and second-order (C^^g and C^^g) similarity indices are relatively small. For example, for the {N2,C0} pair, they are ordered as
242
XAVIER FRADERA, MIQUEL DURAN, and JORDI MESTRES
C^g > C^^B > C \ B , while for the {CO, LiF} pair the ordering is C^g > C^^g > C^g. In contrast, for the {N2, LiF}, while the relative ordering is again C^^g > ^ AB ^ ^ \ B ' ^^^^^ ^^ ^ ^^^^ quantitative difference between the respective values: 0.8500, 0.8391 and 0.6922. These trends can be understood by looking at the p(r), /(r), and E(R) profiles in Figures 1-3. Essentially, the low height (ca. 200 au) of the attractors on N in the p(r) distribution for N2 and the higher height (ca. 400 au) of the attractor on F in the p(r) distribution for LiF are responsible for the C^^g value of 0.6922 for the similarity between N2 and LiF (matching 4 in Figure 5). The situation is reversed when comparing the E(R) distributions of the two molecules. In this case, the height of the central attractor for N2 (ca. 400 au) is higher than that for LiF (ca. 250 au), which gives a C \ g value of 0.8499.
IV. CONCLUSIONS A comparison of one-electron, intracule, and extracule similarity measures and indices computed from the respective density distributions for a series of atomic and molecular systems has revealed that, although in some cases similar trends can be observed, in general the values for the three types of similarity do not have to follow the same trend. Furthermore, it has been shown how the topological characteristics of one-electron, intracule, and extracule density distributions determine the topology of similarity functions. As a consequence, different similarity measures can lead to different optimal alignments associated with their global similarity maximum. We hope that future algorithmic and computational developments will allow computing /(r) and £'(R) distributions on large grids of points for larger molecular systems. This would allow comparing the behavior of first- and second-order similarities for a larger series of molecules, and may find applications for which the newly defined second-order similarities could perform better than the widely used first-order similarities. In particular, second-order similarities computed from intracule densities appear as a good choice for analyzing the quality of wave functions calculated at different levels of theory, because of their inherent advantage during the alignment procedure and the special sensibility of the 7(0) attractor to correlation effects.
ACKNOWLEDGMENTS This work has been supported by the Spanish DGICYT Project No. PB95-0762. X.F benefits from a doctoral fellowship from the University of Girona. We also thank the Centre de Supercomputacio de Catalunya (CESCA) for a generous allocation of computing time.
REFERENCES 1. LSwdin, P. O. Phys. Rev. 1955,97,6,1474.
Comparison of Quantum Similarity Measures 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.
38.
39.
243
Carbo, R.; Leyda, L.; Amau, M. Int. 7. Quantum Chem. 1980, 77, 1185. Sola, M.; Mestres, J.; Oliva, J. M.; Duran, M.; Carbo, R. Int. J. Quantum Chem. 1996, 58, 361. Carbo, R.; Domingo, L. Int. J. Quantum Chem. 1987, 32, 517. Cioslowski, J.; Heischmann, E. D. J. Am. Chem. Soc. 1987,113, 64. Cooper, D. L.; Allan, N. L. J. Am. Chem. Soc. 1992,114, 4773. Carbo, R.; Calabuig, B.; Vera, L.; Besalu, E. Adv. Quantum Chem. 1994, 25, 253. Mestres, J.; Sola, M.; Duran, M.; Carbo, R. J. Comput. Chem. 1994, 75, 1113. Sola, M.; Mestres, J.; Carbo, R.; Duran, M. J. Am. Chem. Soc. 1994, 776, 5909. Besalu, E.; Carbo, R.; Mestres, J.; Sola, M. Top. Curr. Chem. 1995, 775, 31. Constans, R; Carbo, R. / Chem. Inf. Comput. Set. 1995, 35, 1046. Mestres, J.; Sola, M.; Carbo, R.; Luque, R J.; Orozco, M. J. Phys. Chem. 1996,100, 606. Sola, M.; Mestres, J.; Carbo, R.; Duran, M. J. Chem. Phys. 1996,104, 636. Carbo, R.; Besalu, E.; Amat, L.; Fradera, X. J. Math. Chem. 1996, 79, 47. Cioslowski, J.; Stefanov, B.; Constans, R; J. Comput. Chem. 1996, 77, 1352. Carbo-Dorca, R.; Mezey, R G., Eds. Advances in Molecular Similarity, Vol. 1; JAI Press: Greenwich, CT, 1996. Fradera, X.; Amat, L.; Besalu, E.; Carbo-Dorca, R. Quant. Struct.-Act. Relat. 1997,16, 25. Constans, R; Amat, L.; Carbo-Dorca, R. J. Comput. Chem. 1997, 75, 826. Bader, R. F. W. Atoms in Molecules: A Quantum Theory; Clarendon: London, 1990. Ponec, R. In Ref. 16. Stmad, M.; Ponec, R. Int. J. Quantum Chem. 1994, 49, 35. Coleman, A. J. Int. J. Quantum Chem. 1967, 75, 457. Thakkar, A. J.; Smith, V H., Jr. Chem. Phys. Lett. 1976,42, 476. Carlsson, A. E.; Ashcroft, N. W. Phys. Rev. B 1982, 25, 3474. Thakkar, A. J. J. Chem. Phys. 1986, 84, 6830. Cioslowski, J.; Stefanov, B.; Tang, A.; Umrigar, C. J. J. Chem. Phys. 1995,103, 6093. Wang, J.; Smith, V. H., Jr. Chem. Phys. Lett. 1994, 220, 331. Sarasola, C ; Dominguez, L.; Aguado, M.; Ugalde, J. M. J. Chem. Phys. 1992, 96, 6778. Thakkar, A. J.; Tripathi, A. N.; Smith, V. H., Jr. Int. J. Quantum Chem. 1984, 26, 157. Breitenstein, M.; Meyer, H.; Schweig, A. Chem. Phys. 1988,124, 47. Wang, J.; Smith, V. H., Jr. Int. J. Quantum Chem. 1994,49, 147. Ugalde, J. M.; Sarasola, C. Phys. Rev. A 1994, 49, 3081. Cioslowski, J.; Liu, G. J. Chem. Phys. 1996, 705, 4151. Cioslowski, J.; Liu, G. J. Chem. Phys. 1996,105, 8187. Fradera, X.; Duran, M.; Mestres, J. J. Chem. Phys. 1997, 707, 3576. Fradera, X.; Duran, M.; Mestres, J. Theor Chem. Ace. 1998, 99, 44. Gaussian 94: Frisch, M. J.; Trucks, G. W; Schlegel, H. B.; Gill, R M. W; Johnson, B. G.; Robb, M. A.; Cheeseman, J. R.; Keith, T; Petersson, G. A.; Montgomery, J. A.; Raghavachari, K.; Al-Laham, M. A.; Zakrzewski, V. G.; Ortiz, J. V; Foresman, J. B.; Peng, C. Y; Ayala, R Y; Chen, W; Wong, M. W; Andres, J. L.; Replogle, E. S.; Gomperts, R.; Martin, R. L.; Fox, D. J.; Binkley, J. S.; Defrees, D. J.; Baker, J.; Stewart, J. R; Head-Gordon, M.; Gonzalez, C ; Pople, J.A. Gaussian, Inc.: Pittsburg, PA, 1995. Schmidt, M. W; Baldridge, K. K.; Boatz,J. A.; Elbert, S. T; Gordon, M. S.; Jensen,J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S. J.; Windus, T. L.; Dupuis, M.; Montgomery, J. A. J. Comput. Chem. 1993,14, 1347. Constans, P; Amat, L.; Fradera, X.; Carbo-Dorca, R. In Ref. 16.
This Page Intentionally Left Blank
THE COMPLEMENTARITY PRINCIPLE AND ITS USES IN MOLECULAR SIMILARITY AND RELATED ASPECTS
Jerry Ray Dias
Abstract Introduction Basic Definitions Aufbau Principle Results and Discussion A. Self-Complementary Molecular Graphs B. Infinite Series of Molecular Graphs that Are Pairwise Strongly Subspectral V. Conclusion References
I. II. III. IV.
245 246 247 248 248 249 254 257 258
ABSTRACT Properties and theorems of complementary molecular graphs are delineated. Aufbau constructions that constitute inductive proofs are included. Collections of strongly subspectral molecular graphs are tabulated.
Advances in Molecular Similarity, Volume 2, pages 245-258. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 245
246
JERRY RAY DIAS
I. INTRODUCTION Molecular modeling involves the analysis of a given structure in terms of its elementary substructures, stereochemistry, symmetry, shape, size, and similarity to other structures. These six S's (structure, stereochemistry, symmetry, shape, size, and similarity) are intricately related and interwoven.^ When comparing two molecular structures, some type of similarity is sought whereby one might be characterized by what is known about the other. Similarity is the degree of overlap between two or more structures and has been the subject of numerous studies.^ The more elementary substructures (e.g., atoms, bonds, fragments, subgraphs, functional groups) two molecules have in common and the closer they are in size and symmetry, the more they are similar. Similarity serves as a conceptual and molecular modeling tool that allows existing knowledge about molecular systems to be correlated, assembled, and integrated, parameters that are difficult or impossible to measure be calculated, hypotheses to be formulated and inexpensively tested, and gaps in knowledge to be pinpointed. Shape, size, and stereochemistry measure different spatial characteristics of molecules. Both geometrical and orbital symmetry play a vital role in the interpretation and understanding of electronic, vibrational, and NMR spectra of molecules.^ The fact that even isospectral molecules (isomeric molecules having the same eigenvalue set) with different symmetries will have different photoelectron ionization spectra^'"* emphasizes the importance of this variable when considering similarity. The importance of symmetry in similarity comparisons of conjugated polyenes is also emphasized by the fact that molecular graphs with greater than twofold symmetry are guaranteed to have a doubly degenerate eigenvalue subset. The search for and study of structural invariants is a vital undertaking in similarity studies and the development of topological indices.^'^ Allied to this endeavor, the search for and discovery of elementary substructures with specific relations among their eigenvalues is of great relevance for qualitative understanding of chemical systems. Characteristic polynomials, eigenvectors, recurring eigenvalues,^ embedding fragments (substructures),^'^ and right-hand mirror-plane fragments^'^^ are just some examples of quantum chemical-based invariants. The more eigenvalues two subspectral molecular graphs have in common the more they are similar, other things being equal. Two molecular graphs are subspectral if they have one or more eigenvalues in common. Subspectrality is one kind of measure of similarity that is maximized if the frontier molecular orbitals are included in the common eigenvalues. The HMO model is particularly important when dealing with 7i-electron systems. HMO does not include other variables, like strain-related components, which must be determined separately. Embedding fragments and right-hand mirror-plane fragments are molecular orbital functional groups.^"^^ This chapter reports our recent studies on complementary molecular graphs which are correlated by an expanded version of our aufbau principle. ^^
The Complementarity Principle
247
II. BASIC DEFINITIONS A molecular graph is the C-C a-bond skeleton representation of a fully conjugated polyene molecule. Such a graph, therefore, omits the C and H atoms and the C-H and p7i bonds. Since most polycyclic conjugated polyenes can have more than one arrangement of their 7t-bonds, the molecular graph representation avoids artificially representing these molecular systems by writing only one of these arrangements. Molecular energy level and eigenvalue are synonymous as are wave function and eigenvector. The highest occupied MO (HOMO) and the lowest unoccupied MO (LUMO) are called the frontier MOs (FMOs). Strongly subspectral molecular graphs have a preponderance of common eigenvalues. Isospectral molecular graphs have precisely the same eigenvalue spectrum. Almost-isospectral molecular graphs are strongly subspectral molecular graphs with 0,0, ±1, or ±2 as unique eigenvalue pairs."^Functional groups are substructures (groups of interconnected atoms) having a characteristic set of properties that are conveyed to the whole structure. If the two eigenvalues (X) within a single molecular graph or two related mirror-plane fragment graphs sum to zero (Xj + X2 = 0), they are said to be paired. The well-known pairing theorem states that all eigenvalues in a conjugated alternant hydrocarbon (AH) are either zero (nonbonding) or paired (bonding and antibonding). AHs have no odd size rings and every other carbon vertex can be starred so that no two starred and no two unstarred positions are adjacent. The eigenvector coefficients for the starred positions of the AH are unchanged in going from one eigenvalue (Xj) to its paired partner {X^, and for the unstarred positions the sign (but not magnitude) changes in going from one eigenvalue to its paired partner; if an eigenvalue has no paired partner (i.e., X= 0), then the coefficients of the unstarred positions are zero. When an internal mirror-plane of symmetry divides a molecular graph into two parts, the vertices on the mirror-plane remain with the left-hand fragment and vertices in the right-hand fragment originally connected by a bisected edge have weights of -1 .^^ If two eigenvalues in a single molecular graph, a single right-hand mirror-plane fragment, or two related molecular graphs or right-hand mirror-plane fragments sum to minus one (Xj + X2 = -1), they are said to be complementary}^ Two equal-sized right-hand mirror-plane fragments are complementary if all of their eigenvalues are complementary; the normal vertices of one of the complementary right-hand fragments correspond to - 1 weighted vertices in the other and both have the same sets of normalized eigenvector coefficients whose relative sign are fixed for the starred positions in going from one to the other. Two AH molecular graphs are complementary if their right-hand mirror-plane fragments containing normal and - 1 weighted vertices are complementary. If a molecular graph has a right-hand mirror-plane fragment that contains an equal number of normal and - 1 weighted vertices which when interchanged gives the same fragment, then both this molecular graph and its right-hand fragment are said to be self-complementary. For
248
JERRY RAY DIAS
a given eigenvalue, the McClelland mirror-plane of symmetry ^° defines an antisymmetric relationship among the coefficients of the relevant eigenvector.
III. AUFBAU PRINCIPLE All benzenoid (polyhex) structures of a given C^H^ (n = N^ and s = N^j) formula can be generated by a combination of the following three types of attachments to the perimeter of all of its precursor isomeric benzenoids: (1) attachment of C4H2 units to the ^(2,2) edges of all isomeric benzenoids with the formula of C^_4H^_2, (2) attachment of C3H units to the vee regions of all isomeric benzenoids with the formula of C^_3H^_p and (3) attachment of C2 units to the bay regions of all isomeric benzenoids with the formula of C^_2H^. Taking all of the above combinatorial attachments and deleting duplicates gives all of the benzenoids of a given C^H^ formula. In benzenoid enumeration and structure generation, C2, C3H, and C4H2 are elementary aufbau units as all other benzenoid aufbau units can be built by some successive union of these elementary units. ^^ In this chapter, other aufbau units will be used in construction proofs and in the generation of infinite pairs of series composed of strongly subspectral molecular graphs.
IV. RESULTS AND DISCUSSION Figures 1-3 present chemically relevant examples of complementary molecular graphs and their eigenvector relationships. At the head of each column in these figures is the corresponding right-hand mirror-plane fragment. For each eigenvalue belonging to these mirror-plane fragments, the eigenvector coefficients are indicated at each posifion on the molecular graph. The information displayed in each figure for two adjacent columns corresponds to complementary molecular systems. Recall that the mirror-plane defines an antisymmetric relationship for the eigenvector coefficients of each eigenvalue. If you identically star the complementary right-hand mirror-plane fragments, you will note that the signs to the coefficients of the starred positions remain unchanged in going from the structure in one column to the other for a given complementary set of eigenvalues, whereas the signs to the coefficients of the unstarred positions do change. Let a given right-hand mirror-plane fragment be designated by M and its complementary by M. If k is the index number of a specified starred posifion of normal weight in M, then k is also the index number for the same starred position of - 1 weight in M; starred normal weighted vertices in M become starred - 1 weighted vertices in M. Theorem 1. The associated eigenvalues (X) of two complementary right-hand mirror-plane fragments are related by X(M) + X(M) = - 1 .
The Complementarity Principle
Theorem 2. is given by
249
If the eigenvector 0(M) of a right-hand mirror-plane fragment
^(M) = X ^*i ^1 "^ S ^j ^j
^^^ eigenvalue X(M)
where (|)* is the p AO of a starred atomic vertex and (t)° that of an unstarred atomic vertex, then the eigenvector of its complementary is given by 0(M) = ^ a* ([)* - ^ aj (^j
for eigenvalue X(M)
Once half of the eigenvalues/eigenvectors of an AH molecular graph have been calculated, then the pairing relationship allows one to obtain the remaining values by inspection. Similarly from the complementary relationship, if the eigenvalues/eigenvectors of one complementary molecular graph are known, then these quantities for the other can be obtained without calculation. These eigenvalue/eigenvector theorems are illustrated by the complementary pairs depicted in Figures 1-3. Naphthalene and 1,2,4,5-tetramethylenebenzene (Figure 1) have been extensively studied, both experimentally and theoretically."^ The molecular graphs in Figures 2 and 3 are strongly subspectral. The molecular graph of tetravinylethylene and its complementary in Figure 2 are strongly subspectral to the corresponding complementary molecular graph pair in Figure 3. While tetravinylethylene (HOMO = 0.3111) itself has been synthesized,^^ only air-sensitive derivatives of benzodicyclobutadiene (HOMO = 0) have been synthesized,^"^ results that are consistent with the relative energies of their frontier orbitals (Figure 2) and conjugated circuit resonance energies.^^ A. Self-Complementary Molecular Graphs Figure 4 gives an example of a self-complementary molecular graph and its corresponding eigenvalue/eigenvector relationships. If every pair of eigenvalues in a single right-hand mirror-plane fragment sum to minus one (X^ +^2 ~ ~^)' this mirror-plane fragment and its corresponding AH molecular graph are said to be self-complementary. From this Figure 4 example, it should be evident that one needs only to determine one-fourth of the eigenvalues/eigenvectors of selfcomplementary molecular graphs and then use the complementarity principle and pairing theorem to determine the remaining eigenvalues/eigenvectors. Thus, selfcomplementary molecular graphs possess a type of hidden symmetry.^^ Theorem 3. Starting with 1,3-butadiene and the C4H2 set of aufbau units, all (nonbranched) self-complementary molecular graphs are generated per Figure 5.
JERRY RAY DIAS
250
complementary right-hand mirror-plane fragments
4(v^-l)
a=0.1735 2J = 0.2307 c-0.2629 d= 0.30055 6 = 0.3470 /= 0,3996 ^-0.4082 h=0.4253 i = 0.4614
Ji(/5-l)
1^
-g
-h
h
h -h i5(/5-l)
"^^Nw ^ / V ^
-f
^~^
-f
i5(/l3-l) naphthalene
1,2,4,5-tetramethylenebenzene
Figure 1, Corresponding eigenvectors for complementary eigenvalues belonging to complementary molecular graphs.
The Complementarity Principle
251
complementary right-hand mirror-plane fragments e
e
'^
_
-b
-e -e -2.1701
-;CC-. a
3 1.1701
iCb:
a-0.1268 h -0.1409 c-0.2530 d» 0.3020 e« 0.3058 /-0.35355 ^-0.3682 h = 0.3747 3 -'Q.5121
a
-0.3111
"^ -0.6889
1.0 -h
.
h
-h
h 1.4812
tetravinylethylene
-2.4812 benzodicyclobutadiene
Figure 2. Corresponding eigenvectors for complementary eigenvalues belonging to complementary molecular graphs.
JERRY RAY DIAS
252
complementary right-hand mirror-plane fragments
/ -/ ^ - ^ ^ /
v-f
-/
i
-/
-(/2 + 1)
b
-0
/2
^
X_/-«7 -b -b -2.1701
a-0 2>-0.1166 c » 0.08607
1.1701
'OP
0
-1.0
-d -d
-d
d
k^
-K
.-^
r-t^^v^
-X,
-fc
-v^
Sri
L^ *-c
-e
Z.
-Z
e
-2.4812 te travlnyleyclobutadiene
F/gcire 3. Corresponding eigenvectors for complementary eigenvalues belonging to complementary molecular graphs.
The Complementarity Principle
253 self-complementary right-hand mirror-plane fragments
*^
-e
-h -0
Q
-iiC/U + l)
a = 0.L183 &-0.2724 c -0.3907 d-0.4082 = l//6 e-0,5090
-a
^
ijC^^-l)
-1
Figure 4, Corresponding eigenvectors of complementary eigenvalues belonging to a self-complementary molecular graph.
An interesting example of a molecular graph having a right-hand mirror-plane fragment that is not self-complementary but still possesses a self-complementary eigenvalue set (eigenvalues of both L^ and L^) is given by the following structure.
4^
±0.4450 ±0.5550 ±0.8019 ±1.8019 ±1.2470 ±2.2470
In this case, the corresponding branched right-hand mirror-plane fragment has eigenvalues following Xj - ^2 = - 1 . On inspection of the eigenvalues in Figure 1, it should now be evident that the molecular graphs of naphthalene and 1,2,4,5-tetramethylenebenzene are both almost-isospectral and almost-self-complementary. The latter can be understood because mirror-plane fragmentation of the mirrorplane fragments listed at the top of Figure 1 leads to the same ultimate right-hand
254
JERRY RAY DIAS
c a oa oca [CoXnXC
Figure 5. Aufbau generation of all self-complementary molecular graphs having (nonbranched) right-hand mirror-plane fragments.
mirror-plane fragment which is self-complementary and belongs to 1,3-butadiene. Similarly, each corresponding member pair of molecular graphs in Figure 6 are both almost-isospectral and almost-self-complementary. Double fragmentation by perpendicular mirror-planes of all of the molecular graphs in Figure 6 gives self-complementary right-hand mirror-plane fragments. B. Infinite Series of Molecular Graphs that Are Pairwise Strongly Subspectral
Figures 6-10 display five matching pairs of infinite series that are strongly subspectral. In the infinite limit, each pair of series approach identical density of states. A unique aspect is illustrated by each pair of series in Figures 6-10. The series in Figure 6 are pairwise almost-isospectral and almost-self-complementary,
The Complementarity Principle
-X
255
XX
blsallyl
XXX XXXX
1,2,4,5-tetramethylenebenzene
o
CO coo oxo ±li(/5±l) ±»i(/r3±i)
+1 ±/2, ±/2 ±1.1935 ±2,1935
Figure 6, Two series of complementary molecular graphs that are also almost-isospectral. The unmatched eigenvalues are indicated at the beginning next to the first-generation structures of each series.
and each pair of corresponding molecular graphs in Figure 7 are complementary to the pair of molecular graphs in Figure 8 belonging to the same generation. The first three pairs of series (Figures 6-8) have all members with D2^ symmetry, whereas the last two pairs of series (Figures 9 and 10) do not. Of over a dozen pairs of strongly subspectral infinite series discovered and studied by the author,'^'^'''^^ the infinite series in Figure 9 represents the first one without D2^ symmetry (or
±(^±i)
O
±0.3111 ±1.0 ±1.^812 ±2.1701
<XP<JJ> 0, 0 ±li(/5±l) ±1.0 ±J{(/l3±l) ±2.0
±0.1A62 ±0.3111 ±0.758A
±1.0 ±1.20A7 ±1.A812 ±1.667A ±1.9090 ±2.1701 ±2.3510
Figure 7. Two series of strongly subspectral molecular graphs that approach isospectrality in the infinite limit. The unmatched eigenvalues are indicated next to the first-generation molecular graphs of each series.
JERRY RAY DIAS
256
.. [^D aCcCn CnCcCn ±/2,±^
0. 0 ±0.6889 ±1.1701 ±2.4012
±1.0, ±1.0 ±»j{/r3± 1) ±4(3 ± / 5 )
±0.2416 ±0.6889 ±0.9090 ±1.1462 ±1.1701 ±1.3510 ±2.2047 ±2.4812 ±2.6674
Figure 8, Two series of strongly subspectral molecular graphs that approach isospectrality in the infinite limit. The unmatched eigenvalues are indicated next to the first-generation molecular graphs of each series.
pseudosymmetry). The matching members of these series grow by successive addition of a CH unit. The unique feature of the pair of strongly subspectral series in Figure 10 is that the unmatched zero eigenvalues escalate in the lower series with the increase in size. The l,3,l',3'-phenylene triradical member of the upper series in Figure 10 has been the subject of recent theoretical study. ^^ The formulas of the aufbau units used to successively build up the series in Figures 6-10 are indicated in the upper left-hand corners. Two different attachment modes for the elementary C4H2 elementary aufbau unit were used to generate the matching series in Figure 6.^ ^ More complicated recursive constructions using C^^
o acrcncnrcr^ (^ r^ rV fV^ rW i V r 1 1.2593 2.1010
±1
±/z
±2.1358
±1, ±1 ±1.5434 ±2.8492
±1 ±1.1260 ±/3 ±2.1753
±0.6953 ±1 ±1.2032 ±1.8131 ±2.1867
Figure 9, Two series of almost-isospectral molecular graphs that approach isospectrality in the infinite limit. The unmatched zero eigenvalue is indicated next to the first-generation molecular graph of the lower series.
The Complementarity
"'"o ±1,
±1
Principle
257
crooxro 0
±2
±1, ±2 ±/5
±1
0, 0 ± 1 , ± 1 , ± 1 , ±1 ±1.54336 ±2 ±2.84922
Figure 10, Two series of almost-isospectral molecular graphs. The unmatched zero eigenvalues are indicated next to the corresponding structure.
aufbau units were employed for the series in Figure 7; the upper series is generated by successive attachment of l,3,5-hexatriene-2,3,4,5-tetrayl and the lower series by splicing in the tetrayl of bisallyl. Successive attachment of the tetrayl of 3,4-dimethylenylcyclobutene generates the upper series and successive splicing in 1,2,4,5-tetraylbenzene generates the lower series in Figure 8. The two pairs of matching series in Figures 6 and 7 successively increase one ring at a time, whereas the two matching series in Figure 8 successively increase two rings at a time. Successive addition of CH aufbau units to benzene and trivinylmethyl in Figure 9 represents the simplest example of aufbau construction of strongly subspectral series. The same aufbau increments were used in the generation of each pair of series in Figures 6-9, but the series in Figure 10 are built up by different aufbau increments, which explains why the number of zero eigenvalues in the lower series escalate.
V* CONCLUSION Complementary molecular graphs have right-hand mirror-plane fragments that have the following characteristics: The normal weighted vertices in one are - 1 weighted vertices in the other, the starred vertices in both have identical eigenvector coefficients and the unstarred vertices have eigenvector coefficients of opposite sign, and their eigenvalues X are related by X(M) +X(M) = - 1 . Complementary molecular graphs correspond to AH molecules with at least twofold symmetry and their complementary eigenvalues correspond to antisymmetric eigenvectors. The aufbau constructions in Figures 5-10 contain the essence of inductive proofs. The complete set of self-complementary molecular graphs and their corresponding right-hand mirror-plane fragments are contained in Figure 5. Many unique strongly
258
JERRY RAY DIAS
subspectral pairs of series are evolved by aufbau constructions. This work has more completely revealed the various relationships between structure, symmetry (both obvious and hidden), size, and similarity. REFERENCES 1. Johnson, M.; Maggiora, G. M. Similarity in Chemistry, Wiley: New York, 1991; TrinajstiC, N. Chemical Graph Theory; CRC Press: Boca Raton, FL, 1992; Randie, M. J. Chem. Inf. Comput. Sci. 1992, 32, 686-692; Sen, K., (Ed.) Molecular Similarity I and U\ Springer-Verlag: Beriin, 1995; Klein, D. J. / Math. Chem. 1995,18, 321-348; Mezey, R G. Shape in Chemistry, VCH: New York. 1993. 2. Hargittai, I.; Hargittai, M. Symmetry through the Eyes of a Chemist; Plenum: New York, 2nd ed., 1995; Halevi, E. A. Orbital Symmetry and Reaction Mechanism; Springer-Veriag: Berlin, 1992. 3. Heilbronner, E.; Jones, T. B. J. Am. Chem. Soc. 1978,100, 6506-6507. 4. Dias, J. R. Chem. Phys. Lett. 1996, 253, 305-312. 5. Balasubramanian, K. SAR QSAR Environ. Res. 1994,2, 59-77; Chem. Rev. 1985,85, 599-618; Carbo, R., (Ed.) Molecular Similarity and Reactivity: Quantum Chemical to Phenomenological Approaches; Kluwer: Dordrecht, 1995. 6. Randie, M. J. Math. Chem. 1992, 9, 97-146. 7. Jiang, Y; Yu, W.; Kirby, E. C. J. Chem. Soc. Faraday Trans. 1991, 87, 3631-3640. 8. Dias, J. R. Molecular Orbital Calculations Using Chemical Graph Theory; Springer: Beriin, 1993. 9. Hall, G. G. Trans. Faraday Soc. 1957,53, 573-581; Bull. Inst. Math. Appl. 1981,17, 70-72; J. Math Chem. 1993,13, 191-203. 10. McClelland, B. J. J. Chem. Soc. Faraday Trans. 2 1974, 70,1453-1456; J. Chem. Soc. Faraday Trans 2 1982, 78, 911-916; Mol. Phys. 1982, 45, 189-190. 11. Dias, J. R. Z Naturforsch. 1989,44a, 765-771; J. Math. Chem. 1990,4, 17-29. 12. Dias, J. R. Molec. Phys. 1996, 88, 407-417. 13. Skattebol, L; Chariton, J. L.; deMayo, R Tetrahedron Lett. 1966, 2257-2260. 14. Toda, R; Garratt, R Chem. Rev. 1992, 92,1685-1707. 15. Randie, M. Tetrahedron Wll, 33, 1905-1920. 16. Liu, J. J. Chem. Soc. Faraday Trans. 1997, 93, 5-9. 17. Dias, J. R. J. Mol. Struct. (Theochem) 1997,417,49-67. 18. Dias, J. R. J. Phys. Chem. A 1997,101, 7167-7175. 19. Zhang, J.; Baumgarten, M. Chem. Phys. Lett. 1997,269, 187-192.
CORRELATIONS AND APPLICATIONS OF THE CIRCUMSCRIBING/EXCISED INTERNAL STRUCTURE CONCEPT
Jerry Ray Dias
I. II. III. IV. V.
Abstract 259 Introduction 260 History 260 Constant-Isomer Benzenoid Series 261 Constant-Isomer Series of Fluoranthenoids/Fluorenoids and Indacenoids . . . 262 Other Applications 262 References 264
ABSTRACT The circumscribing/excised internal structure concept has been used to generate constant-isomer series of strictly pericondensed benzenoids, fluoranthenoids, indacenoids, and related conjugated polycyclic hydrocarbons and identify their topological properties.
Advances in Molecular Similarity, Volume 2, pages 259-264. Copyright © 1998 by JAI Press Inc. Allrightsof reproduction in any form reserved. ISBN: 0-7623-0258-5 259
260
JERRY RAY DIAS
I. INTRODUCTION The search for and discovery of new elementary substructures is an essential strategy in the quest to understand chemical phenomena. Atoms, bonds, and functional groups are examples of the most fundamental elementary substructures that are used to describe molecules and to decipher their properties. To determine chemical properties, one must first analyze the properties of isolated molecules and then determine how these result in the observable bulk chemical properties. For example, while a paramagnetic molecule gives rise to paramagnetic material, there is no such thing as a ferromagnetic molecule, for ferromagnetic materials require that all of the magnetic moments associated with paramagnetic molecules in the bulk phase be permanently aligned in the same direction. Similarly, a molecule has no melting point transition. Melting point is a bulk property associated with a conglomerate of interacting molecules. Thus, it is necessary for one to first understand molecular properties and then deduce chemical properties from the cooperative effect of many molecules. Consideration of cooperative effects is not necessary for spectroscopy of molecules in the gas phase. But this is not true for most other types of physical property determinations. The excised internal structure (EIS) is an elementary substructure of recent origin. It was originally defined as the conjugated hydrocarbon formed when the internal carbons of a strictly pericondensed benzenoid are excised by stripping away the perimeter carbon ring;^ in other words, the EIS was defined as the subgraph spanned by the internal vertices of a strictly pericondensed benzenoid system."^ The reverse process is called circumscribing. The EIS may be more generally defined as the connected subgraph spanned by the internal vertices of a strictly pericondensed polycyclic conjugated system. Strictly pericondensed systems have no catacondensed appendages. ^'^
II. HISTORY The excised internal structure was forecasted by Piatt's perimeter rule^ and the subsequent spectroscopic distinction of the insular versus perimeter orbitals in pericondensed benzenoids.'^''^ Clar"^'^ named two highly condensed benzenoids using a circo/circum terminology: circobiphenyl (C^^E.^^, K = 136) and circumanthracene (C^^^^, K = 105). Because these are the only examples for which this terminology was used, this nomenclature terminology was relatively unknown. Subsequentiy, the concept of the one-isomer coronene series put forth by Dias^ involved successive circumscribing of benzene to coronene (C^H^ to €2411^2), coronene to circumcoronene (C24H12 to €5411^3), circumcoronene to dicircumcoronene (C54Hig to C94H24), and so on. Shortiy thereafter, the excised internal structure/circumscribing concept was more fully developed. ^'^ Hall showed that the dualist (inner dual) graph of the dualist graph of a strictiy pericondensed benzenoid is the excised internal structure.^
Circumscribing/Excised Internal Structure Concept
261
III. CONSTANT-ISOMER BENZENOID SERIES If an EIS consisting of only hexagonal rings and/or polyenes branches with no less than two-carbon gaps is wrapped (circumscribed) by a perimeter of hexagonal rings, a benzenoid is generated. Every strictly pericondensed benzenoid isomer has a unique EIS. Constant-isomer series are infinite series of benzenoid hydrocarbons that successively increase in formula per N^^ = N^-\- IN^j + 6 and JSt^ = A^^ + 6 and have the same number of isomers at each stage of increase. They are generated by successive circumscribing with a perimeter of 2A^^ + 6 carbon atoms and incrementing with six hydrogens. Starting with the only three possible C^^ polyene isomers—trimethylenemethane diradical, 5'-rran5'-l,3-butadiene, and 5-c/5-l,3-butadiene—the only three C22H12 benzenoid isomers are generated (Figure 1). Circumscribing these first-generation C22H12 benzenoid isomers gives the only three possible C52Hig benzenoids. Continuing to circumscribe in a successive fashion gives the 3-isomer benzenoid series {^ew\nvi^^6x\-^d' Symmetry, the number of bay regions and selective lineations, and the radicaloid cardinality of the benzenoid members of constant-isomer series are conserved on successive circumscribing. We have shown that as one moves downward on the left-hand staircase edge of Table PAH6, a constant-isomer number pattern of . . . abb . . . is observed.^ For those
A
\ t-trana-I,3-butadlenB
) s-aia-1,3-butadiene
y Isotopological j/ mates
clrcumOO) Crianguleaa
clrciim(30)anthanthrcne
clrcum (30) benzo l^W ] pery lene
Figure 1. Illustration of the excised internal structure concept in enumeration of all of the benzenoid isomers of C22H12, C52H18, C94H24, and so on.
262
JERRY RAY DIAS
constant-isomer series with the same cardinality of b, there exists a one-to-one topological matching of their benzenoid membership.^ Constant-isomer benzenoid formulas only occur on the left-hand staircase edge of Table PAH6.
IV. CONSTANT-ISOMER SERIES OF FLUORANTHENOIDS/FLUORENOIDS AND INDACENOIDS Polyhexes are molecular graphs that correspond to benzenoid hydrocarbons. The concepts of EIS, circumscribing, and constant-isomer series can be generalized to polypent/polyhex composite systems where r^ < 6. Ruoranthenoids/fluorenoids which contain one pentagonal ring and indacenoids which contain two pentagonal rings among otherwise hexagonal rings have been shown to possess constant-isomer series.^ A polypent/polyhex system consists of interlocking pentagons and hexagons where the degree-2 vertices correspond to methine >C-H units and degree-3 vertices correspond to = C < carbon units. The number of degree-2 vertices, degree-3 vertices, edges, and rings are given by A^^, A^^ = n, q, and r, respectively. Denote the circumscription of a polypent/polyhex (polyene) system by P —> circum-P = F . It has been previously shown that N^ = N + A^^^ -H A^^ and N , Nj^, and q are the number of perimeter and internal degree-3 vertices and perimeter edges, respectively. For P -4 P', A^^ -^ Aj^ and ^H ~^ ^pc- Thus, for circum-P, N" = A^ - 6 + ^5 = A^ giving /V^ = A^// + 6 - ^3 and, similarly, qp = ^pc-^^H ^"^ K"^%^^c giving K^'^c'^^pc'^^H = N^ + 2A^^ + 6 - r^. These recursive equations are useful for monitoring the progress of successive circumscription. It is presumed that polypent/polyhex constant-isomer series can successively increase without limit, and since A^ = A^^ + 6 - rg should not decrease, this places the constraint that ^5^6.
V. OTHER APPLICATIONS The coronene one-isomer series has been shown by Pisanski et al. to be a rotagraph Wg(r2;X),^^ and Klavzar and Gutman have shown this series to have the lowest Wiener indices. ^ ^ Scott and Necula have used our EIS concept to explain the relative ^H-NMR shielding of C20H1Q indacenoids.^^ Per their interpretation, indacenoids 9,10,18, and 19 (Figure 2) should have ^H-NMR chemical shifts that are relatively more deshielded than the other C20H1Q indacenoids in Figure 5 because their trimethylenemethane EISs prevent the antiaromatic perimeter ring current from participating. Thus, 9 which has a trimethylenemethane diradical EIS was observed to have resonances of all of its hydrogens shifted downfield by 0.4-0.7 ppm relative to the corresponding hydrogens in 1 and 5 which have a closed-shell 1,3-butadiene EIS.^^ Those benzenoid structures having an EIS with K = 1 will have a monoqui-
Circumscribing/Excised Internal Structure Concept
263
^
w^m^ SC - 8
18
19
SC - 8
SC - 4
5
6
7
SC - 7
SC - 2
SC - 9
Figure 2, All 19 C20H10 indacenoid isomers possible.
264
JERRY RAY DIAS
none isomer with K = l.^"^ A two-dimensional map of a family of benzenoids has been shown to have a one-to-one almost-isospectral matching to another two-dimensional map of related EIS.^"^ REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Dias, J. R. / Chem. Inf. Comput. ScL 1984,24, 124-135. Dias, J. R. Can. J. Chem. 1984, 62, 2914-2922. Piatt, J. R. J. Chem. Phys. 1954, 22, 144. Clar, E.; Roberson, R.; Schlogl, R.; Schmidt, W. J. Am. Chem. Soc. 1981,103,1320. Clar, E. Polycyclic Hydrocarbons; Wiley: New York, 1964, Vols. 1 and 2. Dias, J. R. J. Chem. Inf. Comput. Sci. 1982, 22,15-22. Hall, G. G. Theor. Chim. Acta 1988, 73,425-435. Dias, J. R. J. Chem. Inf Comput. Sci. 1990, 30, 251-256. Dias, J. R. J. Chem. Inf Comput. Sci. 1993,33, 117-130. Pisanski, T.; Zitnik, A.; Graovac, A.; Baumgartner, A. /. Chem. Inf Comput. Sci. 1994, 34, 1090-1093. Klavzar, S.; Gutman, I. / Chem. Inf Comput. Sci. 1996,36,1001-1003. Scott, L. T.; Necula, A. / Org. Chem. 1996,61, 386-388. Dias, J. R. J. Chem. Inf Comput. Sci. 1990,30, 53-61. Dias, J. R. / Phys. Chem. A 1997,101, 7167-7175.
LEAST-SQUARES AND NEURAL-NETWORK FORECASTING FROM CRITICAL DATA: DIATOMIC MOLECULAR re AND TRIATOMIC AHa AND IP
Jason Wohlers, W. Blake Laing, Ray Hefferlin, and W. Bradford Davis
Abstract Introduction Theory Data Results for Diatomic-Molecular re A. Least-Squares Results B. Neural Network Results C. Graphical Representations of Neural Network Results V. Results for Triatomic-Molecular/P-and AHa A. Least-Squares Results B. Neural Network Results
I. II. III. IV.
Advances in Molecular Similarity, Volume 2, pages 265-287. Copyright © 1998 by JAI Press Inc. All rights of reproduction in any form reserved. ISBN: 0-7623-0258-5 265
266 266 267 267 277 277 277 282 282 282 282
266
JASON WOHLERS ET AL.
C. Graphical Results VI. Discussion References
284 286 287
ABSTRACT Multiple regression was used to predict 299 diatomic internuclear separations using atomic period and group numbers as a basis. Van der Waals molecules were excluded. The standard deviation a of the differences of predictions from 150 tabulated data is 4.128%. Neural networks, one with van der Waals molecules in the learning set and one without, each predicted the property for 2145 real and nonredundant molecules; a of the differences of the predictions from 342 and 316 tabulated data are 25.00 and 8.63%. For comparable cases, the least-squares technique was more accurate. Multiple regression has been used to predict 205 triatomic ionization potentials using a 3D basis consisting of combinations of atomic period and group numbers. The a of differences from 80 tabulated data is 14.65%. Neural networks using the same 3D basis and using a 6D period and group number basis have predicted the IP for 2596 and 5148 molecules; the a of the differences from 69 and 92 tabulated data are 12.35 and 10.97%. The neural network method was more accurate than the least-squares technique. Neural networks using the same two bases have predicted the bonding energies for 16324 and 5418 molecules; the a of the differences from 79 and 117 tabulated data are 22.67 and 15.13%.
I. INTRODUCTION Systematization of existing data for small molecules holds out hope for enhanced learning by chemistry students, for more efficient preparation of computer databases, for a better understanding of how molecular periodicity and molecular periodic systems are related to atomic periodicity and the chart of the elements, and even for relating those understandings to the observed periodicities and corresponding periodic tables of nuclei and nucleons and of hadrons and quarks.^"^ Systematization also makes possible the rapid forecasting of approximate data for large numbers of molecules, data that should be useful until experiments or ab initio computations produce much more precise values. This systematization of small-molecule data has been carried out with graphical,^'^ statistical,^ and least-squares (LS) techniques,^'^ and also has just begun with the construction of neural networks (NN).^^"^^ LS methods have the advantage that fitting and predicting errors can be well known, while NN have the ability to "learn" and predict data without being supplied a smoothing equation. The present report will not repeat information about LS^'^ or NN^^^^ techniques, but it will evaluate and compare some of the tens of thousands of predictions of molecular data obtained by these two methods.
Forecasting re, AHa, and IP
267
The properties considered here are the diatomic-molecular equilibrium internuclear separation in angstroms (here called r^), and the triatomic-molecular heat of atomization in kilojoules per mole (A//^) and ionization potential in electronvolts
II. THEORY The periodicity of atomic data is associated with the period number R (the principal quantum number) and the group number C (associated with the angular momentum and the magnetic quantum numbers). Detailed studies have shown that the periodicity of diatomic-molecular data can be very successfully associated with the independent variable basis set {/?j,Cj,/?2»^2}' ^here the subscripts refer to the two atoms in the molecules. This basis is the foundation of the matrix-product theory of molecular periodicity,^"^'^^ has been used extensively in the least-squares fitting and prediction of data,^ and was used in the study of main-group, neutral, groundstate, diatomic-molecular data being reported here. Diatomic-molecular data were symmetrized so that each appears with independent variables {R.,Cj,Rj^,C^] and {Rj^.C^R-.C-} except for the homonuclear cases R- - R^ and C = Q (mirror-image molecules are identical). The matrix-product theory of molecular periodicity specifies the six-dimensional basis {/?j,Cj,i?2»^2'^3'^3}' ^h^r^ the central atom is number 2, for the analysis of triatomic-molecular data.^"^'^^ Triatomic data were symmetrized such that each appears with independent variables {/?.,C.,/?^,C^,/?^,C^} and {/?^,C^,/?^,C^,/?y,C.}, except for the cases where R- = R^ and C- = C^. Graphical study^ of triatomic-molecular data has shown that the three-dimensional basis {(R^ * /?2 + /?2 * ^3)'(Ci + C2 + C^)yC2] = {f(rXn^,C2} is also useful because: (1 )/(r) is the reduced variable in /?• along which isovalent molecular (fixed C-) data are most monotonic, (2) n^ enumerates series of isoelectronic molecules, and (3) C2 allows for the very weak tendency of C2 = 4 molecules to be more stable.^ Of course, there may be more than one triatomic molecule for a given {/(r),n^,C2}, and their data are learned and predicted as a single value. Triatomic data in the 3D basis require no symmetrization. LS predictions have already been made in this basis.^ This paper includes results using both the 6D and 3D bases for the study of acyclic, main-group, neutral, ground-state triatomic molecules.
III. DATA The diatomic-molecular r^ in angstroms are from Ref 16. The triatomic-molecular A//^ and IP are from Refs. 17 and 18. Instead of a 4D LS fit to the diatomic data, series of 2D fits were made. For fixed values of (/?i,/?2)» data were fitted to smoothing equations in Cj and C2 to obtain
Table 1. lnternuclear Separations in Angstroms: Tabulated Data, Least Squares and Neural Network Predictions, and Differences from Tabulated Values Least Squares Tabulated
AlAl AIB AIC AlCl AIF
N
(r,
03
AIN AIO AIP AIS AlSi
2.466
2.1 30 1.654 1.786 1.61 8 2.029
BB
1.590
BC B CI
1.715
BF BN BO BP
1.262 1.281 1.204
Area
Values
33 23 23 33 CG 23
2.349 1.897 1.754 2.075 1.983 1.651 1.634 1.665 1.631 2.1 37 2.088 2.224 1.529 1.368 1.657 1.716 1.267 1.366 1.270 1.236 1.682
cc 23 23 33 33 33 22 22 23 CG 22 CG 22 22 23
Average
Neural Network # I 0
% diff
Values
Neural Network #2
% diff.
Values
% diff.
4.74
2.085 1.613 1.613
-1 5.47
1.851 1.591 1.562
-24.94
2.029
0.023
4.74
2.084
2.71
2.1 52
1.05
1.643
0.005
4.67 -6.77 0.80
1.671 1.543 1.556 1.996 1.878 2.024 1.330 1.305
1.69 -1 3.61 -3.82
1.542 1.535 1.529 1.934 2.01 1 1.885 1.31 2 1.297
6.78 -1 4.04 -5.52
-1.98
2.91 -3.84
1.687
0.01 8
-1.63
1.654
1.316
0.038
4.28 -0.86 2.66
1.367 1.297 1.295 1.524
-7.45 -1 6.36
3.86 1.27 7.59
-0.89 -1 7.48
1.562
-8.91
1.288 1.285 1.283 1.535
2.02 0.34 6.57
BS B Si
1.609
BeAl
h,
rn
a
BeB BeC BeCl BeF BeN Be0 BeP Be5 BeSi
cc c CI
1.797 1.361 1.331 1.741 1.242
CF CN
1.272 1.1 72 1.128
CP
cs
1S 6 2 1.535
ClCl
1.987
CIF
1.628
co
23 23 23 22 22 23 22 22 22 23 23 23 22 23 22 22 22 DF 23 23 DF 33
CG 23 CC
1.646 1.766 2.096 1.755 1.564 1.797 1.373 1.436 1.373 1.851 1.800 1.949 1.236 1.572 1.224 1.168 1.164 1.215 1.568 1.546 1.51 5 1.972 1.910 1.645 1.623
2.30
0.00 0.88 3.1 6 3.39 -0.48 -3.77 -0.34 3.19 5.41 0.38
1.512 1.613 1.842 1.435 1.387 2.085 1.598 1.380 1.394 1.654 1.716 1.716 1.294 1.606 1.338 1.283
6.01
4.59
5.1 6 9.50
1.535 1.535 1.766 1.492 1.469 1.897 1.428 1.443 1.428 1.786 1.840 1.766 1.290 1.542 1.283 1.281
1.279 1.501
7.54 -3.92
1.277 1.529
13.19 -2.14
16.00 17.44 4.76 -1.43 4.1 5
5.58 4.96 7.32 5.67 3.84 0.87 9.30
1.189
0.021
1.531
0.010
4.26
1.484
-3.06
1.529
4.42
1.941
0.01 6
-2.32
2.100
8.1 9
2.079
4.66
1.634
0.007
0.37
1.614
-1.25
1.475
-9.41 (continued)
Table 1. Continued Neural Network # I
Least Squares
FF
t 4
u
Tabulated
Area
Values
1.412
22 GG 23 22 22 22 23 AG 22 AG 22 AA 23 22 22 23 23 23 33 23 23 33
1.481 1.337 2.348 2.044 2.330 1.823 1.991 1.963 1.544 1.588 2.679 2.528 2.557 1.666 1.573 2.074 2.009 2.1 88 2.510 2.077 1.919 2.194
LiAl Lit3 LiBe LiC LiCl
2.021
LiF
1.564
LiLi
2.673
0
LiMg LiN LiO Lip LiS LiSi MgAl MgB MgC MgCl
2.199
% dift
% diff.
(3
1.409
0.051
-0.21
1.357 1.878 1.459 1.842 1.402
-3.66
1.269 1.807 1.535 1.862 1.492
-10.13
1.977
0.007
-2.1 8
2.1 65
9.50
1.959
-3.08
1.566
0.01 4
0.1 3
1.645
5.06
1.453
-7.08
2.603
0.029
-2.62
1.928 2.597 1.391 1.410 1.680 1.755 1.745 2.270 1.735 1.654 2.345
-25.92
1.897 2.108 1.464 1.453 1.829 1.885 1.807 2.123 1.874 1.874 2.447
-29.02
6.63
Values
% diff
Average
4.23
Values
Neural Network #2
11.30
MgF MgN MgO MgP Mg5 MgSi N CI
N
v,
1.750 1.749 2.142
23
1.772
23
1.815
23
1.766
33
2.277
33
2.21 7
33
2.375
23
1.542
1.317
EC 22 EC
1.593
NF NN
1.098
22
NO NS NaAl
1.1 51 1.494
7.31
0.97
1.671
4.48
0.01 5
1.277
1.256
0.01 2
EE
1.1 30 1.187
1.159
0.025
22
1.156
23
1.501
0.47
33
2.708
23 23
2.054
9.1 0
1.885
7.80
2.1 37
2.01 0 3.50
1.909 1.862
1.621
1.559
4.11
2.245
2.21 6
2.1 23
1.550
1.529
4.81
-4.63
1.314
4.58
1.279
-2.89
5.56
1.270
16.10
1.271
9.58 10.46
1.275
0.43
1.275
10.76
1.449
-3.02
1.516
1.47
2.198
2.304
2.487 1.776
2.531
2.364
2.245
1.679
1.921
23
2.131
2.361
33
2.350
1.926
AC 23 AC
2.292
NaF NaLi
2.81 0
23
2.81 3
AA
2.894
33
2.891
3.449
23
2.01 3
NaN
1.878
1.246
NaB NaBe NaC NaCl
NaMg
1.26
2.321
0.01 3
1.929
0.006
2.854
0.01 4
1.921
2.445
5.34
2.538
7.51
0.1 6
1.955
1.33
1.972
2.36
1.57
2.424
-1 5.06
2.310
-17.80
-1.69
1.940 1.91 7
1.645
1.921 (continued
Table 1. Continued Neural Network # I
Least Squares Tabulated
Area
Values
Average
3.079
33 AA
3.1 86
NaO
23
3.111 3.260 1.949
NaP
33 33 33
2.455 2.384 2.563
23 22 22 FF
1.566 1.331 1.212 1.176
33
1.949 1.883 1.552
NaNa
NaS NaSi
0 CI OF 00 h,
1.570 1.207
PCI
-4
EC
h,
% diff
(5
0.023
-0.25
1.194
0.01 5
1.927
0.01 6
1.567 1.508
1.557
0.005
PF
1.589
PN
1.491
23 EC 23
1.474 1.503 1.969
0.01 1
1.476 1.893
EE 23 33
1.491
PO PP
EE 33 33
1.760 1.940 1.942
1.865
0.056
23
1.575
PS s CI SF
1.601
3.48
-1.08
Values
3.82 1 1.698 2.069 2.1 00 2.306 1.570 1.31 1 1.273
% diff
19.93
-0.02
6.60
2.01 0 -2.01 0.00 1.83 -1.48
Values
3.1 67 1.921 2.198 2.294 2.1 83 1.504 1.275 1.269
% diff 2.85
4.22
5.14
2.1 52
1.584
1.72
1.522
-4.20
1.495 1.506
0.28 2.06
1.504 1.504
0.85 1.88
1.915
2.69
2.024 2.079 2.1 23
6.93
-0.1 6
1.504
-6.08
1.842 1.941 -1.62
Neural Network #Z
1.598
so ss
N
2
Sic SiCl SiF SIN SiO Si P SiS
SiSi Average
1.481 1.889
23
1.510
FF
1.484
33
1.922
1.497
0.009
1.857
0.035
1.08
1.495
-0.1 3
1.486
0.34
1.878
1.12
2.094
10.83
FF
1.792
23
1.637
2.058
33
2.1 52
4.58
-1.44
2.069 1.629
0.55
23
1.993 1.578
-3.1 6
1.601
1.76
1.529
4.52
1.571
23
1.563
-0.51
1.524
-2.97
1.522
-3.10
1.51 0
23
1.543 0.33
1.543
1.85
1.510
-0.01
1.929 2.246
Standard deviation
DF
1.486
33
2.035
33
1.995
DF
1.787
33
2.111
-1.69
1.542
1.598
1.515
0.01 9
1.982 1.891
0.055
1.959
-1.97
1.890
-0.05
2.038
5.64
-6.01
2.010
-1 0.51
1.922
-1 4.45
0.020
-0.220
0.741
-0.442
0.014
2.609
7.951
8.736
274
JASON W O H L E R S E T A L
D @ 298K
(eV). from Sauval
0
F
X = Right atoa. Y = Left atoi. Z = Central atoi
Symbol Key 4.9900
j^ 2.5000B 5.0000^ 7.5000^ 10.0000 0^ 12.5000 - L - 15.0000 -
7.4900 9.9900 12.4900 14.9900 17.4900
Figure 1, An example of the triatomic molecular data. Shown are unpublished heats of atomization in eV from Ref. 19, in Q (right front), C2 (vertical), and C3 (right rear) coordinates; C2 pertains to the central atom; 0 < C/ < 8; /?i = /?2 = /?3 = 2. The data are separated into bins indicated by symbols. The bins are of equal "width" and lie between the largest A/-/a, i.e., + for OCO at coordinates (6,4,6), and the smallest, i.e., A for FOF at (7,6,7). Isoelectronic molecules lie on tilted parallel planes whose intersections with planes of contant C2 are indicated by small numbers (ne).
Table 2. Ionization Potentials in e V Tabulated Data, Least Squares and Neural Network Predictions, 95% Confidence-Limit Errors, and Differences from Tabulated Values Least Squares (3 0)
Tabulated
N
v
UI
FBO NNO FBF OOF FOF
oco NCF
oso FSiF
OSiO Average
Values
Errors
Values
Errors
13.400 12.890 8.400 12.600 13.700 13.790 13.320 12.340 11 .ooo 11.700
0.251 0.528 0.1 55 0.155 0.155 0.155 0.528 0.155 0.251 0.251
12.15 12.15 8.14 12.60 12.61 12.15 12.15 11.96 11.96 11.59
0.22 0.22 0.25 0.28 0.34 0.22 0.22 0.21 0.21 0.1 7
Standard deviation
% diff.
-9.34 -5.75 -3.07 0.01 -7.98 -11.90 -8.79 -3.07 8.74 -0.96 4.21 5.67
Neural Network (60) Values
Errors
12.33 12.33 12.14 12.56 13.90 12.33 12.33 11.51 11.51 11.43
0.90 0.90 0.89 0.92 1.02 0.90 0.90 0.84 0.84 0.84
% diff.
-7.95 -4.31 44.50 -0.35 1.46 -10.56 -7.40 -6.76 4.60 -2.28 1.10 15.14
Neural Network ( 3 0 ) Values
12.84 12.83 12.75 11.91 12.46 12.42 12.50 11.50 12.23 12.06
Errors
% dihl
ERR
-4.21 -0.44 51.73 -5.48 -9.07 -9.93 -6.1 6 -6.78 11.15 3.11
ERR 1.24 1.16 1.21 1.21 1.21 1.12 1.19 1.17
2.39 17.51
Table 3. Heats of Atomization i n kJ/mol: Tabulated Data, Least Squares and Neural Network Predictions, 95% Confidence-Limit Errors, and Differences from Tabulated Values Least Squares
Tabulated Values
c3 BOB FBO NNO FBF 03 CNN NCO
cco FOF
oco
1302.550 11 00.000 1477.058 1103.390 1216.550 595.892 1252.821 1251.786 1382.153 374.578 1597.893 1378.566 1215.243 1200.000 927.384 973.021 857.508 1046.21 3 588.368 1225.278 1192.217 1259.950
OBO FCO BCC ON0 N3 ONF FCF FNF NCF FSiF OSiO Average Standard deviation
Errors
10.1 3.2 1.6 0.3 1.6 1.6 3.2 1.6 1.6 1.6 0.3 1.6 3.2 1.6 0.3 3.2 1.6 3.2 3.2 1.6 1.6 1.6
Neural Network (6D)
% diff.
Values
Errors
1187 1101 1213 1144 1147 937 1194 1221 1233 654
12 1 264 41 69 342 58 31 149 279
-8.9 0.1 -1 7.9 3.7 -5.7 57.3 -4.7 -2.5 -1 0.8 74.5
1251 1117 1132 1079 1182 984 1023 859 1183 905
128 98 68 152 209 127 23 270 43 287
-9.3 -8.1 -5.7 16.3 21.5 14.8 -2.2 46.0 -3.5 -24.1 6.5 24.9
Values
Errors
1274 900 1349 1127 1276 862 1095 1298 1441 491 1258 1302 1240 1465 1095 1102 953 973 632 1229 1153 1174
157 111 166 139 157 106 135 160 177 60 155 160 153 180 135 136 117 120 78 151 142 145
Neural Network ( 3 0 )
% diff.
-2.2 -1 8.1 -8.7 2.2 4.9 44.6 -1 2.6 3.7 4.2 31 .O -21.3 -5.6 2.1 22.1 18.1 13.2 11.2 -7.0 7.5 0.3 -3.3 -6.8 3.6 15.2
Values
Errors
1232 1125 1258 1246 1167 61 3 1323 1346 1339 41 7 1282 1317 1161 1141 1055 1328 846 1022 660 1282 1048 1165
251 23 257 254 238 125 27 275 273 85 262 269 237 233 21 5 271 173 28 135 262 214 238
% diff.
-5.4 2.3 -1 4.8 12.9 -4.0 2.8 5.6 7.5 -3.1 11.2 -1 9.8 4.5 -4.5 -4.9 13.8 36.5 -1.3 -2.3 12.3 4.7 -1 2.1 -7.6 1.2 11.7
Forecasting re, A Ha, and IP
277
the coefficients of their equations and the subsequent predicted data. Then, for fixed values of (€^,€2), data were fitted to separate smoothing equations in R^ and /?2. These variables are limited by I < R- < S and 0 < C. < 8; however, alkaline-earth pairs at (€^,€2) = (2,2) were excluded. In cases where a datum was fitted both times, its average was used. Columns 1 and 2 of Table 1 list some of the tabulated data for molecules with the additional restriction (to limit the chapter size) that the molecules are formed from row-2 (Li-Ne) and row-3 atoms. The 95% confidence-limit errors in rj are small as to be negligible. For AH^ and IP, the triatomic-molecular data tend to lie in well-defined regions. For/(r) = 8 (^1 = /?2 = /?3 = 2), tabulated data lie mostly within 2 < C2 < 8 and «^ -H C2 > 17 (Figure 1). For/(r) > 8, the domains are at most within the tighter limits 3 < C2 < 7 and 12 < Cj -f- C3 < 14. Some of the data and their errors are given in columns 1 to 3 of Tables 2 and 3.
IV. RESULTS FOR DIATOMIC-MOLECULAR re A. Least-Squares Results^
Column 3 of Table 1 shows whether the predictions for r^ were obtained with fixed R^ and /?2 (numbers) or fixed Cj and C2 (letters: A = 1 , B = 2 , . . . ) . Columns 4 to 7 show the LS predictions, the averages and standard deviations in cases of double predictions, and the percent differences between these LS and the tabulated values. The average of these percent differences is -0.22, clearly not different from zero given the standard deviation of 2.609. Size limitations dictated that R- < 4 in Table 1, which is the reason that in the summary of the LS smoothing for all periods (Table 4) the average of the percent differences is different, i.e. -0.030 with a = 4.128. B. Neural Network Results
The remaining columns of Table 1 show the predictions, and their percent differences from the tabulated data, for two neural networks. The averages of the percent differences of the NNs in Table 1 are 0.741 and -0.442 (neither statistically different from zero to one standard deviation), whereas in Table 4 they are 26.39 (statistically different from zero) and -0.53 (with a = 8.63). The numbers for NN #1 differ so much primarily because no van der Waals molecules are listed in Table 1 and secondarily because of the /?• < 4 limitation. NN #2 learned tabulated data with van der Waals molecules culled out, so its fitting of the tabulated data is more comparable to the LS smoothing. The average percent difference is -0.030 with a = 4.128. These and the following statistics in Section IV are summarized in Table 4. NN #2 has no prediction as low as the minimum tabulated datum but has a maximum prediction higher than the maximum tabulated datum; the latter shows
278
JASON W O H L E R S E T A L
Table 4. Characteristics of Neural-Network Learning, Least-Squares Smoothing, and Preditions: [Ry R2, Q, Cj) 4D and {R2, R2, Rv R\, Ri, R^) 6D Bases Diatomic Moledules Intemuclear. Separation Neural Network ^1
Neural Network #2
Trlatomlc Molecules Ionization Potential
Heat of Atomlzatlon
Neural network information Learning file Number of points
342
316
92
117
Points with no partner
0
0
0
0
Duplicated points
0
0
2
0
Minimum R Maximum R
2
2
2
2
6 1
6 1 7
6 1 7
6
0
0
Minimum C Maximum C Rare-gas molecules Alkalil-earth atoms bonded Minimum tabulated datum Maximum tabulated datum
8 33 2
0 0 1.150 5.100
0
Minimum prediction
1.270
1.265
3.72
298.000 1597.893 289.842
Maximum prediction
2.643
4.981
12.99
1464.713
-0.53
-1.92
-1.61
8.63 6.23 5.99
10.97 7.56 8.14
15.13 10.05 11.39
35 2.25 10.77 7.53
12 -0.04 7.33 5.74
-1.78 12.32 9.77
7.93
4.21
7.14
Average % difference Standard deviation Average abs. % difference Standard deviation Validation file Number of data Average % difference Standard deviation Average abs. % difference Standard deviation
1.150 5.200
0 3.2 14.7
1 7
26.39 25.00 31.32 30.43 37 18.55 28.95 28.39 22.41
12
Global predictions Number of predictions New maximum R
2145
Minimum prediction
7 1.270
Maximum prediction
5.994
2145 7
Least-squares information Tabulated data Number of molecular data Average percent error Least-squares smoothing Average % difference Standard deviation Number of global predictions
150 2.63 -0.030
■
4.128 299
5418
5418
1.265
3.65
45.270
4.981
13.24
1775.672
Forecasting r^ AHa, and IP
279
Figure2, Neural network predictions (•) of internuclear separation for {R^,R2) = (2,2). Known data are shown by x.
Figure 3. Mesh surface fitted to the NN predictions in Figure 2. Note the slight asymmetries, and the maximum at left caused by data for various alkaline-earth pairs at(Ci,C2) = (2,2).
280
JASON W O H L E R S E T A L
Figure 4, Same as Figure 2 except for (/?i,/?2) = (3,3).
J>
'V
Figure 5, Same as Figure 2 but now the groups are fixed, (Ci,C2) = (2,6), and the periods vary. The tabulated data for these alkaline-earth chalcedonides are not symmetrically located.
Forecasting r^ AHa^ and IP
Figure 6. Mesh surface fitted to the NN predictions in Figure 5.
Figure 7, Same as Figure 5 for dihalides (Ci,C2) = 0,7).
281
282
JASON WOHLERS ET AL.
that networks do not have to plateau when predictions reach extrema in the learned data. To the extent that the absolute average percent differences and standard deviations of the validation files are similar to those of the learning files, there is good indication that NN #1 and NN #2 learned the tabulated data well. The standard deviations of r^ global predictions made by these NN can be tentatively set by the learning file values of a = 25.00 and 8.63%, respectively. C. Graphical Representations of Neural Network Results
Figures 2 to 7 show NN #1 results plotted on fixed-row and fixed-column coordinates. They show where the data tend to be concentrated in portions of the base planes. The figures show slight asymmetries. These asymmetries can be quantified by computing the centroids for the data, Z[(Cj. - C2,) x rg-]/S[r^.], summed over all of the data /. For the globally predicted values, this centroid is 0.055. In fixed-row graphs for r^, NN predicted surfaces seem to match known surfaces''^ fairly well. NN predictions in the fixed-column graphs for r^ are also quite faithful (Figures 5 to 7). The surfaces are in qualitative agreement with the log(/?j/?2) formula presented in Ref. 20. There are plateaus in the predicted data for (/?i,/?2) = (6,6),(6,7),and(7,7).
V. RESULTS FOR TRIATOMIC-MOLECULAR /P AND AHa A. Least-Squares Results^
Columns 4 to 6 of Tables 2 and 3 show the LS predictions and 95% confidencelimit errors. The LS predictions were made in the 3D basis only, because it was impossible to guess at any fitting formulas from inspection of the data in the 6D basis. The molecule OCO was omitted in the LS analysis of A//^ because the high numerical value distorted the fitting; OSiO was also omitted. B. Neural Network Results
Columns 7-9 and 10-12 of Tables 2 and 3 show NN predictions using the 6D and 3D bases. The prediction for a given address in the 3D basis can pertain to several molecules (Section II). Table 4 shows that the predictions for IP are inside the extrema of the tabulated data for the 6D basis; Table 5 shows that they are outside the extrema for the 3D basis. Exactiy the opposite is true for AH^. Again, the point is that NNs can extrapolate beyond the extrema of the learned data. The standard deviations of global predictions for IP made by these NN in the 6D and 3D bases are given by a = 10.97% (Table 4) and 12.35% (Table 5), respectively. For A//^, the respective standard deviations are 15.13 and 22.67%.
Forecasting re, AHa, and IP
283
Table 5, Characteristics of Neural-Network Learning, Least-Squares Smoothing, and Predictions: {/(r),A7e, C2} 3D bases IP(eV)
Ha
Neural network information Learning file Number of points
69
79
M i n i m u m fid
8
8
Maximum fir)
72
60
M i n i m u m n^
3
8
Maximum n^
20
20 2
M i n i m u m C2
2
Maximum C2
7
M i n i m u m magnitude Maximum magnitude
3.2
298.00
14.7
1597.89
7
M i n i m u m prediction Maximum prediction
2.812
416.53
13.900
1346.23
Average % difference Standard deviation
1.39 12.35
Average abs. % difference Standard deviation
4.91 22.67
8.77
15.02
8.74
17.62
Validation file Number of data
8
9
Average % difference
1.37
4.99
Standard deviation
9.72
20.4
Average abs. % difference Standard deviation
8.07
14.73
4.39
14.12
Global predictions Same independent-variable limits Number of predictions M i n i m u m prediction Maximum prediction New independent-variable limits Number of predictions New maximum fir) New minimum n^ New maximum n^ New minimum C2
5724 376.38 1346.23 2596
16324
75
98
24
3 24 1
Maximum C2
8
M i n i m u m prediction
0.000
367.59
Maximum prediction
16.484
1556.87
Least-squares information Tabulated data Number of molecular data Average percent error
80
91
4.23
2.63
11.42
2.92 26.82
Least-squares smoothing Average % difference Standard deviation Number of global predictions
14.65 205
254
284
JASON W O H L E R S E T A L
C. Graphical Results
Figure 8 shows global NN predictions for A//^, when R^= R^ = R^ = 2, plotted on the independent variables n^ and C2. Most of the tabulated data lie in the region between the solid line and the near edges of the figure. Figure 9 shows a contour map of the same surface. The region bounded by dotted lines and the edges of the figure is of considerable interest, because the contours have slopes of approximately - 1 . Thus the contours are described approximately by , + C2 = Ci + 2C2 + C3 : (Cj + C2) + (C2 + C3) = constant It appears that in this region, molecules with similar AH^ are not isoelectronic in the usual sense, n^ = constant, but in the "adjacent-DIM" sense.^^ The phenomenon appears to be restricted, for this property, to molecules formed of atoms with high electronegativities. Figure 10 shows a contour map of the surface of global predictions for (R^,R2,R^) = (2,2,3) [and of course (3,2,2)]. Most of the tabulated data lie in a much smaller region than in Figures 7 and 8. In the region at the top right of the figure.
CCF.NCO FCN.OCO
^^V^^
CCN.BCO.BeCF C).BCN.BeCO.UCF
Figure 8. Neural-network global predictions for heat of atomization AHa plotted in the 3D basis [on He and C2, for Ar) = 8]. As explained in the text, this basis results in there being more than one molecule for most addresses. All conceivable molecules are considered, whether or not they exist under currently studied conditions. Most of the meaningful predictions (i.e. those in the domain where most tabulated data lie) are in front of the diagonal solid line or in the corridor extending from that line along C2 = 4 to the right to He = 12 (C3, BCN, BeCO, and LiCF).
285
Forecasting re, A Ha, and IP
D1340-1400 ■1280-1340 01220-1280 ■1160-1220 ailOO-1160 ■ 1040-1100 0980-1040 ■ 920-980 D860-920 ■ 800-860 0740-800 ■ 680-740 0620-680 ■ 560-620 0500-560 ■440-500 0380-440
11
12
13
14
IS
16
17
18
19
20
Figure 9, A contour map of the surface in Figure 8 (but with different intervals).
«C!^5^>:
■ 1280-1340 01220-1280 ■ 1160-1220 01100-1160
■ io4o-noo{ 0980-1040 ■ 920-980 0860-920 ■ 800-860 0740-800 ■680-740 D620-680 ■ 560-620 0500-560 ■440-SOO 0380-440
IM4 L%^wJ^I
II
12
13
14
IS
16
17
18
19
20
Figure 10, Same as Figure 8 except that Ar) = 10; see text for associated values of /?i, /?2, and /?3.
286
JASON W O H L E R S E T A L
'^j0^
01220-1280 ■1160-1220 01100-1160 ■ 1040-1100 0980-1040 ■ 920-980 |a880-920 ■ 800-860 0740-800 ■680-740 0620-680 ■ S60-620 |OS0O-S6O ■440-500 0380-440
19
20
Figure 11, Same as Figure 8 except that /(r) = 12. Comparing this figure with the previous two figures makes it easy to see the periodicity of the predicted data, and their monotonic decline, and the shrinking ofthe region with slope-1 as Ad increases.
similar molecules are again approximately isoelectronic in the "adjacent-DIM" sense. Figure 11 shows the contour map pertaining to (/?i,/?2,/?3) = (2,2,4) [and (4,2,2)], (2,3,2), and (3,2,3). Now the region bounded by the dotted lines is much smaller. At larger values of/(r), the phenomenon of "adjacent-DIM" isoelectron similarity disappears. All IP predictions with the same/(r) and n^ were the same, and so the contours on the graphs (not shown) all consist of lines paralleling the C2 axis.
VI. DISCUSSION This paper assumes, just as do Refs. 14-16, that molecules exist, in spite of the questions raised in Refs. 22-24. For diatomic-molecular r^ and triatomic-molecular A//^, the LS results are more accurate; for triatomic-molecular IP, the NN results are more accurate. The predictions of both methods might be improved if the tabulated data were culled so as to keep only the diatomic or triatomic molecules with the same ground-state terms; this improvement can be an area of future work. NNs are very sensitive to the presence of additional independent variables in the basis (e.g., n^, n^), may be sensitive to the extent that the learning data are equally distributed in the space of independent variables, and are not very sensitive to various partitions of tabulated data into learning and validation files. These aspects of NNs, in the context of molecular classification, are under study now.
Forecasting re, AHa, and IP
287
Inquiries concerning predictions for molecules not listed in Tables 1 through 3 should be directed to R.H. REFERENCES 1. R. Hefferlin, J. Phys. Chem. 1995, 99, Sill. 2. R. Hefferlin, Periodic Systems of Molecules and their Relation to the Systematic Analysis of Molecular Data (Edwin Mellen Press, Lewiston, New York, 1989). 3. E. V. Bavaev and R. Hefferlin, in: Concepts in Chemistry, ed. D.H. Rouvray (Research Studies Press/John Wiley, Chichester, U.K., 1997). 4. C. M. Carlson, R. J. Cavanaugh, R. A. Hefferlin, and G. V. Zhuvikin, 7. Chem. Inf. Comp. Sci. 1996, 36, 396. 5. R. HefferUn and M. Kutzner, J. Chem. Phys. 1981, 75, 1035. 6. Ref. 2, pp. xxiii-xxxiv, 190-233. 7. Ref. 2, pp. 234-249. 8. Ref. 2, pp. 262-289. 9. C. Carlson, J. Gilkeson, K. Linderman, S. LeBlanc, and R. Hefferlin, Estimation of Properties of Triatomic Molecules from Tabulated Data Using Least-square Fitting, Croatica Chem. Acta, in press for the June, 1997, issue. 10. B. Davis, B. Laing, and R. Hefferhn, in: Proceedings of the 1997 International Arctic Seminar (Pedagogical Institute, Murmansk, Russia, 1997), pp. 31-36. 11. T. R. Cundari and E. Moody, /. Chem. Inf. Comp. Sci. 1997, 32, 871. 12. T. R. Cundari and E. Moody, J. Mol. Struct. (Theochem) 1998, 425, 43. 13. J. Lawrence, Introduction to Neural Network Design, Theory, and Applications (California Scientific Software Press, Nevada City, CA, 1994). 14. R. Hefferiin and G. Zhuvikin, J. Quant. Spectrosc. Radial Transfer 1984, 32,151. 15. R. Hefferiin, J. Chem. Inf Comput. Sci. 1994, 34, 314. 16. K. Huber and G. Herzberg, Constants of Diatomic Molecules (D. Van Nostrand Reinhold Co. Inc., New York, 1979). 17. L. V. Gurvich, et al, Thermodinamicheskie Svoista Individual'nikh Veschestv, Vols. 1 -4, (Nauka, Moscow, 1978, 1979, 1981, 1982). 18. L. V. Gurvich, et al, Energii Razryva Khimicheskikh Svyazei. Potentialy lonizatzii i Srodsvo k Electronu, (Nauka, Moscow, 1974) pp. 229-289. [An earlier edition was translated into English: V. I. Vedeneyev, et al, (Bond Energies, Ionization Potentials, and Electron Affinities, St. Martins, New York, 1966).] 19. A. J. Sauval and J. B. Tatum [computations for triatomic molecules done at the same time as those for diatomic molecules, the latter appearing in Astrophys. J. Suppl. 1984, 56, 193]. 20. R. E Nalewajski, J. Phys. Chem. 1979, 83, 2677. 21. R. Cavanaugh, R. Marsa, J. Robertson, R. Hefferiin, J. Mol. Struct. 1996, 382, 137. 22. H. Primas, Chemistry, Quantum Mechanics and Reductionism: Perspectives in Theoretical Chemistry (Springer-Verlag, Berlin, Germany, 1983). 23. V. V Nefedova, A. I. Boldyrev, and J. Simons, J. Chem. Phys. 1993, 98, 8801. 24. A. I. Boldyrev, Structure and Dynamics of Non-Rigid Molecular Systems (Kluwer Academic Publishers, Dordrecht, The Netheriands, 1995).
This Page Intentionally Left Blank
INDEX ASA, 57-61 (see also "Tagged sets") density functions, quantum similarity measures (QSM) and, 43-45, 51-56, 57-68 (see also "Tagged sets") definitions, 51-56 Atomic similarity through neural network, 205-213 abstract, 205, 206 conclusions, 212, 213 introduction, 206, 207 lAC net, 206, 207-209 property layers, 207 neural network for periodic table, architecture and function of, 207-209 database retriever, application as, 209 hidden associations or atomic similarity, applications for, 209 prediction of properties for elements, 211,212 self-association of elements and properties, 209-211 families, three, for 58 elements, 209, 210 Mendeleyev-like properties, 206, 209,213 telluric screw of de Chancourtois, 210
well self-associated group, term, 209 Bader's atoms-in molecules theory, 192, 217 Betti numbers, 85, 86 (see also "Quantum chemical shape...") Boltzmann Distribution (BD), 38, 39 (see also "Quantum similarity") Boolean tagged sets, 43-65 (see also "Fuzzy sets...") degenerate and nondegenerate, 48 metric background vector spaces, 49 vector spaces, 48, 49 Born-Oppenheimer approximation, 17, 38 Breit Hamiltonian, 4 Browsable structure-activity datasets, 153-171 (see also "Structure-activity...") Calculations, similarity, transferability of, 105-134 (see also "Transferability...") Chemicals, molecular similarity of using topological invariants, ni-lSS (see also "Topological invariants...") Circumscribing/excised internal structure (EIS) concept. 289
290 correlations and applications of, 259-264 abstract, 259 applications, other, 262-264 constant-isomer benzenoid series, 261,262 constant-isomer series of fluoranthenoids/fluorenoids and indacenoids, 262, 263 history, 260 Piatt's perimeter rule, 260 introduction, 260 Comparison of quantum similarity measures (QSM) derived from one-electron, intracule, and extracule densities, 215-243 abstract, 216 application examples, 222-242 diatomic molecules, 225-242 Hartree-Fock approximation, 226 second-order QSMs, computation of, 231-233 similarity functions, maximization of, 233-237 similarity matrices, construction of, 237-242 similarity measures, three, 224 topological characteristics of electron density distributions, 225, 226 two-electron atomic systems, 222-225 Z^^, 222-224, 237, 241 computational details, 219-222 Gaussian 94 and Gamess programs, 220, 222, 225 grid spacing, 220-222 Hartree-Fock theory level, 220 intracule and extracule densities, calculation of, 220, 221 second-order QSMs, calculation of, 221, 222
INDEX
superposition of molecules, 222 conclusions, 242 introduction, 216-219 first-order density functions, 216, 217 intracule and extracule densities, 218 second-order density functions, 217,218 Complementarity principle, uses of in molecular similarity and related aspects, 245-258 abstract, 245 aufbau principle, 248 conclusion, 257, 258 definitions, basic, 247, 248 complementary, 247 eigenvalue, 247 eigenvector, 247 frontier MOs (FMOs), 247 functional groups, 247 molecular energy level, 247 molecular graph, 247 pairing theorem, 247 self-complementary, 247, 248 wave function, 247 introduction, 246 aufbau principle, 246, 248 eigenvalues, relations among, 246 HMO model and 7i-electron systems, 246 quantum chemical-based invariants, 246 similarity as modeling tool, 246 six S*s in molecular modeling, 246 subspectrality, 246 results and discussions, 248-257 infinite series of molecular graphs pairwise strongly subspectral, 254-257 self-complementary molecular graphs, 249, 253, 254
Index
Convex sets, 43-45, 55-57 {see also "Tagged sets") Datasets, browsable structure-activity, 153-170 {see also "Structure-activity...") Density function, 51 conclusions, 70 statistical interpretation of, 65-69 {see also "Diagonal vector...") tagged set, 52 Diagonal vector spaces and quantum chemistry, 43-45, 65-70 abstract, 44 conclusions, 70 density functions and other problems, expression of, 68 discrete QO representations, nature of, 66, 67 generating n-dimensional VS: DVS, 67,68 Hilbert spaces, 65, 66 introduction, 44, 45 Elementary Jacobi rotations (EJR) technique, 5, 9-12 {see also "Quantum similarity") Extracule similarity measures, comparison of, 215-243 {see also "Comparison...") Fuzzy sets and Boolean tagged sets, 43-51 abstract, 44 applications, 49, 50 conclusions, 70 definitions, preliminary, 46, 47 extensions, 50, 51 hypercube, 46,47 introduction, 44,45 metric background vector spaces, 49 Minkowski formula, 49 molecular point-cloud, 50
291
operations over Boolean tagged sets, 48, 49 point-molecule, 50 QSM^ 50 tagged classes, 47, 48 unit n-dimensional cube, 46, 47 GATOMIC program, 12 Girona index, 24 Hartree-Fock (HF) approximation, 188,190,220,226 Hybrid density functional, optimizing by quantum molecular similarity techniques, 187-203 abstract, 188 conclusions, 201 introduction, 188-190 adiabatic connection formula, 188 approaches, two, 188 B3LYPfunction, 189, 190, 195-201 B3PW91 method, 189, 193 density functional theory (DFT), 188 exchange-correlation function, 188, 189 generalized gradient approximation (GGA), 188 Hartree-Fock treatment of exchange, 188, 190 Lee-Yang-Parr (LYP) functional, 189 local spin-density approximation (LSDA), 188 quantum molecular similarity measure (QMSM), 190 singles and doubles quadratic configuration interaction (QCISD), 189-201 three-parameter function, Becke's, 189
292 methodology, 190-192 Davidon-Fletcher-Powell (DFP) algorithm, 191, 192 Hartree-Fock method, 190 Messem program, 191, 192 results and discussion, 192-201 CO molecule, 192-199 LiFmolecule, 200, 201 N2 molecule, 198-200 Intracule similarity measures, comparison of, 215-243 (see also "Comparison...") Introduction to Solid State Physics, 206 Lagrange multiplier technique, 8, 13 Least-squares (LS) and neural-network (NN) forecasting from critical data, 265-287 abstract, 266 multiple regression, use of, 266 data, 267-277 discussion, 286, 287 introduction, 266, 267 LS methods, advantage of, 266 NN methods, advantage of, 266 systemization, advantages of, 266 results for diatomic-molecular r^, 277-282 graphical representations of neural network results, 279-281,282 least-squares results, 277 neural network results, 277-282 results for triatomic-molecular IP and A//^, 282-286 graphical results, 284-286 least-squares results, 282, 283 neural network results, 282, 283 theory, 267
INDEX
diatomic-molecular data, 267, 277-282 triatomic data, 267, 274 Mendeleyev postulates, 2, 206, 209, 213 {see also "Atomic similarity...") Neural network, atomic similarity through, 205-213 {see also "Atomic similarity...") Neural-network (NN) forecasting, 265-287 {see also "Least-squares...") One-electron similarity measures, comparison of, 215-243 {see also "Comparison...") Organic synthesis design, similarity in, 137-151 abstract, 137, 138 comparison methodology, 140-142 global transform similarity index (GTSI), 142 globularity similarity index (GSI), 141 number of participating reactions (PRN), 142, 147 substructure similarity index (SSI), 141 conclusion, 150 introduction, 138, 139 results and discussion, 142-150 with Sirenin and Methoxatin, 142-150 similarity measures, 139, 140 global transform similarity measure (GTSM), 139 globularity similarity measure (GSM), 139, 140 strategy and tactics, 139 substructure similarity measure (SSM), 139, 140
Index
Pattern recognition techniques, 73-77 abstract, 73 alignment, 74-76 conclusion, 76 introduction, 74 rotational invariance, 74 seven-dimensional vector, 75, 76 translational invariance, 74 two-dimensional representations, 74 Periodic table database, applying neural network to, 205-213 (see also "Atomic similarity...") Pfeiffer rule, 166 Piatt's perimeter rule, 260 QSAR: finely tuned, 63-65 Quantum chemical shape concept, topology and, 79-92 introduction, 79-81 additive fuzzy density fragmentation (AFDF) methods, 80 adjustable density matrix assembler (ADMA) method, 80 algebraic topology as ideal tool, 80 "ball and stick"-type stereodiagram, 80 electron density cloud models, 80 molecular electron density loge assembler (MEDLA) technique, 80 shape group methods (SGM), 80 sphere "space-filling" models, 80 molecular shape and topological resolution, 81-86 Betti numbers, 85, 86 molecular isodensity contour (MEDCO) surfaces, 84-90
293 resolution-based similarity measures (RBSM) approach, 81 shape-group approach, 84-86 subbase-base approach, 83 topological space, 82-84 summary, 90 topological resolution of shape of electron density, molecular similarity measures based on, 86-90 Quantum chemistry, diagonal vector spaces and, 43-45, 65-69 (see also "Diagonal vector...") Quantum molecular similarity techniques, using to optimize hybrid density functional, 185-201 (see also "Hybrid density...") Quantum objects (QO), 51-56 Quantum similarity, 1-42 atomic shell approximations (ASA), 5-14 alternative approximate expression of density functions, 12-14 approximate expectation values, 14 ASA coefficient constraints, 7, 8 coefficient optimization using elementary Jacobi rotations, 9-12 complete ASA (CASA), 12-14 density functions, 6, 7 elementary Jacobi rotations (EJR) technique, 5, 9-12 GATOMIC program, 12 generating vector, 9, 10, 12 promolecular approximation, 8 quadratic error function, 8, 9 conclusions, 40 introduction, 2-4
294 atomic shell approximation, 3, 5-14 concepts, relevant, 2, 3 Mendeleyev postulates, 2 QSAR or QSPR procedures, 3 manipulation of similarity measures, 21-26 {see also "...similarity indices") measures, 4, 5 Breit Hamiltonian, 4 Dirac's delta function, 4, 17 multiple density QSM, 5 overlap-like QSM, 4 quantum object set (QOS), 5 quantum self-similarity measures (QS-SM), 4 triple density QSM, 4, 5 molecular representations, 14-21 density integral transformations (DIT), 16, 17 density maps and overlap-like measures, 17 discrete matrix representation, 18-21 molecular point cloud, 19 molecular superposition, 14, 16 MQSM surfaces, 15 density transformations, 14, 15-17 QO discrete representation, 14, 18-21 transform kernel, 17 QSAR and related problems, origin of, 27-37 convex sets and QSPR, 29-31 molecular descriptors, 29 molecular quantum self-similarity measure (MQS-SM), 31 MQSM and molecular topology, 31-33 MQSM topological indices (MQTI), 33-37 NESTED-MLR, 37
INDEX
quantitative structure-property or-activity relationships (QSPR or QSAR), 27 success of, 27-29 topological indices (TI), 31, 33-37 topological matrices (TM), 31 similarity over energy surfaces, 38-40 Boltzmann Distribution (BD), 38, 39 Boltzmann similarity measure (BSM), 39 electronic energy surfaces (EES), 38 Gaussian distribution (GD), 39,40 general distributions and similarity measures, 39,40 molecular electrostatic potential (MEP), 38 partition functions, 38, 39 similarity indices (QSI), 21-26 C-class, 22, 23 C-class generalized QSI, 24 Carbo similarity index, 22, 25, 26 cosine-like, and multiple QSM, 22,23 D-class dissimilarity indices, 23 D-class generalized QSI, 24 discrete representation indices, 25,26 generalized QSI, 23, 24 Girona index, 24 Hodgkin-Richards index, 24, 26 Tanimoto index, 24 transformations between QSI, 24, 25 Self-associative periodic table of elements, 205-213 {see also "Atomic similarity...") Structure-activity datasets, browsable, 153-170
Index
abstract, 153, 154 introduction, 154, 155 browsing, question of, 154 level sets, 154, 164-167 similarity searching as browsing tool, 154 substructure searching as traditional method, 154 level sets as primary structural browsing variables, 164-167 BdSetO and BdDel2, 165-167 meta and para positions, cliffs and planes related to, 165 Pfeiffer rule, 166 merchandiser, problem of, 155, 156 chemical descriptor, 155 primary browsing variable, 155 secondary browsing variables, 155 systematic browsing, need for, 155 molecular equivalence numbers as primary structural browsing variables, 160-164 cliffs and planes, 163, 164 globally quantitative chemical descriptor, 161, 162 nominal chemical descriptors, 162 Rnglso value, 162, 163, 167 ShrbSiz count, 160-163 similarity-based projections as primary browsing variables, 167-169 BdDelRPl andBdDelMPl, 167-169 structure-activity dataset, 156-160 aldoxime, 156, 158 delivered potency, 156 perillartine, 156 planes and cliffs, 159, 160 summary and conclusions, 169 Splus, 169 Spotfire, 169
295 Syntheses of different compounds, comparing, 137-151 (see also "Organic synthesis...") Tagged sets, convex sets, and QSM, 51-65 ASA, 57-61 in atoms, 58, 59 continuous case, 60 within CS environment, 57, 58 elementary Jacobi rotations, 58, 59 LCAO MO approach, 60 MO theory, considerations around, 59, 60 molecules, structure in, 59 promolecular approach, 58, 59 SCF theory, 60 conclusions, 70 convex operators, 61-63 PD operators, convex linear combinations of, 62 tuned QSM, SM, and QO descriptors, 62, 63 convex sets, 56, 57 generating vector, 56, 57 Hilbert space, 57 density function, 51 statistical interpretation of, 65-69 (see also "Diagonal vector spaces...") quantum objects (QO), 51-56 similarity matrices and discrete representations of, 53, 54 QSAR, finely tuned, 63-65 and QSM, 51-56 definitions, 51-56 molecular point cloud, 54, 55 vector semispaces, 55 Topological fragment spectra (TFS), structural similarity analysis based on, 93-104 abstract, 94
296 concluding remarks, 103 introduction, 94, 95 approaches, two, 94 exhaustive fragmentation profile, 95 graph theoretical analysis, 94, 95 substructural analysis, 94, 95 methods, 95-98 quantitative evaluation of structural similarity based on TFS, 97, 98 topological fragment spectrum (TFS), 95-97 results and discussion, 98-103 application to similar structure search in chemical database, 100-103 psychotropic agents, forty-two, structural similarity analysis of, 98 spanning tree, 98, 99 subspectrum use, 98-100 Topological invariants, characterization of molecular similarity of chemicals using, 171-185 abstract, 172 discussion, 183, 184 introduction, 172 graph invariants, 172 ^-nearest-neighbor (KNN)-based estimation method, 172 planar graphs, use of, 172 topological indices (TIs), 172 methods, 173-179 database, 173 indices, calculation of, 173-178 indices, classification of, 178 /indices, 178 ^-nearest-neighbor selection and property estimation, 179 Az-dimensional space, 179 PCA analysis, 178, 179
INDEX
principal components (PCs), 178, 179 PRINCOMP, 178 statistical methods and computation of similarity, 178, 179 topochemical indices, 178 topological parameters, symbols, definitions, and classifications of, 174 topostructural indices, 178 Wiener index (W), 173 results, 179-183 analogue selection, 181, 183 A'-nearest-neighbor property estimation, 182, 183 principal component analysis (PCA), 179-181 Topology and quantum chemical shape concept, 81-94 {see also "Quantum chemical shape...") Transferability of similarity calculations from substructures to complex compounds, analysis of, 105-134 abstract, 106 calculations, transferability, similarity measures and indices, 128-130 conclusions, 130 introduction, 106-108 approaches in drug design, two, 106, 107 in drug design, 106 electronic distribution calculation, 107 in nucleic bases, 107, 110, 111 QSAR measures, 106 methodology, 108-112 bases, effects of, 112 results and discussion, 112-128
Index
base triplets, calculations on, 120-128, 131-134 charged compounds, ED values of, 115-117 ED variations: changing conformation, 117-119 ED variations: changing structure, 112-117 ED variations: changing system, 119-125 neutral compounds, ED values of, 114,115 similarity index: juxtapositioned pairs, 127
297 similarity index, 2D and 3D, 125, 126 similarity index: SP influence, 126, 127 similarity index: triplets, 127, 128 similarity indices calculated using neutral and charged isolated bases, values of, 118, 119 supplementary information, 131-134 Vector semispaces, 43-45, 65-69 (see also "Diagonal vector...")
Advances in Molecular Similarity Edited by Ramon Carbo-Dorca, University of Girona and Paul G. Mezey, University of Saskatctiewan Volume 1,1996, 287 pp. ISBN 0-7623-0131-7
$112.50/£72.50
CONTENTS: Introduction to the Series: An Editor's Foreword, Albert Padwa. Preface, Ramon Carbo-Dorca and Paul G. Mezey. Quantum Molecular Similarity Measures: Concepts, Definitions, and Applications to Quantitative StructureProperty Relationships, R. Carbo-Dorca, E. Besalu, LI. Amat, and X. Fradera. Similarity of Atoms in Molecules, B.B. Stefanov and J. Cioslowski. MomentumSpace Similarity: Some Recent Applications, P.T. Measures, N.L Allan, and D.L Cooper Molecular Similarity Measures of Conformational Changes and Electron Density Deformations, P.G. Mezey. Electron Correlation in Allowed and Forbidden Pericyclic Reactions from Geminal Expansion of Pair Densities. A Similarity Approach, R. Ponec. Conformational Analysis from the Viewpoint of Molecular Similarity, J.M. Ollva, R. Carbo-Dorca, and J. Mestres. How Similar are HF, MP2 and DFT Charge Distributions in the Cr (C0)6 Complex?, M. Torrent, M. Duran, and M. Sola. Quantum Molecular Similarity Measures (QMSM) and the Atomic Shell Approximation (ASA), P. Constans, LI. Amat, X. Fradera, and R. Carbo-Dorca. Automatic Search for Substructure Similarity: Canonical Versus Maximal Matching; Topological Versus Spatial Matching, G. Sello and M. Termini. Using Canonical Matching to Measure the Similarity Between Molecules: The Taxol and the Combretastatine A1 Case, G. Sello and M. Termini. New Antibacterial Drugs Designed by Molecular Connectivity, J Galvez, R. Garcia-Domenech, C. de Gregorio Alapont, J. V. de Julian-Ortiz, M.T. Salabert-Salvador, R. Soler-Roca. Index.
^U^^X^^^UM^U^s^smi
Advances in Molecular Structure Research Edited by Magdolna Hargittai, Structural Chemistry Research Group, Hungarian Academy of Sciences, Budapest, IHungary an6 Istvan Hargittai, Institute of General and Analytical Chemistry, Budapest Technical University, Budapest, Hungary Volume 1,1995, 368 pp.
$109.50/£69.50
ISBN 1-55938-799-8 CONTENTS: List of Contributors. Introduction to the Series: An Editor's Foreword, Albert Padwa. Preface, Magdolna Hargittaian6 Istvan Hargittai. Measuring Symmetry in Structural Chemistry, Hagit Zabrodsky and David Anvir Some Perspectives in Molecular Structure Research: An Introduction, Istvan Hargatfa/and Magdolna Hargattai. Accurate Molecular Structure from Microwave Rotational Spectroscopy, Hans Dieter Rudolph. Gas-Phase NMR Studies of Conformational Processes, Nancy S. True and Cristina Suarez. Fourier Transform Spectroscopy of Radicals, Henry W. Rohrs, Gregory J. Frost, G. Barney Ellison, Erik C. Richard, and Veronica Vaida. The Interplay between X-Ray Crystallography and AB Initio Calculations, Roland Boese, Thomas Haumann and Peter Stellberg. Computational and Spectroscopic Studies on Hydrated Molecules, Alfred H. Lowrey and Robert W. Williams. Experimental Electron Densities of Molecular Crystals and Calculation of Electrostatic Properties from High Resolution X-Ray Diffraction, Claude Lecomte. Order in Space: Packing of Atoms and Molecules, Laura E. Depero. Index. Volume 2,1996, 272 pp. ISBN 0-7623-0025-6
$109.50/£69.50
CONTENTS: List of Contributors. Preface, Magdolna Hargittai and Istvan Hargittai. Conformational Principles of Congested Organic Molecules: Trans is Not Always More Stable Than Gauche, Eiji Osawa. Transition Metal Clusters: Molecular versus Crystal Structure, Dario Braga and Fabrizia Grepioni. A Novel Approach to Hydrogen Bonding Theory, Paola Gilli, Valeria Ferretti, Valeric Bertolasi and Gastone Gilli. Partially Bonded Molecules and Their Transition to the Crystalline State, Kenneth R. Leopold. Valence Bond Concepts, Molecular Mechanics Computations, and Molecular Shapes, Clark R. Landis. Empirical Correlations in Structural Chemistry, Vladimir S. Mastryukov and Stanley H. Simonsen. Structure Determination Using the NMR "Inadequate" Technique, Du Li and Noel L. Owen. Enumeration of Isomers and Conformers: A Complete Mathematical Solution for Conjugated Polyene Hydrocarbons, Sven J. Cyvin, Jon Brunvoll, Bjorg Cyvin, and Egil Brendsdal. Index.
(BiliiiiilSBliiP Volumes, 1997, 360 pp. ISBN 0-7623-0208-9
$109.50/£69.50
CONTENTS: List of Contributors. Preface, Magdolna Harglttai and Istvan Hargittai. Determination of Reliable Structures from Rotational Constraints, Jean Demaison, Georges Wlodarczak, and Heinz Dieter Rudolph. Equilibrium Structure and Potential Function: A Goal to Structure Determination, Victor P. Spiridonov. Structures and Conformations of Some Compounds Containing C-C, CN, C-0, N-0, and 0 - 0 Single Bonds: Critical Comparison of Experiment and Theory, Hans-Georg Mack and Heinz Oberhammer. Absorption Spectra of Matrix-Isolated Small Carbon Molecules, Ivo Cermak, Gerold Monninger, and Wolfgang Kratschmer. Specific Intermolecular Interactions in Organic Crystals: Conjugated Hydrogen Bonds and Contacts of Benzene Rings, Peter M. Zorky and Olga N. Zorkaya. Isostructurality of Organic Crystals: A Tool to Estimate the Complementarity of Homo- and Heteromolecular Associates, Alajos Kalman and Laszio Parkanyi. Aromatic Character of Carbocyclic 7c-Electron Systems Deduced from Molecular Geometry, Tadeusz Marek Krygowski and Michal Cyranski. Computational Studies of Structures and Properties of Energetic Difluoramines, Peter Politzer and Pat Lane. Chemical Properties and Structures of Binary and Ternary SE-N and TE-N Species: Application of X-Ray and AB Initio Methods, Inis C. Tornieporth-Oetting and Thomas M. Klapotke. Some Relationships between Molecular Structure and Thermochemistry, Joel F. Liebman and Suzanne W. Slayden. Index. Volume 4,1998, 390 pp. ISBN 0-7623-0348-4
$109.50/£69.50
CONTENTS: Preface, Magdolna Hargittai and Istvan Harglttai. Molecular Geometry of "Ionic" Molecules: A Ligand Close-Packing Model, Ronald J. Gillespie and Edward A. Robinson. The Terminal Alkynes: A Versatile Model for Weak Directional Interactions in Crystals, Thomas Steiner. Hydrogen Bonding Systems in Acid Metal Sulfates and Selenates, Erhard Kemnitz and Sergei I. Troyanov. A Crystal log raphic Structure Refinement Approach Using ab Initio Quality Additive, Fuzzy Density Fragments, Paul G. Mezey. Novel Inclusion Compounds with Urea/Thiourea/Selenourea-Anion Host Lattices, Thomas C. W. Makand Qi Li. Roles of Zinc and Magnesium Ions in Enzymes, Amy Kaufman Katz and Jenny P. Glusker The Electronic Spectra of Ethane and Ethylene, Camille Sandorfy. Formation of (E,E)- and (Z,Z)-Muconic Acid in Metabolism of Benzene: Possible Roles of Putative 2,3-Epoxyoxepins and Probes for Their Detection, Arthur Greenberg. Some Relationships between Molecular Structure and Thermochemistry, Joel F. Libeman and Suzanne W. Slayden. Index.