Anal. Chem. 1997, 69, 2398-2405
Infrared Spectra Simulation of Substituted Benzene Derivatives on the Basis of a 3D Structure Representation Jan Schuur and Johann Gasteiger*
Computer-Chemie-Centrum, Institut fu¨ r Organische Chemie, Universita¨ t Erlangen-Nu¨ rnberg, Na¨ gelsbachstrasse 5, D-91052 Erlangen, Germany
The identification of chemical compounds from their infrared spectra faces new challenges from novel experimental techniques such as combinatorial chemistry. To rapidly provide estimates for the infrared spectra of candidate structures, an empirical approach to the modeling of the relationships between the 3D structure of a molecule and its infrared spectrum has been developed. This method is based on a novel 3D structure representation and a powerful modeling technique, a counterpropagation neural network. A dataset of 871 mono-, di-, and trisubstitued benzene derivatives is analyzed with this approach. Each day, chemists run reactions, isolate the products, and then have to elucidate their structures. In this way, each year 500 000 new compounds are synthesized, not to mention the many known compounds that are made or isolated from natural or environmental sources, and have to be analyzed. This requires massive amounts of spectroscopic data to be analyzed, and it is therefore not surprising that computer methods are being developed to automate some of these processes in structure elucidation. Several systems, developed over the last two decades, strive for automatic structure elucidation on the basis of spectroscopic data: CHEMICS,1 SpecInfo,2 or SESAMI,3 just to name the more outstanding ones. These systems derive structural information from various spectroscopic methods and build these substructures together to obtain a mosaic of the entire structure. This usually results in quite a few suggestions for the structure; therefore, additional filtering methods are necessary to identify the correct structure. Spectra, either taken from a database or simulated by some computational approach, play a central role in this process. Infrared spectra have not yet extensively been utilized because databases of infrared spectra are still not quite comprehensive, the largest one comprising 90 000 spectra, which is a minute amount in comparison to the number of known compounds, which presently stands at 15 million. Furthermore, the calculation of infrared spectra has still some problems. The calculation of infrared spectra by ab initio quantum mechanical methods requires quite extensive basis sets or a density functional theory approach, which is computationally quite demanding. Empirical approaches only consider the constitution of a molecule and are mostly built on dissecting the molecule into (1) Funatsu, K.; Sasaki, S. J. Chem. Inf. Comput. Sci. 1996, 36, 190-204. (2) Available from Chemical Concepts, Weinheim, FRG. (3) Munk, E. M.; Madison, M. S.; Robb, E. W. J. Chem. Inf. Comput. Sci. 1996, 36, 231-238.
2398 Analytical Chemistry, Vol. 69, No. 13, July 1, 1997
fragments. Systems with 640 and 720 fragments have been developed,4,5 but a fragment-based approach will always be incomplete because the number of fragments is unlimited. Another approach, by Clerc and Terkovics,6 utilizes a series of descriptors derived from the constitution of a molecule. However, infrared spectra are the result of vibrations of different molecular parts in 3D space. If one does not take into account the threedimensional arrangement of atoms in a molecule, the correlation of IR spectra with structural features necessarily has serious limitations. We present a coding scheme for the three-dimensional structure of molecules that has recently been published,7 and we show its merits for the correlation between IR spectra and 3D structures of substituted benzene derivatives. The 3D-MoRSE code was introduced in ref 7, giving examples for its application to the classification of biologically active molecules and first attempts for the prediction of IR spectra. Reference 8 gives an overview of our work in representing 3D chemical information. Reference 9 describes the optimization of the 3D-MoRSE code for a specific task, and ref 10 briefly communicates some results for the simulation of IR spectra of aromatic compounds. Here, we present a detailed discussion of these investigations. REPRESENTATION OF THE 3D STRUCTURE OF MOLECULES Clearly, any attempt to correlate IR spectra with the 3D structures of molecules only makes sense when the 3D structures are available for a wide range of compounds. The number of experimentally determined 3D structures is small in comparison to the number of known compounds, and therefore automatic 3D structure generators have recently been developed.11 The 3D structure generator CORINA, developed in our group,12,13 was shown to have a broad scope and high conversion rate, leading to structures having good correspondence with experimental data.14 (4) Dubois, J. E.; Mathieu, G.; Peguet, P.; Panaye, A.; Doucet, J. P. J. Chem. Inf. Comput. Sci. 1990, 30, 290-302. (5) Huixiao, H.; Xinquan, X. J. Chem. Inf. Comput. Sci. 1990, 30, 203-210. (6) Clerc, J. T.; Terkovics, A. L. Anal. Chim. Acta. 1990, 235, 93-102. (7) Schuur, J.; Selzer, P.; Gasteiger, J. J. Chem. Inf. Comput. Sci. 1996, 34, 334-344. (8) Gasteiger, J.; Sadowski, J.; Schuur, J.; Selzer, P.; Steinhauer, L.; Steinhauer, V. J. Chem. Inf. Comput. Sci. 1996, 36, 1030-1037. (9) Schuur, J.; Gasteiger, J. In Software Development in Chemistry 10; Gasteiger, J., Ed.; Gesellschaft Deutscher Chemiker: Frankfurt am Main, 1996; p 94. (10) Selzer, P.; Schuur, J.; Gasteiger, J. In Software Development in Chemistry 10; Gasteiger, J., Ed.; Gesellschaft Deutscher Chemiker: Frankfurt am Main, 1996; p 293. (11) Sadowski, J.; Gasteiger, J. Chem. Rev. 1993, 93, 2567-2581. S0003-2700(96)01107-9 CCC: $14.00
© 1997 American Chemical Society
The question is now how to represent the 3D structures for correlations with infrared spectra. We have recently developed a mathematical transformation of the molecular 3D structure that gives a fixed number of variables for 3D structure representation, as required by empirical modeling methods. This transformation builds on equations used in the analysis of the intensity distributions obtained in electron diffraction experiments. The new 3D structure code was, therefore, named 3D molecule representation of structures based on electron diffraction code (3D-MoRSE code). The radially symmetric intensity, I, in an electron diffraction experiment depends on the scattering factor, fi, of the electrons of the ith atom, the coordinates of this atom, and the interference function associated with each pair of atoms, i and j. The intensity at a scattering angle, s, is usually given in the form of eq 1, as used since the electron diffraction studies of Wierl.15 N i-1
sin sr
∑∑ f f ∫ P (r) ∞ i j 0 ij
I(s) ) K
i)2 j)1
sr
dr
(1)
The definition of s is given by
s ) 4π sin(ϑ/2)/λ
(2)
with ϑ being the scattering angle and λ the wavelength. I(s) is the intensity of the scattered radiation, r represents the interatomic distances, Pij(r) is the probability distribution of the vibrational variation in the distance between atoms i and j, fi and fj are the form factors of atoms i and j, and K collects various constants that depend on the instrument. Following Soltzberg and Wilkins,16 we have made the simplifications,
K)1
Pij(r) ) δ(r - rij)
effectively assuming the atoms to be point scatterers and the molecule to be rigid. In our definition of the 3D-MoRSE code, the form factors, fi, in eq 1 are generalized to an atomic property, Ai, such as atomic mass, polarizability, partial charge, or atomic number, to be selected by the user. This leads to eq 3: N i-1
I(s) )
∑∑A A
sin srij
i j
i)2 j)1
srij
(3)
Furthermore, the function I(s) is made discrete, reporting its values only at equally spaced values of s within a certain range. In the investigation here, I(s) is calculated at 32 equidistant values in the range between 0.0 and 31.0 Å-1. Partial atomic charges, qtot,i, were chosen as atomic property, Ai, because of the importance of changes in the dipole moment (12) Gasteiger, J.; Rudolph, C.; Sadowski, J. Tetrahedron Comput. Methodol. 1992, 3, 537-547. (13) Sadowski, J.; Rudolph, C.; Gasteiger, J. Anal. Chim. Acta 1992, 265, 233241. (14) Sadowski, J.; Gasteiger, J.; Klebe, G. J. Chem. Inf. Comput. Sci. 1994, 34, 1000-1008. (15) Wierl, R. Ann. Phys. (Leipzig) 1931, 8, 521-564. (16) Soltzberg, L. J.; Wilkins, C. L. J. Am. Chem. Soc. 1977, 99, 439-443.
for infrared activity. The size of the datasets to be studied requires rapid methods for the calculation of atomic charges. It was therefore decided to use the partial equalization of orbital electronegativities (PEOE) method17 and its extension to conjugated π-systems.18 Scaling coefficients for the 3D-MoRSE values have been derived on the basis of a standard dataset of 24 organic molecules, comprising a series of functional groups and ring systems so as to represent a broad range of organic chemistry. REPRESENTATION OF INFRARED SPECTRA All infrared spectra were taken from the SpecInfo database.2 Only the absorbance spectra were taken, and transmission spectra were first converted to absorbance spectra. These infrared spectra were brought to an equal resolution in the frequency range by interpolation. From 3500 to 2000 cm-1, the infrared spectra were digitized to 150 equally spaced points, one per 10 cm-1. In the range from 2000 to 552 cm-1, the spectra were digitized to 362 points, one per 4 cm-1. These 512 absorbance values were transformed into 512 Hadamard coefficients. The Hadamard coefficients 129-512 were set to zero. The 512 coefficients were then transformed back by reverse Hadamard transformation into absorbances. This results in 512 values consisting of 128 groups of four equal values. For the representation of the infrared spectrum, only every fourth value was taken, leading to 128 absorbance values with a resolution of 40 cm-1 between 3500 and 2020 cm-1 and a resolution of 16 cm-1 between 2000 and 560 cm-1. This is similar to the IR spectra representation suggested by Novic and Zupan.19 COUNTERPROPAGATION NEURAL NETWORK The relationship between the structure of an organic compound and its IR spectrum is complex in nature. It has been established in recent years that artificial neural networks are able to model complex relationships implicitly in their weights without having to specify the mathematical form of these relationships.20,21 Most applications of neural networks in chemistry for modeling purposes use multilayer feedforward networks trained by the backpropagation algorithm.20-22 In our study, we have chosen a counterpropagation (CPG) neural network23 for two reasons: First, the two-dimensional arrangement of neurons in a CPG neural network gives more flexibility for modeling complex relationships.21 Second, the topology preserving nature of the input (upper part) of a counterpropagation network, which is basically a Kohonen network,24,25 does not suffer from overmodeling. Figure 1 gives a general overview of the architecture of a CPG neural network. For further information, see refs 21 and 23. (17) Gasteiger, J.; Marsili, M. Tetrahedron 1980, 36, 3219-3228. (18) Gasteiger, J.; Saller H. Angew. Chem. 1985, 97, 699-701; Angew. Chem., Int. Ed. Engl. 1985, 24, 687-689. (19) Novic, M.; Zupan, J. J. Chem. Inf. Comput. Sci. 1995, 35, 454-466. (20) Gasteiger, J.; Zupan, J. Angew. Chem. 1993, 105, 510-536; Angew. Chem., Int. Ed. Engl. 1993, 32, 503-527. (21) Zupan, J.; Gasteiger, J. Neural Networks for ChemistssAn Introduction; VCH: Weinheim, 1993. (22) Rumelhart, D. E.; Hinton, G. E.; Williams, R. J. Nature 1988, 323, 533536. (23) Hecht-Nielsen, R. Appl. Opt. 1987, 26, 4979-4984. (24) Kohonen, T, Self-Organisation and Associative Memory, 3rd ed.; Springer: Berlin, 1989. (25) Kohonen, T. Biol. Cybern. 1982, 43, 59-69.
Analytical Chemistry, Vol. 69, No. 13, July 1, 1997
2399
Table 1. Distribution of the Correlation Coefficients for Training and Test Sets correlation coefficient