Atom-Type-Based AI Topological Descriptors: Application in Structure

Atom-Type-Based AI Topological Descriptors: Application in Structure-Boiling ...... MLR Models Obtained for the Entire Data Set Based on the and AI In...
0 downloads 0 Views 101KB Size
J. Chem. Inf. Comput. Sci. 2003, 43, 1121-1131

1121

Atom-Type-Based AI Topological Descriptors: Application in Structure-Boiling Point Correlations of Oxo Organic Compounds Biye Ren* Research Institute of Materials Science, South China University of Technology, Guangzhou 510640, P. R. China Received December 10, 2002

Structure-boiling point relationships are studied for a series of oxo organic compounds by means of multiple linear regression (MLR) analysis. Excellent MLR models based on the recently introduced Xu index and the atom-type-based AI indices are obtained for the two subsets containing respectively 77 ethers and 107 carbonyl compounds and a combined set of 184 oxo compounds. The best models are tested using the leave-one-out cross-validation and an external test set, respectively. The MLR model produces a correlation coefficient of r ) 0.9977 and a standard error of s ) 3.99 °C for the training set of 184 compounds, and rcv ) 0.9974 and scv ) 4.16 °C for the cross-validation set, and rpred ) 0.9949 and spred ) 4.38 °C for the prediction set of 21 compounds. For the two subsets containing respectively 77 ethers and 107 carbonyl compounds, the quality of the models is further improved. The standard errors are reduced to 3.30 and 3.02 °C, respectively. Furthermore, the results obtained from this study indicate that the boiling points of the studied oxo compound dominantly depend on molecular size and also depend on individual atom types, especially oxygen heteroatoms in molecules due to strong polar interactions between molecules. These excellent structure-boiling point models not only provide profound insights into the role of structural features in a molecule but also illustrate the usefulness of these indices in QSPR/QSAR modeling of complex compounds. 1. INTRODUCTION

The construction of quantitative structure-property/activity relationship (QSPR/QSAR) models has been received considerable attentions in the past decade years.1 Numerous QSPR/QSAR models have reported based on various known physicochemical parameters and molecular descriptors such as geometric, electronic or electrostatic, polar, steric, topological indices.2 Among the most-used descriptors are topological indices because they can be easily and directly calculated from the molecular structures without any experimental effort.3-6 Hence, use of topological indices in QSPR/QSAR provides simple and straightforward means for the molecular design and the estimation of properties and activities of a molecule. In general, traditional 2D and 3D topological indices characterize a molecule as a whole, i.e., molecular size or shape, such as the well-known molecular connectivity index (χ),5,6 Balaban’s index (J),7 Wiener’s index (W),8 and the recently introduced Xu index.9 However, these conventional indices are being faced with some challenges because the QSPR/QSAR models based on a single or few conventional topological indices cannot directly reflect the role of individual atom types or groups in molecules and their relative importance to the property or activity of a molecule. In recent years, attention has been focused on the use of the atom-type-based topological indices.10 Unlike the conventional topological indices, the atomic level topological indices code the structural environment of each atom type * Corresponding author phone/fax: +86-20-8711-2886; e-mail: renbiye@ 163.net.

in a molecule and further describe the structural information of a molecule at the atomic level, especially since they offer the possibility of understanding the role of individual atom types or groups in a molecule such as -OH, -COOH, -NH2, >CO, etc. One of the most interesting indices of this type is the electrotopological state (E-state) index introduced by Kier and Hall.11 The E-state indices have been successfully used in a variety of QSPR/QSAR studies.10-14 Very recently, a new type of atom-type-based topological index, AI, was proposed in our laboratory.15 The AI indices have been shown to be very effective in modeling six physical properties of alkanes,16 physical properties and biological activities of alcohols,17 three physical properties of a mixed set of alkanes and alcohols,18 gas chromatographic retention index of a combined set of aldehydes and ketones,19 and the normal boiling points (BP) of a mixed set of sulfides and thiols.20 However, the potential of these AI indices in QSPR/ QSAR modeling has yet to be further verified in different systems. The boiling point of a compound is an important physical property for the design and optimization of the processing engineering in chemical and petrochemical industries. Therefore, it is required to develop quantitative structure-boiling point models for estimation of boiling points of untested compounds or selection of reliable experimental boiling points of compounds.21 Recently, Toropov and Toropova22 studied the structure-boiling point relationships for a data set of 200 carbonyl compounds (mono- and dialdehydes, mono- and diketones, keto aldehydes, and esters of monocarboxylic acids). The best one-variable model by means of optimization of correlation weights of local graph invariants

10.1021/ci025651o CCC: $25.00 © 2003 American Chemical Society Published on Web 05/21/2003

1122 J. Chem. Inf. Comput. Sci., Vol. 43, No. 4, 2003

REN

based on nearest neighboring codes was obtained and produced a standard error of s ) 6.12 °C for the training set and s ) 6.0 °C for the test set. In the present study, to further illustrate the usefulness of the AI indices in QSPR studies, a mixed set of 184 oxygencontaining compounds, including 77 acyclic aliphatic ethers, 33 aldehydes, and 77 ketones with a wide range of nonhydrogen atoms, are chosen to construct the structure-boiling point models using the Xu and AI indices. In addition, we wish to see what structural features or groups are important to the boiling points of oxo compounds using these indices. 2. METHOD

For a molecular graph G ) {V, E} with n vertices, where V (G) and E (G) are the vertex and edge sets, respectively, vertices correspond to individual atoms in a graph G, and edges correspond to chemical bonds between them. The vertex-adjacency matrix, A ) [aij]n×n, is a square symmetric matrix. The elements aij of matrix A are 1 if vertices i and j are adjacent and 0 otherwise, where n is the number of vertices. The distance matrix, D ) [dij]n×n, is also a square symmetric matrix. The entries dij of matrix D are the length of the shortest path between the vertices i and j in a graph G. The sum over row i or column j of matrix A yields local vertex-degree Vi; analogously, the sum over row i or column j of matrix D yields distance sums si. The Xu index can be expressed below9

Xu ) n1/2log

( / ) n

n

i)1

i)1

∑Vi si2 ∑Vi si

(1)

where the sum is over all i vertices in a molecular graph. For any atom i that belongs to jth atom type in a graph G, the corresponding topological index value is expressed as15

AIi (j) ) 1 + φi (j)

(2)

with

φi (j) ) Vi(j)si2(j)

/

n

Vi si ∑ i)1

(3)

where the parameter φ is considered as a perturbing term reflecting the effects of the structural environment of ith atom on its topological index value, and the sum is over all i vertices in a graph G. According to the above definition, for jth atom-type in a graph G, the corresponding atom type topological index, AI(j), is the sum of all AIi(j) values of the same atom type l

l

i)1

i)1

AI(j) ) ∑ AIi (j) ) l + ∑ φi (j) ) l

/

n

l + ∑ Vi (j)si2(j) ∑Vi si (4) i)1

i)1

where l is the count of the atoms or groups of the same type. Clearly, the AI index depends on both the account of the same atom-types and their microenvironment in molecule. To modify these indices for complex compounds, a new type of vertex degree Vm was proposed to differentiate

heteroatoms and carbon atoms with multiple bonds in a graph G17

Vm ) δ + k

(5)

k ) 1/[(2/N)2δV + 1]

(6)

where δ is the number of connections (edges) of that atom; parameter k is a perturbing term reflecting the effects of heteroatoms and multiple bonds; N is the principal quantum number of the valence shell; and δV is the valence connectivity of Kier-Hall.5,6 For heteroatoms in a graph G, the δV value is expressed as

δV ) (Z V - h)/(Z - Z V - 1)

(7)

The δV value of carbon atoms with multiple bonds in a graph G is expressed as

δV ) Z V - h

(8)

where h is the number of hydrogen atoms connected to the heteroatom. Z and ZV are the atomic number and the number of valence electrons for heteroatoms, respectively. Consequently, for any molecular graph, both the Xu and AI indices can be expressed with the same formula defined above (eqs 1 and 4) by using proposed vertex degree Vm instead of the local vertex degree Vi. This approach has been shown to be very effective in modifying these indices for structure-property/activity correlations of organic compounds.15-20 3. DATA SET AND REGRESSION ANALYSIS

Data Set. The data set consists of 205 oxo compounds including 86 ethers, 33 aldehydes, and 86 ketones with up to 17 non-hydrogen atoms (Table 1). Most of the experimental BP values for 86 ethers are taken from refs 23-27, and some are from ref 28 and Beilstein’s Handbuch der Organischen Chemie. The majority of experimental BP data for aldehydes and ketones are collected from refs 23-27, and a few compounds are from ref 29. In the majority of the cases, the data among these sources agree fairly well, but there are some discrepancies between different sources. For compounds with a boiling point range of 1-4 °C we quote their arithmetic mean BP. However, the main problem with exploiting QSPR/QSAR models is the need for reliable experimental BP values. Hence, when a compound has different experimental boiling points reported in the literature, we use chemical intuition to make a choice between the disagreed values. The data set is analyzed from two different approaches to evaluate the applicability of these indices in complex systems. First, the whole data set is divided into two subsets: one contains only 86 ethers and the other 119 carbonyl compounds. Approximately 10% of compounds are randomly selected from the original subset as a test set, while the other 90% remained as the training set, and then individual correlations for the two training subsets and the whole training set are carried out. Regression Analysis. Multiple linear regression (MLR) method using the modified Xu (represented as X mu ) and all AI indices present in molecules as variables is performed to

ATOM-TYPE-BASED AI TOPOLOGICAL DESCRIPTORS

J. Chem. Inf. Comput. Sci., Vol. 43, No. 4, 2003 1123

generate linear structure-boiling point models. The final models have the following general form (eq 9)

BP ) a0 + a1X mu + b1AI(1) + ... + bj AI(j)

(9)

where a0 is a constant, a1 is the contribution coefficient of the X mu index, and bj is the contribution coefficient of jth atom type (group) AI index, AI(j). Each coefficient describes the sensitivity of BP to each of the individual indices, so the coefficients of these parameters would measure the relative importance of each index. The significance of each variable is evaluated by monitoring the t-values and Fisher ratios (F) so as to choose the powerful descriptors.30-32 The quality of the model is considered as statistically satisfactory on the basis of the correlation coefficient (r), standard deviation (s), and F values when all the variables in the final model are significant at the 95% confidence level. Model Validity. Two strategies are used to test the validity of the selected MLR model. First, the model for the training set is tested by the cross-validation using the leave-one-out method. In this way, each time one molecule is removed from the training set a model is developed with the remaining N-1 molecules. At each step, the BP value of the removed molecule is predicted by the model obtained. This process is repeated until each molecule is predicted once. In principle, the predictive ability of the model is quantified in terms of the corresponding leave-one-out cross-validated parameters, rcv and scv values, which are defined as33

rcv2 ) 1.0 - PRESS/SSY

(10)

scv ) xPRESS/(N - M - 1)

(11)

where PRESS is the predicted residual sum of square and SSY is the sum of the squares of the experimental values. N is the number of samples used for model building. M is the number of descriptors involved in the model. In addition, the actual prediction ability of the model is validated using an external prediction set not included in the training set. The performance of the model (its predictive ability) can be given by the predictive correlation coefficient (rpred) (or the PRESS/SSY ratio) and the standard error (spred) of prediction, which is defined as34

spred ) xPRESS/N

(12)

For a reliable model, the PRESS/SSY ratio should be < 0.6 or the rcv2 or rpred2 values should be > 0.6. The model is considered to be excellent if the PRESS/SSY ratio is < 0.1. 4. RESULTS AND DISCUSSION

In general, there are two different directions in which the MLR analysis is usually used in QSPR/QSAR studies.35,36 The first approach is based on a large number of structurally diverse compounds; the other way only deals with a smaller group of structurally related compounds. Both approaches have their merits and faults, and they serve somewhat different purposes. However, the best structure-property/ activity correlations can be constructed for a rather narrow series of homologous compounds. In addition, it may be possible to manifest the role of all possible atom types or

groups in molecules and their relative importance to the properties or activities studied from the QSPR/QSAR models obtained for a small homologous set of compounds.37 In this study, first the separate correlations for the two data subsets are analyzed; second, the study is further extended to the whole data set of oxygen-containing compounds. Ether Subset. The subset of 86 ethers is randomly divided into a training set and a prediction set. The 77 ethers as the training set are used to generate the boiling point models in terms of the X mu index and all five AI indices present in molecules. Two suitable models are obtained (Table 2). As an illustration, the best five-parameter model (eq 13), along with the statistical parameters and the leave-one-out crossvalidated statistics, is shown as follows:

BP ) - 66.4027 (( 2.4346) + 49.2263 (( 1.2710) X mu + 3.9722 (( 0.6158) AI(-O-) 0.6742 (( 0.09244) AI(>CH2) 2.1384 (( 0.2326) AI(>CH-) 2.4418 (( 0.3507) AI(>CCH2) AI(>CH-) AI(>CCH2) AI(>CH-) AI(>CCH2) AI(>CH-) AI(>CCd) indices for ketones. d Calculated from the two subset correlations. e Calculated from the whole data set correlation. f From ref 29.

the criterion on high quality QSPR models proposed by Mihalic´ and Trinajstic´38 (e.g., r > 0.995 and s < 5.0 °C for BP). Of course, the linear regression using a single X mu index produces a slightly poor correlation for 77 compounds and gives r ) 0.9983 and s ) 6.47 °C. Obviously, the standard error in the best model (eq 13) is significantly reduced by 49% compared to the linear model with the X mu index. The results support our anticipation (eq 9) and also

show the usefulness of these indices in structure-boiling point correlations of ethers. Recently, Balaban et al.28 reported two best structureboiling point models in terms of three topological indices for the same series of 72 ethers. In their study, the first model based on the first-order connectivity 1χ index and Balaban’s modified Jhet shape index produced r ) 0.993 (r2 ) 0.978) and s ) 6.49 °C. The second model with the electrotopo-

ATOM-TYPE-BASED AI TOPOLOGICAL DESCRIPTORS

J. Chem. Inf. Comput. Sci., Vol. 43, No. 4, 2003 1127

Table 2. MLR Models Obtained for the Ether Subset Based on the X mu and AI Indices model 1 indices

coefficients

constant X mu AI(-O-) AI(-CH3) AI(>CH2) AI(>CH-) AI(>CCH2 >CH>C
AI(>CH2) (4.5%) > AI(>CH-) (3.0%) > AI(>CCH2), AI(>CH-), and AI(>CCd) AI(-CH3) AI(>CH2) AI(>CH-) AI(>CCd(O) -CH3 >CH2 >CH>C
CHd) 6.2511 (( 2.2958) AI(>Cd) 0.4365 (( 0.1058) AI(>CH2) 1.8273 (( 0.2046) AI(>CH-) 3.3337 (( 0.0.2686) AI(>C AI(>Cd) (7.0%) > AI(-CHd) (4.1%) >AI(>CH2) (2.5%) > AI(>CH-) (1.5%) > AI(>CCO or -CHO). The slightly large ψf value for the AI(dO) index may be attributed to the influence of oxygen atoms of carbonyl groups, which would create polar interactions between molecules and thus

ATOM-TYPE-BASED AI TOPOLOGICAL DESCRIPTORS

J. Chem. Inf. Comput. Sci., Vol. 43, No. 4, 2003 1129

Table 4. MLR Models Obtained for the Entire Data Set Based on the X mu and AI Indices model 1 indices constant X mu AI(-O-) AI(dO) AI(-CHd) AI(>Cd) AI(-CH3) AI(>CH2) AI(>CH-) AI(>CCO or >CHd) move from the peripheral toward to the middle of the chain. This fact implies that the CdO bond of carbonyl groups in molecules behaves like a pseudobranch (although the CdO bond of aldehydes is present in the main chain). Hence, the AI(>Cd) and AI(-CHd) indices, along with the AI(>CH2), AI(>CH-), and AI(>CCH-) 14.9103 (( 0.7660) AI(>Cd) 0.3835 (( 0.07391) AI(>CH2) 1.7079 (( 0.1820) AI(>CH-) 2.8046 (( 0.2627) AI(>CH-) (17) r ) 0.9977; rcv ) 0.9974; s ) 3.99; scv ) 4.16; F ) 4834; P < 0.0001; N ) 184 All parameters are highly significant according to the t-values (p < 0.0001). This model explains more than 99% of the variance in the experimental BP values for 184 compounds. One can see again that a combined use of the X mu and AI indices produces a good correlation and gives r ) 0.9977 and s ) 3.99 °C for the training set and rcv ) 0.9974 (rcv2 ) 0.9949) and scv ) 4.16 °C for the cross-

errors

t-values

-54.1919 44.7638 3.9919 33.8787 -19.7491 -14.9103

2.1938 0.9714 0.6283 1.5288 0.9747 0.7660

-24.70 46.08 6.353 22.16 -20.26 -19.46

-0.3835 -1.7079 -2.8046

0.07391 0.1820 0.2627

-5.189 -9.384 -10.68

corresponding atom types

-OdO -CHd(O) >Cd(O) -CH3 >CH2 >CH>C
3 × s). The final model (eq 17) is therefore considered as an excellent QSPR model judging from the statistics and the plot in Figure 3. The reason acetaldehyde and acetone have larger residuals is not clear. Of course, if the two compounds are excluded from the combined set, a better correlation using the same variables can be obtained (r ) 0.9980 and s ) 3.67 °C) and gives rpred ) 0.9958 and spred ) 3.92 °C. On the other hand, the ψf values of individual indices are calculated as follows: X mu (58.3%), AI(dO) (21.7%), AI(-O-) (1.8%), AI(>Cd) (8.7%), AI(-CHd) (4.1%), AI(>CH2) (1.9%), AI(>CH-) (1.5%) > AI(>C