Investigation of the Relationships between Molecular

Jan 15, 2009 - the variations of the descriptor values in homologous series. It .... residual (CNRj) ensure that the selected descriptors contain valu...
2 downloads 0 Views
Ind. Eng. Chem. Res. 2009, 48, 9723–9734

9723

Investigation of the Relationships between Molecular Structure, Molecular Descriptors, and Physical Properties Inga Paster,† Mordechai Shacham,*,† and Neima Brauner‡ Department of Chemical Engineering, Ben-Gurion UniVersity of the NegeV, Beer-SheVa 84105, Israel, and School of Engineering, Tel-AViV UniVersity, Tel-AViV 69978, Israel

The use of databases containing thousands of molecular descriptors, including 3-D descriptors, for predicting physical properties is discussed. It is shown that the use of 3-D descriptors for property prediction via quantitative structure property relations (QSPR) limits considerably their applicability, as 3-D structure files must be obtained from the same reliable source for all predictive and target compounds. A modified targeted QSPR (TQSPR) algorithm is presented, which includes a new technique for selecting training sets belonging to the homologous series of the target compound (if such compounds are available in the database). The method is employed for predicting seven properties for five homologous series. It is shown that most properties can be predicted on experimental error level, using training sets of 10 compounds and a maximum of 2 (non 3-D) descriptors. The exclusion of the 3-D descriptors enhances considerably the applicability of the TQSPRs, and the use of a small number of descriptors reduces the probability of “chance correlations”. Introduction Pure-compound property data are at present available only for a small fraction of the compounds, pertaining to such diverse areas as chemistry and chemical engineering, environmental engineering environmental impact assessment, hazard, and operability analysis. Therefore, methods for reliable prediction of property data are needed. Current methods used to predict physical and thermodynamic properties can be classified into group contribution methods,1 methods based on the corresponding-states principle,2 asymptotic behavior correlations,3,4 and quantitative structure property relationships.5 Recently we have developed the targeted QSPR (TQSPR) method,6,7 which enables predicting properties within experimental error level. Unlike in the traditional QSPR methods, the TQSPR method is targeted to a particular compound, or a group of compounds, and relies on the identification of a relatively small number of structurally similar compounds. Hence, it can provide reliable predictions and estimates of the prediction error, while avoiding the need to model the highly nonlinear relationships between molecular descriptors and properties that may require a large amount of experimental data. In recent years computer programs that can calculate several thousands of descriptors have emerged. This raised concerns regarding the accuracy and consistency of the molecular descriptors used and the probability of obtaining chance correlations,8 while applying stepwise regression procedures to large descriptor databases in order to obtain a QSPR or TQSPR for representing a particular property. It is practically impossible to check the accuracy and consistency of the individual-descriptor values because of the large number of descriptors and compounds involved. However, some of the inconsistencies can be detected when examining the variations of the descriptor values in homologous series. It is well-known that most physical properties change in a consistent manner in homologous series when plotted versus the number of carbon atoms (nC)3,4 and many molecular * To whom correspondence should be addressed. E-mail: shacham@ bgu.ac.il. Tel.: +972-8-6461481. Fax: +972-8-6472916. † Ben-Gurion University of the Negev. ‡ Tel-Aviv University.

descriptors follow the same trend as properties.9 In this manuscript we investigate the behavior of molecular descriptors associated with some homologous series in order to identify clear trends in the behavior of the descriptors and to find potential causes of inconsistencies among the descriptors. The probability of obtaining a chance correlation increases considerably when a large number of descriptors are used in the QSPR or TQSPR.8 In the following we investigate the option of using one or a maximum two-descriptor TQSPR for predicting gas, liquid, and solid properties within their experimentalerror level. The Targeted-QSPR Method. The targeted QSPR (TQSPR) technique is described in detail by Brauner et al.,6 Shacham et al.,7 and Kahrs et al.10 The basic principles of the method are briefly reviewed here. Let us assume that the vector of properties of the target compound yt (the dependent variable) is potentially related to a set of m vectors of properties of predictive compounds (independent variables) x1, x2,..., xm. The following partition of the yt and x vectors to subvectors is used: yt )

{ }

yct ; ypt

xi )

{ } xci xpi

(1)

where yct is an N vector of known properties, ypt is a K vector of unknown properties. Both the N vector xci and the K vector xpi contain known properties. Typically, the subvectors yct and xci contain properties, which are directly related to the molecular structure and can be calculated with high accuracy (molecular descriptors), while the subvectors ypt and xpi contain measured properties with various levels of experimental error. The practical application of the TQSPR method requires preparation of a bank of potential predictive compounds as a database. The same set of molecular descriptors must be defined for all compounds included in the database, while the span of the molecular descriptors should reflect the difference between the compounds in the database. Having the corresponding molecular descriptors for a target compound, yc+, defined as well, a stepwise regression procedure is applied in order to identify the most appropriate predictive compounds to be included in the similarity group and the training set associated

10.1021/ie801318y CCC: $40.75  2009 American Chemical Society Published on Web 01/15/2009

9724

Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009

with the particular target compound. The similarity between potential predictive compounds and the target compound is measured by the partial correlation coefficient, rti, between the vector of the molecular descriptors of the target compound, yct, and that of a potential predictive compound xci. The partial correlation coefficient is defined as rti ) yjctxjTci, where yjct and xjci are row vectors, centered (by subtracting the mean) and normalized to unit length (by dividing by the Euclidean norm of the vector). Absolute rti values close to one (|rti| ≈ 1) indicate high correlation between vectors yct and xci, and thus, a high level of similarity between the molecular structures of the target compound and the predictive compound i. Various alternative methods of descriptor scaling, measures for similarity of structures, and clustering algorithms for indentifying the similarity group were tested by Kahrs et al.10 The similarity group is used to obtain a training set for the development of the QSPR for compounds that has been identified as structurally similar to the target compound. The training set is established by selecting the np compounds with highest |rti| value for which experimental property values ypi are available. The remaining compounds in the similarity group are used for validation. For development of a TQSPR for a particular property of the target compound, a linear structure-property relation is assumed of the form yp ) β0 + β1ζ1 + β2ζ2...βmζm + ε

(2)

where yp is a np-dimensional vector of the respective property values, ζ1, ζ2,..., ζm are np-dimensional vectors of predictive molecular descriptors, β0, β1, β2,..., βm are the corresponding model parameters to be estimated, and ε is a np-dimensional vector of random measurement errors. A stepwise regression program is used to determine which molecular descriptors should be included in the TQSPR to best represent the measured property data of the training set and to calculate the TQSPR parameter values. In each step one molecular descriptor that reduces the prediction error most strongly is added into the model. The descriptors are selected to the model in a stepwise manner according to the value of the partial correlation coefficient, |Fyj| between the vector of the property values yp, and that of a potential predictive descriptor ζj. The partial correlation coefficient is defined as Fyj ) yjpζjTj , where yj and ζjj are row vectors, centered (by subtracting the mean) and normalized to a unit length. Values close to 1 indicate high correlation between the molecular descriptor and the property. The TQSPR so-obtained can be subsequently employed for calculating estimated property values for the target compound and for other compounds in the similarity group that do not have measured data by y˜pt ) β0 + β1ζt1 + β2ζt2... + βmζtm

(3)

where y˜pt is the estimated unknown property value of the respective compound and ζt1, ζ t2,..., ζtm are its corresponding molecular descriptor values. The selection of a suitable set of predictive molecular descriptors for eq 2 is a challenging problem, since the number of candidates is in the order of θ(103), which prohibits the determination of the best of all possible sets of predictive molecular descriptors by a full search procedure. The stepwise regression program SROV11 is used, which selects in each step one molecular descriptor that reduces the prediction error most strongly. Two criteria for measuring the signal-to-noise ratio in the jth candidate descriptor (TNRj) and in the partial

correlation of the jth candidate descriptor with the prediction residual (CNRj) ensure that the selected descriptors contain valuable information. A detailed description of the TNR and CNR criteria and further algorithmic details can be found in Shacham and Brauner.11 Additionally, model refinement (i.e, addition of more descriptors to the model) stops when the variance of the model prediction error for the training set s2 )

1 p-m-1

np

∑ (y

- (β0 + β1ζ1i + β2ζ2i... + βmζmi))2

pi

i)1

(4) falls below a prespecified threshold value sgoal2. For an optimally refined model, the prediction error should be approximately as large as the measurement error that is present in the property data. Thus, the threshold value sgoal2 can be estimated from the relative measurement error εri of the property data by 2

sgoal

1 ) p

np

∑ (ε y

2

ri pi)

(5)

i)1

Methodology. The Dragon program12,13 was used to calculate 1664 descriptors for the compounds in the database from minimized energy molecular models. The molecular geometries were optimized using the CNDO (complete neglect of differential overlap) semiempirical method implemented in the HyperChem package.14 The number of descriptors was reduced to 1280 by removing the descriptors that had the same value for all compounds. Property data (measured and predicted) were taken from the DIPPR15 and NIST16 databases. A modified version of the stepwise regression program (SROV)11 was used for the identification of the most appropriate QSPRs. Similarity of n-Hexane and n-Heptane, a Motivating Example. Compounds that are immediate neighbors in homologous series are considered as very similar. Shacham et al.17 demonstrated the similarity by plotting molecular mass and some measured properties of n-hexane versus those of n-heptane. The measured properties include the critical temperature, critical volume and compressibility factor, melting point, triple point temperature and pressure, normal boiling temperature, liquid molar volume, refractive index, flash point, lower and upper flammability limits, and lower flammability limit temperature. A straight line could be fitted to the data with slope of 1.069 and correlation coefficient value of R2 ) 0.9994. Shacham et al.17 prepared a similar plot which included normalized values of 99 descriptors18 for the two compounds. A straight line could be fitted to the descriptors, as well, with similar slope of 1.088 and R2 ) 0.9965. Thus, the similarity between the two compounds is represented by collinearity between the vectors of their properties and the vectors of their descriptors. In Figure 1 the 1280 normalized Dragon descriptors of n-heptane are plotted versus these of n-hexane. Observe that there is a considerable decrease in the correlation coefficient value (R2 ) 0.917), the slope is reduced to 0.982. A possible source of the deterioration of the correlation is the existence of significant number of descriptors which are apparently outliers. It is important to investigate such anomalies as they may reduce the precision of the property prediction. Analysis of the Molecular Descriptor Database. The Dragon program generates the following types of descriptors: 0-D (e.g., molecular weight), 1-D (e.g., numbers of functional groups, atoms, bonds), 2-D (e.g., topological descriptors), 3-D (e.g., geometrical descriptors) charge and molecular properties (e.g., drug-like index, toxicity) descriptors.

Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009

Figure 1. Plot of normalized molecular descriptors of n-heptane versus those of n-hexane.

The calculation of molecular descriptors is a complex process which may lead to different types of inconsistencies. A major source of inconsistencies is the optimization of the 3-D molecular geometries. The software used may yield different results depending on the tolerances used and may even converge to a local minimum, and the calculated values of some of the descriptors are highly sensitive to small variations in the (optimized) molecular geometry. Another source of difficulty is when dealing with descriptors which are undefined for some of the compounds. In some cases the descriptor values are categorized as “missing data” (marked by a value of -999 by Dragon) or assigned a value of zero. Obviously, an assigned zero value (e.g., a descriptor value of n-alkane with n1 carbon atoms, which is defined for n > n1) may not have the same role as a zero value of a physical significance (such as the number of oxygen atoms in a hydrocarbon). To check the accuracy and consistency of the descriptors we plot them versus nC. This procedure will be demonstrated for the 1-alkene homologous series. In Table 1 the members of these series, nC and the values of the descriptors GGI1, ADDD, AGDD, ASP, H4m, Gm, EEig10d, PJI2, ICR, and Mor11e are shown. Each of these descriptors describes in a specific way the chemical structure, and its physical meaning (hence its variation with nC) is defined by the feature of the structure it represents.19 Short descriptions of these descriptors (from Dragon’s help section, where references to their calculations are provided) are shown in Table 2 (additional descriptors used in this study are listed in the Appendix). For a homologous series represented by H(CH2)nR, the relative effect of the R group is different for various descriptors, hence they exhibit different variation with increasing nC. The values of the descriptor GGI1 (4th column of Table 1) is an example of Category I descriptors which are constant for the whole series. This descriptor belongs to “topological charge indices”,12,19,20 which reflects the distribution of charge within molecular structure and is unaffected by adding a CH2 group (due to symmetry). Therefore, the topological charge indexes are constant for all the members of the series. The descriptor ADDD versus nC for the 1-alkene series is depicted in Figure 2. It is a geometrical descriptor, which represents the average value of the elements of a matrix DD (DDij ) Gij/Dij, where G and D are the geometrical distance and topological distance matrixes, respectively). The descriptor value increases linearly with nC, following the relationship

9725

ADDD ) 3.364nC - 3.388 with R ) 0.9995. A similar linear (or nearly linear) change can be observed, for example, for the liquid molar volume of n-paraffins and n-olefins.3 The normalized values of the descriptors AGDD, ASP, and H4m for the 1-alkene series are plotted versus nC in Figure 3. In all three cases the descriptor value increases nonlinearly with nC. For AGDD the change is monotonic with an increasing slope. A similar behavior can be observed for critical volume (Vc) of n-paraffins and n-olefins.3 For the descriptor ASP the change is monotonic with a decreasing slope, and the descriptor approaches a constant value for high nC. Similar behavior can be observed for normal boiling and critical temperatures (Tc) of n-paraffins and n-olefins.3 The variation of the descriptor H4m with nC is similar to that of ASP except that there is an abrupt change in the slope (and the level of the nonlinearity) at a particular value of nC. The plot of the descriptor Gm versus nC (Figure 4) does not reveal any consistent variation, and such a behavior will be designated as random. The values of the descriptor Eeig10d for 1-alkene are shown in Figure 5. The value of the descriptor is zero for up to nC ) 10; at this point it abruptly changes to -1.9 and increases monotonically in a nonlinear manner. The definition of Eeig10d (from DRAGON’s Help) is “eigenvalue 10 from edge adjacency matrix weighted by dipole moments”. It is obvious that there are two separate regions with regard to the values of this descriptor for the 1-alkene series. For nC < 11 the descriptor is assigned a zero value, while for nC g 11 there is a functional relationship between the molecular structure and the descriptor value. Additional examples of the same behavior are V descriptors are high charge indexes (related to high intermolecular distances). As they are not defined for small molecules, they are assigned by DRAGON a value of zero (Category V descriptor). A plot of ICR and PJI2 versus nC (Figure 6) shows that these descriptors are characterized by a different behavior of the odd and even carbon number compounds. The value of PJI2 is constant for odd nC and nonlinearly, monotonically increasing for even nC. In the case of ICR, two separate monotonically increasing curves for odd and even nC compounds are observed. Such a behavior is typical for normal melting temperature (Tm) n-paraffins and n-olefins.3 Indeed, the calculation of ICR is based on Shanon’s21 entropy, and therefore its variation with nC can be expected to be correlated with the variation of Tm. The descriptor Mor11e (Figure 7) exhibits a periodical (sinusoidal) change of this descriptor with nC. A similar descriptor analysis and characterization was carried out for members of the n-alkane, 1-alkene, cycloalkane, alkylcyclohexane, alkyl-cycloheptane, n-alkylbenzene, n-alcohol, and n-alyphatic acid homologous series for 1280 descriptors of the database. The results of this study are summarized in Table 3. Most of the descriptors (types II, IIIA and IIIB, 46.2% of the total) increase monotonically with nC. Additional 21.9% are defined at zero for nC lower than a threshold value and increase monotonically thereafter. Thus, approximately 68% of the descriptors are correlated with nC. Approximately 8.5% of the descriptors (type I) have constant value for a particular series. A large portion of the descriptors (22.9%, type IV descriptors) appears to vary inconsistently with nC, apparently, in a random manner. There are only 0.4% descriptors with separate curves for odd and even carbon number compounds (type VI) and 0.2% descriptors with a periodic (sinusoidal) change (type VII). In the last column of Table 3 the percentages of the 3D descriptors in each category are listed. Among type II and IIIA descriptors, which are the important ones for most properties, 2

9726

Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009

Table 1. Selected Descriptor Values for the 1-Alkene Homologous Series no.

compd

nC

GGI1

ADDD

AGDD

ASP

H4m

Gm

EEig10d

PJI2

ICR

Mor11e

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

propylene 1-butene 1-pentene 1-hexene 1-heptene 1-octene 1-nonene 1-decene 1-undecene 1-dodecene 1-tridecene 1-tetradecene 1-pentadecene 1-hexadecene 1-heptadecene 1-octadecene 1-nonadecene 1-eicosene 1-heneicosene 1-docosene 1-tricosene 1-tetracosene 1-pentacosene 1-hexacosene 1-heptacosene 1-octacosene 1-nonacosene 1-triacontene

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

8.250 11.196 14.197 17.226 20.318 23.448 26.627 29.840 33.091 36.371 39.681 43.015 46.373 49.752 53.150 56.566 59.999 63.445 66.906 70.380 73.867 77.365 80.872 84.389 87.916 91.452 94.995 98.547

18.964 29.661 42.653 57.807 75.418 95.377 117.83 142.71 170.11 199.97 232.36 267.24 304.64 344.56 387.01 431.98 479.48 529.5 582.07 637.16 694.8 754.97 817.67 882.9 950.69 1021 1093.9 1169.3

0.553 0.548 0.686 0.741 0.809 0.843 0.877 0.897 0.916 0.927 0.939 0.947 0.954 0.959 0.964 0.968 0.971 0.974 0.976 0.978 0.980 0.982 0.983 0.985 0.986 0.987 0.988 0.989

0.001 0.002 0.006 0.012 0.015 0.02 0.025 0.042 0.074 0.098 0.12 0.146 0.165 0.184 0.201 0.216 0.229 0.242 0.254 0.264 0.274 0.283 0.291 0.298 0.305 0.312 0.318 0.324

0.386 0.286 0.234 0.205 0.254 0.183 0.181 0.254 0.239 0.197 0.180 0.181 0.171 0.243 0.228 0.231 0.210 0.220 0.281 0.251 0.305 0.232 0.336 0.253 0.336 0.243 0.322 0.264

0 0 0 0 0 0 0 0 -1.9 -1.678 -1.416 -1.157 -0.921 -0.711 -0.519 -0.337 -0.163 0 0.151 0.289 0.415 0.529 0.632 0.726 0.812 0.89 0.96 1.025

1 0.5 1 0.667 1 0.75 1 0.8 1 0.833 1 0.857 1 0.875 1 0.889 1 0.9 1 0.909 1 0.917 1 0.923 1 0.929 1 0.933

0.918 1 1.522 1.585 1.95 2 2.281 2.322 2.55 2.585 2.777 2.807 2.974 3 3.146 3.17 3.301 3.322 3.44 3.459 3.567 3.585 3.684 3.7 3.792 3.807 3.892 3.907

-0.380 -0.473 -0.627 -0.799 -0.912 -0.997 -1.034 -1.055 -1.035 -1.007 -0.947 -0.885 -0.802 -0.719 -0.627 -0.551 -0.477 -0.414 -0.372 -0.335 -0.321 -0.335 -0.362 -0.415 -0.494 -0.586 -0.700 -0.837

Table 2. Definition of the Descriptors Shown in Table 1 (from the Help Section of Dragon) name of descriptor nC GGI1 ADDD AGDD ASP H4m Gm EEig10d PJI2 ICR Mor11e

definition

type

class

number of carbon atoms topological charge index of order 1 average distance/distance degree average geometric distance degree asphericity H autocorrelation of lag 4/weighted by atomic masses G total symmetry index/weighted by atomic masses eigenvalue 10 from edge adj. matrix weighted by dipole moments 2D Petitjean shape index radial centric information index 3D-MoRSE - signal 11/weighted by atomic Sanderson electronegativities

0-D 2-D 3-D 3-D 3-D 3-D 3-D 2-D 2-D 2-D 3-D

constitutional descriptors topological charge indices geometrical descriptors geometrical descriptors geometrical descriptors GETAWAY descriptors WHIM descriptors edge adjacency indices topological descriptors topological descriptors 3D-MoRSE descriptors

only 41.7% and 32.1%, respectively, are 3D descriptors. In contrast, all the type VI descriptors, that are used for Tm prediction, are 3D. The large percentage of the 3D descriptors in category IV raises the suspicion that some of the inconsistency may result from the route adopted for obtaining the 3-D representation of the molecule. This assumption will be examined in the next section.

Figure 2. Plot of the average distance/distance degree (ADDD) geometrical descriptor versus nc for the 1-alkene homologous series.

The Influence of the 3-D Structure on the Molecular Descriptor Values. Molecular structure files are available for many compounds in the NIST16 database. The structure is defined in 2-D and 3-D MOL files. The 2-D MOL files are available for all the compounds included in the NIST database, while 3-D MOL files are available only for part of the compounds. Alternatively, commercial programs such as HyprChem14 and Gaussian22 can be used to draw the molecules and to generate 2-D MOL and 3-D MOL files. The molecule’s structure (in the form of 2-D or 3-D MOL files or similar structure files) can be loaded into the Dragon program to generate molecular descriptors. Obviously, a 2-D MOL file cannot represent the 3-D molecular structure, and it is essential to use the 3-D optimized structure in order to obtain meaningful values of the 3-D descriptors. The generation of the 3-D structure (and the respective MOL file) requires minimization of the total energy of the molecule. The computation of the 3-D structure may involve the use of several different programs in order to increase the probability of convergence to a global minimum, and most programs allow use of different models for representing the energy and various convergence tolerances. For a more detailed discussion of this subject, including references, see for example the “Computed 3-D Structures” section in the NIST database (http://webbook. nist.gov/chemistry/3d-structs/).

Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009

9727

Figure 3. Plot of the normalized values of the descriptors AGDD, ASP, and H4m versus the number of carbon atoms for the 1-alkene homologous series.

Figure 4. Plot of the descriptor Gm versus the number of carbon atoms for the 1-alkene homologous series.

Figure 5. Plot of the descriptor EEig10d versus the number of carbon atoms for the 1-alkene homologous series.

To study the influence of the structure representation in 2-D and 3-D MOL files on the values of the molecular descriptors we have generated Dragon descriptors using six different MOL files:

I. 2-D MOL files from the NIST database; II. 3-D mole files generated from 2-D NIST MOL files using the Gaussion 03 program,22 semiempirical molecular orbital theory, known as PM3 with a “tight” convergence tolerance; III. 3-D mole files generated from 2-D NIST MOL files using the HyperChem 0514 complete neglect of differential overlap (CNDO) theory, and a convergence tolerance of ∆rms < 0.1; IV. 3-D NIST MOL file; V. the same as file II but started from a 3-D NIST MOL file; VI. the same as file III but started from a 3-D NIST MOL file. Table 4 shows selected descriptor values for 1-tridecanol that were obtained using the above listed six different options. For the 3-D descriptors there can be very large differences between the various options (see for example the descriptors Mor12u, RDF075m, and As in the first 3 rows of Table 4). In some cases the initial configuration used (for calculating the optimized 3-D structure) has a major effect on the descriptor values (see for example descriptors HATSp and H2u in rows 9 and 8). Nevertheless, some of the 3-D descriptors can be insensitive to the minimization method used (see for example, H0u and MEcc, rows 14 and 19, respectively). Note that 2-D, 1-D, and 0-D descriptors are not included in Table 4 as their values are exactly the same regardless of the computational method used. Values of the 3-D descriptor Mor30v, computed with three different techniques, are shown in Figure 8 for the first 13 members of the 1-alcohol homologous series. In all cases the starting configuration of the 3D NIST MOL file was used. Observe that the descriptor values are very similar for the Gaussian and HyperChem program computations when the PM3 algorithm is used in both programs. Using the CNDO algorithm with the HyperChem program yields completely different results and the differences increase with increasing nC. This part of the study shows that 3-D descriptors can be used with some confidence in QSPRs if the 3-D MOL files of the target and predictive compounds come from the same reliable source. Thus, the inclusion of 3-D descriptors in a TQSPR can limit considerably its practical use in cases where such 3-D

9728

Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009

Figure 6. Plot of the descriptors: 2-D Petitjean shape index (PJI2) and the radial centric information (ICR) index versus nc for the 1-alkene homologous series.

Figure 7. Plot of the descriptor Mor11e versus the number of carbon atoms for the 1-alkene homologous series.

MOL files are not available for the target compounds. It can be very beneficial to develop TQSPRs without 3-D descriptors, if similar property prediction accuracy to QSPRs which do include 3-D descriptors can be achieved. This option will be investigated in the following sections. Selecting Members of the Homologous Series to the Training Set. For the derivation of a reliable TQSPR (which can provide a reliable estimate for the target-compound property value), it is essential that the training set contains predictive compounds which are “similar” to the target. Similarity, with respect to a property value, can be defined as follows: the property values of the members of the training set and the target compound follow the same (possibly unknown) rule. Typical examples of groups which contain “similar” members, in this context, are members of homologous series. In such series most properties change with nC according to specific rules, which can be often approximated by ABC correlations. Because of the similarity between the members of homologous series, it is important to be able to include in the training set all the members of the series to which the target compound belongs and for whom the pertinent property values are available. To achieve that we attempted to identify the subset of descriptors which can be used to select the members of similarity groups/training sets so that the maximal number of homologous series members are included. In Table 5 the first 10 compounds selected to the training sets of 1-butene, 1-undecene, and 1-nonadecene (target com-

pounds) are shown while using all 1256 descriptors in the selection procedure (the total number of 1-alkenes in the database is 28). Observe that in the training set of 1-butene, only 2 members of the 1-alkene series are included, for 1-undecene the number increases to 6, and for 1-nonadecene all 10 selected compounds are members of the 1-alkene homologous series. Thus, the composition of the training set depends very much on the location of the target compound within the homologous series. Similar results were obtained by Kahrs et al.,10 while testing several measures for similarity of structures and/or clustering algorithms. We have tried to use different subsets of the descriptors to achieve maximal utilization of the available predictive compounds belonging to the target-compound homologous series. The most successful option turned out to be the use of a twostep procedure. In the first step, only Category I descriptors, which have constant values for homologous series, are used. After removing the 3-D descriptors, 61 of this type of descriptors remain in the database. Using Category I descriptors yields |rti| ) 1 values for all the members of the homologous series to which the target compound belongs. Thus, this step enables isolation of the homologous series subset from the rest of the database; however, it does not arrange the members of homologous series according to their similarity with the target compound. To achieve that, the second step of the similarity group selection is carried out where only type II descriptors (after excluding the 3-D descriptors) are used to identify the training set from the subset of compounds that were identified in step 1. In Table 6 the members of the training set for the target compound: n-dodecane (nC ) 12) that were identified using the two-step procedure are shown. Observe that the |rti| values are such that the predictive compounds are arranged according to their distance (in terms of nC) from the target compound, while the compound with the higher nC always precedes the compound with the lower nC at the same distance. This type of training set is the most appropriate for predicting gas and liquid properties. However, it may not be the best for predicting solid properties, such as Tm, as for some series there are different similarity rules for odd and even nC compounds. In such cases it may be preferable to remove from the training set compounds whose nC type (odd or even) does not match that of the target compound. In Table 7, for example, the members of the training set for the target compound n-dodecane are shown, where the odd nC compounds have been excluded.

Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009

9729

Table 3. Trend of Change of Descriptors with nC for Homologous Series category

trend of change of the descriptor with nC

% of descriptors in the database

% of 3D descriptors

I II IIIA IIIB IV

constant linear or nearly linear increase nonlinear monotonic increase or decrease with decreasing slope nonlinear monotonic increase or decrease with increasing slope inconsistent, no particular trend or different trends for different homologous series zero value for some nC, nonlinear monotonic increase for others separate curves for odd and even nC periodic

8.5 10.9 25.2 10.1 22.9

7.3 41.7 32.1 66.7 83.6

21.9 0.4 0.2

62.9 100.0 100.0

V VI VII

Table 4. Selected Descriptor Values for 1-Tridecanol Using Various Starting Configurationsa, 3-D Structure Determination Programs, Algorithms, and Convergence Tolerances starting configuration: 2-D NIST MOL file

starting configuration: 3-D NIST MOL file

no.

descriptor

type

I

II

III

IV

V

VI

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

RDF075m Mor12u As G3e DP19 SP19 L2m H2u HATSp H4e P1m HIC G1e H0u H8m ISH ITH L3p MEcc

3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D 3-D

0.002 -0.023 0.837 0.169 5.148 5.148 0.153 1.217 1.778 2.430 0.980 5.147 0.156 3.000 0.007 1.000 53.303 0.320 1.000

7.731 -1.690 5.013 1.000 12.115 12.115 0.315 1.180 1.832 2.416 0.985 5.170 0.156 3.000 0.007 1.000 53.303 0.320 1.000

1.220 -1.748 34.041 0.238 8.261 8.261 1.798 2.263 2.118 2.311 0.895 5.128 0.169 3.000 0.006 0.836 44.548 0.461 1.000

2.876 -2.325 27.839 0.180 1.348 1.348 2.304 3.142 2.340 1.248 0.763 5.203 0.156 3.000 0.002 0.836 44.548 0.881 0.998

2.406 -2.282 27.723 0.169 1.327 1.327 2.315 3.144 2.341 1.253 0.761 5.204 0.156 3.000 0.002 0.836 44.548 0.878 0.998

1.798 -2.103 24.051 0.180 0.865 0.865 2.192 3.216 2.324 1.167 0.756 5.214 0.156 3.000 0.003 0.911 48.548 0.917 0.997

a Definitions: (I) No change from starting configuration. (II) Gaussion 03 program, PM3 algorithm, “tight” convergence tolerance. (III) HyperChem 05 program, CNDO algorithm, convergence tolerance: ∆RMS < 0.1. (IV) No change from starting configuration. (V) Gaussion 03 program, PM3 algorithm, “tight” convergence tolerance. (VI) HyperChem 05 program, CNDO algorithm, convergence tolerance: ∆RMS < 0.1.

Figure 8. Values of the 3-D descriptor: Mor30v for members of the 1-alcohol homologous series.

Comparison of Property Prediction with and without 3-D Descriptors. To check the assumption that property predictions can be carried out with the TQSPR method in a similar level of accuracy with or without the use of 3-D descriptors, seven properties were predicted for members of five

homologous series using the two alternatives. The properties predicted included critical temperature (TC), critical pressure (PC), critical volume (VC), normal melting temperature (Tm), normal boiling temperature (Tb), liquid molar volume (Vm), and refractive index (RI). The following homologous series were

9730

Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009

Table 5. The First 10 Compounds Selected to the Training Set of the (Target) Compounds 1-Butene, 1-Undecene and 1-Nonadecene (Using All the Descriptors of the Database) target compound compounds in the training set correlation coefficient 1-butene

1-undecene

1-nonadecene

1-pentene 2-methyl-1-butene 2-methylbutane 3-methyl-1-butene isoprene 2-methyl-2-butene 2-methyl-1-pentene 4-methyl-1-pentene 1-hexene trans-2-hexene 1-dodecene 1-decene 1-tridecene 1-nonene 1-tetradecene 2-methylnonane 1-pentadecene hexylcyclopentane heptylcyclopentane n-undecane 1-eicosene 1-octadecene 1-heneicosene 1-heptadecene 1-docosene 1-hexadecene 1-tricosene 1-tetracosene 1-pentadecene 1-pentacosene

0.97325 0.96638 0.95847 0.95722 0.95388 0.94798 0.94638 0.94618 0.94451 0.94311 0.99389 0.99334 0.98894 0.98591 0.98129 0.97861 0.97547 0.97198 0.96967 0.96825 0.99882 0.99867 0.9978 0.99734 0.99539 0.99358 0.99275 0.98985 0.98808 0.98559

Table 6. Members of the Training Set for the Target Compound: n-Dodecane (nC ) 12) no.

compound

nC

|rti|a

1 2 3 4 5 6 7 8 9 10

n-tridecane n-undecane n-tetradecane n-decane n-pentadecane n-nonane n-hexadecane n-octane n-heptadecane n-heptane

13 11 14 10 15 9 16 8 17 7

0.99948 0.99946 0.99798 0.99779 0.99555 0.99493 0.99227 0.99082 0.98822 0.98533

a Type II descriptors (3-D descriptors excluded) used for calculation of |rti|.

Table 7. Members of the Training Set for the Target Compound: n-Dodecane (nC) 12) with Odd nC Compounds Excluded no.

compound

nC

|rti|a

1 2 3 4 5 6 7 8 9 10

n-tetradecane n-decane n-hexadecane n-octane n-octadecane n-hexane n-eicosane n-docosane n-tetracosane n-hexacosane

14 10 16 8 18 6 20 22 24 26

0.99798 0.99779 0.99227 0.99082 0.98348 0.97836 0.97211 0.95879 0.94394 0.92798

a Type II descriptors (3-D descriptors excluded) used for calculation of |rti|.

considered: n-alkanes, from n-butane (nC ) 4) to n-hexatriacontene (nC ) 36); 1-alkenes, from 1-butene (nC ) 4) to 1-eicosene (nC ) 20); n-alkylbenzenes, from butylbenzene (nC ) 10) to n-octadecylbenzene (nC ) 24); 1-alcohols, from 1-butanol (nC ) 4) to 1-docosanol (nC ) 22); and aliphatic acids, from butanoic-acid (nC ) 4) to eicosanoic-acid (nC ) 20).

Each compound in these series was targeted individually. The two-step approach for selection of the training set, which was described in the previous section, was used for predicting all the properties except Tm. For Tm the first compounds to enter the training set were the ones that matched the nC type (odd or even) of the target. TQSPRs including one or maximum two descriptors were used, (according to the SROV signal-to-noise ratio criteria, upon introducing the first descriptor into the model). The predicted property value for the target compound, y˜pt, was compared with the property value recommended by DIPPR (yt,publ) by examining its absolute deviation (in %): δ ) 100|y˜pt - yt,publ |/yt,publ

(6)

In Table 8 the mean, the median, and the standard deviation of the δ values are summarized for the five homologous series and the seven properties, for the case where the 3-D descriptors are excluded from the TQSPRs. Observe that TC, Tb, Vm, and RI are predicted for essentially all the compounds (except for Vm of n-aliphatic acids) with deviations from the DIPPR recommended values being δ < 1%. For PC and VC, there are some cases where δ > 1%. The larger deviations are caused in this case by large uncertainties in the DIPPR recommended property values, which may reach 10%-25%, especially for high nC compounds. For Tm there are also cases where δ > 1%. As the uncertainty in the recommended Tm values is usually