8430
Ind. Eng. Chem. Res. 2006, 45, 8430-8437
Structurally “Targeted” Quantitative Structure-Property Relationship Method for Property Prediction Neima Brauner,*,† Roumiana P. Stateva,‡ Georgi St. Cholakov,§ and Mordechai Shacham| School of Engineering, Tel-AViV UniVersity, Tel-AViV 69978, Israel, Institute of Chemical Engineering, Bulgarian Academy of Sciences, Sofia 1113, Bulgaria, Department of Organic Synthesis and Fuels, UniVersity of Chemical Technology and Metallurgy, Sofia 1756, Bulgaria, and Department of Chemical Engineering, Ben-Gurion UniVersity of the NegeV, Beer-SheVa 84105, Israel
Prediction of unknown properties for a target compound using quantitative structure-property relationships (QSPRs) is reliable only if the compound is within the model applicability domain. To improve the prediction reliability, a “targeted” QSPR (TQSPR) method is developed in which a training set containing only compounds structurally similar to the target compound is first identified. Similarity is measured by the partial correlation coefficients between the vectors of the molecular descriptors of the target compound and each potential predictive compound. Available property data in the training set are then used in the usual manner to select molecular descriptors for QSPRs, predicting the properties of the target and the rest of the compounds in the set. Preliminary results show that the method proposed yields predictions within the experimental error for compounds well represented in the database and fairly reliable estimates for complex compounds that are sparsely represented. The cutoff value of the partial correlation coefficient provides an indication of the expected prediction error. 1. Introduction Experimental property data are available only for a small fraction of the pure compounds pertaining to chemistry and chemical engineering, environmental engineering and environmental impact assessment, hazard and operability analysis, etc. Therefore, methods for reliable prediction of property data are needed. One class of modern property prediction methods is based on the identification of quantitative structure-property relationships (QSPRs) between a compound and its properties, so the adequate numerical description of chemical structure is essential. The introduction of topological, graph theoretical, topochemical, quantum chemical, structural, physicochemical, electrotopological, and other molecular descriptors enabled considerable improvement of the presentation of chemical structures and, hence, of the accuracy of the property predictions. Neural network models as well as “most significant common features” QSPRs have been developed in order to take advantage of the new, sophisticated molecular descriptors for prediction of properties. A recent review of the field was published by Dearden.1 In the present work, we will concentrate on the most significant common features QSPR methods, as defined in Wakeham et al.,2 which we shall call for short QSPRs henceforward. The above QSPRs can be schematically represented by the following equation:
yp ) f(xs1, xs2, ..., xsk; xp1, xp2, ..., xpm; β0, β1, ..., βn) (1) where xs1, xs2, ..., xsk are the molecular structure descriptors of a particular pure compound, xp1, xp2, ..., xpm are measurable * To whom correspondence should be addressed. Tel.: +972-36408127. Fax: +972-3-6407334. E-mail:
[email protected]. † Tel-Aviv University. ‡ Bulgarian Academy of Sciences. § University of Chemical Technology and Metallurgy. | Ben-Gurion University of the Negev.
properties of the same compound (such as boiling temperature, melting temperature, toxicity, etc.), β0, β1, ..., βn are the QSPR parameters, and yp is the target property (to be predicted) of the same compound. To derive the QSPR, the available data is divided into a “training set” and a “validation set”. Using the training set, multiple linear or nonlinear regression and partial least squares techniques are employed to select the molecular descriptors and/ or properties to be included in the right-hand side (RHS) of eq 1 and to calculate the model parameter values. Model validation is carried out using the validation set. Applications with QSPRs used for prediction of physical and thermodynamic properties are reported by many authors.2-6 Independent studies of QSPR prediction errors (e.g., Poling et al.7 and Yan et al.8) show that if the chemical structure of the target compound is well represented in the training set, the prediction can be expected to be much more accurate than when its structure is sparsely represented. In the latter case, extrapolation of a particular QSPR toward a target compound of unmeasured pure component constants can be rather risky, and at present, the prediction accuracy cannot be assessed. With no feedback on the prediction error, it is not possible to choose among different prediction methods and to advocate the best. Tropsha et al.9 have pointed out that no matter how robust, significant, and validated a QSPR may be, it cannot be expected to reliably predict the modeled property for the entire universe of chemicals. On the basis of this observation, they recommended defining a “model applicability domain” in order to obtain reliable prediction of properties. The model applicability domain is determined by the variation of chemical structures in the database and tested by the training set. Therefore, selection of compounds in the training set should not be done at random, but it should contain compounds structurally related to the target compounds, the properties of which are to be predicted. Moreover, new methods for identification of structurally related compounds are needed not only for training sets but also for the computer aided molecular design of compounds structurally similar to a compound with known valuable properties. These
10.1021/ie051155o CCC: $33.50 © 2006 American Chemical Society Published on Web 02/11/2006
Ind. Eng. Chem. Res., Vol. 45, No. 25, 2006 8431
problems and approaches for dealing with them have been recently reviewed by Basak et al.12 In an attempt to overcome the limitations of QSPRs, Shacham et al.10 and Brauner et al.11 developed a new method, called quantitative structure-structure property relationship (QS2PR). It is based on the assumption that the structure (the vector of molecular descriptors) of a target compound can be represented by a linear combination of the vectors of molecular descriptors of structurally related predictiVe compounds (a “structurestructure correlation”). A stepwise linear regression algorithm12 was used to identify the parameters of the structure-structure correlation. The parameters of this correlation are then used in property-property correlations, capable of predicting all needed properties of a target compound and providing a realistic estimate of the prediction accuracy.10,11 However, it has been observed that a target compound with a complex molecular structure may require a large number of predictive compounds in order to obtain a structure-structure correlation of satisfactory precision. For the property prediction stage, measured values of the particular property for all predictive compounds must be available. Hence, it can be envisioned that in some cases it will be difficult to apply the QS2PR technique because of the lack of enough predictive compounds for which reliable measured property values exist. In this paper a new, targeted QSPR (TQSPR) method, which complements the QSPR and QS2PR methods, is presented. To some extent, it follows ideas suggested by Basak et al.12 and Chen et al.13 who employed cluster analysis on molecular descriptor data to identify similarity groups of compounds. In our work, the partial correlation coefficient between the vector of the molecular descriptors of the target compound and that of a potential predictive compound is adopted as a quantitative measure of the similarity between molecules. The group of compounds in a database, which are structurally similar to a target compound of unknown properties, is then used as a targeted training set to develop QSPRs for unknown properties. 2. Targeted QSPR Method The targeted QSPR method attempts to identify a training set structurally related to a particular compound (a target compound). For effective use of the targeted QSPR method, a database of molecular descriptors, xij, and physical properties, yij, for the predictiVe compounds is required, where i is the number of the compound and j is the number of the descriptor/ property. Molecular descriptors for the target compound (xtj) should also be available. The same set of molecular descriptors is defined for all compounds included in the database, while the span of molecular descriptors should reflect the difference between any two compounds in the database. In principle, the database should be as large as possible, as adding more molecular descriptors and more compounds to the database can increase its predictive capability. At the first stage of the targeted QSPR method, a similarity group (cluster, training set) for the target compound is established. The similarity group includes the predictive compounds, which are identified as being structurally similar to the target compound. The similarity is measured by the partial correlation coefficient, rti, between the vector of the molecular descriptors of the target compound, xt, and that of a potential predictive compound xi. The partial correlation coefficient is defined as rti ) xt xiT, where xt and xi are row vectors, centered (by subtracting the mean) and normalized to a unit length (after dividing by the Euclidean norm of the vector). Absolute rti values close to one (|rti| ≈ 1) indicate high correlation between
vectors xt and xi and, thus, high levels of similarity between the molecular structures of the target compound and the predictive compound i. The similarity group is established by selecting the first p compounds with the highest |rti| values. Another option is to include in the similarity group only those compounds for which the |rti| values exceed a prescribed threshold value. For development of a QSPR for a particular property of the target compound, applicable also for all members of the similarity group, only members of the similarity group for which data for the particular property are available are considered (N compounds). Considering the limited variability of the property values within the similarity group, a linear structure-property relation is assumed of the form:
y ) β0 + β1x1 + β2x2 ... βmxm
(2)
where y is an N vector of the respective property values (N is the number of compounds included in the similarity group), x1, x2, ..., xm are N vectors of predictive molecular descriptors (to be identified via a stepwise regression algorithm), and β0, β1, β2, ..., βm are the corresponding model parameters to be estimated. It should be noted that, if necessary, nonlinear functions of molecular descriptors may also be considered in the RHS of eq 2. The signal-to-noise ratio in the partial correlation coefficient (CNRj) is used as a criterion for determining the number of the molecular descriptors that should be included in the model (m). The signal-to-noise ratio obtained at step k of the stepwise regression procedure is defined by
CNRkj
)
{
|(yk)Txkj | N
(|xkij ki | + |yki δxkij|) ∑ i)1
}
(3)
where ki and δkij are the error (noise) values in the target property and in the jth molecular descriptor values at the kth step of the stepwise regression procedure. The calculation of CNRj requires specification of error levels for the target property and the molecular descriptor data. For the target property, the estimate of the measurement error is used. The error (noise) in the molecular descriptors is assumed to be of the order of the round-off error of the calculated values. For integer data (number of carbon atoms, for example), the noise level is the computer precision. Addition of new descriptors to the model can continue as long as the CNRj is greater than one for, at least, one of the descriptors which are not yet included. Detailed description of this stopping criterion can be found in Shacham and Brauner.11 As in a typical most significant common features method2, a stepwise regression program is used to determine which molecular descriptors should be included in the QSPR to best represent the measured property data of the similarity group and to calculate the QSPR parameter values. The QSPR so obtained can be subsequently used for calculating the estimated value of the corresponding property values for the target compound and for other compounds in the group that do not have measured data, i.e., using the following equation:
yt ) β0 + β1xt1 + β2xt2 ... βmxtm
(4)
where yt is the estimated unknown property value of the respective compound and xt1, xt2, ..., xtm are its corresponding molecular descriptors values.
8432
Ind. Eng. Chem. Res., Vol. 45, No. 25, 2006
The targeted QSPR method ensures that the most pertinent information available in the database (as measured values and molecular descriptors) is used for prediction of each property of the structurally similar compounds. 3. Application of the Targeted QSPR Method for Property Prediction For practical study of the TQSPR method, we used the molecular descriptor and property database of Cholakov et al.3 and Wakeham et al.2 This database contains 260 hydrocarbons, the molecular structure of which is represented by 99 molecular descriptors, and values for 5 physical properties. Additional measured (and in a few cases, which are identified below, predicted) property data from the DIPPR17 database were used for calculating prediction errors. The properties included in the database are the normal boiling temperature (Tb), relative liquid density at 20° C (d20 4 ), critical temperature (Tc), critical pressure (Pc), and critical volume (Vc). The list of the hydrocarbons in the database and the sources and quality of the property data are given in the corresponding references.2,3 In general, the molecular descriptors include the molar mass along with carbon atom descriptors, descriptors from simulated molecular mechanics (total energy, bond stretch energy, standard heat of formation, etc.), and some of the most popular topological indices,15 calculated with unit bond lengths and with the bond lengths of the minimized molecular model, obtained by molecular mechanics. Appendix A shows the names and numbers of the descriptors selected for the targeted QSPRs discussed in the present paper. A complete list of all molecular descriptors in the database can be found elsewhere.11 The 99 molecular descriptors in the database were normalized by dividing each descriptor by its maximal absolute value over the 260 compounds in the database. The stepwise regression program SROV12 was used for identification of the similarity group, by sorting the compounds in descending order according to their |rti| values. The first p ) 50 compounds were included in the similarity group. This number was set arbitrarily. The SROV program was also used for deriving the structure-property relation (eq 2). In the three examples here, the practical application of the targeted QSPR method is illustrated. Example 1. Prediction of the Properties of n-Tetradecane. The compound n-tetradecane is representative of compounds for which accurate experimental data is available for most physical properties; it is densely represented in the database (meaning that there are many similar compounds included), and its properties can be predicted fairly well with existing QSPRs and homologous series techniques. The partial correlation coefficients of the 50 compounds with the highest |rti| values for n-tetradecane as the target compound are shown in Figure 1. A list of some representative |rti| values for 10 out of the 50 compounds is given in Table 1. It can be seen that the database contains a large number of compounds with a high level of similarity to n-tetradecane (|rti| between 0.93195 and 0.99968). The highest correlations are with the immediate neighbors of the target compound in the homologous series, n-pentadecane and n-tridecane. The lowest |rti| is with 1-nonacosene. Three homologous series are represented in this similarity group. The first and the obvious group is the homologous series of n-alkanes, starting with n-pentane and ending with n-pentatriacontane. The similarity decreases when the distance (in terms of number of the carbon atoms) from the target compound increases, while the rate of decrease
Figure 1. Partial correlation coefficients in the group of compounds similar to n-tetradecane. Table 1. Correlation Coefficient Values for Some of the Compounds Similar to n-Tetradecane compound no. in the database
name
|rti|
4 12 13 (target) 14 31 43 76 80 100 106 113
n-pentane n-tridecane n-tetradecane n-pentadecane n-pentatriacontane 2-methylhexane 2-methylnonane 2,5-dimethyldecane 1-hexadecene 1-docosene 1-nonacosene
0.94061 0.99963 1 0.99968 0.93594 0.93513 0.97276 0.94999 0.93276 0.94294 0.93195
Table 2. Structure-Property Correlation Parameters for the Normal Boiling Temperature of the n-Tetradecane Similarity Groupa coefficient
descriptor
value
confidence interval
β0 β1 β2 β3 β4
x3 x84 x85 x86
111.8972 5.6415 -0.00078 55.7106 0.014299
4.9395 0.32126 8.67 × 10-5 1.3288 0.0030137
a Number of compounds with measured data used, N ) 50. Variance ) 1.9511. Linear correlation coefficient (R2) ) 0.99988.
is higher in the direction of decreasing carbon atoms. The second series is the branched alkanes, where 2-methylnonane has the highest correlation with the target compound (|rti| ) 0.973). It is interesting to note that this correlation coefficient is higher than the correlation coefficient of many of the n-alkanes in the similarity group, even though this value is somewhat lower than the |rti| value for the unbranched alkane n-decane (|rti| ) 0.98986), which has the same number of carbon atoms as 2-methylnonane. Nonbranched alkenes are also represented in the similarity group, with 1-docosene having the highest correlation coefficient value (|rti| ) 0.943). It should be noted that the |rti| value of 1-tetradecene, which has the same number of carbon atoms as the target compound, is lower than the cutoff, and therefore, it has not been included in the similarity group. This similarity group was used to derive with eq 2 QSPRs for Tb, d20 4 , Tc, Pc, and Vc for compounds structurally related to n-tetradecane. Equation 4 was used to predict the respective properties of the target and of the compounds in its similarity group with unknown properties, as discussed in the next example. The results obtained by SROV with the TQSPR for Tb are shown in Table 2. It should be noted that published Tb values are available for all the 50 members of the similarity group.3
Ind. Eng. Chem. Res., Vol. 45, No. 25, 2006 8433 Table 3. Summary of Structure-Property Correlations for Various Properties of n-Tetradecane prediction error (%)
a
property
descriptors
R2
Tb d20 4 Tc Pc Vc
x3, x84, x85, x86 x3, x42, x88, x95 x59, x88, x92, x95 x65, x77, x85 x72, x86, x95, x98
0.99988 0.99932 0.99956 0.99946 0.99987
Eight descriptors.2,3
b
exp (DIPPR)
TQSPR
QSPRa
QS2PRb