Quantitative Structure–Property Relationship Predictions of Critical

Apr 14, 2015 - *E-mail: [email protected]. ... Daniel C. Elton , Zois Boukouvalas , Mark S. Butrico , Mark D. Fuge , Peter W. Chung. Scientific R...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/jced

Quantitative Structure−Property Relationship Predictions of Critical Properties and Acentric Factors for Pure Compounds Wendy Hawley Carande,*,†,‡ Andrei Kazakov,† Chris Muzny,† and Michael Frenkel† †

National Institute of Standards and Technology, 325 Broadway, MS 647, Boulder, Colorado 80305-3337, United States University of Colorado Boulder, Boulder, Colorado 80309, United States



S Supporting Information *

ABSTRACT: Knowledge of critical constants and phase boundary pressure properties is essential to understanding thermodynamic behavior of substances and is often required in practical process design applications. Where critically evaluated data are unavailable, a quantitative structure−property relationship (QSPR) regression method can be used to relate molecular properties (descriptors) to properties of interest. The relationship is trained and tested using existing critically evaluated data and is dynamic; as new data become available, the relationship can be updated to reflect changes. In this work, we use support vector regression (SVR) to develop estimation methods for critical properties and acentric factors based on critically evaluated data for over 900 pure compounds. From threedimensional geometry and connectivity information, we calculate over 500 descriptors for each compound. A matrix of descriptor values defines the input vectors for SVR, whereas critically evaluated data for critical temperature, the ratio of critical temperature to critical pressure, and saturation reduced pressure form the targets. We determine optimal SVR parameters by minimizing the sum of absolute deviations between the SVR outputs and the target values. We use a genetic algorithm to find the Pareto front points that optimize the output fit while reducing the number of input vectors (descriptors). We use a single Pareto front point to make a final evaluation in SVR. To define uncertainties of predicted values, we use uncertainty propagation calculations based on a Monte Carlo method that employs Latin hypercube sampling.



INTRODUCTION As data technology evolves, our ability to access and analyze historical data has improved rapidly, and dynamic modern databases allow rapid access to up-to-date thermodynamic values and their uncertainties.1 Critical properties (such as critical temperature, Tc, and critical pressure, pc) of substances are essential to understanding their thermodynamic behavior, in particular when that behavior is predicted based on an equation of state. We use critically evaluated property values generated through assessment of experimental data.2 Hereafter, we will refer to critically evaluated values as “experimental” values for brevity. Where experimental critical property data are unavailable, various estimation techniques are utilized to obtain critical values.3 A quantitative structure−property relationship (QSPR) regression method can be used to relate molecular properties (descriptors) to critical values.4−13 The relationship is trained and tested on existing experimental data and is dynamic; as new data become available, the relationship can be updated to reflect the changes. QSPR has long been used to predict chemical properties,14 and a number of advanced methods have been developed to implement predictive QSPR algorithms.4−12 The fundamental assumption of QSPR is that there is a relationship between quantities that describe the structure of a molecule (descriptors) and the thermophysical properties that are measured experimentally (such as critical temperature). There are two types of descriptors: experimental and theoretical. This article not subject to U.S. Copyright. Published XXXX by the American Chemical Society

Experimental descriptors are measured directly, whereas theoretical descriptors are derived from a symbolic representation of a molecule. Although many proprietary programs offer thousands of descriptors, several hundreds of descriptors can be generated using open source tools (e.g., RDKit, CDK, and Open Babel) alone. Support vector regression (SVR) offers many advantages for computing the relationship between descriptors and thermophysical properties.15 SVR is effective in high dimensional spaces, memory-efficient, and versatile. However, when the number of features (descriptors) approaches the number of samples (experimental data), the relationship easily becomes overfit to the set of data on which it is trained. This is why a large data set is desirable, and measures are taken to trim the descriptor set to a reasonable size. Additionally, probability estimates for robustness and uncertainty are not directly provided by SVR and require additional cross-validation and uncertainty modeling steps that can be computationally costly. The goal of this work is to combine a large set of critically evaluated data with modern computational techniques to predict thermodynamic properties. In our previous work,12 we presented the basic ideas for developing such a system and demonstrated its performance for predicting critical properties Received: December 1, 2014 Accepted: March 31, 2015

A

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

Figure 1. Overall workflow for generation of predictive methods.

Creation of 3D MOL files. 3D molecular structure representations needed for calculation of molecular descriptors were generated from two-dimensional (2D) structure definitions in MOL17 format stored in the NIST/TRC SOURCE data archival system.18 The generation procedure was developed to provide diverse sampling of low-energy conformations consistent with the original 2D representation. The initial pool of 3D conformers is generated using several available utilities.19−26 All produced structures are optimized with TINKER27 utility using its implementation of the MMFF94.28 Optimized structures are further analyzed. Conformers that are either inconsistent with the original 2D representation (as established by analysis of the resulting InChI strings29) or redundant are removed from the pool. In some cases, when permitted by the original 2D representations, different stereoisomers are present in the resulting pool. If present, distinct stereoisomers are separated into groups also using InChI representation. For each distinct stereoisomer, a more detailed conformational analysis is performed. Depending on the number of rotatable bonds and the number of flexible ring atoms, either systematic or stochastic searches of conformational space are performed for each stereoisomer. Systematic search is carried out with the CONFAB utility.30 When systematic search becomes combinatorically prohibitive, 200 simulated annealing runs with TINKER are performed instead. The number of annealing runs was chosen in our previous work12 and represents a compromise between the computational cost and search effectiveness for the molecules within the scope of our data set. As previously, all newly generated structures were optimized with the MMFF94, and inconsistent or redundant structures are removed. At this stage, we retain up to 200 conformations per stereoisomer; the larger structure collections are truncated based on the MMFF94 energy and diversity measured as root-mean-squared distance (RMSD)30 from the lowest-energy conformer. The entire procedure as described above is implemented programmatically and executed without manual intervention. A single representative 3D conformer to be used in QSPR was generated by reoptimization of all structures in final collection with the semiempirical PM6 method31 and choosing the structure with the lowest free energy at standard conditions, as was suggested by us previously.12 PM6 represents substantial improvement over previous semiempirical methods both in terms of its accuracy and the scope. PM6 use in QSPR modeling was reported to produce models of quality comparable with those generated with much more computationally expensive density

of pure substances. Here, we develop a substantially improved version of our earlier methodology, use an updated data set for model training, and extend model development to one more property. As previously, we target critical temperature (Tc) and the ratio of critical temperature to critical pressure (Tc/pc). The choice of Tc/pc over just pc was explained in detail previously:12 simple theoretical considerations relate this ratio to the molecular volume. Although accurate correlations do require more than one parameter to model Tc/pc, many fewer parameters (11 as opposed to over 30 according to the tests performed in this work) are needed overall as compared to describing pc itself. We also consider an additional property, the acentric factor (ω), as defined by eq 1. The combination of Tc, pc, and ω represents a minimal set of properties needed for approximate equation of state, and the resulting correlations can be used for compound screening based on their performance in thermodynamic cycles.16 Initial tests indicated that, for the property related to ω, the saturation reduced pressure at Tr = T/Tc= 0.7, psat r , better correlations can be produced than for ω itself. Therefore, psat r was used for an actual target, and value ω can be obtained from it following its formal definition (eq 1) ω = − log10 prsat |Tr = 0.7 − 1

(1)

In the first section of this paper, we discuss our research methods, including how we preprocessed the data and our use of support vector machines and genetic algorithms to make predictive relationships, as well as our uncertainty propagation and cross-validation techniques. In the second section, we relay the results of our predictive models. Finally, we include concluding remarks and suggestions for further work that can be done in this field.



METHODS Overview. In this section, we will discuss the methods used to make predictions of critical properties and acentric factors for pure compounds. First, three-dimensional (3D) molecular models are created for each compound to be studied. A set of descriptor values is then calculated for each compound using the 3D model. SVR is used to relate the descriptor values to the experimental data. The SVR parameters are optimized using a genetic algorithm, which includes the reduction of overall number of descriptors used in the model. An overall workflow for our process is depicted in Figure 1. The flow of the prediction algorithm is shown in Figure 2. B

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

Figure 2. Prediction algorithm.

functional theory (DFT) methods.32 All PM6 calculations in this work were performed with Gaussian 09 package.33 Preprocessing of Data. Critical temperatures, critical pressures, acentric factors, and the corresponding uncertainties for pure compounds were produced via dynamic evaluations of up-to-date collection of experimental data18 using NIST/TRC ThermoData Engine.34 Only original experimental data (or, for critical pressures and acentric factors, robust interpolations/ extrapolations of original experimental data) were used. The final compilation contained 942 compounds for which at least one of the three properties was available experimentally. The data archive18 is one of the most complete (and continuously updated) collections of original experimental data, and the

generated data set covers the majority of available experimental information. Larger collections of critical properties were reportedly used for correlation development.11,35 However, as we pointed out previously,12 these larger data sets are based on compilations that include significant fractions of nonexperimental values typically estimated with other empirical methods. Descriptor Calculation. Three open source tools were used to create the descriptor set: RDKit,22 CDK,36 and Open Babel.20 In cases of identical descriptors generated by different tools, the RDKit values were used. A number of additional descriptors were generated in-house from the results of PM6 calculations, as described in the Supporting Information. The C

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

single Pareto front point was chosen and the descriptors from that set were used in a final SVR run. The Pareto front point with the lowest sum of absolute deviations is used as a baseline, and the point with the lowest number of descriptors within a range of 3−4% of the baseline is chosen for the final SVR model. This strategy for descriptor subset selection is consistent with recommendations for empirical model building based on large data sets.42 Uncertainties. Uncertainties were estimated by a Monte Carlo method that employs Latin hypercube sampling, following the approach used previously.12 Similar methodologies were reported recently by Hajipour et al.43,44 Python mcerp package45 was adopted in the present work. SVR itself does not provide a path to explore uncertainty propagation, which must be performed separately. This method allows the creation of uncertainty metrics based on a distribution of likely values based on the input experimental uncertainty. A total of ,000 sample points were used for sampling using a normal distribution (with a σ given by half the experimental uncertainty). This means SVR was run 4000 times before a median value was taken for the predicted quantities. The uncertainties are calculated based on the spread of the predicted quantity distribution.

combined list included 533 descriptors. Descriptors were discarded before running SVR if they met one of the following criteria: they had empty values, they had zero range (every molecule had the same value), they were highly discrete (e.g., number of aromatic nitrogens), or they were strongly correlated with other descriptors. A full listing of the descriptors used is given in the Supporting Information. Removal of Outliers. Some molecules always exhibited behavior inconsistent with the rest of the set. Water, nitrogen, hydrogen peroxide, and carbon monoxide are notoriously difficult to model using our approach and, thus, were discarded from the data set. These outliers are well-studied molecules with associated specific models. ν-Support Vector Regression (ν-SVR). In the ν-SVR formulation,37 a parameter ν controls the number of support vectors by providing a lower bound for the fraction of support vectors. Scikit-learn,38 a machine learning algorithm for Python, was used to perform SVR calculations. Each experimental value used has an associated combined expanded uncertainty,39 the reciprocal of which is used as a weighting term in SVR. We set maximum weights of 1/3, 100, and 250 for Tc, Tc/pc, and psat r , respectively, in order to avoid overweighting of any particular point with comparatively low uncertainty. The weights were normalized before use in the SVR calculation. The built-in normalization routine in scikit-learn was also used on the descriptor values. All SVR modeling was performed with the third-order polynomial kernel function; the advantages of polynomial kernel were discussed in our previous work.12 Once the SVR parameters were selected, the SVR routine was run with all the descriptor values constituting the input vectors. SVR is done in three steps: 1. The data are split into three groups: training, validation, and testing. This selection is random and is done with a 2:1:1 ratio of training:validation:testing. In our previous work,12 we used rational selection for division of the data set. For the present workflow, random selection is more appropriate (due to the added part for reduction of the number of descriptors, see below). On the other hand, recent analysis40 suggests that rational selection presents no significant benefit over random selection in terms of predictive power of the resulting models. 2. Run SVR using the training set to create the correlations and the validation set to tune the SVR parameters. SVR is highly sensitive to three parameters: the penalty parameter (C), an upper bound on the fraction of training weights and a lower bound of the fraction of support vectors (ν), and the independent term in the kernel function (r). A three-dimensional grid search is used to maximize the coefficient of determination (R2) between the predicted values and the targets. 3. Combine the training and validation sets and create the SVR correlations using the combined set, then test the correlations on the testing set. After following these SVR steps, the next task is to reduce the number of descriptors used while maintaining a good fit to the target data. This was done by using a genetic algorithm in DEAP41 to find the Pareto front values that represent a tradeoff between quality of the fit (the sum of absolute deviations, hereafter SAD) and the number of descriptors. Each individual in the population to be evolved was a set of descriptors to run as the input vectors in SVR, ranging from single descriptor values to the entire descriptor set. After 100 generations, a



RESULTS For each target property, a description of these results, as well as accompanying figures, are provided below. Results for Tc. Figure 3 shows a comparison of SVR predictions with experimental values for Tc for the training (gray points) and testing and validation combined (red points) sets. The left plot is the comparison of predicted versus experimental Tc for all the descriptors available (top) and

Figure 3. Left-top panel: comparison of SVR predictions using all descriptors with the experimental values for critical temperature. The gray points: combined training and validation sets, the red points: testing set. Right panel: the Pareto front of the sum of absolute deviations (SAD) versus the number of descriptors for critical temperature. The Pareto front point chosen for the final model is marked with a red ×. Left-bottom panel: comparison of final SVR model predictions with the experimental values for critical temperature. D

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

Pareto front-selected descriptors (bottom); the right plot shows the Pareto front. The y uncertainty bars on the SVR plots indicate the uncertainty calculated from the Latin hypercube sampling method. The experimental (x axis) uncertainty bars are also provided. 106 descriptors are used in the first SVR run after initial filtering described in the previous section. After the Pareto front analysis, 33 descriptors listed in Table 1 are chosen for the final SVR model. The SAD for the final calculation is 3.69 × 103 K, above the initial run SAD of 3.31 × 103 K. Table 1. Descriptors Used in the Final Model for Tc Ordered by the Importance Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

gvdWV Weighted Path 1 BCUT 5 gCOSMO-PCD gCOSMO-NCD Topological polar surface area Carbon Types 3 Molecular distance edge 00 Molecular distance edge 10 Spectrophore 23 Spectrophore 47 gIRAFreq gLUMO BCUT 4 Spectrophore 48 BCUT 1 Weighted Path 3 Balaban J value Molecular distance edge 07 gNCD Atom additive LogP 2 gHLgap gCOSMO-VAR MOE MR VSA Descriptor 06 (2.75 ≤ x < 3.05) BCUT 0 MOE Charge VSA Descriptor 08 (0.00 ≤ x < 0.05) Molecular distance edge 02 Carbon Types 4 Molecular distance edge 08 Molecular distance edge 12 Carbon Types 5 BCUT 3 Carbon Types 0

Figure 4. Left-top panel: comparison of SVR predictions using all descriptors with the experimental values for the critical temperature to critical pressure ratio. The gray points: combined training and validation sets, the red points: testing set. Right panel: the Pareto front of the SAD versus the number of descriptors for the critical temperature to critical pressure ratio. The Pareto front point chosen for the final model is marked with a red ×. Left-bottom panel: comparison of final SVR model predictions with the experimental values for the critical temperature to critical pressure ratio.

Table 2. Descriptors Used in the Final Model for Tc/pc Ordered by the Importance Index 1 2 3 4 5 6 7 8 9 10 11

gvdWV gPCD Kier and Hall Chi cluster index 5 Molecular distance edge 04 Carbon Types 5 BCUT 1 MOE logP VSA Descriptor 01 (−∞ < x < −0.40) gCOSMO-SKW gSumIR Carbon Types 2 MOE MR VSA Descriptor 06 (2.75 ≤ x < 3.05)

the SAD for the final run. However, most of that increase in SAD is due to the sparsity of values for high Tc/pc; the left side of the plot remains virtually unchanged. For the k-fold crossvalidation over 5 folds, the mean MAD is 1.31 × 10−2 K/kPa and the standard deviation is 1.56 × 10−3 K/kPa. sat Results for psat r . For the final target property, pr , \Figure 5 shows predicted versus experimental values. The plots use the same setup and colors as Figures 3 and 4. For psat r , 99 descriptors pass in to the initial calculation. The SAD for the testing set in the initial SVR run is 0.412. Although this target property displays higher scatter than the previous two, the predicted values generally fall within the limits of the experimental uncertainties. After the Pareto front calculations, 30 descriptors remain for the final SVR model accompanied by an increase in the SAD, which has a final run value of 0.479. This increase in the SAD is a reasonable trade-off for reducing the number of descriptors used in the calculation. Full listing of descriptors used in the final model for psat r is given in Table 3.

A k-fold cross-validation method is used to test the robustness of the final model. In this method, the data are randomly divided into k subsets, then the model is run with one of the folds as the testing set and the combined other folds as the training set. In our case, we used k = 5 and evaluate the mean absolute deviation, or MAD for each fold. For the five folds, the mean MAD is 13.4 K with a standard deviation of 1.6 K. The low standard deviation between folds indicates that the model is robust. Results for Tc/pc. SVR predictions of Tc/pc compared to experimental values are shown in Figure 4. The plots use the same setup and colors as Figure 3. For Tc/pc, 100 descriptors are used in the first run. After the Pareto front calculation, 11 descriptors (see Table 2) remain for the final SVR calculations. The SAD for the testing set for the initial and final runs, respectively, are 2.47 and 2.65 K/kPa, indicating an increase in E

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

Table 3. Descriptors Used in the Final Model for psat r Ordered by the Importance Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Figure 5. Left-top panel: comparison of SVR predictions using all descriptors with the experimental values for the saturation reduced pressure. The gray points: combined training and validation sets, the red points: testing set. Right panel: the Pareto front of the SAD versus the number of descriptors for the saturation reduced pressure. The Pareto front point chosen for the final model is marked with a red ×. Left-bottom panel: comparison of final SVR model predictions with the experimental values for the saturation reduced pressure.

For the k-fold cross-validation over 5 folds, the mean MAD is 3.42 × 10−3 and the standard deviation is 4.00 × 10−4.



DISCUSSION Important Descriptors. The number of descriptors (and which ones are chosen by the genetic algorithm) are different for Tc, Tc/pc and psat r . In order to access the relative importance of chosen variables (descriptors) for a given property, a firstorder sensitivity analysis of the final SVR models is performed. We define sensitivity coefficients with respect to each feature (scaled descriptor) as sij =

has the least, meaning that the latter requires fewer features to model than the former. Additionally, the relative importance of features drops off rapidly for Tc/pc, but shows only a slow decline for Tc. Tc/pc can be modeled with fewer features because the relative importance of features decreases heavily after the first few. A full list of descriptors for Tc sorted by their importance is given in Table 1. The five most important descriptors for Tc are gvdWV, Weighted Path 1, BCUT 5, gCOSMO-PCD, and gCOSMO-NCD. gvdWV is the van der Waals volume. gCOSMO-PCD and gCOSMO-NCD are, respectively, the positive and negative charge densities on COSMO46 cavity surface. The Weighted Path descriptors are formulated by CDK and characterize molecular branching. The BCUT descriptors47 are derived from a weighted version of the Burden matrix (B)48 calculated in CDK. B is a symmetric matrix with the dimension equal to the number of nonhydrogen atoms in the molecule. Off-diagonal elements of B characterize connectivity within the molecule. Diagonal elements contain atomic properties relevant to intermolecular interactions. CDK implements three types of Burden matrices with diagonal elements defined as atomic weights (B1), empirically derived atomic partial charges (B2), and atomic polarizabilities (B3). The BCUT descriptors are the highest and lowest eigenvalues for each of the three matrices: BCUT 0/ BCUT 1, BCUT 2/BCUT 3, and BCUT 4/BCUT 5 are the highest/lowest eigenvalues of B1, B2, and B3, respectively. Specifically, BCUT 5 is the lowest eigenvalue of the matrix weighted by polarizability. In spite of being somewhat

∂yi ∂pj

(2)

where yi is the property for the ith compound and pj the jth scaled descriptor. Then, we define the importance index for the jth descriptor as

Ij = A ∑ |sij| i

gCOSMO-POS Auto Correlation Charge 1 gvdWV gGAFreq gPNSA1 gNCD MOE Charge VSA Descriptor 01 (−∞ < x < −0.30) Molecular distance edge 04 gCOSMO-NCD EState VSA Descriptor 09 (4.69 ≤ x < 9.17) Weighted Path 1 Hall Kier Alpha gCOSMO-PCD gSumIR MOE Charge VSA Descriptor 03 (−0.25 ≤ x < −0.20) gCOSMO-VAR MOE MR VSA Descriptor 04 (2.24 ≤ x < 2.45) Molecular distance edge 07 Balaban J value MOE Charge VSA Descriptor 09 (0.05 ≤ x < 0.10) gDipole MOE logP VSA Descriptor 04 (0.00 ≤ x < 0.10) Weighted Path 4 MOE logP VSA Descriptor 01 (−∞ < x < −0.40) Kier and Hall Chi path cluster index 5 gPCD MOE MR VSA Descriptor 09 (3.80 ≤ x < 4.00) Kappa3 MOE logP VSA Descriptor 08 (0.25 ≤ x < 0.30) Molecular distance edge 14

(3)

where the summation is performed over all compounds in the data set, and the normalization constant A is chosen to set the maximum Ij to 1. In actual calculations, partial derivatives in eq 2 were evaluated numerically with the central differencing scheme. The importance index introduced in such a manner provides a metric for ranking of individual descriptors included in the final model. The results of this analysis are shown in Figure 6. Tables 1, 2, and 3 list the descriptor names linked to the x-axis labels in Figure 6. Not only is each predicted quantity modeled with a different number of features but the shape of the importance index curve varies as well. Tc has the greatest number of features, and Tc/pc F

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

Figure 6. Importance indices for the predicted properties. The descriptor names corresponding to their sequential numbers on each panel are given in Tables 1, 2, and 3.

on COSMO cavity surfaces that strongly affect intermolecular interactions. Comparison with the Previous Work. Our previous models12 have shown superior performance in predicting Tc and Tc/pc as compared to several commonly used group contribution-based methods. Here, we compare the performance of our earlier models with those developed in this work. The comparison is done for the new, updated data set used here. The results, in a form of statistical distributions of deviations, are presented in Figure 7 for Tc and in Figure 8 for

nonintuitive, BCUT descriptors effectively combine molecular information relevant to modeling of a range of different properties, including normal boiling points.49 For Tc/pc, gvdWV, gPCD, Kier and Hall Chi cluster index 5, Molecular distance edge 04, and Carbon Types 5 are the five most important descriptors (as shown in Table 2). As expected,12 the van der Waals volume, gvdWV, is by far the most dominant factor in the Tc/pc correlation. It is followed by gPCD, positive-charge surface density based on the PM6Mulliken partial charges projected on the van der Waals surface. Molecular distance edges50 are calculated by CDK, which also provides the Kier and Hall descriptor that evaluates the Chi cluster index. gCOSMO-POS, Auto Correlation Charge 1, gvdWV, gGAFreq, and gPNSA1 constitute the most important descriptors for psat r (see Table 3). Auto Correlation Charge 1 is a Moreau-Broto autocorrelation descriptor51 that uses empirical partial charges. gCOSMO-POS is the partial positive charge on the COSMO cavity surface. gPNSA1 is a partial negatively charged surface area, and gGAFreq is a geometricmean vibrational frequency. Van der Waals volume is the most prominent descriptor across the board, appearing in the final cuts for all three predicted properties, ranking first for Tc, Tc/pc, and third for psat r . As it represents the measure of molecular size (and, to some extent, atomic composition), its importance is consistent with well-established empirical observations. Also of note is strong presence of COSMO-based descriptors introduced in this work. As explained in Supporting Information, these descriptors were derived from computed charge distributions

Figure 7. Comparison of deviation distribution between the current study and12 for critical temperature. δexp is the experimental uncertainty. G

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

Table 4. Comparison of Data Set Uncertainties and Deviations between the Data and Model Predictions critical temperature absolutea data set size this work (expt) DIPPR (2006)b expt all DIPPR (2013)b expt all

Figure 8. Comparison of deviation distribution between the current study and12 for the ratio of critical temperature to critical pressure. δexp is the experimental uncertainty.

ref ref ref ref ref ref ref ref ref ref

Tc/pc. As seen, the new results exhibit a noticeable improvement over the previous work. The percentage of compounds for which Tc was predicted within 6 K or experimental uncertainty increased from 78% to 91%. The percentage of compounds for which Tc/pc was predicted within 3% or experimental uncertainty slightly increased, from 82% to 83%. The decrease in the deviation distribution functions with an increase of the deviation for the experiment also appears faster for the present models. This is primarily due to improved methodology used here (better strategies and theoretical models used to generate 3D geometries, extended list of descriptors, improved approach to the model generation and feature selection). Another source of improvement is the updates to the data set due to addition of new data, correction of identified erroneous entries, and improvements in data-evaluation algorithms. The methodology based on dynamic model generation as both the quantity and the quality of data increase clearly represents a better strategy for development of estimation methods. The demonstration of the method performance in the form of discrete deviation distributions when the first “bin” also includes the cases predicted within the experimental uncertainty was used by us before12 and represents a compromise between the rigor and the relevance in a context of practical use. On the other hand, a dominant practice is to report a single-valued metric, such as correlation coefficient, average absolute deviation or root-mean-squared (RMS) deviation. These values are typically used to compare and rank the performance of different models. However, such comparisons are only meaningful in a context of error-free data (or, in practical terms, when the deviations between the model and the data significantly exceed the data uncertainty). As the accuracy of the model increases and approaches the level of the input data, the global metrics can no longer be used. One can fit the data achieving “accuracy” better than the input data uncertainty which has no practical meaning and constitutes overfitting. Considering the dominant use of single-valued metrics and recent reports of high-accuracy correlations, it is of interest to perform a systematic comparison of the values reported in the literature for the studied properties with the actual levels of experimental uncertainties. The relevant data are compiled in Table 4. For the experimental uncertainties, the data sets from the present work and two versions of the DIPPR52 database were

4 54 6 5 7 8 10 11 35 13

average

RMS

RMS

Data Set Uncertainties 917 2.7 4.9

0.40

0.70

572 1912

0.98 4.9

6.0 46

636 5.8 0.94 2116 43 4.7 Deviations between Model and Data 147 12 285 4.9 7.0 165 7.7 165 11 15 530 1.4 5.6 856 14 153 12 1230 3.7 1697 0.90 407 23 critical pressure absolutea data set size

this work (expt) DIPPR (2006)b expt all DIPPR (2013)b expt all ref ref ref ref ref ref ref

54 6 7 10 11 35 13

this work (expt) ref ref ref ref ref

55 56 35 53 13

relative (%) average

average

RMS

0.24

0.96

relative (%) average

RMS

Data Set Uncertainties 684 150 250

5.0

8.0

432 1899

2.5 7.5

130 620

480 120 2103 580 Deviations between Model and Data 269 110 200 165 160 463 20 80 139 250 1230 49 1696 0.40 382 260 acentric factor absolutea data set size average RMS Data Set Uncertainties 614 0.031 0.048 Deviations between Model and Data 219 4.2 181 3.0 1691 3.7 477 0.057 5.7 331 0.039

2.4 7.4 2.9 0.52

2.5

relative (%) average RMS 6.9

9.8

a

The units for critical temperature and critical pressure are K and kPa, respectively. bDIPPR represents uncertainties with discrete ranges;57 range medians were used for averaging.

analyzed. DIPPR represents a popular source of evaluated data frequently used in correlation development. The version from 2006 was cited by refs 35 and 53. The version from 2013 is the H

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

as follows: the present model performs better against the considered set of compounds; the performance of the model of ref 35 is good but not to the extent of the accuracy that was claimed.

most recent at the time of this writing. To maintain its standard of completeness, in addition to experimental data, the DIPPR database includes estimated values for compounds with no experimental data available. When analyzing the data, the subset based only on experimental data was used; analysis of all (combined experimental and predicted) data is also shown for illustration. DIPPR data for the acentric factor are not included due to the difficulty of separating the experiment-derived data from the estimates. The reported uncertainties are presented in the form of averages and RMS (absolute and relative). The present data sets for all three properties are larger than those based on experimental data from DIPPR due to somewhat different venues of the two projects: NIST/TRC collects any experimental data within its scope, and DIPPR focuses on the most complete characterization of the selected set of compounds while gradually expanding this selection. The averages of experimental uncertainties from the present data set and DIPPR are generally consistent within a factor of 2, and the differences between the two versions of DIPPR are very small. Combined, this information gives the scale of typical experimental uncertainty levels for large data sets. Smaller uncertainty scales can obviously be achieved within smaller subsets obtained by selective sampling based on uncertainty values. Various error metrics reported for a range of QSPR and group contribution-based estimation methods are also listed in Table 4 for all three properties. As seen, the majority of reported models does not exhibit obvious conflicts with established levels of experimental uncertainties; somewhat low error metrics observed in a few cases based on small data sets can be attributed to overall higher accuracy of these samples. However, three reports7,11,35 based on large data sets claiming the highest accuracy among other methods exhibit clear signs of overfitting across all considered properties. The problem is particularly pronounced for the critical pressure: the predictions of ref 7 show average absolute deviation that is a factor of 6 to 7 below the average experimental uncertainty,11 report average absolute deviation 2.5 to 3 times smaller than average experimental uncertainty, and35 result for the average relative deviation is a factor of 6 to 12 lower than the corresponding measure of experimental uncertainty. The data used in the study of7 were compiled from the multiple printed sources and their analysis is beyond the scope of this work. The data used by11 and35 were taken from the DIPPR database, versions 1999 and 2006, respectively. Data point statistics shown in Table 4 points to the source of the problem: both studies did not filter out nonexperimental data, and the data sets used contained about 60% to 70% of predicted data. Formally, this makes the problem even worse since, as seen in Table 4, addition of estimated data points greatly increases the overall data uncertainties, thus resulting in even greater degree of overfitting. It should be noted that overfitting is normally expected to be detected and prevented with the statistical procedures applied in a course of machine learning (use of testing set, cross-validation, etc.). However, these procedures fail when the data set is dominated by the predicted data exhibiting errors that are mainly systematic in nature. As an additional verification of the above findings, we performed a comparison of the predictions of the present model and the model of35 using the experimental data from this study. Compounds common for both studies were selected for each property; the resulting data sets included 609, 507, and 484 compounds for Tc, pc, and ω, respectively. Detailed results are given in Supporting Information, and the main conclusions are



CONCLUDING REMARKS A system for generation of empirical property-estimation methods based on large experimental data sets is presented. The system includes software-based critical evaluation of available experimental data stored in a large-scale data archival system, systematic generation of three-dimensional molecular models from stored two-dimensional structure definitions, calculation of molecular descriptors, and development of correlations using machine learning methods. Correlations for three properties, the critical temperature, the ratio of critical temperature and critical pressure, and the reduced saturation pressure (related to the acentric factor), were produced. Feature (variable) selection for the final models was performed using multiobjective genetic algorithm. Generated models exhibited excellent performance against the available experimental data as well as other estimation methods available. The predictive accuracy of the models was also confirmed with the k-fold cross-validation. During feature selection, it was shown that, for large experimental data sets used here, one needs to use relatively large number of features to effectively utilize the available information and produce more accurate models. Traditionally, QSPR models were limited to just a few variables; here, the resulting correlations, after feature selection, included between 11 and 33 variables. Several descriptors derived directly from quantum-chemical calculations and typically not used in QSPR were found to be useful for modeling of critical properties and acentric factors. Descriptors based on charge distributions on COSMO cavity surfaces were, expectedly, among the most important variables. Some added descriptors related to vibrational and spectroscopic analysis were also found to be important (geometric mean vibrational frequency, average vibrational frequency weighted by IR intensity, total IR intensity). Also of note are identified cases of overly optimistic accuracy claims made in the recent literature. The problem originates from the misuse of the reference data sources that resulted in overfitting. It appears that modern machine learning methods reached the point when empirical models for property predictions can be generated very efficiently. However, the quality of the produced models is still controlled by the quantity and the quality of the input data.



ASSOCIATED CONTENT

* Supporting Information S

Table of in-house descriptors, table of all calculated descriptors, and figures comparing deviation distributions with other studies. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone:(303)497-4648. Fax: (303)497-6682. Funding

This research was funded in part by the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, I

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

(15) Ivanciuc, O. In Reviews in Computational Chemistry; Lipkowitz, K. B., Cundari, T. R., Eds.; Vol. Wiley-VCH: Weinheim, 2007; Vol 23, pp 291−400. (16) McLinden, M. O.; Kazakov, A. F.; Brown, J. S.; Domanski, P. A thermodynamic analysis of refrigerants: Possibilities and tradeoffs for Low-GWP refrigerants. Int. J. Refrig. 2014, 38, 80−92. (17) CTfile Formats. Accelrys, 2011. http://accelrys.com/products/ informatics/cheminformatics/ctfile-formats/no-fee.php (accessed Feb 2015). (18) Kazakov, A.; Muzny, C. D.; Kroenlein, K.; Diky, V.; Chirico, R. D.; Magee, J. W.; Abdulagatov, I. M.; Frenkel, M. NIST/TRC SOURCE Data Archival System: The Next-Generation Data Model for Storage of Thermophysical Properties. Int. J. Thermophys. 2012, 33, 22−33. (19) Gilbert, K., Guha, R. SMI23D3D Coordinate Generation, 2010. http://sourceforge.net/p/cicc-grid/code/HEAD/tree/cicc-grid/ smi23d (accessed Feb 2015). (20) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An open chemical toolbox. J. Cheminf. 2011, 3, 33. (21) The Open Babel package, 2014. https://github.com/ openbabel/openbabel (accessed Feb 2015). (22) Landrum, G. RDKit: Open-source cheminformatics. http:// www.rdkit.org (accessed Feb 2015). (23) RDKit, version Q3, 2013. http://sourceforge.net/projects/ rdkit/files (accessed Feb 2015). (24) Lagorce, D.; Pencheva, T.; Villoutreix, B. O.; Miteva, M. DGAMMOS: A New tool to generate 3D conformation of small molecules using Distance Geometry and Automated Molecular Mechanics Optimization for in silico Screening. BMC Chem. Biol. 2009, 9, 6. (25) Vainio, M. J.; Johnson, M. S. Generating conformer ensembles using a multiobjective genetic algorithm. J. Chem. Inf. Model. 2007, 47, 2462−2474. (26) Balloon package, version 1.5.0.1143.2014. http://users.abo.fi/ mivainio/balloon/ (accessed Feb 2015). (27) Ponder, J. W. TINKER: Software Tools for Molecular Design. Version 5.1.09. 2010. (28) Halgren, T. J. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 1996, 17, 490−519. (29) IUPAC International Chemical Identifier (InChI) Programs. InChI version 1, software version 1.04.2011. http://www.inchi-trust. org (accessed Feb 2014). (30) O’Boyle, N. M.; Vandermeersch, T.; Flynn, C. J.; Maguire, A. R.; Hutchison, G. R. ConfabSystematic generation of diverse lowenergy conformers. J. Cheminf 2011, 3, 8. (31) Stewart, J. J. P. Optimization of parameters for semiempirical methods V: Modification of NDDO approximations and application to 70 elements. J. Mol. Model. 2007, 13, 1173−1213. (32) Puzyn, T.; Suzuki, N.; Haranczyk, M.; Rak, J. J. Calculation of Quantum-Mechanical Descriptors for QSPR at the DFT Level: Is It Necessary? Chem. Inf. Model. 2008, 48, 1174−1180. (33) Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Mennucci, B.; Petersson, G. A.; Nakatsuji, H.; Caricato, M.; Li, X.; Hratchian, H. P.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, J. A.Jr.; Peralta, J. E.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; Millam, J. M.; Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.; Farkas, O.; Foresman, J. B.; Ortiz, J. V.; Cioslowski, J.; Fox, D. J. Gaussian 09, Revision C.01; Gaussian, Inc.: Wallingford, CT, 2010.

under Grant No. DE-EE0002057, and by the Professional Research Experience Program (PREP) Postdoctoral Fellowship, a collaboration between the University of Colorado at Boulder and the National Institute of Standards and Technology. Notes

The authors declare no competing financial interest. Contribution of the U.S. National Institute of Standards and Technology and not subject to copyright in the United States. Trade names are provided only to specify procedures adequately and do not imply endorsement by the National Institute of Standards and Technology. Similar products by other manufacturers may be found to work as well or better.



REFERENCES

(1) Frenkel, M. Global Information Systems in Science: Application to the Field of Thermodynamics. J. Chem. Eng. Data 2009, 54, 2411− 2428. (2) Frenkel, M.; Chirico, R. D.; Diky, V. V.; Marsh, K. N.; Dymond, J. H.; Wakeham, W. A. ThermoML: An XML-Based Approach for Storage and Exchange of Experimental and Critically Evaluated Thermophysical and Thermochemical Property Data. 3. Critically Evaluated Data, Predicted Data, and Equation Representation. J. Chem. Eng. Data 2004, 49, 381−393. (3) Poling, B. E.; Prausnitz, J. M.; O’Connell, J. P. The Properties of Gases and Liquids, 5th ed.; McGraw-Hill: New York, 2000. (4) Egolf, L. M.; Wessel, M. D.; Jurs, P. C. Prediction of boiling points and critical temperatures of industrially important organic compounds from molecular structure. J. Chem. Inf. Comput. Sci. 1994, 34, 947−956. (5) Katritzky, A. R.; Mu, L.; Karelson, M. Relationships of critical temperatures to calculated molecular properties. J. Chem. Inf. Comput. Sci. 1998, 38, 293−299. (6) Turner, B. E.; Costello, C. L.; Jurs, P. C. Prediction of critical temperatures and pressures of industrially important organic compounds from molecular structure. J. Chem. Inf. Comput. Sci. 1998, 38, 639−645. (7) Espinosa, G.; Yaffe, D.; Arenas, A.; Cohen, Y.; Giralt, F. A fuzzy ARTMAP-based quantitative structure-property relationship (QSPR) for predicting physical properties of organic compounds. Ind. Eng. Chem. Res. 2001, 40, 2757−2766. (8) Yao, X.; Zhang, X.; Zhang, R.; Liu, M.; Hu, Z.; Fan, B. Radial basis function neural network based QSPR for the prediction of critical pressures of substituted benzenes. Comput. Chem. 2002, 26, 159−169. (9) Yang, S.; Lu, W.; Chen, N.; Hu, Q. Support vector regression based QSPR for the prediction of some physicochemical properties of alkyl benzenes. J. Mol. Struct.: THEOCHEM 2005, 719, 119−127. (10) Sola, D.; Ferri, A.; Banchero, M.; Manna, L.; Sicardi, S. QSPR prediction of N-boiling point and critical properties of organic compounds and comparison with a group-contribution method. Fluid Phase Equilib. 2008, 263, 33−42. (11) Godavarthy, S. S.; Robinson, R. L. J.; Gasem, K. A. M. Improved structure-property relationship models for prediction of critical properties. Fluid Phase Equilib. 2008, 264, 122−136. (12) Kazakov, A.; Muzny, C. D.; Diky, V.; Chirico, R. D.; Frenkel, M. Predictive correlations based on large experimental datasets: Critical constants for pure compounds. Fluid Phase Equilib. 2010, 298, 131− 142. (13) Mokshina, E. G.; Kuz’min, V. E.; Nedostup, V. I. QSPR Modeling of Critical Parameters of Organic Compounds Belonging to Different Classes in Terms of the Simplex Representation of Molecular Structure. Russ. J. Org. Chem. 2014, 50, 314−321. (14) Grigoras, S. A structural approach to calculate physicalproperties of pure organic-substances - the critical temperature, critical volume and related properties. J. Comput. Chem. 1990, 11, 493−510. J

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX

Journal of Chemical & Engineering Data

Article

(56) Constantinou, L.; Gani, R.; O’Connell, J. P. Estimation of the acentric factor and the liquid molar volume at 298 K using a new group contribution method. Fluid Phase Equilib. 1995, 103, 11−12. (57) Ericksen, D.; Wilding, W. V.; Oscarson, J. L.; Rowley, R. L. Use of the DIPPR Database for Development of QSPR Correlations: Normal Boiling Point. J. Chem. Eng. Data 2002, 47, 1293−1302.

(34) Frenkel, M.; Chirico, R. D.; Diky, V.; Yan, X. J.; Dong, Q.; Muzny, C. ThermoData Engine (TDE): Software implementation of the dynamic data evaluation concept. J. Chem. Inf. Model. 2005, 45, 816−838. (35) Gharagheizi, F.; Eslamimanesh, A.; Mohammadi, A. H.; Richon, D. Determination of Critical Properties and Acentric Factors of Pure Compounds Using the Artificial Neural Network Group Contribution Algorithm. J. Chem. Eng. Data 2011, 56, 2460−2476. (36) Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The Chemistry Development Kit (CDK): an opensource Java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci. 2003, 43, 493−500. (37) Schölkopf, B., Bartlett, P., Smola, A., Williamson, R. In Advances in Neural Information Processing Systems; Kearns, M. S., Solla, S. A., Cohn, D. A., Eds.; MIT Press: Cambridge, MA, 199911330−336. (38) Pedregosa, F.; Weiss, R.; Brucher, M. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (39) Chirico, R. D.; Frenkel, M.; Diky, V. V.; Marsh, K. N.; Wilhoit, R. C. ThermoML-An XML-based approach for storage and exchange of experimental and critically evaluated thermophysical and thermochemical property data. 2. Uncertainties. J. Chem. Eng. Data 2003, 48, 1344−1359. (40) Martin, T. M.; Harten, P.; Young, D. M.; Muratov, E. N.; Golbraikh, A.; Zhu, H.; Tropsha, A. Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling? J. Chem. Inf. Model. 2012, 52, 2570−2578. (41) Fortin, F.-A.; de Rainville, F.-M.; Gardner, M.-A.; Parizeau, M.; Gagne, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 2012, 13, 2171−2175. (42) Wisnowski, J. W.; Simpson, J. R.; Montgomery, D. C.; Runger, G. Resampling methods for variable selection in robust regression. Comput. Stat. Data Anal. 2003, 43, 341−355. (43) Hajipour, S.; Satyro, M. Uncertainty analysis applied to thermodynamic models and process design. 1. Pure Components. Fluid Phase Equilib. 2011, 307, 78−94. (44) Hajipour, S.; Satyro, M.; Foley, M. Uncertainty analysis applied to thermodynamic models and process design. 2. Binary Mixtures. Fluid Phase Equilib. 2014, 364, 15−20. (45) Lee, A. Real-time Latin-hypercube sampling-based Monte Carlo error propagation. https://github.com/tisimst/mcerp (Feb 2014). (46) Klamt, A.; Shüürmann, G. COSMO: A new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc., Perkin Trans. 1993, 2, 799−805. (47) Pearlman, R. S.; Smith, K. M. Novel Software Tools for Chemical Diversity. Perspect. Drug Discovery Des. 1998, 9−11, 339− 353. (48) Burden, F. R. Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 1989, 29, 225−227. (49) Stanton, D. T. Evaluation and Use of BCUT Descriptors in QSAR and QSPR Studies. J. Chem. Inf. Comput. Sci. 1999, 39, 11−20. (50) Liu, S.; Cao, C.; Li, Z. Approach to Estimation and Prediction for Normal Boiling Point (NBP) of Alkanes Based on a Novel Molecular Distance-Edge (MDE) Vector, lambda. J. Chem. Inf. Comput. Sci. 1998, 38, 387−394. (51) Moreau, G.; Broto, P. The autocorrelation of a topological structure: A new molecular descriptor. Nouv. J. Chim. 1980, 4, 359− 360. (52) Evaluated Standard Thermophysical Property Values. DIPPR (Design Institute for Physical Properties) 801. 2013 (53) Wang, Q.; Jia, Q.; Ma, P. Prediction of the Acentric Factor of Organic Compounds with the Positional Distributive Contribution Method. J. Chem. Eng. Data 2012, 57, 169−189. (54) Constantinou, L.; Gani, R. New Group-Contribution Method for Estimating Properties of Pure Compounds. AIChE J. 1994, 40, 1697−1710. (55) Han, B.; Peng, D.-Y. A Group-Contribution Correlation for Predicting the Acentric Factors of Organic Compounds. Can. J. Chem. Eng. 1993, 71, 332−334. K

DOI: 10.1021/je501093v J. Chem. Eng. Data XXXX, XXX, XXX−XXX