Reliable and Versatile Model for the Density of ... - ACS Publications

Jump to Additive Fragment Volumes - The volume increments Vk derived from the fit of the model against the training set are reported in Table 1 with ...
1 downloads 0 Views 1010KB Size
Correlation pubs.acs.org/IECR

Reliable and Versatile Model for the Density of Liquids Based on Additive Volume Increments Didier Mathieu and Rémi Bouteloup* CEA, DAM, Le Ripault, 37260 Monts, France S Supporting Information *

ABSTRACT: A new procedure is put forward to predict the standard density (ρ) of pure fluids, using a newly compiled database, including 8870 organic and metal−organic compounds. This model is simpler, easier to apply, and more general than existing methods, being applicable to a broad range of compounds, including elements C, H, N, O, F, Cl, Br, I, B, Al, Si, P, S, Ge, Se, Sn, Pb, and Ti. The average relative error of 1000 compounds, ranging from 1.3% to slightly over 2%.12−15 In the case of the GCVOL method and its Oldenburg revision GCVOL-OL,12−14 this good performance is further supported by subsequent applications to new compounds.29−31 Earlier additivity schemes also yield fair predictions.3

Figure 1. Schematic representation of the relative performance of available methods to estimate liquid densities.

More approximate densities are obtained using semiempical methods based on three-dimensional (3D) van der Waals models, with ARE values ranging from 2.4%24 to 3%.32 The advantage of these methods stems from the very small number of fitting parameters involved and the insight provided into the molecular factors that determining the values actually observed for ρ.24 Finally, QSPR methods prove to be the least reliable. For instance, models based on restricted pools of quantum chemical descriptors yield ARE values of >3.5%.16,17 Similar approaches based on QSPR methodologies implemented in the CODESSA software package do not demonstrate significantly better performances.18,19 Another QSPR study based on topological indices reported AAE values of 3.9% and 3.3% for models based respectively on multilinear regression and neural networks.20 Lower ARE values are only obtained for restricted families of compounds. For instance, a QSPR model focused on amines yields AAE = 2.7%.21 In contrast, a recent QSPR study based on a large dataset of 1741 liquids comprising 76 diverse chemical classes yields disappointing results (AREV = 8.3%), despite the use of advanced techniques, including the consideration of a pool of 3224 descriptors and the use of a biologically inspired rankbased ant system.1 This relatively poor performance of QSPR methods comes as no surprise. These techniques are remarkably powerful to build predictive models for complex properties not amenable to theoretical approaches and for which extensive amounts of data are available, especially in biochemical sciences. However, for the physicochemical properties of condensed phases, taking advantage of elementary physical considerations, even in a very approximate manner, systematically leads to improved predictions, as observed for many properties.33−36 With regard to the prediction of liquid densities, the superior performance of GC methods comes at the cost of an extensive parametrization and, consequently, reduced generality. Such methods are only applicable to compounds made of the most common functional groups, as emphasized by Bagheri et al.1 Another drawback of these methods lies in fact that the fragmentation of molecules into groups may be ambiguous and error prone in the lack of dedicated software. For this reason, there were not considered in a recent comparison of predictive models for liquid densities.3 Moreover, all currently existing methods are fitted and validated against datasets with typically 100−1000 compounds, despite the fact that experimental ρ values are available for thousands of liquid compounds proposed by vendors of chemical products. 12971

DOI: 10.1021/acs.iecr.6b03809 Ind. Eng. Chem. Res. 2016, 55, 12970−12980

Correlation

Industrial & Engineering Chemistry Research



PRESENT MODELING APPROACH The following three ingredients should contribute to overcoming the limitations of predictive tools available in the current literature: (1) taking fuller advantage of the great deal of experimental data available; (2) sticking to a physically sound approach consistent with the extensive character of Vm (the fact that it scales linearly with the molecular size); and (3) minimizing overfitting issues through a suitable definition of the additive contributions to Vm. An approach along these lines was previously applied to the prediction of crystal densities.37 The resulting model is among the most reliable and general, providing performance on par with more heavily parametrized GC methods.38−40 The present one for liquids follows exactly the same strategy. It is fitted and validated using a specially large database described in the next section and provided as Supporting Information. Unlike existing additivity schemes for Vm, such as the Girolami method,10 GCVOL and derivatives12−14 or GC+,15 the present one relies on contributions specifically designed in view of providing reliable predictions with a relatively small number of adjustable parameters. To this objective, the contribution Vk of any atom k to the molar volume is assumed to be dependent only on three parameters, namely (1) the atomic number Zk of the atom; (2) its coordination number nk; and (3) the number nHk of hydrogen atoms among the nk neighbors. Hydrogen atoms are distinguished from the others because of their especially small size. In addition to the empirical formula of the compound, this approach accounts for the overlap between van der Waals atomic spheres associated with atoms either covalently bonded or in geminal positions. With these assumptions, the procedure to split a molecule into additive fragments is very simple. Indeed, each fragment consists in a non-hydrogen atom and all attached hydrogen neighbors. Therefore, a single molar volume increment Vk = V(Z,n,nH) is introduced for all non-hydrogen atoms k that share a common atomic number (Zk = Z), a common coordination number (nk = n), and a common number of hydrogen atoms attached (nHk = nH). These increments Vk are fitted against a training set of experimental data. In contrast to most GC schemes, the present fragmentation procedures ignores bond orders. Therefore, a single parameter V(C,3,1) is used for any sp2 carbon atom in either benzene or 2butene. Similarly, V(C,2,0) applies to sp carbons in both 2butyne (−C≡) and allene (C). For simplicity, we use notations such as −CH and −C≡ for such fragments. It must then be kept in mind that they may represent similar substructures that differ only through the bond orders. To account for the possible influence of rings on the molar volume, four additional volume increments are introduced. They are denoted V6 for rings with, respectively, 6 atoms. Finally, Vm is simply obtained as follows: Vm =

∑ V (Zk , nk , nH ) + Vr k

k

where n6 represent the number of rings with, respectively, 6 atoms. V6 are the corresponding volume increments. Finally, na is the number of aromatic rings in the molecule, and Va is a volume increment that accounts for the role of aromaticity on the molar volume. Because the additive contributions to Vm are designed on the basis of simple geometrical considerations, they have been referred to as geometrical fragments.41,42 Accordingly, the present model is referenced as the GF method throughout this paper. Quite remarkably, this approach which involves only trivial additions is even simpler than the Girolami method.10 However, as shown in what follows, it is significantly more reliable due to a more extensive and physically motivated parametrization. Its utmost simplicity is illustrated by the worked-out examples provided as Supporting Information.



DATABASE Most existing procedures to predict liquid densities rely on datasets including a few hundred compounds or even less. Notable exceptions are the recent QSPR model of Bagheri et al.,1 based on 1741 compounds from the DIPPR database,43 as well as GCVOL-OL14 and GC+,15 which are fitted against over 1000 compounds from the DDB-Pure44 and CAPEC45 databases, respectively. Density values taken from such critically evaluated databases are deemed reliable, which is clearly advantageous for model development. However, it was recently pointed out that data availability is likely to be at least as important as data quality to develop accurate QSPRs.46 This result may be explained by the fact that noise in the data is averaged out in large datasets. Therefore, data availability is probably of primary importance for additivity methods as well. In order to get the most from existing ρ data, for this work, we have compiled an extensive set of measured densities coming from a variety of sources: the DIPPR database already used by Bagheri et al.,1 the Hazardous Substance Data Bank,47 the Aldrich catalog,48 and web sites compiling data from chemical vendors and literature.49−55 Other data retrieved from these databases, when available, are the melting and boiling points (MP and BP) of the compounds and their physical state under ambient conditions. The curation of the database is carried out by first discarding salts and mixtures, as well as compounds explicitly reported to be solids or gases, or for which values of >27 °C for MP or 0.001 g/cm3 for any given compound come from distinct measurements. From all 5015 compounds with distinct ρ values reported, we considered the lowest (ρmin) and largest (ρmax) measured densities. Furthermore, we restrict our attention to the 5001 entries with ρmin > 0.5 g/cm3 and ρmax − ρmin < 0.5 g/cm3. From this data, the following average values are obtained: 0.017 g/cm3 for Δρ = ρmax − ρmin and 1.5% for δρ = (ρmax/ρmin) − 1. The distribution of Δρ is shown on Figure 2. The fact that a smaller

Figure 2. Distribution of Δρ values for the present database.

number of Δρ values is found in the first bin (0−0.005 g/cm3), compared to the second bin (0.005−0.010 g/cm3), stems from the fact that many values ≤0.001 g/cm3 are discarded because they are assumed to be derived from the same measurement, as mentioned above. These data give us a clear appreciation of the uncertainties associated with the experimental densities used in this work. In particular, it is clear than the ARE of any density prediction scheme fitted against the present data cannot be BH −Br C −C ≡CH >C −CH CH2 >C< −CH< >CH2 −CH3 −Cl −F >Ge< −GeH< −I N −N NH −N< >NH −NH2 O >O −OH −P< >P< −PH< >Pb< S >S −SH >S >S< >Se >Si< −SiH< >Sn< >Ti
BH, which is depends on only two compounds in the training set. Similarly, the fragments >Al−, >GeH−, >PbS, and >Ti< are represented in only four compounds. The last quantity reported in Table 1 is the intrinsic density ρk of individual fragments, which is defined as the ratio of the mass of the fragment (including hydrogen atoms in addition to the central non-hydrogen atom) to the corresponding volume increment. These data allow molecular designers to know immediately whether adding a given fragment to a molecule will increase or decrease the density. If ρ is the density of a molecule, the addition of a fragment k will increase this density if and only if

PARAMETER FITTING The only adjustable parameters of the present GF method are the additive volume increments involved in eqs 1 and 2. Their values and associated standard errors are straightforwardly obtained through a multilinear ordinary least-squares regression aimed at minimizing the root-mean-square error on calculated molar volumes, presently denoted RMSEV. This regression is carried out using the Statsmodel package.56



ADDITIVE FRAGMENT VOLUMES The volume increments Vk derived from the fit of the model against the training set are reported in Table 1 with corresponding standard deviations σk. In this table, Nocc,k is the 12973

DOI: 10.1021/acs.iecr.6b03809 Ind. Eng. Chem. Res. 2016, 55, 12970−12980

Correlation

Industrial & Engineering Chemistry Research the fragment density ρk is larger than ρ. Obviously, fragments associated with negative volume increments necessarily increase the density. At first glance, it might appear surprising to note that some fragments exhibit negative volume increments. This is the case when the central atom is bonded to several non-hydrogen neighbors. Such negative values clearly arise because of the enhanced interatomic overlap between the neighboring atoms in geminal positions. As any volume increment Vk represents the contribution of the corresponding fragment k to the molar volume Vm, it can be split into two components: an intramolecular component Vint,k arising from the contribution of the fragment to the van der Waals volume of the molecule, and an intermolecular component Vext,k arising from its contribution to the empty space between the van der Waals spheres of neighboring molecules. Therefore, every increment Vk is dependent on the atomic number Zk of the central atom through its van der Waals radius rw,k, or equivalently through the corresponding van der Waals volume Vw,k = (4/ 3)πrw,k3. This dependence is illustrated on Figure 3. Since both Vint,k and Vext,k should increase with Vw,k, it is no surprise to note that Vk exhibits a similar increase.

(which is dependent on only four measurements) and, to a lesser extent, −P< (trivalent P in phosphites). On the other hand, the hydroxyl and primary amine groups (−OH and −NH2) exhibit especially small Vk values, because of the fact they include labile protons whose contributions to Vm are very small, as emphasized previously.24 In view of the correlation observed between Vk and Vw,k, it should be possible to derive a variant of the present model where the volume increments are expressed directly as a function of van der Waals radii, rather than fitted individually for the various elements. Regarding the Vk parameters listed in Table 1, it may also be noted that hydrogens bonded to large atoms do not significantly affect Vm. For instance, V(S,1,0) ≃ V(S,2,1). In other words, a sulfur atom bonded to a single neighbor through a double bond (S), as in thioketones and thioaldehydes, occupies roughly the same volume as the thiol group (−SH). Similarly, considering V(P,3,0) and V(P,4,1), it may be observed that only a very small decrease of the fragment volume is observed upon going from >P− (tricoordinated P atoms not bonded to any hydrogen, such as that observed in tertiary phosphines) to >PH− (tetracoordinated P atoms bonded to one H atom, such as that observed in dimethyl phosphite). Another pair of volumes very similar are V(N,2,0) (typically  N− moieties in imines) and V(N,3,1) (>NH in secondary amines). Here, the H atom induces only a very small volume increase. This could be explained by a contraction of the volume of the lone pair as an additional H atom is attached to nitrogen. Additional systematic trends may be noted for the dependence of V(Z,n,nH) on nH, as shown in Figure 4. In this plot, nH is varied

Figure 3. Dependence of Vk on the van der Waals volume of the central atom k (Vw,k). Equality between Vk and Vw,k is shown by a dashed line. Straight lines are obtained through linear regressions for every possible chemical environement of the central atom, denoted n/nHH where n and nH are, respectively, its coordination number and the number of hydrogen neighbors. Figure 4. Dependence of Vk on the number of hydrogen atoms attached to the central atom.

Upon going from Vw,k to the actual increment Vk, the volume increases as Vext,k gets included, along with the volume of neighboring hydrogen atoms, and decreases due to the overlap of the central atom with non-hydrogen neighbors. Therefore, it comes as no surprise that Vk can be either smaller or larger than Vw,k, as illustrated on Figure 3. However, it is especially interesting to note that Vk may be estimated as ak Vw,k − bk, where the regression coefficients ak and bk are dependent only on the numbers of hydrogen and nonhydrogen neighbors. Furthermore, ak is essentially independent of the environment of the atom, with a typical value of 1.6 suitable for almost all fragments. The only four exceptions are outlined on Figure 3. Two fragments exhibit a larger Vk value than expected from their van der Waals radii, namely, >Pb
O to −OH yields a smaller volume increase as H atoms bonded to O atoms occupy virtually no volume in liquids.24 The consistency of the volume increments plotted on Figure 4 contributes to validate their values and transferability. For instance, it was mentioned above that V(B,3,1) might be illdefined as the corresponding value of 22.09 cm3/mol is dependent on only two compounds. However, the fact that this value is ∼14.5 cm3/mol larger than V(B,3,0), which is consistent with the general trend, attracts confidence. Other trends are shown on Figure 5, where the volume increment V(Z,n,nH) is plotted against n for a constant number of

despite the fact that it slightly deteriorates the statistics for the training set.



RING CORRECTIONS The above-mentioned observations regarding the volume of the additive fragments considered here are reminiscent of corresponding data derived previously to predict the density of organic crystals.37 In this earlier study, rings increments were motivated by the need to account for a decrease of crystal volumes associated with the enhanced overlap of the atoms involved in the ring. The magnitude of this volume contraction steadily increases from ∼0.5 cm3/mol to >5 cm3/mol on going from 3-membered rings to large ones with 8 atoms or more.37 Ring corrections are also introduced in the present GF model for liquids. However, a completely different picture emerges regarding their influence on Vm. First, it turns out that rings increase the molar volume of liquids, in sharp contrast to their influence on crystal volumes. As shown in Table 2, the Table 2. Ring Corrections, Associated Standard Deviation, and Number of Occurrences ring increment correction, V V6 Va

10.89 9.41 6.89 3.75 1.82

associated standard deviation, σ (cm3/mol)

number of occurrences, Nocc

0.32 0.23 0.22 0.56 0.36

124 386 2049 42 1920

corresponding volume increments are positive. This implies that the above-mentioned volume contraction associated with the enhanced interatomic overlap of van der Waals spheres plays only a secondary role in fluids. This is all the more obvious as the magnitude of ring corrections is significantly larger for liquids, by comparison with crystals.37 Another difference is the fact these corrections are most important for small rings (starting from a maximum value close to 11 cm3/mol for 4-membered rings) and decrease down to