Prediction of Henry's Law Constant of Organic ... - ACS Publications

Sep 21, 2010 - In this work, a new model is presented for estimation of Henry's law constant of pure compounds in water at 25 °C (H). This model is b...
0 downloads 0 Views 210KB Size
Ind. Eng. Chem. Res. 2010, 49, 10149–10152

10149

Prediction of Henry’s Law Constant of Organic Compounds in Water from a New Group-Contribution-Based Model Farhad Gharagheizi,*,†,‡ Reza Abbasi,‡ and Behnam Tirandazi§ Department of Chemical Engineering, Faculty of Engineering, UniVersity of Tehran, P.O. Box 11365-4563, Tehran, Iran, Saman Energy Giti Co., Postal Code 3331619636, Tehran, Iran, and Department of Chemical Engineering, Iran UniVersity of Science and Technology, Tehran, Iran

In this work, a new model is presented for estimation of Henry’s law constant of pure compounds in water at 25 °C (H). This model is based on a combination between a group contribution method and neural networks. The needed parameters of the model are the occurrences of a new collection of 107 functional groups. On the basis of these 107 functional groups, a feed forward neural network is presented to estimate the H of pure compounds. The squared correlation coefficient, absolute percent error, standard deviation error, and rootmean-square error of the model over a diverse set of 1940 pure compounds used are, respectively, 0.9981, 2.84%, 2.4, and 0.1 (all the values obtained using log H based data). Therefore, the model is a comprehensive and an accurate model and can be used to predict the H of a wide range of chemical families of pure compounds in water better than previously presented models. Introduction One of the key processes affecting the fates of many organic compounds in environment is the transfer of chemicals between air and aqueous phases.1 One of the most important parameters applied for this purpose is the Henry’s law constant for compounds in water denoted by H.1,2 The H is usually referred to as the ratio of chemical’s concentration in water to its concentration in air, so reliable data for H are needed to track the fates of chemicals in environment. Generally, accurate measurements of the H are difficult and expensive due to the adsorption of minute amounts of solute on the wall of the apparatus and the analytical detection limits of the low concentrations of very hydrophobic compounds.3,4 Therefore an accurate estimation method for the H is of great importance. A number of methods have been presented to directly estimate the H of pure compounds in water from chemical structure. It should be noted that there are some indirect methods for estimation of H from other vapor-liquid equilibrium data such as activity coefficient,5,6 but application of those methods for estimation of H is not exactly evaluated, so in the present work we focus on those methods which directly estimate the H of pure compounds from their chemical structure. These correlations can be classified into two main classes based on the type of parameters they use. The class-1 includes those correlations which use other physical properties such as vapor pressure and aqueous solubility of the compound for estimation of the H. The most well-known method of this class is the correlation presented by Mackay et al.7 These correlations have some important disadvantages. The accuracy in these correlations is directly related to the accuracy in the needed physical properties or methods used to estimate those physical properties. Furthermore, if only one of the needed properties is missed, no calculation can be performed to estimate the H. * To whom correspondence should be addressed. Fax: +98 21 77926580. E-mail: [email protected]; [email protected]. † University of Tehran. ‡ Saman Energy Giti Co. § Iran University of Science and Technology.

The class-2 contains those correlations called quantitative structure property relationships (QSPR) which use only molecular-based parameters to predict the H. The most well-known correlations of this class are those correlations presented by Hine and Mookerjee,8 Meylan and Howard,9,10 Abraham et al.,11 Katritzky et al.,12 Dearden et al.,13-15 English and Carrol,16 Yao et al.,17 Lin and Sandler,18 and Yafe et al.19 The most important disadvantage of the majority of these correlations is their complex procedure for computations of molecular-based parameters, so the majority of these correlations are not usually simple to apply. It seems the simplest type of the correlations of this class is those correlations called group contribution methods (GC). In this type of methods, numbers of occurrences of several functional groups are used to estimate various physical properties. In this study, a new comprehensive model is presented to estimate H of pure compounds based on a combination between the application of a new collection of functional groups as parameters of the model and the application of neural networks to develop the model. Materials and Methods Materials. The comprehensiveness of a molecular-based model is directly related to the comprehensiveness of the data set of compounds applied to its development. This comprehensiveness includes both diversity in chemical families used and the number of compounds available in the data set. Our literature survey showed one of the most comprehensive data sets presented for H of pure compounds is the compilation provided by Yaws,20 so 1940 pure compounds found in the handbook and their H values were extracted and used as main data set in this work. It should be noted that the H values were compiled in the units of atm · m3 · mol-1 (mol fraction basis) and presented as a decimal logarithm of H at 25 °C. The values range from -13.461 to 6.238. On the basis of our literature survey, this database is the most comprehensive data set that has ever been used for developing a model for the prediction of the H of pure compounds in water. This data set is presented as Supporting Information. Developing New Group Contributions. After providing the data set, the chemical structures of all 1940 compounds were

10.1021/ie101532e  2010 American Chemical Society Published on Web 09/21/2010

10150

Ind. Eng. Chem. Res., Vol. 49, No. 20, 2010

Figure 1. The schematics structure of the three-layer feed forward neural network used in this study.

analyzed, and finally 107 functional groups were found to be useful to estimate the H. It should be noted that some of these 107 functional groups have been used in former group contribution methods for estimation of H, but this set of 107 functional groups has not been used to estimate any physical property, yet. The functional groups found and used in this study are extensively presented as Supporting Information. These 107 functional groups and their number of occurrences in each of 1940 pure compounds are presented as Supporting Information. The numbers of occurrences of these functional groups are used as input parameters for the model. Development of Neural Network-Group Contribution Model. After providing the group contributions table, the problem is defined as finding a relationship between these groups and the H. The simplest method is to assume a multilinear relationship between these groups and the H. This solution is the general method used in the classic group contribution method. Application of this method for this problem failed because the lack of accuracy in the obtained multilinear model. Therefore, application of nonlinear methods such as neural networks was considered useful for this problem. Neural networks are extensively used in various scientific and engineering areas such as estimations of physical and chemical properties.21 These powerful tools usually apply to study of the complicated systems such as the problem defined here. The theoretical explanations about neural networks can be found in many references such as ref 22. This solution was found useful and therefore, using the Neural Network toolbox of the MATLAB software (Mathworks Inc. Software), three-layer feed forward neural networks were evaluated for the problem. The schematic typical structure of three-layer feed forward neural networks is presented in Figure 1. This type of neural networks has been used by the authors in previous works; therefore, the detailed explanations about the three-layer feed forward neural networks used in this study can be found, elsewhere.23-38 All the 107 functional groups and also the log H values should be normalized between -1 and +1 to decrease computational errors. This work can be performed using maximum and minimum values of each of the 107 functional groups (inputs) and also using maximum and minimum values of the log H (output). After this step, the main data set should be divided into three new data sets. These three data sets include training set, validation set, and test set. The training set is used to generate the neural network. The validation set is used to optimize neural networks. The test set is used only to check validity of the obtained model. The process of division of main data set into three new data sets is usually randomly performed. For this purpose, 85%, 10%, and 5% of the main data set were randomly selected for the training set (1649 compounds), the validation set (194 compounds), and the test set (97 compounds). The effect of the allocation percent of training set, validation set, and test set from main data set on the accuracy of the neural networks has been studied by the author.35

Figure 2. The comparison between the estimated and experimental values of log H. Table 1. Statistical Parameters of the Obtained Model

parameter

training set

validation set

test set

training + validation + test set

R average percent error standard deviation error root mean square error n

0.9989 2.50% 2.41 0.08 1649

0.9917 5.89% 2.41 0.22 194

0.9974 2.72% 2.17 0.12 97

0.9981 2.84% 2.41 0.1 1940

2

Generating a neural network means determining the weight matrices and bias vectors. As shown in Figure 1, there are two weight matrices and two bias vectors in a three-layer feed forward neural network (W1 and W2, b1 and b2). These parameters should be obtained by minimization of an objective function. The objective function used in this study is the sum of squares of errors between the outputs of the neural network (estimated log H) and the target values (experimental log H). This minimization was performed by Levenberg-Marquardt algorithm. This algorithm is rapid and accurate in the process of training neural networks.21,22 Results and Discussion By the presented procedure, an optimized feed forward neural network was obtained for prediction of the log H. For determination of the number of neurons of hidden layer of the neural network, numbers 1 through 50 were checked and then the number 10 showed best results, so the best three-layer feed forward neural network has structure 107-10-1. The mat file (MATLAB file format) of the obtained neural network containing all parameters of the obtained model can be obtained from the corresponding author by e-mail. The predicted log H values using this model are shown in Figure 2 in comparison with the experimental values. Also these values are reported as Supporting Information. The results obtained by the model are presented in Table 1. These results show that the squared correlation coefficient, absolute percent error, standard deviation error, and root-meansquare error of the model over the training set, the validation set, the test set, and the main data set are, respectively, 0.9989, 0.9917, 0.9974, 0.9981, 2.5%, 5.89%, 2.72%, 2.84%, 2.41, 2.41, 2.72, 2.41, 0.08, 0.22, 0.12, and 0.1 (all the values obtained using log H). The absolute percent error obtained by the model over all 1940 compounds is shown in Figure 3. As can be found, the obtained model is an accurate model to predict the log H of pure compounds. To compare this model with previously presented models, several points should be considered. The first point which needs

Ind. Eng. Chem. Res., Vol. 49, No. 20, 2010

10151

ity, but application of the model is restricted to those compounds that are similar to the compounds used to develop the model. Application of the model for those compounds that are completely different from the compounds used is not recommended. Also, comparison between the presented model and the previously presented models shows that the model is more comprehensive and more accurate. Supporting Information Available: There are two supplementary tables. The first one contains the set of 107 functional groups used in this study. The second one is the number of occurrences of the 107 functional groups in all of 1940 pure compounds used in this study, their values reported by Yaws, and the obtained results by the model. This material is available free of charge via the Internet at http://pubs.acs.org. Appendix Figure 3. The absolute percent error of the obtained model over 1940 pure compounds. The absolute percent error is defined as 100 × |(yexpt - ycalc)/ yexpt|.

careful attention is the comprehensiveness of the model. The presented model is the most comprehensive model in comparison with previously presented models, because it was developed over a diverse set of 1940 pure compounds of various chemical families. It should be noted that the largest data set used in previous studies was the data set used by Modarresi et al.15 that contained 940 pure compounds. The second point is the root-mean-square of error (RMSE) of the presented model. The RMSE of the obtained model is 0.1 on log H-based data. This value is considerably lower than the best previously presented models such as models presented by Lin and Sandler18 (RMSE of 0.34 over 395 pure compounds), Meylan and Howard9,10 (RMSEs of 0.52 and 0.42 over the same data set used by Lin and Sandler18), and Modarresi et al.15 (RMSE of 0.564 over 940 pure compounds). Another point that should be considered is that based on the model estimations, there are only 14 compounds whose values log H show more than 30% absolute deviation (or absolute percent error). Also, there are only seven compounds whose logH values show more than 50% absolute deviation. These seven compounds are methyl alcohol, diphenylmethane, anisole, dibromomethane, tridecanoic acid, dibenzofuran, 1,2-dimethylnaphthalene. It seems there is no relation between these compounds to show weakness in predicting log H values of some particular chemical families, so it is probable that the log H values for these compounds are not accurate or are somehow erroneous. Conclusion In this study, a molecular-based model was presented for estimation of the Henry’s law constant of pure compounds in water. The model is the result of a combination between a group contribution and feed forward neural networks. The needed parameters of the model are the numbers of occurrences of 107 functional groups in every molecule. It should be noted that most of these 107 functional groups are not simultaneously available in a particular molecule; therefore, computation of these parameters from chemical structure of any molecule is simple. For developing the model, 1940 pure compounds were used; therefore, this model can be used to predict Henry’s law constant of every regular compound with some limitations. These 1940 pure compounds cover many chemical families of compounds; therefore the model has a wide range of applicabil-

The model is easy to apply. What is needed is to just drag and drop the mat file (that is freely accessible from the corresponding author) into the MATLAB environment (any version) workspace. Let us get a response from the model step by step: At the first place, assume we want to see the model response for log H of abietic acid. First of all the group-contribution parameters should be computed from the chemical structure of abietic acid (it is the first row in the second supplementary table from log H1 to log H107). After that, drag and drop the mat file and do as follows in MATLAB workspace: GC ) [ 6 3 2 6 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 log H ) sim(net, a')

2 0 0 0 0 1 0 0

2 0 1 0 0 0 0 0

0 0 0 0 0 27 0 0

4 0 0 0 0 2 0 0

0 0 0 0 0 0 0 0];

2 0 0 0 0 0 0

2 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

··· ··· ··· ··· ··· ··· ···

Therefore, one will observe the estimated logH ) -5.4835. The experimental value for this log H is equal to -5.451. Therefore, the absolute relative deviation of the calculated value from the experimental one is 0.6%. Literature Cited (1) Hemond, H. F.; Flechner-Levy, E. J. Chemical Fate and Transport in EnVironment; 2nd ed.; Academic Press: San Diego, CA, 2000. (2) Schwarzenbach, R. P.; Gschwend, P. M.; Imboden, D. M. EnVironmental Chemistry, 2nd ed.; Wiley-Interscience: NJ, 2003. (3) Altschuh, J.; Bruggemann, R.; Sntl, H.; Eichinger, G.; Pringer, O. G. Henry’s law constant for a device set of organic chemicals: Experimental determination and comparison of estimation methods. Chemosphere 1999, 39, 1871. (4) Staudinger, J.; Roberts, P. V. A critical review of Henry’s law constants for environmental applications. Crit. ReV. EnViron. Sci. Technol. 1996, 26, 205. (5) Valles, H. R.; Estevez, L. A.; Durate, H. A neural network approach to predict activity coefficients. Can. J. Chem. Eng. 2009, 87, 748. (6) Petersen, R.; Fredenslund, A.; Rasmusen, P. Artificial neural networks as a predictive tool for vapor-liquid equilibrium. Comput. Chem. Eng. 1994, 18 (suppl), S63. (7) Mackay, D.; Shiu, W. S.; Ma, K. C. Henry’s law constant. In Handbook of Property Estimation Methods for Chemicals: EnVironmental and Health Sciences; Boethling, R. S., Mackay, D., Eds.; Lewis: Boca Raton, FL; 2000; pp 69-87. (8) Hine, J.; Mookerjee, P. K. The intrinsic hydrophilic character of organic compounds. Correlations in terms of structural contributions. J. Org. Chem. 1975, 40, 292.

10152

Ind. Eng. Chem. Res., Vol. 49, No. 20, 2010

(9) Meylan, W. M.; Howard, P. H. Bond contribution method for estimating Henry’s law constants. EnViron. Sci. Technol. 1991, 10, 1283. (10) Meylan, W. M.; Howard, P. H. HENRYWIN 3.10.; Syracuse Research: Syracuse, NY, 2000. (11) Abraham, M. H.; Andonian-Haftvan, J.; Whiting, G. S.; Leo, A.; Taft, R. S. Hydrogen bonding. Part 34. The factors that influence the solubility of gases and vapours in water at 298 K, and a new method for its determination. J. Chem. Soc., Perkin Trans. 1994, 2, 1777. (12) Katritzky, A. R.; Mu, L.; Karelson, M. A QSPR study of the solubility of gases and vapors in water. J. Chem. Inf. Comput. Sci. 1996, 36, 1162. (13) Dearden, J. C.; Cronin, M. T. D.; Sharra, J. A.; Higgins, C.; Boxall, A. B. A.; Watts, C. D. The Prediction of Henry’s Law Constant: A QSPR from Fundamental Considerations. Chen, F., Schu¨u¨rmann, G., Eds.; Quantitative Structure-Activity Relationships in Environmental Sciences7; Pensacola, FL, 1997; pp 135-142. (14) Dearden, J. C.; Ahmad, S. A.; Cronin, M. T. D.; Sharra, J. A. QSPR Prediction of Henry’s Law Constant: Improved Correlation with New Parameters. In Molecular Modeling and Prediction of BioactiVity; Gundertofte, K., Jørgensen, F. S., Eds.; Plenum: New York, 2000; pp 273274. (15) Modarresi, H.; Modarress, H.; Dearden, J. C. QSPR model of Henry’s law constant for a diverse set of organic chemicals based on genetic algorithm-radial basis function network approach. Chemosphere 2007, 66, 2067. (16) English, N. J.; Carrol, D. G. Prediction of Henry’s law constants by a quantitative structure property relationship and neural networks. J. Chem. Inf. Comput. Sci. 2001, 41, 1150. (17) Yao, X.; Liu, M.; Zhang, X.; Hu, Z.; Fan, B. Radial basis function network-based quantitative structure-property relationship for the prediction of Henry’s law constant. Anal. Chim. Acta 2002, 462, 101. (18) Lin, S. T.; Sandler, S. I. Henry’s law constant of organic compounds in water from a group contribution model with multipole corrections. Chem. Eng. Sci. 2002, 57, 2727. (19) Yaffe, D.; Cohen, Y.; Espinosa, G.; Arenas, A.; Giralt, F. A fuzzy ARTMAP-based quantitative structure-property relationship (QSPR) for the Henry’s law constant of organic compounds. J. Chem. Inf. Comput. Sci. 2003, 43, 85. (20) Yaws, C. L. Yaws’ Handbook of Thermodynamic and Physical Properties of Chemical Compounds; Knovel: Norwich, NY, 2003. (21) Taskinen, J.; Yliruusi, J. Prediction of physicochemical properties based on neural network modeling. AdV. Drug DeliVery ReV. 2003, 55, 1163. (22) Hagan, M.; Demuth, H. B.; Beale, M. H. Neural Network Design. International Thomson: Andover, MA, 2002. (23) Gharagheizi, F. A new group contribution-based method for estimation of lower flammability limit of pure compounds. J. Hazard Mater. 2009, 170, 595.

(24) Gharagheizi, F. New neural network group contribution model for estimation of lower flammability limit temperature of pure compounds. Ind. Eng. Chem. Res. 2009, 48, 7406. (25) Gharagheizi, F.; Sattari, M. Estimation of molecular diffusivity of pure chemicals in water: A quantitative structure-property relationship study. SAR QSAR EnViron. Res. 2009, 20, 267. (26) Gharagheizi, F. Prediction of standard enthalpy of formation of pure compounds using molecular structure. Aust. J. Chem. 2009, 62, 376. (27) Gharagheizi, F.; Tirandazi, B.; Barzin, R. Estimation of Aniline point temperature of pure hydrocarbons: A quantitative structure-property relationship approach. Ind. Eng. Chem. Res. 2009, 48, 1678. (28) Gharagheizi, F.; Mehrpooya, M. Prediction of some important physical properties of sulfur compounds using QSPR models. Mol. DiVers. 2008, 12, 143. (29) Sattari, M.; Gharagheizi, F. Prediction of molecular diffusivity of pure components into air: A QSPR approach. Chemosphere 2008, 72, 1298. (30) Gharagheizi, F.; Alamdari, R. F.; Angaji, M. T. A new neural network-group contribution method for estimation of flash point. Energy Fuels 2008, 22, 1628. (31) Gharagheizi, F.; Fazeli, A. Prediction of Watson characterization factor of hydrocarbon compounds from their molecular properties. QSAR Comb. Sci. 2008, 27, 758. (32) Gharagheizi, F.; Alamdari, R. F. A Molecular-based model for prediction of solubility of C60 fullerene in various solvents, fullerenes. Nanotubes Carbon Nanostruct. 2008, 16, 40. (33) Gharagheizi, F. A new neural network quantitative structureproperty relationship for prediction of (lower critical solution temperature) of polymer solutions. e-Polym. 2007, article no. 114. (34) Gharagheizi, F. QSPR studies for solubility parameter by means of genetic algorithm-based multivariate linear regression and generalized regression neural network. QSAR Comb. Sci. 2008, 27, 165. (35) Gharagheizi, F. QSPR analysis for intrinsic viscosity of polymer solutions by means of GA-MLR and RBFNN. Comput. Mater. Sci. 2007, 40, 159. (36) Gharagheizi, F. A chemical structure-based model for estimation of upper flammability limit of pure compounds. Energy Fuels 2010, 24, 3867. (37) Vatani, A.; Mehrpooya, M.; Gharagheizi, F. Prediction of standard enthalpy of formation by a QSPR model. Int. J. Mol. Sci. 2007, 8, 407. (38) Mehrpooya, M.; Gharagheizi, F. A molecular approach for prediction of sulfur compounds solubility parameters. Phosphorus, Sulfur Silicon Relat. Elem. 2010, 185, 204.

ReceiVed for reView July 18, 2010 ReVised manuscript receiVed August 27, 2010 Accepted September 2, 2010 IE101532E