7378
Ind. Eng. Chem. Res. 2009, 48, 7378–7387
QSPR Flash Point Prediction of Solvents Using Topological Indices for Application in Computer Aided Molecular Design Suhani J. Patel, Dedy Ng, and M. Sam Mannan* Mary Kay O’Connor Process Safety Center, Department of Chemical Engineering, Texas A&M UniVersity, College Station, Texas 77843-3122
Incorporating consideration for safety issues while selecting solvents for processes has become crucial in light of the chemical process accidents involving solvents that have taken place in recent years. Computer aided molecular design (CAMD) is a methodology that has been researched recently for designing compounds with required target properties and can be applied for selection of safer solvents as well. An important aspect of this methodology concerns the prediction of properties given the structure of the molecule. This paper utilizes one such emerging method for prediction of a hazardous property, flash point, which is indicative of the flammability of solvents. Quantitative structure property relationship (QSPR) and topological indices have been used in this paper to predict flash point properties of different classes of solvents. Multiple linear regression and back-propagation neural network analysis were used to model the flash point. The neural network model showed higher accuracy (training set, r ) 0.948, R2 ) 0.898). However, there are certain limitations associated with using QSPR in CAMD which have been discussed and need further work. This paper advances the “forward problem” of CAMD using QSPR which has not been researched extensively in the past. Introduction Solvents are widely used in chemical industries in several different processes. Certain characteristics and physical/chemical properties of solvents make them useful, while other properties make them extremely hazardous. Solvents are generally organic chemicals that tend to be highly flammable and toxic. Issues such as human and ecological toxicity, process safety hazards, and waste/pollution management are a concern for solvent processes. Thus, hazards and risk associated with solvent processes need to be assessed and mitigated at early stages of process design. Chemical industries reduce risk by placing emphasis on proper storage and handling procedures, operator training, proper ventilation systems, and minimization of ignition sources. Despite abundant precautions in chemical plants, many incidents and accidents have taken place in recent years. The incidents at the CAI/Arnel facility in Massachusetts (November 2006),1 and the Barton Solvents facilities in Kansas and Iowa (July 2007 and October 2007),2,3 involving fire and explosions, were escalated due to large quantities of solvents being used in those facilities. Apart from fire and explosion incidents, there have also been innumerable reports of workers being exposed to solvent vapors with toxic and adverse health effects. Such incidents result in loss of property, and at times, they result in loss of human lives as well. Thus, it is imperative to consider safety in solvent processes during the conceptual phase of process design. Integrating safety in solvent processes can be achieved by exploring inherently safer design (ISD) concepts.4 One of the key principles of inherent safety is “substitution” of a morehazardous compound with a less-hazardous one. The benefits of “substitution” in ISD have been demonstrated in practical applications such as the replacement of benzene with a less toxic solvent such as cyclohexane, and the usage of supercritical carbon dioxide instead of hexane, ethanol, and ethyl acetate in the food industry. To incorporate inherently safer substitution into chemical processes, consideration for hazardous properties * To whom correspondence should be addressed. E-mail: mannan@ tamu.edu.
(such as flash point, flammability limits, and toxicity) needs to be embedded into the solvent-selection process. Novel solvents and more-complex solvents being synthesized in the laboratories often lack experimental data on hazardous properties. Among the properties that describe a material’s flammability, flash point provides a stronger indication of flammability. Flash point is the minimum temperature at which the liquid (or solid) emits sufficient vapor to form an ignitable mixture with air. NFPA ratings for flammability, as described in NFPA 704,5 are also based on the flash point values for chemicals. Thus, the emphasis of this work is on predicting flash points for solvents. Models for prediction of flash point have been developed in the past. Boiling points of organic compounds have been correlated to flash point using quadratic relationships by Hshieh et al.6 and an exponential relationship by Satyanarayana and Rao.7 Molecular structure information was also used by some researchers to predict flash point. Prugh8 incorporated stoichiometric concentration and boiling point for flash point prediction. Suzuki et al.9 used a combination of structural factors such as molecular connectivity index and group contributions to predict the flash point. A particular group-contribution-based model has also been developed by Stefanis et al.10 The quantitative structure property relationship (QSPR) method has been applied for flash point prediction by Katritzky et al.11,12 using molecular descriptors which were of the types geometric, electrostatic,
Figure 1. Schematic for QSPR application in CAMD.
10.1021/ie9000794 CCC: $40.75 2009 American Chemical Society Published on Web 07/08/2009
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009
7379
Figure 4. Box plots of test set and training set. Figure 2. Distribution of molecular weights in data set (n ) 236).
Figure 3. Distribution of flash point values in data set (n ) 236).
quantum mechanical, and constitutional.13 The QSPR method relies on predicting properties based on computable molecular descriptors which in turn are evaluated from information derived from the molecular structure. It has been used as a technique for prediction of properties such as critical temperature, boiling point, refractive index, octanol-water partition coefficient, and many others, with a higher level of accuracy.14-16 In addition, computer aided molecular design (CAMD) approaches have been used successfully to develop novel materials given specified target properties. A general CAMD problem can be formulated following a mixed integer nonlinear programming (MINLP) approach which consists of property constraints.17 Among the different types of predictive models described before, the group contribution approach for property estimation is employed by most studies in CAMD,18-21 while some studies have explored the applicability of QSPR/quantitative structure-activity relationship (QSAR) in CAMD.22,23 Computer aided molecular design using QSPR/QSAR consists of two parts. (1) The first is the forward problem: a method to predict properties given the molecular structure. (2) The second is the inverse problem: applying the forward problem solution
Figure 5. Distribution of flash point values for each class.
to obtain molecular structures that satisfy given target properties. Among the models described above, quantitative structure property relationship (QSPR) remains the choice of method for its predictive and new molecular design purposes. A schematic of the approach for applying QSPR in CAMD has been depicted in Figure 1. Topological indices prove to be more suitable for CAMD since they are calculated using information obtained from atomic constitution and bond characteristics of a molecule. Molecular descriptors based on topological information provide a higher level of molecular representation compared to functional groups
7380
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009
Table 1. Results of Multiple Linear Regression on Different Classes of Solvents for Flash Point Predictiona training set
test set
class of solvents
equation
r
R2
R2(CV)
F
r
R2
monohydric alcohols
Tf ) 9.248(SC:2) 34.953(3χcluster) + 286.053 Tf ) 13.064(2κ) 8.665(3κ) + 385.328 Tf ) 33.638(0χ) + 164.386 Tf ) 50.264(1χ - 31.887(2χv) + 210.783 Tf ) 30.045(1χ) + 186.184
0.925
0.855
0.613
109.06
0.933
0.870
0.608
0.370
-
7.62
0.500
0.251
0.918
0.842
0.822
203.21
0.572
0.327
0.691
0.477
0.22
16.89
0.396
0.157
0.825
0.680
0.600
80.75
0.875
0.770
polyhydric alcohols hydrocarbons amines ethers
a 2 R ) coefficient of determination; r ) correlation coefficient; R2(CV) ) R2 for the cross-validation set, F ) Fisher test statistic; SC:2 ) subgraph counts (second order): path; 3χcluster ) chi(3): cluster; nκ ) kappa-n; nχ ) chi(n); 2χv ) chi(2) (valence modified).
Table 3. Results of Neural Network Analysis for Flash Point Prediction (n ) 236, Network Configuration ) 16:5:1)
training set test set
Figure 6. Plot of calculated versus experimental values of flash point using MLR (graph depicts correlations from Table 1 for all classes). Table 2. Sixteen Input (Predictive) Variables to the Artificial Neural Network Kier shape indices41
1
κ κ κ 1 κ 2 κ SC:0
kappa-1 kappa-2 kappa-3 kappa-1 (alpha modified) kappa-2 (alpha modified) subgraph counts (0): path
SC:1 SC:2 0 χ
subgraph counts (1): path subgraph counts (2): path chi(0)
1
chi(1) chi(2) chi(3): path chi(0) (valence modified) chi(1) (valence modified) chi(2) (valence modified) chi(3): path (valence modified)
2 3
Kier and Hall subgraph count indices Kier and Hall molecular connectivity indices42
χ χ χ 0 χv 1 χv 2 χv 3 χv 2 3
or molecular fragment counts. In calculating topological indices, graph theory is used to evaluate information about the constituent atoms and the connecting bonds between them. This makes it easier to visualize the molecular structure and simultaneously evaluate properties for the molecule. The possibility of solving the inverse problem for certain topological indices has been explored using graph reconstruction methods to obtain the exact
r
R2
R2(CV)
0.948 0.814
0.898 0.662
0.577 0.710
molecular structure based on the index values.24-26 In recent years, more work has been done to resolve the inverse problem for QSPR, but few researchers have approached solving the forward problem using topological indices. Topological indices encode information on molecular connectivity which, in principle, would yield more accurate correlations than simpler group contribution methods. Some properties other than flash point have been modeled using topological indices (such as boiling point, molar volume, heat of vaporization for alkanes27,28), but only for distinct groups of chemicals. While previous attempts did not address heteroatoms and multiple bonds, the necessity of extending this approach to make it more inclusive of different types of chemicals and variation in properties has been advocated in their work. In this work, the forward problem has been examined using topological descriptors (indices) to predict the flash point of solvents that are diverse in terms of chemical constitution, bond saturation/unsaturation, and cyclic/straight-chain/branch characteristics. Application of such predictive models to CAMD will aid in selecting solvents that are inherently safer. Methodology The experimental flash point data for 236 solvents were collected from the Acros29 and Aldrich catalogs30 and the Industrial Solvents handbook.31 Major classes of solvents were used to form the data set, such as monohydric alcohols, polyhydric alcohols, amines, ethers, and aliphatic and aromatic hydrocarbons. These solvents were selected in this study because of their frequent usage in the petrochemical industries, where safety concerns are escalated due to solvent processes. The distribution of the molecular weights and flash points for the entire data set has been shown in Figures 2 and 3, respectively. The solvents used for this study show wide variability in composition and experimental flash point values (range of flash points ) 603.15 K - 227.15 K ) 376 K, standard deviation ) 71.7 K). The complete data set is shown in Table 4, along with the experimental and calculated flash point values. Structures for the chemical compounds were obtained from the PubChem Database32 in the standard data format (SDF). This format is able to provide sufficient information to calculate topological indices using molecular graphs and related matrices
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009
7381
Table 4. Solvent Data Set with Experimental and Predicted (MLR and ANN) Flash Points Tf(predicted) (K) no.
IUPAC/CAS name
Tf(exptl) (K)
MLR
ANN
283.706 286.483 298.150 295.372 303.150 284.261 313.706 294.261 347.039 331.483 358.150 353.150 377.594 373.706 325.372 356.483 393.150 324.150 386.150 332.150 346.150 314.150 306.150 354.150 386.150 313.150 341.150 302.150 333.150 314.150 293.150 302.150 330.150 350.150 333.150 327.150 418.150 327.150 377.150 389.150 284.817 309.817 330.372 324.817 340.928 363.706 295.372 347.039 383.150 -
286.053 295.302 304.550 308.776 308.776 271.635 318.024 296.240 332.294 334.407 345.769 360.038 369.287 362.152 309.714 352.904 396.931 308.776 331.452 336.521 341.542 327.272 318.024 350.790 392.489 322.204 349.948 296.240 351.918 322.251 296.240 305.488 334.407 352.904 331.320 336.342 406.279 349.106 364.265 397.031 293.617 313.798 323.046 318.024 349.948 350.790 304.550 352.904 361.058 349.948
284.340 285.456 274.910 301.396 316.581 288.146 305.520 286.685 330.810 326.348 325.464 374.841 387.239 382.601 330.277 360.835 402.285 335.076 382.980 316.964 345.096 317.700 321.946 360.680 433.132 307.145 371.141 310.841 336.049 331.725 286.685 329.785 326.348 333.723 340.316 324.966 413.206 333.735 352.081 410.084 270.555 302.735 315.048 325.130 317.034 360.680 320.431 326.006 371.088 364.4354
392.594 411.483 435.928 477.594 377.594 397.039 416.483 433.150 382.039 394.261 402.594 372.150 352.150 366.150 358.150 401.150 425.150 420.150 374.150 380.150
389.860 411.722 423.681 438.116 380.062 420.520 409.900 401.138 380.919 404.435 411.722 380.062 402.925 401.138 388.364 404.435 404.435 414.388 374.379 419.210
361.7896 413.696 435.501 434.547 361.3137 420.358 421.93 432.076 373.788 374.179 380.36 341.9739 370.4675 367.5868 335.6638 399.387 404.609 377.063 382.259 375.556
Class: Monohydric Alcohols 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
methanol ethanol propan-1-ol butan-2-ol 2-methylpropan-1-ol 2-methylpropan-2-ol pentan-2-ol 2-methylbutan-2-ol hexan-1-ol 2-ethylbutan-1-ol octan-2-ol nonan-1-ol decan-1-ol phenylmethanol 4-hydroxy-4-methylpentan-2-one 2-furylmethanol 2-methylpropanoic acid (3-hydroxy-2,2,4-trimethylpentyl) ester 1-chloropropan-2-ol 1,4-dibromobutan-2-ol heptan-2-ol heptan-1-ol hexan-2-ol 1-methoxypropan-2-ol octan-1-ol (2R)-2,4-dihydroxy-N-(3-hydroxypropyl)-3,3-dimethylbutanamide pentan-3-ol oct-1-en-3-ol 2,2,2-trifluoroethanol 2,3,4-trimethylpentan-1-ol 4-methylpentan-2-ol 2-methylbutan-2-ol 3,3-dimethylbutan-1-ol 2-ethylbutan-1-ol 2-ethylhexan-1-ol 2,2,4-trimethylpentan-1-ol 3-methylheptan-3-ol tetradecan-1-ol 5-methylheptan-3-ol 8-methylnonan-1-ol tridecan-1-ol propan-2-ola butan-1-ola pentan-1-ola 3-methylbutan-1-ola cyclohexanola octan-1-ola prop-2-en-1-ola 2-tetrahydrofuranylmethanola (1R)-1-(2-furyl)ethanola 5-methylheptan-1-ola
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
ethylene glycol 2-(2-hydroxyethoxy)ethanol 2-[2-(2-hydroxyethoxy)ethoxy]ethanol 2-[2-[2-(2-hydroxyethoxy)ethoxy]ethoxy]ethanol propane-1,2-diol 3-(3-hydroxypropoxy)propan-1-ol 2-[2-(2-hydroxypropoxy)propoxy]propan-1-ol glycerol butane-1,3-diol butane-1,4-diol pentane-1,5-diol propane-1,1-diol propane-1,3-diol butane-1,2-diol butane-2,3-diol (E)-but-2-ene-1,4-diol but-2-yne-1,4-diol hexane-1,6-diol hexane-2,5-diol 2,2-diethylpropane-1,3-diol
Class: Polyhydric Alcohols
7382
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009
Table 4. Continued Tf(predicted) (K) no.
IUPAC/CAS name
Tf(exptl) (K)
MLR
ANN
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
2,5-dimethylhex-3-yne-2,5-diol benzoic acid [4-[(oxo-phenylmethoxy)methyl]cyclohexyl]methyl ester [4-(hydroxymethyl)phenyl]methanol 2-butyl-2-ethylpropane-1,3-diol 3,6-dimethyloct-4-yne-3,6-diol 2-ethyl-2-(hydroxymethyl)propane-1,3-diol 2,2-bis(hydroxymethyl)propane-1,3-diol (2R,3R,4R,5S)-hexane-1,2,3,4,5,6-hexol 2-(hydroxymethyl)-2-methylpropane-1,3-diol 2,2-dimethylpropane-1,3-diola 2-methylpentane-2,4-diola (2S)-butane-1,2,4-triola 2,3-dimethylbutane-2,3-diola 2-ethylhexane-1,3-diola 2-(2-hydroxyethylthio)ethanola hexane-1,2,6-triola
347.050 434.150 460.950 386.150 382.150 445.150 513.150 422.050 433.150 424.817 374.817 385.150 350.150 409.150 433.150 471.150
344.495 471.191 417.780 430.521 395.317 419.210 419.210 428.185 406.717 392.840 355.860 406.488 388.935 428.496 411.722 414.417
334.8306 434.367 417.132 388.488 373.551 459.232 396.468 491.91 424.135 454.466 342.5066 440.035 315.5053 386.009 429.347 471.752
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
pentane hexane heptanes octane nonane decane dodecane cyclohexane cyclohexene benzene methylbenzene 1,2-dimethylbenzene 1,4-dimethylbenzene 1,3,5-trimethylbenzene isopropylbenzene tridecane tetradecane hexadecane heptadecane nonadecane isopentane isohexane 3-methylpentane 2,3-dimethylbutane buta-1,3-diene; vinylbenzene but-2-ene (Z)-but-2-ene but-1-ene 2-methylprop-1-ene (Z)-pent-2-ene pent-1-ene dec-1-ene hept-1-ene cyclooctane cyclopenta-1,3-diene cyclopentane cyclopentene 2-methylheptane 4-vinylcyclohexene 4-methylpent-1-yne ethylbenzenea undecanea pentadecanea octadecanea tricosanea 2,2-dimethylbutanea 2-methylhexanea but-2-ynea 2,4,4-trimethylpent-1-enea 2,3,4-trimethylpent-2-enea
309.150 342.150 371.150 399.150 424.150 447.150 489.150 354.150 355.150 353.150 384.150 417.150 411.150 436.150 426.150 507.150 525.150 560.150 575.150 603.150 303.150 335.150 337.150 331.150 268.760 274.040 277.150 266.850 266.150 309.150 303.150 443.150 367.150 301.150 273.150 236.150 243.150 277.150 294.150 269.150 409.150 469.150 543.150 590.150 507.150 323.150 363.150 297.150 267.150 275.150
303.019 326.804 350.590 374.376 398.161 421.947 469.518 307.100 307.100 307.100 336.373 365.646 365.646 394.919 389.432 493.304 517.090 564.661 588.447 636.018 308.506 332.292 332.292 337.780 360.159 279.233 279.233 279.233 284.721 303.019 303.019 421.947 350.590 354.671 283.314 283.314 283.314 379.863 360.159 332.292 360.159 445.733 540.875 612.232 731.161 339.542 356.078 279.233 392.601 390.839
320.262 348.362 369.100 396.714 420.665 448.598 495.179 257.539 274.537 291.387 337.451 389.847 381.027 381.639 392.782 512.634 528.382 547.604 552.692 558.614 316.463 305.518 338.813 354.536 358.556 252.188 252.188 258.673 252.479 217.813 266.690 406.188 337.276 281.062 261.165 249.880 247.459 338.342 326.274 301.790 372.344 471.715 538.855 556.795 559.588 308.681 321.960 236.707 273.992 299.718
137 138 139 140
methoxymethane ethoxyethane 2-methoxy-2-methylpropane 2-isopropoxypropane
232.039 233.150 238.706 245.372
228.675 258.720 263.120 280.103
239.213 228.781 243.283 278.387
Class: Hydrocarbons
Class: Ethers
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009
7383
Table 4. Continued Tf(predicted) (K) no.
IUPAC/CAS name
Tf(exptl) (K)
MLR
ANN
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
1-butoxybutane 1-pentoxypentane 1-ethenoxybutane 2-methyloxirane 2-ethyloxirane 1,4-dioxane (2,2-dimethyl-1,3-dioxolan-4-yl)methanol 2-methylfuran tetrahydrofuran tetrahydropyran methoxybenzene 1,2-bis(2-methoxyethoxy)ethane 2-(phenoxymethyl)oxirane 2-methoxy-2-methylbutane 2-(2-hydroxyethoxy)ethanol ethoxyethylene allyloxybenzene 2-methoxy-2-methylpropane phenylmethoxymethylbenzene 1-chloro-2-(2-chloroethoxy)ethane 1-chloro-2-methoxyethane 1-[2-(2-butoxyethoxy)ethoxy]butane 3-allyloxyprop-1-ene 1-chloro-4-(phenoxy)benzene 1-isopentyloxy-3-methylbutane 2-(allyloxymethyl)oxirane chloro-(chloromethoxy)methane 2-(butoxymethyl)oxirane 1-chloro-1-(1-chloroethoxy)ethane 1-butoxybutane 1,2-dimethoxybenzene 1,2-dimethoxyethane (3S,3aR,6R,6aR)-3,6-dimethoxy-2,3,3a,5,6,6a-hexahydrofuro[3,2-b]furan phenoxybenzene 1,3,5-trioxane ethoxyethylene 1-hexoxyhexanea ethoxyethylenea 1-ethenoxy-2-methylpropanea furana phenoxybenzenea 4-(4-aminophenoxy)anilinea chloromethoxymethanea 4,7,7-trimethyl-8-oxabicyclo[2.2.2]octanea 1-pentoxypentanea 1-ethenoxypropanea
304.261 330.372 263.706 235.928 260.928 291.483 267.039 243.150 247.594 253.150 324.817 386.150 388.150 262.150 416.150 228.150 335.150 245.150 408.150 328.150 288.150 374.150 266.150 386.150 319.150 321.150 292.150 314.150 328.150 298.150 345.150 271.150 376.150 388.150 318.150 227.150 349.817 227.594 263.706 237.594 388.150 491.150 289.150 322.150 330.150 247.150
318.811 348.856 288.765 243.085 259.250 276.320 310.541 273.131 261.298 276.320 304.318 363.879 349.916 279.965 288.765 258.720 334.363 263.120 410.007 288.765 258.720 408.947 288.765 391.794 340.193 304.318 258.720 319.341 280.103 318.811 332.822 273.743 362.361 379.961 276.320 258.720 378.901 258.720 284.434 261.297 379.961 403.628 243.697 336.993 348.856 273.743
296.757 312.100 283.769 236.360 250.855 320.828 283.432 311.857 279.045 309.917 346.959 379.106 388.447 271.696 413.696 233.210 337.870 243.283 412.252 323.477 260.097 384.509 302.843 413.976 305.138 291.064 338.792 332.015 308.603 296.757 356.996 265.427 373.741 393.353 302.339 233.210 315.229 233.210 316.458 268.924 393.353 414.861 263.456 319.235 312.100 258.428
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211
acetamide 1-(2-pyridyl)ethanone 2-(bis(2-hydroxyethyl)amino)ethanol prop-2-en-1-amine 6-methyl-2-pyridinamine 2-(2-aminoethoxy)ethanol 2-pyridinamine aniline phenylmethanamine 3-bromopyridine butan-2-amine 2-methylpropan-2-amine 3-chloroaniline 2-chloroaniline 2-chloropyridine 2,4,6-trimethylpyridine 2-pyridinecarbonitrile 3-pyridinecarbonitrile 4-pyridinecarbonitrile 1-cyclohexyl-2-pyrrolidinone cyclohexanamine N-pentylpentan-1-amine 2,6-di-tert-butylpyridine ethane-1,2-diamine N-cyclohexylcyclohexanamine
315.150 349.150 452.150 245.150 376.150 400.150 236.150 343.150 345.150 324.150 254.150 235.150 391.150 371.150 337.150 330.150 362.150 357.150 361.150 418.150 300.150 277.150 345.150 330.350 376.150
278.371 370.670 390.196 290.279 346.839 346.534 341.367 336.390 354.535 314.827 280.705 235.866 337.668 340.514 332.276 343.568 366.108 365.256 365.441 379.197 304.905 382.028 334.552 288.590 380.740
262.030 357.829 451.632 297.846 348.889 391.355 335.281 347.643 378.437 297.980 292.878 292.165 369.945 395.286 337.524 360.574 344.020 346.343 347.554 405.431 298.473 342.516 368.770 312.476 383.147
Class: Amines
7384
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009
Table 4. Continued Tf(predicted) (K)
a
no.
IUPAC/CAS name
Tf(exptl) (K)
MLR
ANN
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236
N,N-diethylacetamide N-ethylethanamine 2-diethylaminoethanol diethylcyanamide N-(2-aminoethyl)ethane-1,2-diamine N,N-diethylformamide 1-(2-hydroxypropylamino)propan-2-ol N-isopropylpropan-2-amine N-methylmethanamine N,N-dimethylaniline 2-dimethylaminoethanol N,N-dimethylformamide 3,5-dimethylpiperidine heptan-2-amine heptan-1-amine 3-methyl-2-pyridinaminea propan-1-aminea 2-bromopyridinea butan-1-aminea 4-chloroanilinea N-allylprop-2-en-1-aminea N-butylbutan-1-aminea 2-(2-hydroxyethylamino)ethanola N,N-dimethylacetamidea 2-aminoethanola
343.150 245.150 324.650 342.150 363.150 333.150 399.150 266.150 255.150 336.150 313.150 330.150 294.650 327.150 317.150 384.150 243.150 327.150 261.150 461.150 280.150 312.150 411.150 336.150 358.150
343.773 301.613 348.285 338.401 340.071 336.669 343.034 288.950 265.924 356.040 301.330 290.798 302.183 321.424 337.285 347.631 281.851 318.717 295.710 337.778 342.997 354.311 344.221 298.548 290.664
344.431 238.239 344.764 318.526 357.922 316.788 402.600 274.742 237.551 360.444 318.491 301.131 288.842 310.311 352.557 387.669 271.757 325.941 298.990 384.012 298.738 312.677 407.429 307.027 335.417
The solvents belonging to the test set for MLR calculations.
(such as distance and adjacency matrices). The molecular descriptors (29 topological indices) were then calculated using the Materials Studio 4.3 software (Accelrys Software Inc.), and the correlations were obtained using the Materials Studio Software package.33 Both multiple linear regression (MLR) analysis and artificial neural network (ANN) were used to evaluate the models and corresponding accuracy. Multiple regression models can be depicted using eq 1. y ) a1x1 + a2x2 + ... + anxn + c
(1)
where a1, a2, etc., and c are constants chosen to give the smallest possible sum of least-squares difference between true y values and the y′ values predicted using this equation. Neural network is a model-building technique that may better represent nonlinear functions. ANN typically consists of three layers: an input layer, a hidden layer, and an output layer. Each layer is connected to the next layer, and the connections are associated with certain “weights”. The connection weights are generally adjusted through a training method. Back-propagation has been used here for training the model. The algorithm comprises of the forward pass initially, wherein the input layer propagates a component of the input vector to each node in the middle hidden layer. Consequently, the middle layer computes output values, which become inputs to the nodes in the output layer. The output layer computes the network output for a particular input vector. These steps comprise the forward pass which is based on the current state of the network weights. The network weights are initially given as random values; thus, prior to training the weights it is unlikely that reasonable outputs will be obtained. Hence the weights are adjusted to reduce the error by backward propagation through the network. This process is known as the backward pass. The error values are computed for each node, based on the known desired output. The error for the middle-layer nodes is then calculated by assigning a portion of the error at the output layer node to the middle node. The amount of error attributed depends on the magnitude of the connection weight.34 Further-
more, the weight values are adjusted to improve the network performance according to the BFGS (Broyden-FletcherGoldfarb-Shanno) method or steepest descent algorithm.35 Advantages of back-propagation are its simplicity and reasonable speed. This method enables the network to model complex nonlinear functions for engineering applications.36 Equation 2 best describes a neural network.37
( ( nH
yk ) f
∑w
kj f
j)1
d
∑w x
ji i
) )
+ wj0 + wk0
i)1
(2)
where yk denotes the output, nH is the number of hidden nodes, wkj is the hidden-to-output layer weights at the output layer k, wji is the input-to-hidden layer weights at the hidden unit j, xi is the ith input of total d inputs, and wj0 and wk0 are known as the bias. Also, f is the nonlinear transfer function which calculates the output at a node. The transfer function used in Materials Studio is an S-shaped sigmoid function. This function is chosen because it is smooth and easily differentiable, which makes it easier to train the network. The S-sigmoid function is depicted by the following underlying equation.37 f(z) )
1 1 + e-z
(3)
The data set is divided into a training set (80%) and test set (20%), and cross-validation is performed by omitting each of three groups in turn. The total range of flash point values was divided into n smaller ranges. From each range a proportional number of solvents were used to form the test set. The box plots of the training set and test set for the entire data set are shown in Figure 4. The test set and training set are representative of the data set. Results and Discussion The calculated values for the flash points of solvents using multiple linear regression and back-propagation neural network
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009
7385
analysis are shown in Table 4. Upon using the entire data set for multiple linear regression analysis, poor accuracy was obtained (R2 ) 0.523, r ) 0.723). This is partially due to the large variability in chemical constitution and structure. Thus, the data set was divided into different classes of solvents (Figure 5 shows the distribution of flash point values for each class) which have been analyzed using MLR, and the results are as shown in Table 1, along with the correlation coefficient, r, R-squared value, R2, R2(CV) for cross-validation, and F value, being indicative of their predictive capability. An overall depiction of MLR results is shown in Figure 6 as a plot of calculated versus experimental values. A statistical evaluation of error and deviation in calculated versus experimental values is found using the following definitions for the entire set of solvents. average absolute deviation )
1 n
n
∑ |T
exp
average absolute relative deviation )
average percent bias )
1 n
n
∑ i)1
- Tcalc |
(4)
i)1
1 n
∑| n
i)1
Texp - Tcalc Texp
Texp - Tcalc × 100 Texp
|
(5) (6)
The average absolute deviation is 24.92 K, the average absolute relative deviation is 7.42%, and the average bias is -0.21% for the data set using MLR. From Table 1 monohydric alcohols and ethers show a consistent and higher accuracy in prediction (as seen from the results of the training set and test set), whereas hydrocarbons show higher accuracy for the training set and do not perform well for the test set. Models for polyhydric alcohols and amines do not perform better overall. Alcohols and amines have additional molecular phenomena that dictate the physical properties associated with the chemicals. Hydrogen bonding that occurs between molecules where a hydrogen atom is attached to one of the electronegative elementssfluorine, oxygen, or nitrogensis one of the contributing factors for such chemical types. Vapor pressure is an influencing factor for flash point determination,38 which in turn is governed by attractions among molecules, unevenly distributed electron densities, and bonded hydrogen atoms.39 Molecular connectivity indices χ show good correlation for most types of solvents as seen in Table 1. Previous work has also shown that this particular type of topological index has been used successfully to predict properties for normal and branched alkanes.27 The topological indices applied here have some physical meaning. The molecular connectivity index χ provides a quantitative assessment of the degree of branching of molecules, and the valence modified connectivity index χv provides information on the chemical nature of the atoms. The shape of a molecule, as determined by the different degrees and location of branching, is described by Kier’s shape indices κ. The subgraph count index (second order) is a measure of the number of pairs of connected edges (i.e., number of paths of length 2). These indices collectively can provide information on the size of a molecule (volume occupied by the molecule), and the shape of the molecule (distribution of the molecular volume in space).40 The topological indices are able to provide some information on the interactions among molecules, but do not give sufficient
Figure 7. Plot of calculated versus experimental values of flash point using ANN.
information on hydrogen bonding ability. As aforesaid for certain classes of compounds, the properties would depend on the patterns in intermolecular attractions. Thus, other types of molecular descriptors may perform better in predicting flash point for more-complex compounds. To enhance the predictive power, neural network analysis using topological indices was performed on the entire data set of 236 solvents. The training set consisted of 189 (∼80%) compounds and the test set consisted of 47 compounds. A 16: 5:1 network (consisting of 16 input nodes as given in Table 2, one output node, and one hidden layer with five nodes) gave higher accuracy for prediction of flash point as shown in Table 3. The input parameters are selected based on the best set of descriptors found in the MLR models. Neural networks are good at fitting functions, but they could occasionally result in overfitting. Thus, a test set is needed to verify the predictive power of the neural network, and as expected, the accuracy for the test set is lower than the training set. Figure 7 shows the plot of calculated values against experimental values of flash point. The average absolute relative deviation is 6.16%, the average absolute deviation is 20.44 K, and the average percent bias is 0.21% for the complete data set using ANN. The results from ANN show higher accuracy than MLR, although the complexity and nonlinearity of the ANN model makes it difficult to directly apply it to the CAMD problem (since CAMD requires simpler, preferably linear, relationships for an MINLP-based formulation). Thus, a suitable methodology that incorporates neural network models in the solution of an inverse problem would be a significant development in this approach. Overall, neural network analysis gives better prediction of flash points for solvents. Conclusions Computer aided molecular design is one means for finding inherently safer chemical substitutes for solvents. Supplementing the group contribution methods by newly developed methods is a promising venture for CAMD in the future. This paper discusses an approach to achieve this objective. Efficient QSPR approaches have become an attractive option
7386
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009
in recent years for property estimation in general. Thus, it is also important to identify its applicability in CAMD for solvent substitution. One of the hazardous properties, flash point, was evaluated using QSPR for different classes of solvents. Topological indices in particular have been used in this work to facilitate future application in computer aided molecular design as explained earlier. Although the results proved to be promising, some aspects can be explored further with regard to CAMD: (a) An ANN model for flash point gave higher accuracy for the entire data set than MLR, but the application of this complex model in CAMD has not yet been investigated. The nonlinear equation that defines a neural network will prove to be difficult to implement in CAMD. Previous studies have used linear relationships preferably, to reduce the CPU time. (b) Further justification for using QSPR instead of group contribution methods in CAMD is needed. The established group contribution methods are applicable to a wider range of chemical species, whereas QSPR methods are occasionally specific to a particular class of chemical compounds. In this paper, the flash point prediction proved to be of higher accuracy for certain classes of solvents (monohydric alcohols and ethers) compared to other classes (amines, polyhydric alcohols). In order to obtain higher accuracy in the QSPR model, subclasses for each of the classes investigated here need to be assessed. Subsequent subdivision defeats the purpose of CAMD to choose among a larger range of chemicals simultaneously. (c) More properties which need to be considered during selection of solvents such as solubility parameter, boiling point, surface tension, and vapor pressure for diverse chemical data sets need to be predicted and assessed using topological indices. Other safety-related properties that need to be evaluated are toxicity levels (e.g., LC50) and reactivityor stability-related properties. Thus, there is a pressing need to explore the applicability of QSPR (and topological indices) in future CAMD studies. QSPR has been successfully applied in quantifying certain biological responses and polymer behaviors in the past, but analyzing its suitability for CAMD remains to be accomplished. Acknowledgment This research was funded by the Mary Kay O’Connor Process Safety Center of the Department of Chemical Engineering at Texas A&M University. Computational results were obtained using Materials Studio 4.3 software program from Accelrys. Molecular descriptors were calculated using the QSAR module. We are grateful to the Laboratory for Molecular Simulation, Department of Chemistry, Texas A&M University, for providing the software. Literature Cited (1) CSB. Confined Vapor cloud explosion; No. 2007-03-I-MA; US Chemical Safety and Hazard Investigation Board, 2008. (2) CSB. Static sparks ignites flammable liquid during portable tank filling operation; No. 2008-02-I-IA; US Chemical Safety and Hazard Investigation Board, 2008. (3) CSB. Static spark ignites explosion inside flammable liquid storage tank; No. 2007-06-I-KS; US Chemical Safety and Hazard Investigation Board, 2007. (4) Kletz, T. A. Process PlantssA handbook for inherently safer design; Taylor & Francis: Philadelphia, 1998. (5) National Fire Protection Association. NFPA 704: Standard system for the identification of the hazards of materials for emergency response; 2007.
(6) Hshieh, F.-Y. Correlation of closed-cup flash points with normal boiling points for silicone and general organic compounds. Fire Mater. 1997, 21 (6), 277–282. (7) Satyanarayana, K.; Rao, P. G. Improved equation to estimate flash points of organic compounds. J. Hazard. Mater. 1992, 32, 81–85. (8) Prugh, R. W. Estimation of flash point temperature. J. Chem. Educ. 1973, 50 (2), 85-89. (9) Suzuki, T.; Ohtaguchi, K.; Koide, K. A method for estimating flash points of organic compounds from molecular structures. J. Chem. Eng. Jpn. 1991, 24 (2), 258-261. (10) Stefanis, E.; Constantinou, L.; Panayiotou, C. A Group-Contribution Method for Predicting Pure Component Properties of Biochemical and Safety Interest. Ind. Eng. Chem. Res. 2004, 43 (19), 6253–6261. (11) Katritzky, A. R.; Petrukhin, R.; Jain, R.; Karelson, M. QSPR Analysis of Flash Points. J. Chem. Inf. Comput. Sci. 2001, 41 (6), 1521– 1530. (12) Katritzky, A. R.; Stoyanova-Slavova, I. B.; Dobchev, D. A.; Karelson, M. QSPR modeling of flash points: An update. J. Mol. Graphics Modell. 2007, 26 (2), 529–536. (13) Karelson, M. Molecular Descriptors in QSAR/QSPR; WileyInterscience: New York, 2000. (14) Katritzky, A. R.; Maran, U.; Lobanov, V. S.; Karelson, M. Structurally diverse quantitative structure-property relationship correlations of technologically relevant physical properties. J. Chem. Inf. Comput. Sci. 2000, 40 (1), 1–18. (15) Godavarthy, S. S.; Robinson, R. L.; Gasem, K. A. M. An Improved Structure-Property Model for Predicting Melting-Point Temperatures. Ind. Eng. Chem. Res. 2006, 45 (14), 5117–5126. (16) Sola, D.; Ferri, A.; Banchero, M.; Manna, L.; Sicardi, S. QSPR prediction of N-boiling point and critical properties of organic compounds and comparison with a group-contribution method. Fluid Phase Equilib. 2008, 263 (1), 33–42. (17) Achenie, L. E. K.; Gani, R.; Venkatasubramanian, V. Computer Aided Molecular Design: Theory and practice; Elsevier: Amsterdam, 2003. (18) Odele, O.; Macchietto, S. Computer aided molecular design: A novel method for optimal solvent selection. Fluid Phase Equilib. 1993, 82, 47–54. (19) Wang, Y.; Achenie, L. E. K. A hybrid global optimization approach for solvent design. Comput. Chem. Eng. 2002, 26 (10), 1415–1425. (20) Karunanithi, A. T.; Achenie, L. E. K.; Gani, R. A New Decomposition-Based Computer-Aided Molecular/Mixture Design Methodology for the Design of Optimal Solvents and Solvent Mixtures. Ind. Eng. Chem. Res. 2005, 44 (13), 4785–4797. (21) Kazantzi, V.; Qin, X.; El-Halwagi, M.; Eljack, F.; Eden, M. Simultaneous Process and Molecular Design through Property Clustering Techniques: A Visualization Tool. Ind. Eng. Chem. Res. 2007, 46 (10), 3400–3409. (22) Visco, D. P.; Pophale, R. S.; Rintoul, M. D.; Faulon, J.-L. Developing a methodology for an inverse quantitative structure-activity relationship using the signature molecular descriptor. J. Mol. Graphics Modell. 2002, 20 (6), 429–438. (23) Brown, N.; McKay, B.; Gasteiger, J. A novel workflow for the inverse QSPR problem using multiobjective optimization. J. Comput.-Aided Mol. Des. 2006, 20, 333–341. (24) Kier, L. B.; Hall, L. H.; Frazer, J. W. Design of molecules from quantitative structure-activity relationship models. 1. Information transfer between path and vertex degree counts. J. Chem. Inf. Comput. Sci. 1993, 33 (1), 143–147. (25) Skvortsova, M. I.; Baskin, I. I.; Slovokhotova, O. L.; Palyulin, V. A.; Zefirov, N. S. Inverse problem in QSAR/QSPR studies for the case of topological indexes characterizing molecular shape (Kier indices). J. Chem. Inf. Comput. Sci. 1993, 33 (4), 630–634. (26) Skvortsova, M. I.; Fedyaev, K. S.; Palyulin, V. A.; Zefirov, N. S. Inverse structure-property relationship problem for the case of correlation equation containing the hosoya index. Dokl. Chem. 2001, 379 (1-3), 191– 195. (27) Needham, D. E.; Wei, I. C.; Seybold, P. G. Molecular modeling of the physical properties of alkanes. J. Am. Chem. Soc. 1988, 110 (13), 4186–4194. (28) Raman, V. S.; Maranas, C. D. Optimization in product design with properties correlated with topological indices. Comput. Chem. Eng. 1998, 22 (6), 747–763. (29) Acros Organics. Catalog of organic and fine chemicals; Fisher Scientific, Ed.; 2007/2008. (30) Aldrich Chemical Co. Catalog handbook of fine chemicals; 19961997. (31) Archer, W. L. Industrial SolVents Handbook; Marcel Dekker, Inc.: NewYork, 1996.
Ind. Eng. Chem. Res., Vol. 48, No. 15, 2009 (32) NCBI. PubChem Project. http://pubchem.ncbi.nlm.nih.gov/ (accessed May 2008). (33) Materials Studio Release Notes, Release 4.3; Accelrys Software Inc.: San Diego, 2006. (34) Shihab, K. A backpropagation neural network for computer network security. J. Comput. Sci. 2006, 2 (9), 710–715. (35) Nash, S. G.; Sofer, A. Linear and nonlinear programming; McGraw Hill: New York, 1996. (36) Himmelblau, D. M. Accounts of Experiences in the Application of Artificial Neural Networks in Chemical Engineering. Ind. Eng. Chem. Res. 2008, 47 (16), 5782–5796. (37) Duda, R. O.; Hart, P. E.; Stork, D. G. Pattern Classification, 2nd ed.; Wiley: New York, 2001. (38) Vidal, M.; Rogers, W. J.; Mannan, M. S. Prediction of Minimum Flash Point Behaviour for Binary Mixtures. Process Saf. EnViron. Prot. 2006, 84 (1), 1–9.
7387
(39) Katritzky, A. R.; Slavov, S. H.; Dobchev, D. A.; Karelson, M. Rapid QSPR model development technique for prediction of vapor pressure of organic compounds. Comput. Chem. Eng. 2007, 31 (9), 1123–1130. (40) Rouvray, D. H. The modeling of chemical phenomena using topological indices. J. Comput. Chem. 1987, 8 (4), 470–480. (41) Kier, L. B. A Shape Index from Molecular Graphs. Quant. Struct.Act. Relat. 1985, 4 (3), 109–116. (42) Kier, L. B.; Hall, L. H. Derivation and significance of valence molecular connectivity. J. Pharm. Sci. 1981, 70 (6), 583–589.
ReceiVed for reView January 17, 2009 ReVised manuscript receiVed March 29, 2009 Accepted June 3, 2009 IE9000794