J. Chem. In8 Comput. Sci. 1995, 35, 841-850
841
Prediction of Normal Boiling Points for a Diverse Set of Industrially Important Organic Compounds from Molecular Structure Matthew D. Wessel and Peter C. Jurs* Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratory, University Park, Pennsylvania 16802 Received March 17, 1995@ Models that accurately predict normal boiling points for organic compounds containing heteroatoms have been developed with regression and computational neural network methods. The structures of the compounds are represented by calculated structural descriptors. Two models are presented-one for a set of 277 compounds containing only 0, S , and halogens, and a second for a set of 104 compounds all containing N. Root-mean-square errors of about 9 K result. The accuracy of prediction of these models is compared to a widely used group contribution method for boiling point estimation. INTRODUCTION Quantitative structure-property relationships (QSPRs) attempt to quantitatively link attributes of molecular structure to physico-chemical properties. These mathematical links, or models, have a variety of important uses. The prediction of certain properties is an important function that QSPRs fulfill. There has been a substantial amount of interest in the prediction of normal boiling points.'-3 The normal boiling point is one of the simplest physical properties that can be used to identify an unknown compound, and boiling points can also be used to determine other physical properties, such as heats of vaporization4 and critical temperature^.^ Prediction methods have also been important and useful in managing and updating physical property database^.^,' These methods provide estimates of boiling points to fill in gaps in databases and can also be used for validation of the data. The Design Institute for Physical Property Data (DIPPR)8 has been interested in prediction methods for the purpose of compiling a high quality, accurate database of physical properties for several thousand compounds of industrial importance. In a previous paper a diverse set of 298 organic compounds was taken from the DIPPR Project 801 database9 and used to develop a boiling point model.' The model developed provided clues as to the direction that needed to be taken to develop better models with the DIPPR data. It was apparent that the compounds needed to be divided into subsets. More recently, a set of 356 hydrocarbons was taken from the DIPPR database, and a high quality model for boiling points was developed.I0 This model reinforced the idea that subsetting was the best approach to take in order to ensure that the models developed would be of high quality. Having built a model for predicting boiling points for hydrocarbons, the next step was to develop a model to estimate boiling points for heteroatom-containing compounds. This paper describes the work completed using the heteroatom-containing compounds taken from the DIPPR Project 801 database. The QSPR software system ADAPT," developed over the years in our laboratory, was used to create the models. We have consistently compared the results obtained with our methodology to the Joback group contri@Abstractpublished in Advance ACS Abstracts, September 1, 1995.
bution method,'* a widely used method for estimating boiling points. EXPERIMENTAL SECTION A DEC Alpha 3000 A X P Model 500 workstation running DEC OSF/l ver. 1.2 B field test software was used to perform all computations involving the ADAPT software system. The names and normal boiling points for all compounds used in this study were provided by DIPPR or taken from other published sources. Data Set. All organic compounds containing evaluated experimental normal boiling points and one or more heteroatoms were extracted from the current DIPPR database. The range of molecular weights was 27-346 amu, and the number of carbons per compound spanned the range 1- 15. The heteroatoms included N, 0,S, F, C1, Br, and I. This screeningprocedure produced a working data set of 327 compounds. Of these 327 compounds, 27 were chosen randomly and comprised an external prediction set, and the remaining 300 comprised the training set. The external prediction set was never used until after a model had been developed using the training set. Uncertainties. The experimental error for each compound in the data set was provided by DIPPR. However, the error codes provided only an upper limit of error for each compound. Therefore low and high error bounds were estimated for the data set. The mean absolute error (mae) was estimated at 2.3 K for the low error bound. This corresponded to an rms error of 5.4 K. The high error bound was estimated at 7.6 K mae or an rms error of 11.4 K for the data set. Therefore, in order to prevent overfitting of the data, the rms errors of the neural network models should fall within the approximate range of 5.4 to 11.4 K. Molecular Modeling. The 327 structures in the data set were modeled using the semiempiricalmolecular orbital routine MOPAC (vers. 6.0)13with the PM3 Hamiit~nian.'~ Since there are several molecular interactions crucial to boiling point that are dependent on the geometry of a given compound, molecular modeling was deemed necessary. Descriptor Generation and Analysis. Molecular descriptors fall into three general categories-topological,~s~zo geomet1ic,2'-*~
and electroni~.~~ Topological descriptors are numerical representations of the two-dimensional carbon (or heavy atom) skeleton of a molecule. Geometric descriptors rely on the actual threedimensional conformation of the molecule for the calculation of quantities such as volume, surface area, and moments of inertia. Electronic descriptors include partial atomic charges and numerical combinations of various partial charges in-a molecule. The three categories of descriptors are by no means mutually exclusive; for
0095-2338/95/1635-0841$09.00/0 0 1995 American Chemical Society
842 J. Chem. In5 Comput. Sci., Vol. 35, No. 5, 1995
WESSEL AND
JURS
Table 1. List of Compounds Taken from DIPPR Project 801 Database9Used To Develop the N-Excluded Data Subset Model no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
compound bromotrichloromethane chlorotrifluoromethane carbon tetrachloride carbon tetrafluoride tribromomethane chlorodifluoromethane dichlorofluoromethaneb chloroform" trifluoromethane dibromomethaneb dichloromethane difluoromethane diiodomethane' formaldehyde formic acidb methyl bromide methyl chloride" methyl fluoride methyl iodide methanol methyl mercaptan
1,l-dichlorotetrafluoroethane hexafluoroethaneb 1,l-dichloro-2,2,2-trifluoroethane trichloroethylene trichloroacetaldehydeC pentachloroethane pentafluoroethane
1.1,2,2-tetrabromoethane 1,1,1,2-tetrachloroethane 1.1,2,2-tetrachloroethane 1,l-difluoroethylene
1,1,2,2-tetrafluoroethaneb vinyl bromide acetyl chloride 1,1,1-trichloroethane" 1,1,2-trichIoroethane vinyl fluoride l,l,l-trifluoroethane 1,l-dibromoethane 1,2-dibromoethane 1.1-dichloroethane 1,2-dichloroethane 1,l-difluoroethane 1,2-difluoroethane acetic acid" methyl formate bromoethane ethyl chloride ethyl fluoride ethyl iodide dimethyl ether ethanol ethylene glycol dimethyl sulfide ethyl mercaptan dimethyl disulfide" 1.2-ethanedithiol 2-chloropropene 1,2,3-trichloropropane 1,2-dichloropropane" acetone 1-propanal 1,2-propyleneoxide ethyl formate methyl acetate propionic acidb 1-bromopropane isopropyl chloride n-propyl chloride isopropyl iodide" n-propyl iodide isopropanol
exptl NBP (K) 378.0 191.7 349.8 145.1 422.3 232.3 282.0 334.3 191.0 370.1 312.9 221.5 455.2 254.0 373.7 276.7 248.9 194.8 315.6 337.8 279.1 276.2 194.9 301.0 360.1 370.8 433.0 225.1 516.7 403.7 418.3 187.5 250.1 288.9 323.9 347.2 387.0 200.9 225.8 381.1 404.5 330.4 356.6 247.4 283.6 391.1 304.9 311.5 285.4 235.4 345.4 248.3 35 1.4 470.5 310.5 308.2 382.9 419.2 295.8 430.0 369.5 329.4 321.1 307.6 327.5 330.1 414.3 344.1 308.8 319.7 362.6 375.6 355.4
calcd NBP
(Wd 385.9 190.8 360.0 161.7 415.0 220.6 278.5 341.8 180.7 369.3 311.9 201.6 267.5 402.5 290.8 252.0 204.6 312.4 338.5 282.4 293.1 195.6 301.7 345.5 429.2 219.2 520.0 391.9 417.3 210.2 253.2 299.7 318.7 359.3 373.1 226.8 213.9 386.7 407.8 330.3 349.0 236.2 268.5 409.4 307.3 318.5 283.8 242.7 338.0 250.8 347.8 478.7 311.2 305.6 399.5 418.2 297.2 425.6 370.3 321.0 320.7 306.1 334.3 321.6 413.4 344.0 308.9 313.4 364.3 366.4 356.9
exptl NBP no. 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
compound methyl ethyl ethe@ 1-propanol 1,2-propyleneglycol 1,3-propyleneglycol isopropyl mercaptanb n-propyl mercaptan furan thiophene 2,5-dihydrofuran methacrylic acid 1,2-dichlorobutane 2,3-dichlorobutane 1-butanalb methyl ethyl ketone 2-methylpropanal tetrahydrofuranb n-butyric acid" ethyl acetate isobutyric acid methyl propionate n-propyl formate tetrahy drothiophene" 1-bromobutane 2-bromobutane n-butyl chloride" sec-butyl chloride rert-butyl chloride isobutyl chloride 1-butanolu 2-butanol diethyl etheF 2-methyl-1-propanol 2-methyl-2-propanol methyl n-propyl ether" 1,3-butanediol 1,4-butanediolb 2.3-butanediol 2-methyl-1,3-propane n-butyl mercaptan sec-butyl mercaptana terr-butyl mercaptan isobutyl mercaptan methyl n-propyl sulfide 2-methylthiophene methyl isopropyl ketone 1-pentanal 2-pentanone 3-pentanone n-butyl formateb sec-butyl formate rert-butyl formate ethyl propionate isobutyl formate isopropyl acetate methyl n-butyrate isovaleric acid neopentanoic acid n-pentanoic acid n-propyl acetate 1 -chloropentane 2,2-dimethyl-1-propanol ethyl isopropyl ether ethyl propyl ether 2-methyl-1-butanol" 2-methyl-2-butanol 3-methyl- 1-butanol 3-methyl-2-butanol methyl n-butyl ether methyl sec-butyl ether methyl tert-butyl ether 1-pentanoP 2-pentanol" 3-pentanol
calcd NBP
(K)
(Kid
280.5 370.4 460.8 487.6 325.7 340.9 304.5 357.3 339.0 434.2 397.1 392.6 348.0 352.8 337.3 339.1 436.4 350.2 427.7 352.6 354.0 394.3 374.8 364.4 35 1.6 341.3 323.8 342.0 390.8 372.7 307.6 380.8 355.6 312.2 480.2 501.2 453.9 487.2 371.6 358.1 337.4 361.6 368.7 385.7 367.5 376.1 375.5 375.1 379.3 366.5 356.0 372.3 37 1.2 361.6 375.9 448.3 437.0 458.9 374.6 38 1.5 386.3 326.1 337.0 401.9 375.1 404.4 384.6 343.4 332.1 328.4 410.9 392.1 388.4
277.1 368.3 457.3 485.5 326.7 336.7 308.4 359.1 335.9 418.5 394.3 393.4 355.7 342.4 353.0 347.2 439.6 346.8 436.4 338.7 360.4 392.3 375.8 369.3 343.8 334.3 328.6 338.2 389.1 361.9 304.2 381.1 367.9 304.9 475.5 495.0 45 1.8 479.1 367.1 342.0 347.3 361.7 364.7 383.6 366.2 383.0 369.4 363.2 385.2 381.6 379.7 410.3 382.6 366.1 362.0 456.7 446.1 458.6 371.4 374.3 391.0 329.1 332.7 401.9 377.3 410.1 381.5 333.0 327.4 324.7 410.0 384.5 375.8
NORMAL BOILINGPOINTPREDICTION
J. Chem. In$ Comput. Sci., Vol. 35, No. 5, 1995 843
Table 1. (continued) exptl NBP no. 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 20 1 202 203 204 205 206 207 208 209 210 21 1 212
compound 1,5-pentanediol methyl n-butyl sulfide methyl tert-butyl sulfide n-pentyl mercaptan hexachlorobenzene hexafluorobenzene 1,2,4-trichlorobenzene" m-dichlorobenzene o-dichlorobenzeneb p-dichlorobenzene m-chlorophenol o-chlorophenol p-chlorophenolb phenol 1,2-benzenediol 1,3-benzenediol p-hydroquinone cyclohexanone ethyl acetoacetate diethyl oxalate" cyclohexanol 3,3-dimethyl-2-butanone ethyl isopropyl ketone 1-hexanal 2-hexanone 3-hexanone methyl isobutyl ketone 3-methyl-2-pentanone" n-butyl acetate sec-butyl acetate tert-butyl acetate" ethyl n-butyrate 2-ethyl butyric acid ethyl isobutyrate n-hexanoic acid isobutyl acetate n-pentyl formate n-propyl propionate n-butyl ethyl ether tert-butyl ethyl ether diisopropyl ether 2-ethyl- 1-butanol 1-hexanol 2-hexanol 2-methyl- I-pentanol" 4-methyl-2-pentanol methyl n-pentyl ether 1,6-hexanediol hexylene glycol di-n-propyl sulfideb ethyl tert-butyl sulfide" n-hexyl mercaptan di-n-propyl disulfide benzotrichlorideb benzyl dichloride" benzoic acid p-hydroxybenzaldehyde salicylaldehyde benzyl chloride benzyl alcoholb m-cresol o-cresol p-creso1 diethyl malonate diisopropyl ketone 1-heptanal
6) 512.2 396.6 372.0 399.8 582.6 353.4 486.2 446.2 453.6 447.2 487.0 447.5 493.1 455.0 518.7 549.7 558.2 428.9 454.0 458.9 434.0 379.4 386.5 401.5 400.9 396.6 389.6 390.6 399.1 385.1 369.1 394.6 466.9 383.0 478.8 389.8 405.5 395.6 365.3 346.0 341.5 419.7 430.6 413.0 421.2 404.9 372.0 516.2 470.6 416.0 393.6 425.8 469.0 486.7 487.0 522.4 583.2 469.7 452.6 477.9 475.4 464.2 475.1 472.0 397.6 426.0
calcd NBP (Kid 507.6 390.1 381.1 396.7 579.1 353.9 497.7 456.3 450.4 457.3 489.4 479.9 488.3 452.0 530.5 550.3 548.0 430.4 428.0 449.8 459.5 386.0 386.7 408.8 394.9 389.4 391.2 389.9 394.4 396.9 379.0 387.0 462.0 389.4 478.7 391.5 409.3 387.6 361.1 351.8 354.3 421.8 430.4 406.5 422.8 419.8 362.3 520.3 466.8 411.2 403.5 424.0 469.2 476.3 467.3 520.2 570.0 485.7 447.3 471.5 468.5 453.0 468.0 476.2 407.0 432.2
exptl NBP no. 213 214 215 216 217 218 219 220 22 1 222 223 224 225 226 227 228 229 230 23 1 232 233 234 235 236 237 238 239 240 24 1 242 243 244 245 246 247 248 249 250 25 1 252 253 254 255 256 257 258 259 260 26 1 262 263 264 265 266 267 268 269 270 27 1 272 273 274 275 276 277
compound 2-heptanone 4-heptanoneb 1-methylcyclohexanol 5-methyl-2-hexanone n-butyl propionate ethyl isovalerate n-heptanoic acid n-hexyl formate n-propyl n-butyrate I-heptanol 2-heptanol 5-methyl-1-hexanol n-heptyl mercaptan vanillin p-ethylphenol 2,3-xylenolb 2,4-xylenol 2,5-xylenol0 2,6-xylenol 3.4-xylenol 3.5-xylenol diethyl maleateb 1-OCtanal 2-octanone n-butyl n-butyrateb n-heptyl formate n-hexyl acetate isobutyl isobutyrate di-sec-butyl ether 2-ethyl- 1-hexanol 1-octanol" 2-octanol n-octyl mercaptan ethyl benzoateb benzyl ethyl ether isophorone diisobutyl ketone 2-nonanone 5-nonanone n-heptyl acetate n-nonanoic acid" n-octyl formate 2,6-dimethyl-4-heptanol 1-nonanol 2-nonanol n-nonyl mercaptan dimethyl phthalate anethole p-tert-butylphenol" n-decanoic acid isopentyl isovalerate n-octyl acetate 1-decanol n-decyl mercaptan 2-ethylhexyl acrylate n-undecanoic acid methyl decanoate 1-undecanol diethyl phthalate" n-decyl acetate n-dodecanoic acid di-n-hexyl ether n-tridecanoic acidb anthraquinone p-tert-octylphenol
(K) 424.0 417.2 441.2 418.0 419.8 407.5 496.2 428.6 416.5 449.5 432.4 445.2 450.1 558.0 491.1 490.1 484.1 484.3 474.2 500.2 494.9 498.2 447.2 445.8 438.2 451.3 444.7 420.6 394.2 457.8 468.3 453.0 472.2 486.6 458.1 488.4 441.4 467.5 46 1.6 465.6 528.8 472.0 45 1.O 486.3 471.7 493.0 556.8 508.5 512.9 543.2 467.2 484.5 504.1 512.3 489.2 557.4 505.0 518.2 567.2 517.2 571.9 498.9 585.3 653.1 563.6
calcd NBP (Kid 419.2 414.2 460.4 415.9 410.2 407.8 495.0 433.1 410.2 450.5 428.4 450.0 450.7 569.5 486.3 473.9 474.8 475.1 460.2 483.7 484.8 493.5 453.3 441.2 433.3 455.7 439.6 43 1.3 405.3 442.8 469.1 448.9 474.0 47 1.3 463.9 486.5 45 1.3 461.2 457.8 461.2 528.3 477.4 457.7 487.0 468.7 495.7 550.6 481.4 519.4 543.5 470.7 481.1 503.6 514.5 489.5 557.6 495.2 519.1 567.1 517.8 569.9 505.4 581.5 652.8 575.6
Compound in external prediction set. Compound in cross-validation set. Compound an outlier in N-excluded data subset model. Boiling point calculated using the final neural network model. example, the dipole moment, considered to be an electronic descriptor, is dependent on geometry. A fourth class of descriptors combines solvent-accessible surface area and atomic charge infor-
mation to produce charged-partial surface area (CPSA) descriptors.26 The same information used to develop CPSA descriptors is also used to calculate descriptors that encode hydrogen bonding, such
844 J. Chem. In$ Comput. Sci., Vol. 35,No. 5, 1995 as the surface area or charges on donatable hydrogen atoms or
acceptor N, 0, or F atoms.6 A total of 124 descriptors were generated for each compound in the data set. The descriptor pool was pruned using a series of objective methods. Any descriptor that contained more than 90% identical values was removed. Since the data set was structurally diverse, it was thought that some descriptors acting as indicator variables (Le., a high percentage of identical values) might be useful later in the study, hence the high cutoff percentage. The pool of descriptors was then checked for painvise correlations, and if two descriptors correlated with r 2 0.94 one of them was removed. The descriptor which was correlated with a smaller number of descriptors or was easiest to explain or calculate was retained. These methods were effective in removing 47 descriptors, leaving a reduced descriptor pool with 77 members. To avoid chance correlations, it is recommended that the number of descriptors submitted to regression analysis be less than 60% of the number of observations in the training set.27 Since there were 300 compounds in the training set, this condition was met, and regression analysis was carried out. Regression Analysis. Multiple linear regression analysis was used to develop linear models that linked the boiling points to a small subset of descriptors. Since the dependent variable was used to develop the models, this process can be thought of as subjective feature selection. Regression equations have the form
+
n
BPj = bo x ( b i X i i ) i= 1
where BP, is the boiling point for compound j , bo is the intercept term, and b, is the coefficient for descriptor X,. Two automated routines were used to perform subjective feature selection with regression analysis. The first routine used a genetic algorithm (GA) to choose high quality subsets of descriptors from the reduced pool. The G A routine used the root-mean-square(RMS) error to evaluate the quality of the models. Internal statistics, such as individual F values, were not utilized by the G A routine. A second routine employed simulated annealing (SA)30,31 to choose subsets of descriptors. As in the G A routine, the SA routine evaluated the quality of the models with the RMS error and internal statistics were not used. Since both methods discussed above utilized only RMS errors to evaluate quality, it was important to examine models suggested by these methods to determine their overall quality. An interactive regression analysis (IRA)routine allowed for this. A pool of descriptors that are of potentially high quality are entered into IRA. Standard regression analysis is used to develop models from the entered pool at the discretion of the user. The number of descriptors in a potential model is also controlled by the user. The routine also provides statistical guidelines, in the form of F-to-enter and T-values. As descriptors are added and removed, the changes in the statistics from model to model can be monitored. Therefore, a more statistically comprehensive evaluation of a given model and the pool of descriptors is achieved. Neural Networks. Computationalneural networks were effective in further minimizing the R M S errors associated with the linear models. Neural networks can provide nonlinear mappings of the descriptors to the boiling point values thus accounting for variability that a linear function can not encode. Their nonlinear nature, coupled with an increase in the number of adjustable parameters, allow neural networks to perform better than linear methods in our studies. A Broyden-Fletcher-Goldfarb-Shanno (BFGS)quasiNewton minimization algorithm was employed to train the computational neural network.32 The theories behind computational neural networks, and the BFGS algorithm have been discussed previously.33
WESSEL AND
JURS
RESULTS AND DISCUSSION Both GA and SA routines were effective in developing
an 11 descriptor model that fit the boiling points for the 298compound training set with an RMS error of 13.5 K. There were two outliers in the 300-compound training set, malononitrile and cyanogen, the only dinitriles present in the data set. The RMS error was relatively high, and examination of the data showed that nitrogen-containing compounds were mostly responsible for this. When the boiling points of the 48 nitrogen-containing compounds were calculated by the model, the RMS error found was 21.3 K. The remaining 250 compounds had an R M S enor of just 13.0 K. Thus, the nitrogen-containing compounds were seriously degrading the quality of the model. It is possible that the compounds containing nitrogen were not properly encoded with the 11 descriptors selected and thus were not modeled adequately. Subdividing the data set is an effective method that can be used to investigate this phenomenon. The 27compound external prediction set had an RMS error of 20.2 K, which provided further evidence that there was a flaw in the model. As a result of the previous observations, the approach was altered. Since the nitrogen-containing compounds were problematic, the original 327-compound data set was split into a set of 50 N-containing compounds and 277 nonnitrogen containing (N-excluded) compounds. Then models were developed for each subset. It should be noted that the 50-compound N-containing subset was augmented with an additional 54 compounds taken from the literature to bring the N-containing subset total to 104 compounds. This was done because it was necessary to have a data set of at least 100 compounds to insure that a statistically valid model could indeed be developed. Part I. N-Excluded Data Subset. The 277 compounds in the N-excluded data subset were split randomly into a training set of 250 compounds and an external prediction set of 27 compounds. Table 1 lists the compounds comprising the N-excluded subset. Since descriptors were already calculated for these 277 compounds, feature selection was immediately performed as described previously. A reduced descriptor pool with 77 members was obtained after objective feature selection was completed. The GA and SA descriptor selection routines each produced the same 10-member subset of descriptors. However, it was apparent that the model was not encoding the data set adequately. The residual values produced by this model were sorted in descending order. After examining the sorted residuals, it was clear that several trends existed that were not encoded. Structural features such as carbonyl groups and fluorine or sulfur atoms were inflating the FUvlS error, as structures with these features tended to exhibit large positive or negative residuals. To rectify the problem, seven new descriptors were specifically calculated to encode the structural moieties. Two descriptors, the number of sulfur and fluorine atoms, which were originally removed because they contained greater than 90% identical values, were also added because they can act as indicator variables. The new reduced pool was then submitted to regression analysis. After repeating the GA and SA feature selection routines, several high quality models were developed using the 250compound training set. IRA was used to determine which
J. Chem. In8 Comput. Sci., Vol. 35, No. 5, 1995 845
NORMAL BOILINGPOINTPREDICTION Table 2. Final Linear Regression Model for the Prediction of Normal Boiling Points for the N-Excluded Subset Taken from the DIPPR Project 801 Database9 coeff
sd of coeff
label
0.3009 -3.690 -51.78 9.515 19.21 554.7 -25.52 19.52 50.84 -135.0
0.01105 0.1656 6.394 0.3571 0.5395 19.06 0.9835 2.364 4.345 13.55
PPSA PNSA RPCG NRA SQMW SADH NF KETO SULF SNA
descriptor definition
partial positive surface areaa partial negative surface areao relative positive charge" number of ring atoms square root of molecular weight surface area of donatable hydrogensb number of fluorines' ketone indicator' number of sulfide groupsc number of sulfurshotal number of atomsC 59.86 5.411 Y-intercept R = 0.991; RMS elTor = 11.6 K; N = 248 compds; F = 1253.7
2ool/'
See ref 25. See text and ref 6. Descriptor added to pool after preliminary screening (see text). a
Table 3. Example Calculation of Normal Boiling Points for Compounds 81 and 138 Using the Linear Regression Model in Table 2 descriptor label
model coeff
Y-intercept 59.86 PPSA 0.3009 PNSA -3.690 RPCG -51.78 NRA 9.515 SQMW 19.21 NF -25.52 SADH 554.1 KETO 19.52 SULF 50.84 S/NA -135.0 calcd bp: 355.0, 370.5 K exptl bp: 357.3, 375.1 K
value for thiophene (81)
value for 2-methyl2-butanol (138)
1.o 121.3 -6.565 0.2517 5 .Ooo 9.165 0.0 0.0 0.0 1.o 0.2000
1.o 24.18 -12.52 0.331 1 0.0 9.381 0.0 0.05 154 0.0 0.0 0.0
model was of highest statistical quality. The best model developed was a 10-variable model with an R M S error of 12.3 K. Several validation methods were used to examine the stability of the model. The model was fnst checked for outlier values. The six statistical tests used were the residual, standardized residual, studentized residual, leverage, DFFITS statistic, and Cooks distance.34 If four or more of these six tests for a given compound exceeded the cutoff value for that test, the compound was considered an outlier. Only two compounds were flagged by the test, diiodomethane and trichloroacetaldehyde. When the model coefficients were recalculated, the RMS error was 11.6 K, a significant improvement. Therefore, the outliers were removed permanently, leaving 248 compounds in the training set. Table 2 shows the final 10-descriptor model and Table 3 provides an example calculation for the compounds thiophene and 2-methyl-2-butanol. To further validate the model several other tests were performed. The residual values were plotted against the calculated boiling point values, and there was no observable pattem. The largest pairwise correlation coefficient was 0.592 for the 10 descriptors. The variance inflation factor (VIF), defined as (1 - R2)-', was determined for each descriptor as well. A VIF larger than 10 is indicative of multicolinearity problems.35 The highest VIF was 2.28 for the model. These tests demonstrate the high intemal stability of the model. A plot of calculated vs observed boiling points,
100V' ' 100 200
R M S E m = 11.6 K
j"pdl.1 '
'
'
'
'
300 400 500 Obwrved BP (K)
I
' " 600 700
'
Figure 1. Plot of calculated vs observed boiling points for the training set using the N-excluded linear regression model. 7
0
600
I
0
1
.
i
'
i
.
i
1
.
I/
t
.
-
8 k 400
300
200
1oov
100
'
'
200
'
'
'
400
'
'
500 Obwrved BP (IC)
300
'
'
600
'
J
700
Figure 2. Plot of predicted vs observed boiling points for the extemal prediction set using the Nexcluded linear regression model.
given in Figure 1, showed no observable pattem or deviation from normal behavior. The final validation step for the model involved predicting the boiling points for the 27 compounds in the external prediction set. The RMS error was 10.5 K for the external prediction set, thus completing the final validation step. Figure 2 shows a plot of predicted vs observed boiling points for the prediction set. The descriptors in the final model cannot be directly linked to boiling point in a causal sense. They are numerical representations of molecular structure that collectively will encode the boiling point with a high degree of accuracy. The descriptors themselves do not cause the boiling point value to change. However, it is reasonable to use the descriptors as an aid in deciphering what structural features are likely to be important. To this end, examination of the descriptors clearly shows that for a diverse group of compounds, indicators of specific substructure features are important. For instance, the number of fluorines has a large negative coefficient, thus suggesting that fluorine atoms wiIl tend to
~
846 J. Chem. InJ: Comput. Sei., Vol. 35,No. 5, 1995
WESSEL AND
JURS
Table 4. List of Unique Structural Groups and Their Coefficients for the Three Group Contribution Approaches for the N-Excluded Data Subset group constant -CH3 -CH2-CH2>CH'CH'C' >C' =CH2 =CH=CH=C < =C
>
c-0 c=o
alcohol phenol nonring ring nonring ring
-HC-O -COOH
-coo-
ester
-SH -S-SSQMW
nonring ring
Joback
regression I
regression I1
198.18 23.58 22.88 27.15 21.74 21.78 18.25 21.32 18.18 24.96 26.73 24.14 31.01 -0.03 38.13 66.86 93.84 92.88 76.34 22.42 3 1.22 76.75 94.97 72.24 169.09 81.10 63.56 68.78 52.10
232.85 11.12 21.56 30.47 25.00 17.79 12.32 -34.36 -15.88 24.42 22.23 26.21 29.02 -13.24 31.73 57.11 80.00 79.25 84.69 20.80 -4.19 77.10 20.78 60.03 141.88 74.49 64.34 68.33 40.79
103.17 3.06 7.85 13.71 5.09 -3.15 -13.43 -32.51 -18.82 14.29 11.02 4.85 14.49 -23.54 6.66 -4.37 -31.27 74.15 73.10 11.73 6.75 54.49 45.71 46.40 112.28 37.96 44.63 41.31 20.38 21.05
lower the boiling point of a compound. In fact, the boiling point of fluoromethane is higher than the boiling point of tetrafluoromethane. This trend is not observed for the other halo-methane compounds. Most of the descriptors in the model are fairly straightforward. The three CPSA descriptors encode intermolecular interaction properties, such as dipoledipole interactions and London dispersion forces. Collectively, the descriptors accurately estimate boiling points for organic compounds that do not contain nitrogen. Once a final model was determined, it was compared to the Joback group contribution method. In group contribution methods, a compound is broken down into unique substructure groups, and the frequencies of those groups are regressed against the dependent variable of interest, in this case boiling point. Joback used a diverse set of 438 compounds to generate coefficients.'* In that data set, there were 43 unique substructure groups. A total of 28 groups were unique to the 248 member training set used in this study. Joback's original coefficients for those 28 groups, shown in Table 4, were used to predict the boiling points for the training set. The RMS error was 26.7 K for the training set. This rather large value is not surprising, since the original coefficients were derived from a different set of compounds. A new set of coefficients was calculated by regressing the group counts against the boiling points for the training set. These coefficients are listed in Table 4 under the regression I heading. The RMS error was 17.6 K, a significant improvement. In a previously published paper, we used the square root of the molecular weight in combination with Joback's groups and developed an excellent model for hydrocarbon boiling points.I0 Taking that approach here, new coefficients were calculated for the 28 groups plus the square root of molecular weight. Table 4 lists the coefficients
Training Set (223 cmpdr.) 600
100 100
C.V.Set (25 cmpdti.)
300 400 500 600 700 Observed BP (IC) Figure 3. Plot of calculated vs observed boiling pints for the N-excluded training and CV sets using the computational neural 200
network model. under the regression I1 heading. The RMS error was 14.7 K for the training set with this enhanced group contribution method. The three group contribution methods do not yield results that are comparable to the ADAPT methodology. Additionally, the best linear model from ADAPT has 11 adjustable coefficients, whereas the best group contribution model has 30 adjustable coefficients and performs worse. The ten descriptors from the final linear model were used to develop a computational neural network model. Neural networks take advantage of nonlinearity to improve the link between the descriptors and boiling points. The 248compound training set was split randomly into a new training set of 223 compounds and a cross-validation (CV) set of 25 compounds. The original 27-compound external prediction set was used to validate the best network model developed. Several trial runs with different network architectures were performed to determine the best setup using the program QNET, developed in our l a b ~ r a t o r y .A~ ~10:6:1 (10 input, 6 hidden, and 1 output neuron) network architecture was chosen. This architecture minimized the number of adjustable parameters without sacrificing network performance. As a general rule, the ratio of observations (223) to adjustable network parameters (73) should be no less than 2.0, and the ratio for this network is much greater.36 Once the network architecture was established, an automated version of QNET was used to find an optimal network model. The automated version conducted 500 individual training sessions and recorded the lowest RMS errors for the training and CV set. Since the training algorithm is very sensitive to the starting set of weights and biases, the automated version is used to find quality starting points that lead to low RMS errors. Sessions that provided low RMS errors were examined with more scrutiny. The session that provided the lowest and best behaved CV set RMS error was chosen as the best network model. This model had an RMS error of 9.1 K for the training set and 8.6 K for the CV set. A plot of calculated vs observed boiling points for the training and CV sets is shown in Figure 3. The extemal prediction set was used to validate the final network model. The prediction set RMS error was 9.0 K.
NORMALBOILING POINTPREDICTION
J. Chem. In& Comput. Sci., Vol. 35, No. 5, 1995 847
Table 5. List of Compounds in the N-Containing Data SubseP no. 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
compound hydrogen cyanide nitromethane methylamine acetonitrileb ethyleneimine nitroethane dimethylamine" ethylamine cyanogenc malononitrile' acrylonitrileb propionitrile 1-nitropropane 2-nitropropane n-propylamine trimethylamine cis-crotonitrile" n-but yronitrile isobutyronitrile pyrrolidine" n-butylamine tert-butylamine diethylamine" isobutylamine n-pentylamineb m-chloroaniline p-chloroanilineb aniline 3-methylpyridineb phenylhydrazine hexanenitrile di-n-propylamine 3-nitrobenzotrifluoride benzonitrile p-nitrotoluene n-heptylamine indole n-octylamine isoquinoline quinoline n-nonylamine tripropylamine" n-dec ylamine diamylamine undecylamine n-dodecylamine tri-n-butylamine acridine n-tetradecylamine" triamylamine' 2-butylamine"sd pyrroled
exptl NBP
calcd NBP
(K)
(KY
298.8 374.4 266.8 354.8 329.0 387.2 280.0 289.7 252.0 491.5 350.5 370.5 404.3 393.4 321.0 276.0 380.6 390.8 376.8 359.7 350.6 317.5 328.6 340.9 377.6 501.7 503.7 457.2 417.3 516.7 436.8 382.0 475.9 464.1 511.7 430.1 526.1 452.8 516.4 510.8 475.4 429.7 493.1 476.1 514.8 532.4 487.2 619.2 564.5 516.2 335.9 402.9
315.3 385.0 278.3 341.1 311.6 39 1.9 286.6 293.1 359.5 365.9 401.6 391.2 323.7 294.5 312.7 387.9 372.2 356.9 352.0 317.9 32 1.3 339.6 379.0 509.4 510.6 460.9 406.7 495.1 433.0 384.4 481.3 475.0 507.5 426.8 529.7 448.8 515.6 5 10.9 470.8 439.1 492.2 478.7 514.2 535.6 490.6 610.5 575.2 334.6 403.1
exptl NBP no. 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 13 14 75 76 71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
compound l-methylpyrroled 2-methylpyrroled 3-methylpyrroled pyridined 2-methylpyridined
4-methy lpyridined piperidined 1-naphthylamine",d 2-naphthylaminebsd 2-propylamine"t' 2-methyl aniline' 3-methyl aniline' 4-methyl aniline' carbazoled 9-methyl carbazoled 1-(2-aminoethyl)piperizineb
1-(2-aminoethyl)piperidine tert-pentylamine o-methoxyaniline m-methoxyaniline p-methoxyaniline benzylamine N-methylbutylamine 2-methy lbutylamine cyclopentylamine p-nitrophenol allylamine N-allylanilineb nitrobutane nitrocyclopentane o-nitrotoheneb m-nitrotoluene N,N-diethylmethylamine triethylamine N-ethylbutylamine o-bromoaniline m-bromoaniline o-fluoroaniline m-nitroaniline p-fluorobenzylamine benzylamine" 3,3-dimethylpiperidine N,N-dimethyl- 1,3-diaminomethane 2,2-dimethyl-1,3-diaminomethane 1-ethylpiperidhe 2-ethylpiperidine 2-ethylpyridine N-methylcyclohexylamine 2-aminoheptane
N-tert-butylisopropylamine" n-hepty lamine N-methylhexylamineb
calcd NBP
(K)
(KY
386.0 420.7 416.0 388.4 402.5 418.5 379.4 573.8 579.3 304.9 473.5 476.5 473.6 627.8 616.8 493.2 459.2 350.2 498.2 524.2 5 14.7 457.7 364.2 368.7 380.2 552.2 326.2 492.2 425.7 453.2 498.2 503.7 337.2 362.0 381.2 502.2 524.2 455.1 551.2 456.2 433.2 410.2 418.2 426.2 404.2 416.2 422.2 422.2 416.2 37 1.2 428.2 414.2
377.8 420.0 422.8 378.8 406.7 409.4 385.3 574.8 571.3 314.4 47 1.4 479.0 478.6 626.3 618.2 488.7 469.0 341.4 486.7 514.6 503.9 441.4 365.8 37 1.4 386.1 541.6 317.2 509.4 418.4 437.7 492.8 507.6 337.4 358.3 382.0 492.2 512.5 486.3 560.9 487.9 448.1 413.6 410.3 425.6 405.1 417.9 434.8 42 1.9 414.0 384.6 426.2 418.0
Compound in extemal prediction set. Compound in cross-validation set. Compound an outlier. See ref 37. See ref 38. f Boiling point ~ and NBP predicted by final neural network model. g Compounds and NBP values 1-50 taken from DIPPR Project 801 D a t a b a ~ e .Compounds values 68-104 taken from 1994 Aldrich catalog.39
Figure 4 displays a plot of predicted vs observed boiling used to further reduce the descriptor pool to about 40 members. points for the prediction set. Part 11. N-ContainingData Subset. The 50 nitrogenBoth the GA and SA feature selection routines were used containing compounds from the DIPPR database were to find good models. A model with 10 descriptors was found augmented with 54 compounds from three new s o ~ r c e s . ~ ~ - and ~ ~ analyzed in depth with the IRA routine. The R M S error was 16.9 K for the original model. This was due to A listing of the 104 compounds is given in Table 5. The same methodology detailed in the N-excluded section was malononitrile and cyanogen. These two compounds were the only dinitriles, as mentioned previously. They also also used here. The data set was split randomly into a 93exhibited very large residuals, so they were removed. The compound training set and an 11-compound extemal prediction set. Descriptors (129) were generated for each commodel coefficients were recalculated, and the RMS error was pound. Objective descriptor screening reduced the pool to 11.O K, a significant improvement. 67 members. Since there were only 93 compounds in the The model was validated with the same methods used for training set, a vector-space descriptor analysis routine''' was the N-excluded data set. One outlier, triamylamine, was
848 J. Chem. In$ Comput. Sci., Vol. 35,No. 5, 1995
WESSELAND JURS
700 600
500
400
RMS Error = 9.0K
IOOV' 100
' 200
'
I
300
'
'
400
'
500
'
'
600
'
I
700
Figure 4. Plot of predicted vs observed boiling points for the N-excluded prediction set using the computational neural network model.
d
/
600 -
-
a
500
f
400 -
3
3
!3i
400
300 -
J'
2 0 0 V '
200
R M S Error 6 10.7 K
"
300
' ' 400
I
500 Obrved BP (K)
"
600
-
'*v,iI"-"""j, RMS Error = 9.3 K
Observed BP (K)
8
500 -
b:
300
700
8
"
700
Figure 5. Plot of calculated vs observed boiling points for the training set using the N-containing linear regression model.
removed. The coefficients were recalculated, and the R M S error was 10.7 K for the training set. Table 6 shows the final linear regression model and coefficients. A plot of residual vs calculated boiling points produced no observable
200 200
300
400 500 Observed BP (K)
600
700
Figure 6. Plot of predicted vs observed boiling points for the prediction set using the N-containing linear regression model.
patterns. The largest pairwise r value was 0.83, and the largest VIF was 9.7, which is lower than the limiting value. Figure 5 shows a plot of calculated vs observed boiling points for the training set. The plot looks smooth, and the data fit the ideal one to one correlation line quite well. The external prediction set R M S error was 9.3 K, thus providing the final validation for the linear model. A plot of predicted vs observed boiling points for the prediction set is depicted in Figure 6 . Examination of the descriptors can lend some insight into what structural features are important to boiling point. However, as mentioned before, the descriptors themselves are not the cause of the normal boiling point for any given compound. For this data set, there are several whole molecule descriptors in the final model. Individual substructure features are not as important. This is not surprising, since the data set is composed of N-containing compounds only. The dipole moment and CPSA descriptors are encoding different aspects of intermolecular interactions. The connectivity indices MOLC 4 and V3C are encoding the degree of branching in the structure. As the branching increases, the values of MOLC 4 and V3C both increase, but the boiling point will generally decrease in value. The connectivity descriptor WPT 3 is encoding the topological environment of heteroatoms in each structure. As the number of nitrogen atoms increases, the degree of hydrogen
Table 6. Final Linear Regression Model for the Prediction of Normal Boiling Points for the N-Containing Data Subset taken from the DIPPR Proiect 801 Database9 and Refs 37-39 coeff 16.49 0.3673 -346.4 4.226 14.20 45.63 -58.34 8.349 263.9 144.7 338.2
a
sd of coeff label descriptor definition 2.382 DPOL dipole momento 0.05 197 PNSA partial negative surface areab 24.23 RNCG relative negative charge* 0.3317 RNCS relative negative charged surface areab 0.4873 NAB number of aromatic bonds 2.502 MOLC 4 path 2 molecular connectivity index' 5.569 v3c cluster 3 valence connectivity index' 1.234 WTPT 3 sum of all path weights from heteroatomsd 27.76 SADH surface area of donatable hydrogens' 27.04 CHDH charge of donatable hydrogens' 11.78 Y-intercept R = 0.990; RMS error = 10.7 K; N = 90 compds; F = 397.8
See ref 25. See ref 26. See ref 19. See ref 18. e See text and ref 6.
NORMALBOILINGPOINTPREDICTION
J. Chem. In& Comput. Sei., Vol. 35, No. 5, 1995 849
Table 7. List of Unique Structural Groups and Their Coefficients for the Three Group Contribution Approaches for the N-Containing Data Subset group constant -CH3 -CH2-CHI'CH> CH'C < >C' =CH2 =CH=CH=C