Anal Chem 1987, 5 9 , 2322-2327
2322
Prediction of Olefin Boiling Points from Molecular Structure Peter J. Hansen’ and Peter C. Jurs*
Department of Chemistry, 152 Davey Laboratory, T h e Pennsylvania State University, University Park, Pennsylvania 16802
The normal boiling points for olefins are predicted by use of exclusively topological descriptors derived from molecular structure. Predictive equations having from one to eight Independent variables were obtained by applying multiple linear regression analysis to a set of topological descriptors (independent variables) and the observed boiling points of 123 C,-C,,, oteflns (dependent variable). The best model found, which included eight descriptors, yielded a correlation coefficient of 0.999 and an estimated standard deviation of 1.78 OC.
Values for the fundamental properties of many chemicals are unavailable in t h e chemical literature, and their measurement can be a costly and time-consuming undertaking. For this reason a need exists for estimation methods which are both reliable and accessible. Books (1,2) and numerous papers have been published devoted t o estimation methods. The accessibility, convenience, and speed of microcomputers suggest the need for computer-based estimation methods whereby the user can simply enter a chemical structure into a computer, specify the property of interest, and within seconds obtain a reliable estimate of the property. Estimation methods which require no elaborate molecular modeling (often requiring at least some user intervention), n o sophisticated electronic (molecular orbital) calculations, or no other physical constants have obvious advantages. For this reason the study reported here employed only topological descriptors which can be derived from chemical structures using rather simple algorithms. This study focused on only one property-the normal boiling point. Reid and Prausnitz ( I ) have stated that, ”Methods for estimating boiling points are generally poor.” Lyman, Reehl, and Rosenblatt (2) have reviewed seven methods for boiling point estimation, dividing them into those with “general application” a n d those with “limited application”. T h e three general methods which they review are all group contribution methods and all suffer from a n inability t o distinguish between many isomers, for example, each would predict identical boiling points for 2,3- and 2,4dimethylheptane, whose observed boiling points differ by almost eight degrees. Of the remaining four methods, one suffers from the same limitation as the general methods, and the others suffer from applicability t o only a very limited set of compounds (Le., saturated aliphatic hydrocarbons, derivatives of normal hydrocarbons, or derivatives of small hydrocarbon radicals). Aside from t h e simple correlations of boiling point with molecular weight or carbon number for homologous series of organic chemicals, Wiener was the first to correlate boiling point with structurally based topological descriptors ( 3 ) . Of the aliphatic hydrocarbons, the alkanes have received considerable attention not only from Wiener but from others ( 4 , Present address: Department of Chemistry, Northwestern Col-
lege, Orange City, IA 51041.
0003-2700/87/0359-2322$015010
5). T h e olefins were chosen for this study since little if any work has been reported on the prediction of their properties using topological descriptors. EXPERIMENTAL SECTION This investigation consisted essentially of five stages: (1)assembly of the olefin data set, (2) entry of the molecular structures into the computer, (3) generation and selection of the molecular descriptors, (4) development of the models using multiple linear regression analysis, and (5) verification of ,he validity of the models. Except for the first stage, nearly 11 of the work was performed with the ADAPT software packagc (6, 7) implemented on a PRIME 750 computer. Data Set. The quality of a model developed from experimental data is strongly dependent upon the quality of that data. To optimize the quality of the data used in this study, all of the boiling point data were obtained from a single source (8),and the training set consisted of only those 123 olefins whose boiling points were reported to the nearest 0.1 OC or better. These compounds and their boiling points are listed in Table I. An additional 69 olefins whose boiling points were reported to the nearest whole degree were included in the early stages of this study but were later dropped when it was feared that they may have been adversely affecting the quality of the models being produced. The reference cited stated that “a rough indication of the estimated uncertainty in the tabulated values is shown by the number of significant figures used to display them.” (9). Structure Entry. The olefin structures were entered into the ADAPT software system graphically with a Visual 500 graphics terminal. ADAPT stores structures as connection tables which include atom and bond types, in addition to atom adjacencies. As discussed above, three-dimensional molecular modeling was not required owing to the methods used in this study. Descriptor Generation and Selection. Molecular descriptors are numeric quantities related directly to molecular structure rather than to bulk properties of a compound. The descriptors employed in this study were all topological (more precisely, graph theoretical); that is, each molecule was treated as a set of points (atoms) and a set of lines of unspecified length (bonds) which joined pairs of points. Topological descriptors are defined in terms of simple concepts such as counts and atom valencies. Topological descriptors, in contrast to geometrical descriptors, are invariant to structural parameters such as bond lengths, bond angles, and torsional angles, and hence their calculation does not require accurate three-dimensional molecular models. In addition, no assumptions are made about the electronic nature of the atoms and bonds, and hence no information is required that relates to energy levels or charge distribution. The descriptors fell into two broad categories: whole molecule descriptors and substructure descriptors. The former (with number generated) included fragment descriptors (5),path counts (191,molecular connectivity indexes (19), Balaban’s index (I),a topological symmetry descriptor (l),and the degree of olefinic substitution (1);the latter included substructure counts (6), substructure environment descriptors (6),and substructure path counts (12). The fragment descriptors included the numbers of carbons, single bonds, rings, and ring atoms, as well as the nominal molecular weight. (Although arguable, molecular weight can be considered a “topological” descriptor if defined as a sum of atom labels, that is, atomic weights.) A path count (more accurately a subgraph count) is simply the number of occurrences of a specified subgraph (path, cluster, or path-cluster; see Figure 1) within a molecule. Since subgraph ’C 1987 American Chemical Society
ANALYTICAL CHEMISTRY, VOL. 59, NO. 19, OCTOBER 1 , 1987
2323
Table I. Olefin Data Set: Observed Boiling Points ( " C ) and Predicted Boiling Points ( " C ) from the Best Eight-Variable Model name ethene propene 1-butene cis-2-butene trans-2- butene 2-methylpropene I-pentene cis-2-pentene trans-2-pentene 2-methyl-1-butene 3-methyl-1-butene 2-methyl-2-butene cyclopentene 1-hexene cis-2-hexene trans-2-hexene cis-3-hexene trans-3-hexene 2-methyl-1-pentene 3-methyl-1-pentene 4-methyl-1-pentene 2-methyl-2-pentene 3-methyl-cis-2-pentene 3-methyl-trans-2-pentene
4-methyl-cis-2-pentene 4-methyl-trans-2-pentene 2-ethyl-1-butene 2,3-dimethyl-l-butene 3,3-dimethyl-l-butene 2,3-dimethyl-2-butene cyclohexene 1-methylcyclopentene 3-methylcyclopentene 4-methylcyclopentene I-heptene cis-2-heptene trans-2-heptene cis-3-heptene trans-3-heptene 2-methyl-1-hexene 3-methyl-1-hexene 4-methyl-1-hexene 5-methyl-1-hexene 2-methyl-2-hexene 3-methyl-cis-2-hexene 3-methyl-trans-2-hexene 4-methyl-cis-2-hexene 4-methyl-trans-2-hexene 5-methyl-cis-2-hexene 5-methyl-trans-2-hexene 2-methyl-trans-3-hexene 3-methyl-cis-3-hexene 3-methyl-trans-3-hexene 2-ethyl-1-pentene 3-ethyl-1-pentene 2,3-dimethyl-l-pentene 2,4-dimethyl-l-pentene 3,3-dimethyl-l-pentene 3,4-dimethyl-l-pentene 4,4-dimethyl-l-pentene 3-ethyl-2-pentene 2.3-dimethvl-2-~entene
bp(obsd)
bp(ca1cd)
name
bp(obsd)
bp(ca1cd)
-103.71 -47.70 -6.26 3.720 0.88 -6.900 29.968 36.942 36.353 31.163 20.061 38.568 44.242 63.485 68.891 67.884 66.450 67.088 62.113 54.178 53.865 67.308 67.702 70.438 56.387 58.612 64.682 55.616 41.247 73.205 82.979 75.49 64.91 65.67 93.643 98.41 97.95 95.75 95.67 92.00 83.90 86.73 85.31 95.41 97.26 95.18 86.31 87.56 89.5 88.11 85.90 95.401 93.542 94.0 84.11 84.28 81.610 77.48 80.80 72.518 96.01 97.40
-104.0 -48.10 -5.430 2.113 2.113 -7.639 32.08 35.19 35.19 30.99 22.99 36.07 43.44 65.48 68.19 68.19 65.76 65.76 62.85 56.76 55.58 65.61 69.05 69.05 58.38 58.38 63.36 56.34 46.29 68.71 81.79 76.16 67.06 66.99 95.21 98.08 98.08 95.26 95.26 92.82 85.32 86.26 85.79 95.18 97.50 97.50 88.74 88.74 88.24 88.24 85.54 95.28 95.28 91.95 85.15 85.77 82.32 78.73 78.57 75.20 96.89 97.48
2,4-dimethyl-2-pentene 3.4-dimethyl-cis-2-pentene 3.4-dimethyl-trans-2-pentene 4,4-dimethyl-cis-2-pentene 4,4-dimethyl-trans-2-pentene 2-ethyl-3-methyl-1-butene 2,3,3-trimethyl-l-butene 1-methylcyclohexene 3-methylcyclohexene 4-methylcyclohexene 1-ethylcyclopentene 3-ethylcyclopentene 4-ethylcyclopentene
83.300 89.25 91.50 80.430 76.740 86.365 77.891 110.296 102.47 102.74 106.33 97.77 98.2 105.8 93.2 1 21.280 125.64 125.0 122.9 123.3 122.54 122.25 119.3 112.8 113.3 113.2 122.6 110.3 110.5 111.2 111.6 107.2 102.5 121.77 110.6
84.89 90.37 90.37 77.77 77.77 84.33 77.34 110.9 101.7 102.0 104.9 96.28 96.64 106.8 95.57
size and connectivity can vary, many different path counts can be generated. Two final descriptors of this type consisted of the sum of all path counts and the quotient of this sum and the total number of carbons in the molecule (10). The molecular connectivity index, first introduced by RandiE ( 4 ) , is a graph theory concept which quantifies the extent of branching within a molecule. -The indexdepends on the number of occurrences of a specified subgraph within a molecule and the valencies of the atoms in the submauh for each occurrence. The index was later modified to acc&nt' for the presence of unsaturation and heteroatoms ( 5 , 11). As with path counts, molecular
1,2-dimethylcyclopentene 1,4-dimethylcyclopentene
1-octene cis-2-octene trans-2-octene cis-3-octene trans-3-octene cis-4-octene trans-4-octene 2-methyl-1-heptene 4-methyl-1-heptene 5-methyl-1-heptene 6-methyl-I-heptene 2-methyl-2-heptene 3-ethyl-1-hexene 2,3-dimethyl-l-hexene 2,4-dimethyl-l-hexene 2,5-dimethyl-l-hexene 4,4-dimethyl-l-hexene 5,5-dimethyl-l-hexene 2,3-dimethyl-2-hexene 2,4-dimethyl-2-hexene 2,5-dimethyl-2-hexene 5,5-dimethyl-cis-2-hexene 5,5-dimethyl-trans-2-hexene 2,2-dimethyl-cis-3-hexene 2,2-dimethyl-trans-3-hexene 2,4-dimethyl-cis-3-hexene 2,4-dimethyl-trans-3-hexene 2-n-propyl-1-pentene 3-methyl-2-ethyl-1-pentene 4-methyl-2-ethyl-1-pentene 4-methyl-3-ethyl-1-pentene
2,3,3-trimethyl- 1-pentene 2,4,4-trimethyl-l-pentene 2-methyl-3-ethyl-2-pentene 4-methyl-3-ethyl-trans-2-pentene 2,3,4-trimethyl-2-pentene 2,4,4-trimethyl-2-pentene 1-ethylcyclohexene 3-ethylcyclohexene 1,2-dimethylcyclohexene 4,4-dimethylcyclohexene 1-n-propylcyclopentene 1-nonene 2-methyl-1-octene 1-decene 2-methyl-1-nonene
112.2
106.9 104.1 105.43 100.85 109.0 107.6 117.7 112.5 110.3 107.5 108.31 101.44 117.0 114.3 116.2 104.91 136.992 131.6 137.98 117.24 131.2 146.868 144.65 170.570 168.4
122.2
125.0 125.0 122.4 122.4 122.0 122.0 119.8 111.9 113.6 112.7 122.3 111.2
111.6 110.3 110.4 105.2 102.9 123.2 112.5 112.5 105.1 105.1 102.2 102.2 111.9 111.9 117.8 111.2
108.8 103.2 106.3 98.83 121.8 114.6 115.7 101.2 136.7 128.1 138.7 120.6 131.2 146.9 144.5 169.7 167.3
- - -
Flgure 1. Paths of length one, two, three, and four; a three-bond cluster; and a four-bond path
connectivity indexes corresponding to many different subgraphs can be generated.
2324
ANALYTICAL CHEMISTRY, VOL. 59,
NO. 19, OCTOBER 1, 1987
Balaban’s index, also called the distance sum connectivity index, is similar to the path-one molecular connectivity index (12). It has been reported to he one of the most discriminating topological indexes (13, 14). The topological symmetry descriptor is the quotient of the number of topologically nonequivalent atoms, and the total number of atoms in a molecule (15, 16). The degree of substitution is simply a count of the number of substituent alkyl groups attached to the two carbons joined by the double bond (17). Its value is hence limited to the integers zero through four for the monoolefins. The substructure count is simply the number of occurrences of a substructure in a molecule. The following substructures were used in this study for the generation of all three types of substructure-based descriptors (7): the methyl, ethyl, propyl, and isopropyl groups and both the 2-alkyl substituted terminal double bond and a double bond without any valency restrictions on either carbon. The substructure environment descriptor is the path-one molecular connectivity of a pseudomolecule. This pseudomolecule consists of the specified substructure plus all atoms bonded directly to it. Two substructure path count descriptors were computed: (1) the total number of paths in a molecule originating from a pseudomolecule and (2) the quotient of this count and the number of occurrences of the substructure in the molecule-the latter is hence the average total path count per substructure occurrence. Of the 70 descriptors generated, 26 were eliminated from the data set for the following reasons: one possessed a constant value for all but one member of the data set; three were identical with other descriptors; four correlated very highly with other descriptors ( r greater than 0.98) but had low correlations with boiling point ( r less than 0.40); and 18 (molecular connectivities and path counts) had fewer than 35% nonzero values. The correlation coefficient of boiling point with each of four descriptors exceeded 0.95, these were number of carbons, number of single bonds, molecular weight, and path-one molecular connectivity. Scatter plots of each of these four vs. boiling point displayed a decided curvature. It was empirically shown that use of the square root of these descriptors removed the curvature. Fearing that anomalous behavior of the low molecular weight olefins and/or an unrepresentative sample of high molecular weight olefins may have been responsible for the curvature, the ten C2,C3, C4,C9, and Cl0 compounds were temporarily excluded from the training set. The correlations of the original descriptors with boiling point for this subset were then computed and compared to the corresponding values obtained by using their square roots. As evidenced by modest increases in the correlation coefficients, the descriptor square roots clearly yielded the better fit. The square roots of these four descriptors were hence accepted as additional descriptors. It was also observed that when added to certain models, the square of the degree of olefinic substitution improved the goodness of fit more than the original descriptor. Since the degree of olefinic substitution is limited to the discrete values 0, 1, 2, 3, and 4, it was hypothesized that this variable was behaving as a qualitative variable rather than a quantitative variable and that these values simply corresponded to “labels” for five different classes of olefins. (In which case the observation reported above was merely an accidental artifact.) To test this hypothesis, five indicator (binary) variables, uo, ul, ..., u4, were generated (18). For any given structure, one of the indicator variables would equal one, and the others would equal zero. Hence, for cis-2-butene, v2 = 1, while uo = u1 = ug = u4 = 0. Although the information encoded in the five indicator variables is identical with that encoded in the degree of olefinic substitution, regression analysis yields five coefficients rather than one. The relative magnitudes of these five coefficients provide one with an indication as to whether the original variable is metric or not. When added to several of the earlier models considered in this study, the coefficients of the five indicator variables more closely approximated the ratio of the squares of the original descriptor values. Hence the square of the degree of olefinic substitution was also accepted as an additional descriptor, resulting in a total of 49 descriptors. Regression Analysis. Highly correlated independent variables are generally considered to he a serious problem in the devel-
opment of multiple linear regression models. For this reason, investigators frequently prescreen their pool of descriptors to discard those which correlate highly with other members of the pool. In this study, however, the elimination of highly correlated variables was very minimal-as stated above, only four highly correlated descriptors were discarded-and furthermore no descriptors were discarded for reasons of high multicollinearity (Le., high correlation between a descriptor and a linear combination of two or more other descriptors). The justification for this relates to the object of this study which was the development of empirical models which accurately predict olefin boiling points. The use of highly correlated independent variables results in models with estimated regression coefficients having very high uncertainties, but it does not affect adversely the reliability of the predicted values for the dependent variable-provided the independent variables to which the model is applied possess the same patterns of multicollinearity as did the variables used to develop the models (19). (The latter could be expected to be true for a class of chemicals such as the monoolefins.) Selection of the best regression equation from a set of 49 descriptors can be problematic. Clearly an “all possible regressions” approach involving the computation and comparison of Z4’ different models would be impractical. Three different ADAPT multiple linear regression modules were employed in selecting models in this study. These methods entailed, (1) a standard stepwise regression procedure (20), (2) a “leaps and bounds” best subset regression procedure (21), and (3) an interactive regression analysis procedure in which the addition and elimination of descriptors to and from the model are totally under user control. Model Validation. The initial selection of models resulted from the application of the three methods mentioned above to the entire data set of 123 olefins. Although the F values obtained for these models are appropriate for the evaluation of the fit of the model to the data set, it was arguable as to whether they accurately reflected the validity of the models as predictive equations. (The authors’ suspicions were raised after the “leaps and bounds” procedure produced a model with 16 independent variables for which the smallest partial F value for any of the 16 variables was 4.88 (greater than the critical F value, F(1,106,0.95) = 3.9) and the overall F value was 5080 (greater than the critical F value, F(16,106,0.95) = 1.8)-suggesting at least some measure of statistical significance (20)!) If the descriptor values for the data set possess a skewed distribution with respect to the corresponding values for the population (in this case the chemical class of monoolefins), the curve-fitting process can be expected to be biased by this skewed distribution. The data set of 123 olefins used in this study was in at least one sense unrepresentative of the population. Examination of the olefin structures (sorted in order of increasing boiling point) revealed that there were very few structures at both extremes of the boiling point range and that these structures contained little if any branching. At the low end of this range the data set included only six C2,C3,and C4 structures (for the simple reason that there exist only very few small monoolefins) and five of these were unbranched. At the high end of the range, the literature yielded accurate boiling points for only two nonenes and two decenes, in each case one straight-chained compound and one 2-methyl isomer-clearly not a random sample of nonenes and decenes. Given the highly leveraged nature of data points located a t the extrema of a data set, one can reasonably expect that the models obtained would be biased and that F values (such as those given above) might provide overly optimistic characterizations of the resultant models. In summary, since the principal focus of this study related to the development of useful models for the prediction of boiling points, and not merely the fitting of a curve to a set of experimental data, the need for an alternative validation methodology was apparent. The cross validation method employed consisted of the following: (1) the 123 structures were divided randomly into a training set of 92 members and a prediction set of 31 members, and this process was repeated until 30 training set/prediction set pairs were obtained; (2) each “best” model (containing from one to nine variables) obtained from regression analyses using the entire data set was fit in turn to the data belonging to each of the 30 training sets; (3) each of the 30 equations obtained in this
ANALYTICAL CHEMISTRY, VOL. 59, NO. 19, OCTOBER 1, 1987
Table 11. Best Models with from One to Eight Descriptors model no. 1 2
descriptor no. 1 1, 4 2, 6, 5 3, 6, 5, 7 2, 8, 9, 5, 6 2, 8, 9, 5, 6, 7 2, 8, 10, 5, 9, 6, 11 2,8,13,5,9,6,12,14
3 4 5 6 7 8
descriptor no.
overall
std dev,*
multiple R
F"
"C
0.990 0.996 0.997 0.997 0.998 0.998 0.999 0.999
5860 6690 6230 5640 5490 6080 6830 7120
5.49 3.65
2325
I .8e d
3.09 2.82 2.56 2.22 1.94 1.78
descriptor
1
square root of the path 1 molecular connectivity,
2
square root of the molecular weight square root of the carbon number degree of alkene substitution squared degree of alkene substitution count of methyl groups count of paths of length 4 count of ring atoms count of paths of length 2 cluster 3 molecular connectivity, 3xc path-cluster 4 molecular connectivity, *xpe path 3 molecular connectivity, 3xp count of clusters of size 3 count of ethyl groups
1.p
3 4 5 6 7 8 9 10 11 12
13 14
U
I
- I .28 -1.29
-0.e~
,
I
t
I
,
a.08
0.88
I
,
1.28
1.80
BOILING POIKT CCALC) x 112
Figure 2. Observed boiling points vs. boiling points predicted by the best eight-variable model for the 123-member training set.
"Overall F is F ( p - I, n - p ) , where n represents the number of observations and p the number of adjustable parameters in the model. *Standard deviation adjusted for number of degrees of freedom, n - p . manner was then used to predict the boiling points of the compounds belonging to the corresponding prediction set; (4) an estimated standard deviation was computed for each of the 30 prediction sets; ( 5 ) after having completed steps 1-4 for each model, a one-sided, paired t test (22)was applied to the differences between the estimated standard deviations obtained from two successive models (Le., models for which the number of descriptors differed by one) in order to deduce whether the added descriptor resulted in a statistically significant improvement. This procedure for model testing was considered a more valid measure of the quality of a predictive model. By use of this method, pairs of successive models up to and including the seven and eight variable models yielded t values in the range of 3.11-11.24 for the mean value of the differences between the estimated standard deviations of paired prediction sets. All of these values exceeded the critical t value of 2.462 at the 1% significance level for 29 degrees of freedom and, hence, justified rejection of the null hypothesis, that is, that no improvement in fit occurred. The best nine-variable model when compared with the best eight-variable model yielded a t value which was in fact negative (-2.19) and the second-best ninevariable model yielded a t value of only 0.37; hence in neither case could the null hypothesis be rejected. (Both of these models did provide a better fit of the 123-member data set than did the best eight-variable model as measured by either the overall F value or the estimated standard deviation.)
RESULTS AND DISCUSSION Listed in Table I1 are the best models developed in this study containing from one to eight descriptors. Each of these models includes one rather dominant descriptor (descriptor number 1,2, or 3). In each of the eight models, this dominant descriptor was selected first in the stepwise regression procedure and also possessed a significantly larger partial F value than any other descriptor in the model. Furthermore, the application of the regression procedure to autoscaled data resulted in this dominant descriptor having, in each case, the largest regression coefficient in the model. These three descriptors are highly correlated (all three corelation coefficients
'I" I
I -1.89
t
-1.28
+.ea
0.88
8.08
1.28
1.88
BOILING POINT CCALC) x
162
Figure 3. Residuals vs. predicted boiling points for the best eightvariable model for the 123-member training set.
exceeded 0.975) and hence it is perhaps not surprising that three different descriptors were observed as dominant rather than simply one. The best overall model developed in this study was the eight-variable model listed in Table 11. This model was best in the sense that it yielded the smallest estimated standard deviation and the highest overall F value; in addition, it performed best as a predictive model in the cross validation procedure discussed earlier. The equation for model 8 is (with the Xn defined in Table 11) b p = (54.62 f 2.19)(X2) (4.53 f 0.70)(X8) + (7.10 f1.43)(X13) + (8.98 f 0.82)(X5) - (12.33 f
+
1.77)(X9) - (5.71 f 1.09)(X6) + (7.84 f 1.69)(X12) (2.02 f 0.67)(X14) - (393.04 f 13.35) (1) n = 123 s = 1.78 r = 0.999 F(8, 114) = 7120
The boiling points for the 123-member training set predicted by this model are shown in Table I and are plotted against the observed boiling points in Figure 2; the linearity is good and there are no obvious outliers. A residual plot is shown
2326
ANALYTICAL CHEMISTRY, VOL. 59, NO. 19, OCTOBER 1, 1987
in Figure 3 which shows no apparent abnormalities. The estimated adjusted standard deviation was 1.78 “C, the multiple correlation coefficient was 0.999, and the overall F value was 7120. Examination of the predicted values revealed that the largest residual was -5.05 “C; this corresponds to a standardized residual slightly less than 3. Each regression coefficient is expressed as a 95% confidence interval. The uncertainties in the regression coefficients are high (six exceed lo%, and the highest equals 33%). As discussed earlier the use of highly correlated independent variables will adversely affect the uncertainties in the estimated regression coefficients but need not jeopardize the predictive ability of the resulting regression equation (18). The eight-variable model was examined further by dividing the 123-member training set into a subset of cyclic molecules ( n = 18) and a subset of acyclic molecules (n = 105). The estimated standard deviations of the predicted boiling points for these two subsets were 1.68 and 1.73 “C, respectively. The slight difference is perhaps not significant, but the success of the model for both subsets is encouraging. (That both values are lower than the 1.78 “C reported above arises from the correction for degrees of freedom ( n - p ) for the latter, while for the subsets the correction was for ( n - l).) The training set was also divided into subsets consisting of cis isomers (n = 20), trans isomers ( n = 22), and the remaining structures (n = 81). The corresponding estimated standard deviations were 1.79, 1.38, and 1.80 “C, respectively. Given the sample size for the two isomer subsets, care must be exercised in drawing any inferences from these results. No suitable explanation for the lower standard deviation of the subset of trans isomers was found, although it is reasonable to assume that (on average) trans isomers possess smaller permanent dipole moments than do cis isomers and the topological (graph theoretical) descriptors employed in this study would not encode information related to dipole moment. Future work devoted to the development of descriptors to satisfactorily encode information related to cis/trans isomerism is clearly justified. Predicted boiling points for the 69 olefins which were excluded from the training set yielded an estimated standard deviation of 2.62 “C. I t must be remembered, however, that the reported values for the boiling points of these compounds are themsellves in question. The information content of the eight descriptors in this model can be qualitatively characterized as follows: X2 relates to the molecular weight of the molecule; X8 relates to the presence or absence of a ring and, if present, the size of the ring; and X13, X5, X9, X6, X12, and X14 relate to the extent to which the molecule is branched, as well as, to a certain extent, the size of the molecule. Clearly both a compound’s molecular weight and the branching of its structure will affect its boiling point; it is also reasonable to assume that cyclic and acyclic compounds will have boiling points which differ systematically. T o the extent that these qualitative generalities are true, the inclusion of these descriptors in this model is reasonable. No outliers were excluded from the regression analysis that produced this model or any of the models listed in Table 11. Some might argue that ethene (and perhaps even propene) represents a rather unique structure and should have been excluded from the data set. Undoubtedly the data set would have been easier to fit if this (these) compound(s) had been excluded, but from first principles, there appeared to be no a priori reason for their exclusion. Examination of the residuals resulting from the application of the eight models listed in Table I1 to the training set of 123 olefins revealed that four compounds were consistently poorly fit. These compounds were &-%butene, 2,4-dimethyl-
trans-3-hexene, 2,3-dimethyl-2-butene, and 2-methyl-3ethyl-2-pentene. The last two of these compounds are among the most highly symmetrical in the 123-member data set-an attempt to identify a cause for their poor fit was complicated by the fact that the signs of their residuals were invariably opposite. Clearly the significant permanent dipole moment of cis-2-butene was not accounted for by the topological descriptors employed in this study. As perhaps expected the models listed in Table I1 invariably predicted a boiling point for cis-2-butene which was too low. No unique feature of the hexane listed above was ever identified. Those models in Table I1 having from three to six variables perhaps deserve special mention. For each of these models the dominant descriptor is the square root of either the molecular weight or the carbon number and all of the remaining descriptors are merely counts. These models are hence very simple. Topliss and Edwards (23)demonstrated that there is “a risk of arriving a t fortuitous correlations when too many variables are screened relative to the number of available observations.” In light of their work, one might conclude that the number of observations included in this study (123) does not justify the number of variables screened (49). They pointed out, however, that the collinearity of the screened variables used in an actual study will very likely exceed that possessed by the random-number simulated variables employed in their study, and therefore their work tends to overestimate the likelihood of chance correlation effects. Topliss and Edwards recommended that pairs of screened variables possessing a correlation coefficient of 0.8 or higher be counted as only one variable in order to arrive a t an “effective number of independent variables screened. Application of this rule-of-thumb to the pool of 49 screened variables resulted in a count of only 19 “independent” variables. (Indeed, eigenanalysis showed that 95.13% of the variance was accounted for by the first six principal components of the 49 descriptor data set.) By extrapolating their results to this ratio (Le., 19:123), one can readily conclude that the likelihood of chance correlation in this study was minimal.
LITERATURE CITED Reid, R. C.; Prausnitz, J. M.; Sherwood, T. K. Properties of Gases and Liquids, 3rd ed.: McGraw-Hill: New York, 1977. Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of Chemical Property Estimation Methods; McGraw-Hill: New York, 1982. Wiener, H. J . A m . Chem. Soc. 1947, 6 9 , 17-20. RandiE, M. J . A m . Chem. Soc. 1975, 9 7 , 6609-6615. Kier, L. B.; Hall, L. H. Molecular Connectivity in Chemistry and Drug Research; Academic: New York, 1976. Jurs, P. C.; Chou, J. T.; Yuan, M. I n Computer-Assisted Drug Design; Olson, E. C., Christoffersen, R. E., Eds.; American Chemical Society: Washington, DC, 1979; pp 103-129. Stuper, A. J.; Brugger, W. E.; Jors. P. C. Computer Assisted Studies of Chemical Structure and Biological Function; Wiley-Interscience; New York, 1979. TRC Thermodynamic Tables -Hydrocarbons : Thermodynamics Research Center, Texas A&M University: College Station, TX, 1986; Vol. I , Part a. Reference 8, Introduction, p 4. Randie, M. Comput. Chem. 1979, 3 , 5-13. Kier, L. B.: Hall, L. H. J . fharm. Sci. 1981, 7 0 , 583-589. Balaban, A. T. Chem. f h y s . Lett. 1982, 8 9 , 399-404. Razinger, M.; Chretien, J. R.; Dubois, J. E. J . Chem. I n f . Comput. Sci. 1985, 2 5 , 23-27. Trinajstic, N. Chemical Graph Theory; CRC Press: Boca Raton, FL, 1983; Vol. 2. RandiE, M.; Brissey, G. M.; Wilkins. C. L. J . Chem. I n f . Comout. Sci. 1981, 2 1 , 52-59: Fujiwara, I.;Okuyama, T.; Yamasaki, T.;Abe, H.; Sasaki, S. Anal. Chim. Acta 1981. 133. 527-533. Pine, S. H.; Hendrickson, J. B.; Cram, D. J.; Hammond, G. S. Organic Chemistry, 4th ed.; McGraw-Hill: Nerw York, 1980; p 460. Neter, J.; Wasserman, W.; Kutner, M. H. Applied Linear Statistical Methods, 2nd ed.; Richard D. Irwin: Homewood, IL, 1985; Chapter 11. Reference 18, Chapter IO. Draper, N. R.; Smith, H. Applied Regression Analysis, 2nd ed.: Wiiey: New York, 1981. Furnival, G. M.; Wilson, R. W. Technometrics 1974, 16, 499-511. Ryan, 6. F.; Joiner, 6. L.; Ryan, T. A. MINITAB Handbook, 2nd ed.;
2327
Anal. Chem. 1987, 59, 2327-2330 Duxbury: Boston, MA, 1985; Chapter 8. (23) TOpllSS, J. G.;Edwards, R. P. J . Med.
1979, 22, 1238-1244.
RECEIVED for review April 16,1987. Accepted June 12,1987. P.J.H. is indebted to the administration and Board of Trustees
of Northwestern College, Orange City, IA, whose award of sabbatical leave made this work possible. This work was supported by the National Science Foundation under Grant CHE-8202620. The Prime 750 computer was purchased with partial financial support of the National Science Foundation.
Conductivity and Resistivity of Water from the Melting to Critical Points Truman S. Light* Corporate Research Center, The Foxboro Company, Foxboro, Massachusetts 02035
Stuart L. Licht Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
An accurate knowledge of the theoretlcal conductivity and reslstlvlty of water over a wide range of temperatures Is necessary to facilitate the analysls for trace Ionic impurities In water. I n this paper, values for pH, conductivity, reslstlvity, and temperature coefflcient of pure water over the range of 0-374 O C are determined by utlllrlng sources of fundamental data of better accuracy then those prevlously available. These calculations are based on Improved values for the equlvalent ionic conductances of hydrogen and hydroxyl ions, the Ionization constant of water, and the density of water. The extent of improvement over earlier values Is noted and comparison is made with experlmental measurements. New values of the ionization constant, conductivity, and reslstlvlty of water at the critical temperature, 374 OC, are given. The use of low-temperature reslstivlty measurements to Increase the sensltlvlty for detection of Ionic impurities to the fractional-parts-per-bllllon level is discussed.
The measurement of the resistivity of water has proved to be a sensitive, reliable, and low maintenance method for monitoring water purity. The comparison of measured resistivity with theoretical resistivity of pure water permits evaluation of the ionic impurity level. The temperature a t which the water may be used and measured varies considerably, but is not usually at 25 "C. Therefore the resistivity, which theoretically is 18.2 mO cm only a t 25 "C, may be automatically compensated to the standard temperature of 25 "C to permit ready judgment of the ionic impurity level. Previous studies have improved the knowledge of theoretical resistivity values for water (1, 2). In recent years, several papers have been published commenting on the theoretical values (3-8) and offering alternate analytical methods for measuring impurities (9, 10). Increased accuracy has been demanded of these theoretical values in order to compare them with measured values and thus monitor the extent of ionic impurities in the fractional-parts-per-billion range. By use of sources of fundamental data of better accuracy then those previously available, values for pH, conductivity, resistivity, and temperature coefficient of pure water over the range of 0-374 "C are determined.
THEORY From the self-ionization of water, expressed by the general equilibrium
HZO + H+ + OHthe conductivity of water due to mobility of hydrogen and hydroxyl ions may be calculated. The pertinent equations have been discussed in an earlier paper (1) and are
K(n= 1 0 - 3 d ( ~ 0+~ +x ~ ~ ~ - ) ( K , ) ' / ~ P (T ) =
K(
(1)
r)
(2)
= 0.5(pKW) p H = -log KW1iz p ( ~ =)
e
~
+
~
~
a(T)= lOO%(dp/dT)p-l
= -IOO(BT~
+2
'
+
~
+3
c ~ 3
~
(3) 2
~
+
~
~
~
4
3
)
(4)
(5)
where K ( T is ) the conductivity in siemens per centimeter, S cm-', a t the temperature in Kelvin, T, d is the density in g ~ m - XoH+ ~ , and XooH- are the limiting equivalent conductances of the hydrogen and hydroxyl ions in cm2 O-' equiv-l, K, is the ionization constant of water a t T in molal units, p(T) is the resistivity of water in MO cm, a(T) is the temperature coefficient of the resistivity in percent per degree Kelvin, and A , B, C, and D are constants with units to give the resistivity in MO cm. As indicated in eq 1, K is strongly temperature dependent reflecting the fact that density, equivalent conductance, and the ionization constant of water are each functions of temperature. In the earlier paper ( I ) , values were calculated for the pH, conductivity, resistivity, and its temperature coefficient, over the range of 0-300 "C. These calculations were based on the best literature data available for the four variable functions of eq 1: pK,, XoH*, XooH-, and density. In that paper, the values for the ionization constant of water, pK,, and the limiting ionic conductance for hydrogen ion, XoH+, from 5 to 55 "C were taken from Harned and Owen ( I I ) and had confidence values of better than f l % . Similarly, the density of water over the range 0-300 "C, taken from handbook values (12),had confidence values better than f0.1%. The values for the limiting ionic conductance of hydroxyl ion, XooH-, were the weakest link in the calculation chain. Harned and Owen reported the value only a t 25 "C, and Iverson (13) discussed in some detail the best averages over the range of 5-55 "C. These approximations were used resulting in resistivity uncertainties in the 1-5% range. From 70 to 90 "C, the Walden approximation, which states that the product of equivalent conductance and viscosity is a constant, was used for both
0 1987 American Chemical Society 0003-2700/87/0359-2327$01.50/0