Anal. Chem. 1992, 84, 3059-3063
a059
Quantitative Structure-Retention Relationship Studies of Sulfur Vesicants Thomas F. Woloszyn and Peter C. Jurs’ Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratory, University Park, Pennsylvania 16802
Ouantttatlvestructure-retentlonrelatknshlp (QSRR) methods were followed to relate the observed Kovbts retentlonlndlces of sulfur vedcants wlth theh molecularstructure. Topobglcal, electronlc, and geometrical descriptors were calculated to represent the molecular structures. Then multlpk llnear regresdon technlques were used to generate regrer#lon equationr to model the retentlonbehavlor on three statlonary phases. Modelswere generated( R = 0.998, relatlvestandard error 0.90 were discarded due to suspected multicollinearityproblems. The final pool of nine descriptors is listed in Table 11. These descriptors were submitted to multiple linear regression techniques. These methods include leap and bounds regression1*and forward stepwise regression analysis.14 Regression equations were developed in the form of eq 3. This
Zi= bo + b,D,
+ bzDz + ... + B,D, + ei
(3)
model relates the retention index of a particular compound on a given stationary phase to a linear combination of calculated structuraldescriptors (&I. The intercept (bo)and the regression coefficients (b,) are determined using multiple linear regression analysis. The residual (ei) is the difference between the experimentally determined and predicted values. While it is important that a particular model fit the data set, this alone does not ensure the predictive capabilities of the equation. Thus validation is a means to determine the stability and robustness of a model, key attributes which are necessary if the equation is to be applied to observations outside the data set used for the study. Two methods for validation are possible, internal and external validation. Internal validation uses compounds from the data set to determine the robustnessof a model. External validation requires that a portion of the data set be withheld from regression analysisand then uses these compounds for model validation. For data sets which are initially small, such as the one in this study, the option of a prediction set for external validation is not possible. Therefore internal validation is the method of choice to test the quality of a model. One such method is internal jackknifing. Here an observation is held out of the data set, and the model is recalculated. The resulting model is then used to predict the observation withheld. This process continues for each observation in the entire data set. The jackknifed predictions are then compared with the actual observations; large residuals indicate possible irregularities in the model. This method is also useful for showing how influential a particular observation is on the regression model. Furthermore, the coefficients of the recalculated regression equation are checked for large deviationsfrom the originalmodel when one observation is removed. This, too,is an indicator that one compound is having a strong effect on the model. Anomalies can also be noted by comparing the sum of the squared residuals of the original equation with the value obtained with jackknifing.
RESULTS AND DISCUSSION Numerous models were developed using automated leaps and bounds regression. Four-descriptor models were found to be optimum given the fact that additional descriptors, when added to a model, resulted in only small increases in the R coupled with little decrease in the standard deviation. This also was within reasonable limits to prevent overfitting of the data set, since a greater than 5-to-1 ratio of observations to parameters is considered to be statistically sound. Each of the stationary phases, or columns, was modeled separately. Four of the best models were selected from each column. Criteria by which this determinationwas made were (18) Furnival, G.
M.;Wilson, R.W. Technometrics
1974, 16, 499.
the multiple correlation coefficient(R),the standard deviation of the regression (SI, and the number of descriptors in the model. These models were then again tested for any possible multicollinearity problems which would reduce their predictive capability and render them unstable. This was done using three distinct diagnostic tests, namely the variance inflation factor, the condition index, and the variance decomposition proportions. These tests are fully described by Neter, Wasserman, and Kutner.‘* No systemic multicollinearity problems were identified at this stage. The next step involved testing these candidate models for outliers. Outliers are poorly fitted data points which cause models to deviate from the optimum in an attempt to accommodatethe outliers. Outliers can exert undue influence on the regression function, thus skewing the actual equation as it attempts to model the data. Outlier detection identifies data points which are poorly fitted by the model. T w o different methods for identifying outliers were used in this study, a data diagnostics generation routine and robust regression analysis. The criterion for flagging a compound as an outlier involved calculating several statistical indices including the residual, the standardized residual, the studentized residual, DFFITS, and Cook’s distance. An acceptable range of values was associated with each test. Any compound with an associated test value falling outside this range on three or more tests was flagged as an outlier. Robust regression analysis uses a least median squares method that contrasts with the least mean square approach. Robust regression analysis flags as outliers any compound that has a residual greater than 2.5 times the standard error. Because RRA uses the least median squares approach, it is better able to identify leverage problems within a data set. Leverage problems are caused by data points whose associated retention indices are at the extremes of the data set. Such data points are easily located on the calculated versus observed plots and were manually tested for undue leverage on the model. This was accomplished by removing these points from the data set and regenerating the model. Significant changes in the regression equation are an indicator of high leverage. Several of the models developed had a small number of outliers. These outliers tended to vary with the model: no outliers were consistently flagged by all models. Furthermore, at least one model for the DB-1 and DB-5 stationary phases displayed no outliers. Since these models differed only in the makeup of their descriptor sets and did not compromise the fit, they were selected as the best models for those particular columns. The DB-1701column had one compound flagged as an outlier, bis[(Zchloroethyl)thiolethyl sulfide. The molecular structure did not differ greatly from other compounds within the data set. This outlier did not appear to affect the model when included in the data set, and thus it was kept in the model. A summary of the best models calculated for each of the stationary phases, DB-1,DB-5, and DB-1701,is given in Table 111. The coefficients are reported with their 95% confidence intervals. The multiple correlation coefficient ( R ) is an indicator of the amount of variation in the dependent variable accounted for by the model. The F statistics show a high degree of statistical credibility and are indicative of an excellent fit of the models to the calculated values. Figures 1through 3 show the plots of the calculated versus observed retention indices. Several important factors can be noted about the descriptors selected for each model. Molecular weight is a bulk property descriptor which encodes information about a molecule’s size. The nature of the dispersive interactions within a chromatographic system are related to molecular polarizability, which in turn is related to the molecular size
3062
ANALYTICAL CHEMISTRY, VOL. 64, NO.
23, DECEMBER 1, 1992
Table 111. Summary of the Best Regmssion Models Found for the Data Set column
DB-1 6.3 f 0.2 126.0 f 13.3
descriptor mol wt no. of S atoms no. of C1 atoms
FPSA-1 FNSA-3 DPSA-1
DB-5 8.7 f 0.1 -156.8
1104.8f 87.3 -4647.3 407.2
f 9.9
-5200.5
f 437.1
219.5 f 34.5 -989.3 0.998 31.5 1806 31 2.1
9
F n % std error of the mean
DB-l/DB-5 6.4 f 0.2 130.0 f 10
-8980.4 f 586.3 2.1 f 0.2
-4930.5
-645.4
-55.8 f 8.5 -1009.6
0.998
0.998
1148.1f 66
*
ratio of the momenta of inertia indicator variable intercept R
DB-1701 6.2 f 0.2 173.3f 18.9
-600.9 0.998
33.1 1757 31
43.8 1270 30 2.5
2.1
f 30
34.7 2643 62 2.3
3400 2600
I a
3000
/ - I
1
18 2600
;
2200
8
j 2200
3
i
8 1800
.a
1800
11400
a1
R = 0.998
1400
N=31 s = 31.5
000 600 600
I
I
I
I
1000
I
1800 2200 2600 3000 Observed Retention Index Flgure 1. Calculated retention index versus the observed retentbn index for the model developed for the DE1 stationary phase. 1000
1400
2600
I a 2200
P
18 1800
11400
R = 0.998
a 1000
N=31 s = 33.1
I/
600 600
I
1000
I
I
I
I
1800 2200 2600 Observed Retention Index 1400
3000
Figure 2. Calculated retention Index versus the observed retentbn index for the model devekped for the D E 5 stationary phase.
of the solute. The counts of both sulfur and chlorine possibly account for the electronic effects these electronegative atoms have on the molecular environment. A geometric descriptor, the ratio of the first and second principal moments of inertia, has been found useful in previous studies.'*J3 This descriptor also encodes information about the size and shape of a molecule. CPSA descriptors are alsofound in all three models. The FPSA-1 and FNSA-3 are fractional charged partial surface area descriptors. The FPSA-1 descriptor is the charged positive surface area divided by the total molecular
600 600
1000 1400 1800 2200 2600 3000 3400 Observod Retention Index
Flgure 3. Calculated retentkm index versus the observed retention index for the model developed for the DE1701 stationary phase.
surface. Often it is reflective of the number of hydrogen a t o m attached to a molecule. The FNSA-3 descriptor is the sum of the partial surface area of the negatively charged atoms divided by their respective atomic charge. This summation is then divided by the total molecular surface area. The DPSA-1 descriptor is the difference between the valuer, of the positively and negatively charged partial surface areas of a molecule. These CPSA descriptors combine information with regard to the geometry of a molecule and electronic information. Therefore,the CPSA descriptors are well suited to provide information to a model concerning both the dispersive and polar molecular interactions. Thus the descriptors contained in the models are consistent with the understanding of what types of molecular features have important effects on retention behavior. An experimentwas conducted to determineif one regreasion equation could be developed to model the retention indices on a combination of the three stationary phases. First, retention data from the DB-1 and DB-5 stationary phases were combined into one dependent variable and modeled. Since the retention indices of these two phases were correlated with each other with R2= 0.999, the possibility of developing a common model waa likely. The same procedures as previously discussed were used to select descriptors and develop models. The best model that describedthe retention behavior on the two columna used descriptors included in the DB-1 model and also included one indicator variable. An indicator variable is binary-when one phase is modeled, this indicator descriptor was set a t a constant value of one. The other phase can be predicted for when the indicator variable in the same regression equation is set to zero. Without an
ANALYTICAL CHEMISTRY, VOL. 64, NO. 23, DECEMBER 1, 1992
9069
3400
the validity of the models. Jackknifing leaves one compound 4~
3000
-
2600
-
4 2200
-
bl
8
i 2?
1800
3 1400
a
-
1
600 600
I
I
I
I
I
I
II
1800 2200 2600 3000 3400 Obesrvod Retention Index Flguro 4. Calculated retention index versus the observed retentlon Index for the model developed for both the DE1 and DE5 stationary
1000 1400
phases.
indicator variable in the model, a standard error of s = 44.2 was realized. This improved to s = 34.7 and a multiple correlation coefficientof R = 0.998 with the indicator variable. The combined DB-l/DB-B model is shown in column 4 of Table 111. The calculated versus observed plot of this model is given in Figure 4. The more polar DB-1701 column could not be modeled with the other columns without a significant increase in the standard error. An indicator variable, when added to the regression model developed for the DB-1and DB-1701phases, decreased the resulting standard error by half. However, the resulting model remained unacceptable with numerous outliers and a standard error of over 73. Each of the retention index models presented contains one or more CPSA descriptors. To determine how sensitive the CPSA descriptors were to minor changes in the molecular geometry,the x , y,and z coordinatesof the three-dimensional molecular models were slightly altered in a random manner. To ensurereasonableconformationsat all times, the structures were visually checked and strain energy values monitored. The CPSA descriptors were then recalculated and the resulting values compared with the original values. The average percentage changes in the CPSA values were less than 4 % These altered CPSA descriptors were used in conjunction with the original values for the other descriptors, and predictions of retention indices were performed. The calculatedvalues were essentiallyunchanged from the original predictions, with differences less than 1%. Thus minor changes in the geometry had little effect on the descriptor values or the models or predictions developed from them. This result agrees with previous work on this topic.g Model Validation and Prediction. Although prediction is the most desirable form of validation, no prediction set was available due to the relatively small number of compounds in the data set. Thus internal jackknifii was used to test
.
out from the data set and regenerates the model coefficienta. The retention index of the molecule that was left out is then predided. The jackknifing resulta showed that no one compound exerted undue influence on any of the models. Another technique that was used to ensure the models were valid was the leave-n-out method. In this study, three randomly selected compounds were left out of the data set. The models were recomputed without the three observations and the resulting model Coefficients were then compared to those in the original model. Any changes greater than one standard deviationwould indicate a possible validity problem. Likewise, the new model is then used to predict the retention indices for those compounds left out. All recalculated coefficients were very close to those in the final model and were well within acceptable limits. Furthermore, the root mean square (rms)error of predictionwas used as an indicator of the stability of the equation. A stable equation generates an rms error that is reasonably close to that of the rms error of the model developed using the entire data set. The resulta of these two testa demonstrated the stability of the f i i models in the absence of a true prediction set. These tests substitute for the prediction of true unknowns in the absence of a prediction set. The results demonstrate that one would expect to get errors described by the standard errors of the equations themselves (Table 111) when predicting retention indices for compounds structurally similar to those in the data set.
CONCLUSIONS The retention indices for 31 sulfur vesicant compounds have been modeled with regression equations based on calculatedmolecular structure descriptors. Molecular weight, the number of sulfur and chlorine atoms, and several charged partial surface area descriptors were found to be important variables. I t waa shown that the CPSA descriptors used in this study were not significantly affected by minor changes in conformation. T w o of the columne could be modeled using the m e equation with the addition of an indicator variable, while themost polar columnhadtobemodeledwithaseparate equation. The equations were statistically stable and fit the data well. Since the retention indices on three different columna were modeled, the equations developed here are a favorablestartingpoint for further studiesof them compounds with larger data sets.
ACKNOWLEDGMENT This work was supported by the National Science Foundation and the Department of the Army. Larry Anker and Jon Ball are acknowledged for their assistance in the application of the ADAPT software package. Steve Dixon is ale0 acknowledged for his help.
RECEIVED for review June 14, 1992.
5, 1992. Accepted September