Prediction of gas chromatographic retention data for hydrocarbons

Prediction of Thermodynamic Parameters in Gas Chromatography from Molecular Structure: Hydrocarbons. Matevž Pompe, Joe M. Davis, and Clint D. Samuel...
1 downloads 0 Views 762KB Size
Anal. Chem. 1993, 65, 502-507

182

Prediction of Gas Chromatographic Retention Data for Hydrocarbons from Naphthas Thomas F. Woloszyn and Peter C. Jurs’ Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratory, University Park, Pennsylvania 16802

Regr.rrlon equations that madel the gw chromatographic retentlon behavior of hydrocarbons found In complex petrochemicalmixtures were developedfor two dmerent statlonary SE-30 and Carbowax 20M. The modek had rdative pha-, standard emam Inthe rang. 1-2 % Thb qwntttathrestructureretention reiatlonrhip (OSRR) study focused on a relatively heterogeneous data set and resulted in the generation of several statbtical modek that related Kovhts’ retentlon Index with descriptors that encode molecular structure. A b investigated was the addition of boiling point as a phydcochemical descriptor. There models bore a significant knprovement over the models containing only structural descrlptom, with R values of 0.996.

.

INTRODUCTION Pyrolysis naphtha (PN) and fluid catalytically cracked naptha (FCCN) are complex hydrocarbon mixtures that are important to the petrochemical industry. P N is used in polymer synthesis; therefore it is necessary to conduct component analysis to identify aromatics containing unsaturated side chains. FCCN, a major component in automotive petroleum products, requires the identification of high-octane aromatics Ce through Cl0. Alkylbenzenes have also been a focus of attention because they are a potential pollutant. The analysis of these mixtures is not well documented in the chemical literature. Capillary gas chromatography, coupled with mass spectrometry, has been the preferred method of identification.’ However, even though alkylbenzenes comprise the major component of these mixtures, there is a significant lack of data for higher alkylbenzenes over C9.2 It has also been found that the reproducibility of retention index data on a single column under isothermal conditions has been good, especially for nonpolar columns. Unfortunately, the reproducibility between columns and laboratories has been problematic, with variations as large as 8 index units (iu) reported. This has been attributed to column aging and differences in film thickness.2*3 Alkylbenzenes have been studied previously in an attempt to predict chromatographic retention data.44 These studies had in common the use of very small data sets that were typically congeneric in nature. Furthermore, all of the studies used physical property descriptors such as boiling point, molar refraction (Mr),vapor pressure, or molar volume (Vm).The previous studies also relied heavily on topological parameters such as connectivity indices. Molecular additivity coupled with molar mass was also used to calculate retention indices (1)Gallegos,E. J.; Whittemore, I. M.; Klaver, R. F. Anal. Chem. 1974, 46,157-161. (2)Matiaovh, E.J. Chromatogr. 1988,438,131-144. (3)Hlberger, K. Anal. Chim. Acta 1989,223,161-174. (4)Hlberger, K. Chromatographia 1990,29,375-384. (5)Geraeimenko, V. A.;Nabivach, V. M. J. CHromatogr. 1990,498, 357-366. (6)Vernon, F.;Suratman, J. B. Chromatographia 1983,17,600-605. 0003-2700/93/0385-0582$04.00/0

of these compounds.7 All the models described above yielded high coefficientsof multiple determination (R2)and relatively low standard errors. Yet these small data sets limit the ability to predict retention data for compounds which differ from those in the training set, and statistical irregularities may occur with small data sets. The aim of the present research was to develop a general model capable of predicting the gas chromatographic retention indices of the compounds in this complex hydrocarbon mixture of P N and FCCN. The models developed in this study were derived from a more heterogeneous data set in an attempt to broaden the applicability of the model to predict a larger population of related compounds. In addition, physical descriptors such as boiling point data and molar volume were not the focal point. The objective was to produce a regression equation that related only structural features with the observed property, Kovhts’ retention index. Quantitative structure-retention relationships describe the field that relates molecular structure with retention behavior. Calculations generate numerical descriptors that encode structural information about the compounds in a data set. Multiple linear regression statistical analysisthen relata these descriptors to the Kovhts’ retention index. QSRR studies have two primary goals, the prediction of chromatographic retention and the explanation of the solute-stationary-phase retention behavior within a chromatographic column.

EXPERIMENTAL SECTION Data Set. The data used in this research were reported by T6th.8 The Kovhts’ retention indices were determined for 81 hydrocarbons using capillary gas chromatography on a SE-30 and a Carbowax 20M column. The data were obtained at 70 O C using nitrogen gas flowing at 11 cm/s as the carrier gas. The compounds ranged from benzene to C11 compounds such aa 1,2dimethyl-3-propylbenzene. The hydrocarbon mixture consisted of 51 alkylbenzenes, 14 styrene derivatives, and 16 associated aromatic acid and nonaromatic ringed compounds. The 81 compounds and their retention data are listed in Table I. The experimental error was not reported for the data set. However, *3 iu is usually given as the experimental error of these measurements for hydrocarbon compounds such as alkylben~ e n e s .Complete ~ experimental details can be found in Tbth’s paper.s This QSRRstudy was performed in a four phases: (1)structure entry and molecular modeling, (2) generation of molecular descriptors, (3)multiple linear regresssion analysis, and (4) outlier analysis and model validation. The ADAPT software system provided a full range of programs that were used to carry out this study.9 Data Set Entry and Molecular Modeling. The structure and associated retention index values for each of the 81 hydrocarbons were entered into a Sun 4/110 workstation and stored in connection table form. Allinger’s MM2 molecular (7) Dimov, N.; MatisovB, E. J. Chromatogr. 1991,549,325-333. (8)T6th, T. J . Chromatogr.l983,279,157-165. (9) Stuper, A.J.; Brugger,W. E.;Jum, P. C. Computer-AssistedStudies of Chemical Structure and Biological Function;Wiley-Interscience: New York, 1979. 0 1993 Amerlcan Chemical Society

ANALYTICAL CHEMISTRY, VOL. 65, NO. 5, MARCH 1, 1003

181)

Table I. Pyrolysis Naphtha and Fluid Catalytically Cracked Naphtha Retention Data no. 1

2 3

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

23 24 25 26 27

28 29 30

31 32 33

34 35

36 37 38 39 40

compd name benzene naphthalene 1,2,3,4-tetrahydronaphthalene toluene o-xylene m-xylene p-xylene 1,2,3-trimethylbenzene 1,2,4-trimethylbenzene 1,3,5-trimethylbenzene 1,2,3,4-tetramethylbenzene 1,2,4,5-tetramethylbenzene 1,2,3,5-tetramethylbenzene ethylbenzene 1-methyl-2-ethylbenzene 1-methyl-3-ethylbenzene 1-methyl-4-ethylbenzene 1,2-dimethyl-3-ethylbenzene 1,2-dimethyl-4-ethylbenzene 1,3-dimethyl-2-ethylbenzene 1,3-dimethyl-4-ethylbenzene 1,3-dirnethyl-5-ethylbenzene 1,4-dimethy1-2-ethylbenzene l,2-diethylbenzene 1,3-diethylbenzene 1,4diethylbenzene n-propylbenzene isopropylbenzene 1-methyl-2-propylbenene 1-methyl-3-propylbenne 1-methyl-4-propylbenene 1-methyl-2-isoprop ylbenzene 1-methyl-3-isoprop ylbenzene

1-methyl-4-isopropylbenzene 1,2-dimethyl-3-propylbenzene 1,2-dimethyl-4-propylbenzene 1,3-dimethyl-2-propylbenzene 1,3-dimethyl-4-propylbenzene 1,3-dimethyl-5-propylbenzene 1,4-dimethyl-2-propylbenzene

SE-30 657.1 1145.0 1129.8 759.9 882.2 859.6 860.8 1006.0 977.5 956.2 1127.5 1096.4 1099.5 851.4 964.6 949.0 950.6 1083.5 1066.1 1069.6 1061.7 1042.6 1059.0 1043.5 1034.8 1040.1 941.8 911.3 1048.3 1034.8 1038.1 1020.6 1006.6 1009.5 1166.5 1149.5 1152.2 1143.7 1126.0 1140.0

Carbowax2OM 954.5 1620.0 1490.4 1049.5 1185.7 1145.3 1138.9 1325.4 1277.6 1242.2 1461.6 1406.8 1416.5 1131.9 1258.4 1224.9 1233.4 1394.8 1357.2 1372.1 1350.0 1319.8 1343.5 1324.0 1297.3 1305.2 1210.1 1176.9 1327.7 1301.0 1301.0 1276.4 1266.5 1268.8 1458.6 1435.8 1451.6 1429.0 1406.2 1415.0

mechanics program was used to model the majority of the compounds in the data ~ e t . ~This J ~program produces the threedimensional coordinates of each atom which represents the geometry of the molecules.ll Not all of the structures could be modeled using MM2 since several compounds containing triple bonds could not be handled. Therefore MOPAC was used for these compounds.'* This molecular modeling routine uses semiempirical molecular orbital techniques to determine the atomic coordinates. Descriptor Generation. The structure of a compound can be represented by a set of calculated numerical descriptors. A total of 128 separate descriptors was calculated for the entire hydrocarbon data set. These descriptors were either topological, geometrical, electronic, or physicochemical in nature. Topological descriptors encode the information found within the molecular connection table. These descriptors include fragment and bound counts, atom types, molecular connectivity, and path descriptors. Two of the topological descriptors used were molecular weight and the simple path 3 connectivity index (3x)developed by Kier and Hall.13 These descriptors have proven J~ topological important in previous QSRR s t u d i e ~ . ' ~ Another descriptor found useful was one which enumerates all path counts (10) Allinger, N. L.; Yul, Y. H. MM2/MMP2,85-Force Field; QCPE Program No. 395; QuantumChemistry Program Exchange: Bloomington,

IN, 1985. (11) Burkert, U.; Allinger,N. L. MolecularMechanics;ACS Mongraph 177; American Chemical Society: Washington, DC, 1982. (12) MOPAC, ver 5.0.; QCPEProgram No. 445; Quantum Chemistry Program Exchange: Bloomington, IN. (13) Kier, L. B.; Hall,L. H. Molecular Connectivity in StructureActivity Relationships; John Wiley & Sons, Inc.: New York, 1986. (14) Kier, L. B. Quant.Strut.-Act. Relat. 1985,4, 109-116. (15) Needham, M. D.;Jurs, P. C. Anal. Chirn. Acta 1992,258,183-198.

no. 41 42

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

compd name n-butylbenzene isobutylbenzene sec-butylbenzene n-pentylbenzene tert-butylbenzene allylbenzene 1,3-divinylbenzene 1,4-divinylbenzene ethynylbenzehe 44methylethynyl)benzene 3-(methylethynyl)benzene styrene 2-methylstyrene 3-methylstyrene 4-methylstyrene 2,4-dimethylstyrene 2,5-dimethylstyrene 2,3-dimethylstyrene 4-ethylstyrene 3-ethylstyrene trans-b-methylstyrene a-methylstyrene cis-&methylstyrene 2-methyl-trans-&methylstyrene

SE-30 1039.6 994.3 997.0 1140.5 980.0 932.6 1091.3 1100.0 862.1 965.9 960.5 877.8 977.5 977.0 980.5 1075.4 1078.7 1100.3 1073.2 1065.8 1009.9 966.4 975.3 1098.3 4-methyl-trans-b-methylstyrene 1109.0 dipentene 1019.8 indan 1015.7 indene 1023.3 4-methylindan 1120.6 5-methylindan 1112.2 trans-hexahydroindan 949.8 cis-hexahydroindan 980.9 2-ethylnorbornane 919.9 2-ethylnorbornane 915.9 5-ethylidene-2-norbornene 908.2 5-vinyl-2-norbornene 877.9 cis-decahydronaphthalene 1081.5 trans-decahydronaphthalene 1041.7 tetrahydrcdicyclopentadiene 1077.6 dicyclopentadiene 1011.9 dihydrodicyclopentadiene 1050.5

Carbowax 20M 1309.1 1241.3 1248.1

1404.3 1237.9 1263.2 1541.0 1554.2 1357.2 1454.3 1450.9 1255.1 1342.2 1348.1 1348.6 1440.5 1432.2 1485.1 1431.2 1423.7 1390.3 1320.9 1324.3 1464.0 1428.7 1200.0 1355.9 1455.8 1467.9 1446.9 1059.3 1107.3 1020.2 1015.1 1106.8 1067.1 1223.0 1160.9 1243.4 1247.1 1252.7

for specified substructures embedded in a molecule. The substructure found helpful in this study was a three-carbon fragment of an aromatic ring. Geometrical descriptors depend on the three-dimensional z, y, and z coordinates of the compounds. Two geometricdescriptors were found useful in this study. First, the third major geometric axis, or the molecule's thickness, was found to be a significant descriptor in modeling retention. A charged partial surface area (CPSA) descriptor, the partial negative surface area (PNSA), was the other geometrical descriptor used in this study, particularly to model the polar Carbowax 20M column. This descriptor represents the sum of the negatively charged atoms' contribution to the surface area of the rn0lecule.~7J8 Regression Analysis. Objective feature selection was used to shrink the large pool of descriptors prior to submitting them to regression techniques. This was necessary to reduce the possibility for chance correlations. Objective feature selection is a method where descriptors are subjected to analysis which eliminates redundant or less useful descriptors from the pool. Objective feature analysis is performed without considering the dependent variable. Thus objective feature selection is an unbiased technique that reduces the descriptor pool to a statistically valid size. Objective feature selection was done with a descriptor analysis program that offers a set of options that provide a means to analyze the descriptor pool statistically. Descriptors that contained a high number of zero values (>30 % ) were discarded since these descriptors do not contain enough information to be useful. (16) Georgakopoulos, C. G.; Tsika, 0. G.; Kiburis, J. C.; Jurs, P. C. Anal. Chem. 1991,63,2025-2028. (17) Dixon, S. L.; Jurs, P. C. J. Comput. Chern. 1992, 13, 492-504. (18) Stanton, D. T.; Jurs, P. C. Anal. Chern. 1990,62,2323-2329.

584

ANALYTICAL CHEMISTRY, VOL. 65, NO. 5, MARCH 1, 1993

Table 11. Reduced Descriptor Pool Used in Regression Analysis SB MW DPOL

MPOL PPSA-1 PNSA-1 KAPPA-1 MOMI-1 DMPATH DMGEO-3 3Xe

no. of single bonds mol wt electric dipole moment molecular polarizability partial positive charged surface area partial negative charged surface area shape-related index based on graph theory first major moment of inertia substructure all path count

cc'.;'.

third major geometric axis (width) molecular connectivity (cluster 3)

D

I

A-E-C

A-E-C-D 4xc

5xc

6 XP" 34 X P XP

molecular connectivity (cluster 4)

Table 111. Summary of Outliers Identified by Robust Regression Analysis (RRA) and DescriDtor Diannostics . (DDG) Carbowax compd SE-30column 20M column cis-decahydronaphthalene(78) RRA, DDG RRA, DDG RRA,DDG RRA trans-flmethylstyrene (61) 4-methyl-trans-fl-methylstyrene (65) RRA cis-hexahydroindan(72) RRA 5-vinyl-2-norbornene(endo)(76) RRA RRA, DDG dipentene (66) RRA, DDG dihydrodicyclopentadiene(80) allylbenzene (46) RRA

I

E

molecular connectivity (cluster 5) valence-corrected molecular connectivity (path 6) molecular connectivity (path 3) molecular connectivity (path 4)

Descriptors with a large number of identical values are removed from consideration,since they too have little discriminatory power within the data set. Descriptors with low standard deviations are removed from the group of possible descriptors for the same reason. Descriptors that encode similar information, indicated by high (>0.90)pairwise correlations, were also removed from the descriptor list. Finally, the descriptor set was analyzed for multicollinearity among the members in the pool. Multicollinearity is the presence of relationships among the independent variables of a data set such that some of the variables can be computed from subsets of other descriptors;e.g., molecular weight can be computed from atom counts. Since the structures in the data set were similar in many respects, many of the descriptors generated were correlated. Over 100 descriptors were deleted using objective feature analysis. The final list of descriptors remaining after objective feature selection is given in Table 11. The 81 compounds were randomly divided into two sets. The training set consisted of the 71 compounds used in model development. Ten compounds were set aside as a prediction set, which was used later to assess the validity and predictive ability of the regression models. The training set compounds were then used to develop regression models. Several multiple linear regression techniques were used, such as regression by the leaps and bounds technique19 and forward stepwise regression20to determine the optimal model for each chromatography column. The best models are selected on the basis of the multiple correlation coefficient (R),the standard error (a), the overall F value of the model and the predictive ability of the model. There are a maximum number of descriptors that can be included in a model while still retaining statistical validity. The usual rule of thumb is to limit a regression equation to one independent variable, or descriptor, for every five observations in the data set.*l All of the models evaluated in this study were well within this limit. Regression by leaps and bounds produced severalmodels which were tested for multic~lliiearity.~~ Possible problems with multicollinearity were identified with all of the regression models that were developed by leaps and bounds regression and which contained more than five descriptors. This placed a practical limit on the number of descriptors that could be included in the models of this study. Statistically, a maximum of 14 descriptors could be included without violating the five to one limit. Of course this limit assumes no multicollinearities among the descriptors in the model. Interactive regression analysis uses forward stepwise regression and was employed to further develop the models which showed (19) Furnival, G.M.;Wilson, R. W., Jr. Technometrics 1974,16,499504. (20) Neter, J.; Wasserman, W.; Kutner, M. H. Applied Linear Statistical Models, 3rd ed.; Irwin: Boston, MA, 1990. (21) Topliss,J. G.;Edwards, R. P. J.Med. Chem. 1979,22,1238-1244.

promise with leap and bounds regression. With this approach, each descriptor is entered individually into a model and the statistical effects, along with the resulting model, can be observed. Thus descriptors could be placed into or withdrawn from a particular model interactively and the effects on the model statistics monitored. An attempt was undertaken to improve the models through transformations of the descriptors and the dependent variable. Transformations are typically used to correct for problems with nonlinearity. Nonlinearity is a condition which exists when a plot displays a curvilinear relationship of the calculated versus observed values. Transformations alter a descriptor using a mathematical expression, such as XT = X", XT = 1/X, XT= log X, or XT = exp(-X). If transformations of the independent variables do not correct these nonlinearities, another feasible solution is to transform the dependent variable. This is a less favorable solution, since transformations of the dependent variable may bring about problems with nonconstant variance of the error terms.20 Several different transformations listed above were attempted. Models were developed that implemented the transformations and tested for any signs of multicollinearity between the descriptors. Although severalmodels were developed that fit the data well, none of these models improved to the extent that they justified the inclusion of the transformed descriptors into the final model. Outlier Analysis. Each model was tested for outliers using both outlier diagnostic programs and robust regression analysis.22 Outliers are poorly fitted data points which cannot be accounted for by the model without compromising modelvalidity. A battery of tests are used such as DFFITS, Cook's distance,leveragevalues, and studentized residuals to detect outliers within a data set for a given regression equation. Data points failing any combination of three of the six tests are flagged as outliers. Robust regression analysis uses a least median squares error criterion to identify simultaneously all outliers and points of high leverage. Both methods identified several outliers in the data set. Although this was not a homogeneous data set, there were two groups of compounds which composed a large percentage of the data set-alkylbenzenes and styrenes. It would be expected that the models would account mainly for the structural features of these groups and potentially not account for members of the data set which significantly differed from them. This proved to be true and some common structural traits were found among the outliers. The models both had difficulty discriminating between cis and trans isomers. Likewise, cyclic structures which did not contain aromatic bonds also displayed high residuals and were difficult to predict. This can be explained because one of the descriptors specifically encodes path information related to a structure's aromatic bonds. Once identified as an outlier, compounds were again checked to ensure molecular modeling resulted in reasonable conformations. All of the compounds flagged as outliers displayed reasonable conformations and were not structurally unique to the data set. Table I11 summarizes the results of the outlier analysis using both RRA (robust regression analysis) and DDG (the data diagnostics program). RRA and DDG identified cis-decahydronaphthalene(7) as an outlier on both columns. This compound was both nonaromatic and its retention index differed from that of the trans conformation by over 40 retention units (ru) on the SE-30 column and (22) Rousseeuw, P.J. Am. Stat. Assoc. 1984, 79,871-880.

ANALYTICAL CHEMISTRY, VOL. 65, NO. 5, MARCH 1, 1993

by 60 ru on the Carbowax 20M column. DDG and RRA both identified trans-B-methylstyrene (61) as an outlier on the SE-30 column. RRA flagged this same compound on the Carbowax 20M column. Another styrene, 4-methyl-trans-/3-methylstyrene (65)was also identified as an outlier by RRA on the SE-30column. It should be noted that this compound only differsfrom compound 61 by a methyl group in the para position. Likewise, RRA also flagged cis-hexahydroindan (72) on the SE-30 column. 5-Vinyl2-norbornene (endo) (76) was flagged as an outlier on the Carbowax 20M column by RRA. Both programs identified dipentene (66) and dicyclopentadiene (80) as outlier compounds on the Carbowax 20M column. All of these compounds were without aromatic bonds. Finally, allylbenzene (46)was identified by RRA as an outlier on the polar Carbowax 20M column. Of the four structures with triple bonds, allylbenzene had the triple bond the furthest from the aromatic ring. All of these compounds were removed from consideration leaving 67compounds modeled on the SE-30 column and 65 on the Carbowax 20M column. Model Validation and Prediction. The final step in this study was to test the models for statistical validity and to determine their utility to predict compounds outside of the training set. Model validation was achieved using both internal and external validation. Internal validation uses observations from within the data set to test the quality of a model. External validation involves the use of a separate prediction data set of observations that is set aside during development of the models. The models developed with the training set compounds are later used to predict the retention values for the prediction set compounds. Typical prediction set sizes range from 10% to 20 % of the data available. One can be confident that models performing well on the external prediction set can be used for predictive analysis of new observations, providing that the prediction set compounds are structurally similar to the compounds used to develop the model. Internal jackknifing was initially performed to assess the quality of the models. This technique removes one observation and recalculates a new model, which the program then uses to generate a predicted value for the missing observation. This is repeated for the entire data set. The jackknifed estimates gave no indication of validation problems for the models. All of the jackknifed estimates were very close to the calculated values, indicative of a robust model. External validation is the preferred validation technique. Nine compounds (8,15,32,36,47,59,63,70,81) were randomly selected and withheld from the model-building process. These compounds formed the basis to conduct external validation. The retention values calculated for these compounds correlated highly with the observed values. Those compounds predicted for the SE-30 column were correlated with the observed values with R = 0.955. The Carbowax 20M prediction set yielded an R value of 0.958. This is indicative of a stable model of value for predictive purposes. The results of model validation suggest that the models developed here are indeed stable and are able to predict hydrocarbons of nature similar to that of those in the training set.

RESULTS AND DISCUSSION Several high-quality models were developed using interactive regression analysis. The best equation found that describes the SE-30 column for 67 hydrocarbons is given in Table IV. The scatter plot of the calculated versus observed retention values is shown for the SE-30 column in Figure 1. The descriptors contained in this model include molecular weight, third-order connectivity index, third major geometric moment, and the substructure path count. Chemically, these variables are significant to the retention problem. All implicitly relate the structure to the topological and geometric factors that govern chromatographic retention behavior. The substructure path count and the molecular connectivity both encode information about the side groups and the degree of branching within a molecule. The third geometric moment is a measure of the width across a molecule’s major axis. This,

585

Table IV. Regression Model Developed for the 67-Hydrocarbon Mixture Separated on the SE-30Column reg coeff std dev of the reg coeff descriptor 4.7 f0.3 mol wt 4.6 f0.3 substructure path count 68.2 f7.2 third-order path -142.8 119.3 third major geometric axis 161.2 intercept s=

R = 0.983

n = 67

18.6

F = 454

std error of the mean: 1.8%

1100

I-

R = 0.983 N=67 I= 18.6 I

I

-

1

800 900 1000 1100 1200 Obwrved Retention Index Figure 1. Plot of the calculated versus observed retentbn lnditxs for the 67 hydrocarbons separated on the SE30 column. 600

700

Table V. Model Developed for the 65-Hydrocarbon Complex Separated on the Carbowax 20M Column ree coeff std dev of the ree coeff descriDtor 3.2 f0.2 PNSAl 156.1 *9.5 third-order path 3.4 f0.4 mol wt -207.2 125.5 third major geometric axis 2.8 f0.4 substructure path count 106.1 intercept R = 0.987

s = 23.3

n = 65

F = 432

std error of the mean: 1.93%

coupled with the molecular weight, are bulk property descriptors. These two descriptors are an indicator of molecular size, which is related to the molecular polarizability of a compound. The solutestationary-phase interactions within a column are primarily dispersive, with the energy of the interactions dependent on the electronic polarizability. Table V lists the best model developed for 65 hydrocarbons for the Carbowax 20M column. The corresponding calculated versus observed plot is presented in Figure 2. The only difference in the descriptors used in this model compared to the SE-30 model is the addition of one CPSA descriptor. This permits the model to account for the increase in polarity of the Carbowax 20M column and the increased importance of the polar interactions. Historically, retention behavior on polar columns has been more difficult to model, due to the more polar nature of the solutestationary-phase interactions. This is readily observed when comparing the SE-30 model with the Carbowax 20M. It appears that no one descriptor can fully encode the structural features responsible for these interactions. Compounds which were located at the extremes of the retention range of the data set have the potential to exert leverage on the model. All of these points were tested to

586

ANALYTICAL CHEMISTRY, VOL. 65, NO. 5, MARCH 1, 1993

1200

1700 1600

B

4

1500

8 1400

1100 lo00

*a

8

‘ci 900 ec

11300 11200 R = 0.987 N = 65 s = 23.3 1

I

I

I

I

-

I

900 lo00 1100 1200 1300 1400 1500 1600 1700 Obrsrvsd Ratention Index Flgurr 2. Plot of the calculated versus observed retention lndlces for the 65 hydrocarbons separated on the Carbowax 20M column.

determine if leverage existed by removing each point and rebuilding a new model. If undue leverage was present, the coefficients of the model would change significantly. The tests showed that undue leverage was not a factor in either of the models. As explained before, descriptors derived from the data sets often exhibited problems with multicollinearity. Multicollinearity can be related to the similarities among the descriptors which encode similar information about the compounds in the data set. A contributing factor to multicollinearity can be the amount of homogeneity among the structures in the study. Because of this, an experiment was conducted to determine if principal component analysis (PCA) could be used to generate useful descriptors, since PCA has also proved helpful with the display of multidimensional data.23 Principal component analysis is an algebraic technique that reduces the dimensionality of a data set. Therefore several descriptors could be reduced to a few fundamental variables, or principal components. The principal components are a combination of the original descriptorswhich explain as much of the total variance of the dependent variable as possible.24 The results of PCA as a method to generate the descriptors for these data were unsatisfactory. The four descriptors generated with PCA represented the first four principal components and accounted for 96% of the variance of the original variables. When the four PCA descriptors were used in regression analysis, however the resulting model was poor. An attempt was made to add each of the PCA descriptors to an existing model in hopes of obtaining better results. This too was unsuccessful, with the best model having a multiple correlation coefficient R = 0.810. Another experiment was intended to compare this study with previous alkylbenzene QSRR studies that relied on the boiling point as a physicochemical d e ~ c r i p t o r . 4 ~Since ~ ~ ~ the boiling point has been shown to be correlated with retention properties, the use of this descriptor should result in excellent models. Forty alkylbenzene compounds from the main data set were placed into a separate training set. Boiling points were obtained from published data.% Models were developed

1a

800

700 600

600

I

I

700

800

I

I

I

900 lo00 1100 Oblsrved Retation W x

1200

Flgurr 3. Plot of the calculated versus observed retention lndlces for the 40 alkylbenrenes separated on the SE-30 column.

Table VI. Regression Model Developed for the 40 Alkylbenzenes Separated on the SE-30Column Using the Boiling Point Descriptor reg coeff std dev of the reg coeff descriptor 3.7 AO.1 boiling point 4.6 Al.1 no. of single bonds 340.1 intercept R = 0.998

s = 6.4

n = 40

F = 4994

std error of the mean: 0.6 %

Table VII. Model Developed for the 40 Alkylbenzenes Separated on the Carbowax 20M Column Using the Boiling Point Descriptor rea coeff std dev of the rea coeff descriptor 6.6 *0.2 boiling point -27.0 A3.8 molecular polarizability -20.6 f3.1 no. of single bonds 660.7 intercept

R = 0.994

s = 13.0

n = 40

F = 1056

std error of the mean: 1.1% for both chromatographic columns using regression analysis which utilized boiling point as one of the descriptors. Figure 3 shows the calculated versus observed plot for the SE-30 model using the boiling point descriptor. The model itself is listed in Table VI. This model used only two descriptors, boiling point and the number of single bonds. The boiling point is the most significant descriptor in the model. The number of single bonds is a topological fragment descriptor that accounts for the size of a molecule. An excellent model was also developed for the more polar Carbowax 20M column, as given in Table VII. The Carbowax 20M model includes the boiling point as a descriptor with molecular polarizability and the number of single bonds. Again, the boiling point is the most significant descriptor in the model. The molecular polarizability descriptor permits the model to account for a molecule’s retention behavior on the more polar Carbowax column. Molecular polarizability is also related to the molar refractivity (M,) by eq 1,where

(23) Maesnrt, D. L.; Kaufman,L. The Interpretation of Analytical Chemical Data by the Use of Clwter Analysis;Wiley-Interscience: New York, 1983. M,= 4/3 ?r N,(molecular polarizability) (1) (24) Pietrogrande, M. C.; Dondi, F.; Borea, P. A.; Bighi, C. Chemom. Intell. Lab. Syst. 1989,5, 257-262. (25) Wold, S.; Esbensen, K.;Geladi, P. Chemom. Intell. Lab. Syst. N, is Avagodro’s number. Molecular refractivity has been 1987,2,37-52. (26) Kumar,B.;Kuchhal,R.K.;Kumar,P.;Gupta,P.L.J.Chromtogr.found to be an important descriptor in previous QSRR work Sci. 1986, 24, 99-108. studying alkylbenzenes.27 The calculated versua observed

ANALYTICAL CHEMISTRY, VOL. 65, NO. 5, MARCH 1, 1993

1600

flSW

Table VIII. Comparison of Models from This and Previous Studies

/4

t

~

stationary phase descriptors SE-30 MW, third geometric axis, substr path, 3xp Carbowax 20M MW, third geometric axis, substr path, 3xp,PNSA SE-30 bp, no. of single bonds Carbowax 20M bp, no. of single bonds, molecular polarizability SE-30 MW, third geometric axis, substr path, 3xp Carbowax 20M kappa-1, substr path, third geometric axis,

R = 0.994

N = 40

-

8 = 13.0 I

I

I

I

I

587

I

900 lo00 1100 1200 1300 1400 1500 1600 1700 Obnervcd Retention Indox F l ~ w4. r Plot of the calculated versus observed retention Indices for the 40 alkylbenzenes separated on the Carbowax 20M column. plot for the Carbowax 20M column is illustrated in Figure 4. Both of these models indicate excellent fits, which was to be expected using the physicochemical descriptor and the smaller, more homogeneous data set. These regression models were also compared with models that used the alkylbenzene training set which did not contain the boiling point descriptor but instead used descriptors calculated in the descriptor generation phase of this study. This was done to determine more accurately the effect of using the boiling point descriptor in the model. Because of the congeneric data set, the resulting SE-30 model was quite good (R = 0.9910, s = 14.8) and better than the model developed using the entire data set. The Carbowax 20M model (R = 0.9826, s = 21.3) developed with the 40 alkylbenzenes had a lower standard error than the corresponding model which used the entire data set. As expected, both of the regression models generated using the alkylbenzene data set did not perform as well as the models which used the boiling points. Table VI11 compares the models developed in this study with those previously published in the literature.* Although this study relied on more descriptors with generally less favorable results, the previous studies used boiling points and physicochemical properties that were highly correlated with the retention index as descriptors. Physicochemical properties must also be measured whereas the ~~

(27) Kaliezan,R. Quantitative Structure-ChromatographicRetention Relationships; John Wiley & Sons: New York, 1987.

R 2 stderror N 0.9670

18.6

67

0.9741

23.3

65

0.9960 0.9881

6.4 13.0

40 40

0.9821

14.8

40

0.9655

21.3

40

0.9956 0.9934 0.9928 0.9998

7.5 17.7 9.6 2.8

19 18 19 6

PNSA, 3xp SE-3P Carbowax 20M0 Carbowax 20M0 Carbowax 20MO (I

bp, llx bP, 11Vm bp, (molar refractivity)-' bp, (molar refractivity)-1

Reference 4.

models developed in this study used descriptors which could be readily calculated given the molecular structure. In addition, the previous studies used small data seta of very homogeneous compounds, reducing their ability to generalize for similar compounds.

CONCLUSIONS Past studies have shown that QSRR is useful in developing models for homologous hydrocarbon data seta using experimentally obtained chemical properties such as boiling points as descriptors. The results of this study demonstrate that the QSRR can also generate high-quality models using data sets of a more diverse nature, using only calculated descriptors. The models developed to represent retention behavior were statistically valid and correlated highly with the observed data. Both models passed both external and internal validation. The good predictive ability of these models should allow for estimation of retention indices for similar compounds in cases where retention values are not readily available.

ACKNOWLEDGMENT This work was supported in part by the National Science Foundation and the Department of the Army. Larry Anker and John Ball are acknowledged for their assistance in the application of the ADAPT software package.

RECEIVED for review June 8, 19, 1992.

1992. Accepted November