Prediction of Reduced Ion Mobility Constants of Organic Compounds

Quantitative structure-property relationships (QSPRs) are used to develop mathematical models that accurately predict the reduced ion mobility constan...
0 downloads 0 Views 136KB Size
Anal. Chem. 1996, 68, 4237-4243

Prediction of Reduced Ion Mobility Constants of Organic Compounds from Molecular Structure Matthew D. Wessel, Jon M. Sutter, and Peter C. Jurs*

152 Davey Laboratory, Department of Chemistry, The Pennsylvania State University, University Park, Pennsylvania 16802

Quantitative structure-property relationships (QSPRs) are used to develop mathematical models that accurately predict the reduced ion mobility constants (K0) for a set of 168 organic compounds directly from molecular structure. The K0 values are taken from an unpublished database collected by G. A. Eiceman, Chemistry Department, New Mexico State University. The data were collected using a Graseby Ionics environmental vapour monitor (EVM) gas chromatography/ion mobility spectrometer. Standardized conditions with controlled temperature, pressure, and humidity were used, and 2,4lutidine was used as an internal standard. K0 values were measured for all monomer peaks. The best model was found with a feature selection routine which couples the genetic algorithm with multiple linear regression analysis. The set of six descriptors was also analyzed with a fully connected, feed-forward neural network. The model contains six molecular structure descriptors and has a root-mean-square error of about 0.04 K0 unit. The descriptors in the model lend insight into some of the important molecular features that influence ion mobility. The model can be utilized for prediction of K0 values of compounds for which there are no empirical K0 data. Ion mobility spectrometry (IMS) is a useful analytical technique for the detection of trace analytes in gas phase sample matrices. Some advantages of IMS include that it is relatively inexpensive and lightweight and requires no pumping. Ion mobility spectrometers have been developed as small, portable, stand-alone monitors for use in field situations.1-5 Whether as a stand-alone detector or coupled to another analytical technique, such as gas chromatography, IMS is an economical and reliable detection technique. IMS is based on the concept that an ion will drift at a constant velocity when exposed to an electric field at ambient pressure. In reality, the ion accelerates in the electric field before colliding with a neutral molecule and stopping.6 The continuous series of accelerations and collisions leads to the apparent constant velocity. The ratio of the drift velocity of a given ion to the electric field gives rise to the ion mobility value, typically denoted as K. In (1) Eiceman, G. A.; Karpas, Z. Ion Mobility Spectrometry; CRC Press, Inc.: Boca Raton, FL, 1994. (2) Snyder, A. P.; Harden, C. S.; Brittain, A. H.; Kim, M.; Arnold, N. S.; Meuzalaar, H. L. C. Anal. Chem. 1993, 65, 299. (3) Eiceman, G. A. Crit. Rev. Anal. Chem. 1991, 22, 471. (4) Turner, R. B.; Brokenshire, J. L. Trends Anal. Chem. 1994, 13, 275. (5) McClennan, W. H.; Arnold, N. S.; Meuzalaar, H. L. C. Trends Anal. Chem. 1994, 13, 286. (6) Hill, H. H., Jr.; Siems, W. F.; St. Louis, R. H.; McMinn, D. G. Anal. Chem. 1990, 62, 1201A. S0003-2700(96)00466-0 CCC: $12.00

© 1996 American Chemical Society

most IMS instruments, the value of K can be derived from the following equation:

K ) L2/Vt

(1)

where L is the length of the drift tube region of the spectrometer, V is the applied voltage, and t is the drift time of the ion. For purposes of standardization, the reduced mobility constant, K0, is typically reported for any particular analyte ion. K0 values are calculated using the following equation:

K0 ) K(273/T)(P/760)

(2)

where T is the operating temperature of the instrument and P is the current ambient pressure. The literature provides reports dealing with the theory of IMS in greater detail.1,6-8 IMS has found several important uses in recent years. There have been advances in the interpretation of ion mobility spectra, many of these through statistical evaluation of data.9-14 Other advances in instrument design have also increased the sensitivity and selectivity of IMS.15,16 Several portable IMS instruments have also been developed. In particular, the United States Army is finding small, portable IMS units to be valuable devices for the detection of chemical warfare agents and other toxic entities. The portability of these IMS units allows individual soldiers to actively monitor their surroundings in real time. Portable IMS units are also useful in other areas, such as in the detection of drugs, explosives, and pollutants.6,17,18 Of particular interest is the development of mathematical models that accurately link K0 to molecular structure. A method (7) Revercomb, H. E.; Mason, E. A. Anal. Chem. 1975, 47, 970. (8) Mason, E. A. In Plasma Chromatography; Carr, T. W., Ed.; Plenum Press: New York, 1984; Chapter 2. (9) Snyder, A. P.; Waleed, M. M.; Eiceman, G. A.; Wang, Y.; Bell, S. E. Anal. Chim. Acta 1995, 316, 1. (10) Boger, Z.; Karpas, Z. Anal. Chim. Acta 1994, 292, 243. (11) Davis, D. M.; Harden, C. S.; Shoff, D. B.; Bell, S. E.; Eiceman, G. A.; Ewing, R. G. Anal. Chim. Acta 1994, 263. (12) Siems, W. F.; Wu, C.; Tarver, E. E.; Hill, H. H., Jr.; Larsen, P. R.; McMinn, D. G. Anal. Chem. 1994, 66, 4195. (13) Bell, S. E.; Mead, W. C. In Computer Enhanced Anaylitcal Spectroscopy, Vol. 4; Wilkins, C. L., Ed.; Plenum Press: New York, 1993; Chapter 12. (14) Bell, S. E.; Mead, W. C.; Jones, R. D.; Eiceman, G. A.; Ewing, R. G. J. Chem. Inf. Comput. Sci. 1993, 33, 609. (15) Wittmer, D.; Chen, Y. H.; Luckenbill, B. K.; Hill, H. H., Jr. Anal. Chem. 1994, 66, 2348. (16) Dworzanski, J. P.; Kim, M.; Snyder, A. P.; Arnold, N. S.; Meuzelaar, H. L. C. Anal. Chim. Acta 1994, 293, 219. (17) Eiceman, G. A.; Sowa, S.; Lin, S.; Bell, S. E. J. Hazard. Mater. 1995, 43, 13. (18) Eiceman, G. A.; Wang, Y.; Garcia-Gonzalez, L.; Harden, C. S.; Shoff, D. B. Anal. Chim. Acta 1995, 306, 21.

Analytical Chemistry, Vol. 68, No. 23, December 1, 1996 4237

Table 1. List of Compounds (Taken from Eiceman et al.) K0

K0

no.

compound

exptl

calcdc

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

pentane 2,2,3,3-tetramethylbutaneb n-hexane 3-methylhexane 2-methylhexane n-heptane 2,2,4-trimethylpentane 2,2-dimethylhexanea 2,5-dimethylhexane 4-methylheptane 2-methylheptane n-octane n-nonane n-decane cyclohexene 2,5-dimethyl-1,5-hexadiene tert-butyl acetate o-toluidine 2-ethyl-1-hexene 3,5,5-trimethyl-1-hexenea dicyclopentadiene cycloheptane methylcyclohexane ethylcyclopentanea 1,3-dimethylcyclohexane isobutylbenzene 1,2-dimethylcyclohexane 1,4-dimethylcyclohexanea ethylcyclohexaneb isopropylcyclohexane propylcyclohexane cyclodecane butylcyclohexane isopropyl acetateb sec-butyl acetate isobutyl acetate methyl anteisovalerate methyl isovalerate methy isobutyrate methyl tert-valerate anteisoamyl acetate sec-amyl acetate 3-amyl acetate isoamyl acetate 2-ethylbutanoic acid methyl ester methyl isocaproatea 2-methylpentanoic acid methyl ester glutaraldehyde 2,4-lutidine 2,4,6-trimethylpyridine 2-(2-hydroxyethyl)pyridine aniline N,N-dimethylanilinea N,N-diethylaniline benzene toluene o-xylene m-xyleneb ethylbenzene β-methylstyrene R-metasytrene 1,2,4-trimethylbenzene mesitylene propylbenzene cumene 1,2,4,5-tetramethylbenzene butylbenzene sec-butylbenzenea p-cymene p-diisopropylbenzene acrolein propanal butanal 2-methylpropanal 3-methylbutanala

2.02 2.02 2.00 1.91 1.90 1.90 1.82 1.82 1.81 1.81 1.80 1.80 1.72 1.63 1.83 1.78 1.75 1.80 1.84 1.77 1.79 1.96 1.95 1.94 1.87 1.75 1.87 1.87 1.86 1.78 1.77 1.72 1.68 1.83 1.75 1.72 1.80 1.76 1.87 1.82 1.64 1.66 1.68 1.64 1.74 1.67 1.72 1.97 1.95 1.84 1.86 1.86 1.81 1.70 1.94 1.87 1.96 1.96 1.97 1.91 1.89 1.87 1.86 1.85 1.86 1.98 1.89 1.75 1.76 1.59 1.98 1.93 1.83 1.93 1.81

2.05 2.03 1.95 1.91 1.88 1.86 1.85 1.82 1.82 1.82 1.80 1.78 1.71 1.64 1.91 1.87 1.75 1.80 1.83 1.78 1.77 1.95 1.94 1.94 1.88 1.81 1.89 1.88 1.86 1.81 1.78 1.69 1.69 1.83 1.77 1.73 1.80 1.77 1.90 1.84 1.69 1.68 1.71 1.65 1.73 1.69 1.72 1.87 1.95 1.87 1.87 1.85 1.86 1.70 1.95 1.93 1.91 1.90 1.91 1.89 1.90 1.88 1.88 1.86 1.88 1.85 1.78 1.80 1.83 1.66 1.94 1.97 1.91 1.92 1.83

4238 Analytical Chemistry, Vol. 68, No. 23, December 1, 1996

no.

compound

exptl

calcdc

76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150

2-methylbutanal pentanal 2-methylpentanal hexanal benzaldehyde heptanal octanala nonanal tert-butylamine n-butylamine isoamylamine diisopropylamine di-n-propylamine n-hexylamineb triethylaminea benzylamine n-heptylaminea diisobutylamine di-n-butylamine tri-n-propylamine tri-n-butylamine ethanol propanol 2-propanol butanol 2-methyl-1-propanol 2-methyl-2-propanol 2-butanol 2-methyl-1-butanolb amyl alcohola isoamyl alcohol tert-amyl alcohol 3-pentanol hexanol 2-hexanolb 2-ethyl-1-butanol heptanol octanol nonanol decanolb acetic acida propionic acid isobutyric acid ethyl ether 2-butanoneb 2-pentanone 3-pentanone cyclohexanone 2-hexanone 3-hexanone 3-methylcyclohexanoneb 4-methylcyclohexanoneb 2-methyl-3-hexanonea 5-methyl-2-hexanone 3-methyl-5-hexanone 2-heptanone 3-heptanone 4-heptanone acetophenone 2-octanone 3-octanone ethyl acetate propyl methanoate methyl butanoate ethyl propanoate butyl methanoateb propyl ethanoate ethyl butanoate propyl propanoatea butyl ethanoateb butyl propanoate propyl butanoate methyl hexanoate 2-ethoxyethyl acetate ethyl hexanoateb

1.81 1.80 1.73 1.69 1.85 1.60 1.51 1.44 2.00 1.92 1.84 1.90 1.86 1.68 1.95 1.74 1.58 1.70 1.67 1.69 1.46 2.06 1.93 1.98 1.82 1.84 1.83 1.85 1.72 1.70 1.72 1.67 1.78 1.62 1.69 1.66 1.54 1.47 1.40 1.34 2.03 1.94 1.86 1.98 2.00 1.88 1.88 1.84 1.79 1.83 1.76 1.76 1.79 1.70 1.72 1.69 1.75 1.76 1.83 1.60 1.70 1.93 1.87 1.86 1.86 1.75 1.81 1.79 1.77 1.72 1.68 1.71 1.68 1.86 1.63

1.88 1.81 1.80 1.73 1.88 1.64 1.57 1.49 2.00 1.92 1.86 1.88 1.85 1.74 2.01 1.79 1.61 1.71 1.64 1.69 1.45 1.97 1.85 1.93 1.80 1.81 1.91 1.81 1.78 1.71 1.73 1.74 1.77 1.64 1.65 1.78 1.55 1.47 1.41 1.36 2.09 1.96 1.90 1.97 1.93 1.84 1.89 1.86 1.75 1.81 1.79 1.79 1.76 1.68 1.73 1.66 1.71 1.72 1.81 1.58 1.63 1.92 1.89 1.86 1.86 1.79 1.82 1.76 1.76 1.71 1.68 1.68 1.67 1.86 1.60

Table 1 Continued K0

a

K0

no.

compound

exptl

calcdc

151 152 153 154 155 156 157 158 159

methyl heptanoate ethyl octanoate naphthalene biphenylene acenaphthylene fluorene phenanthrene anthracene dimethyl phthalate

1.60 1.47 1.86 1.72 1.74 1.66 1.63 1.63 1.64

1.60 1.47 1.86 1.73 1.69 1.66 1.64 1.64 1.61

no.

compound

exptl

calcdc

160 161 162 163 164 165 166 167 168

diethyl phthalate decanala n-amylamineb cyclohexylamine 3-hexanola 3,3-dimethyl-2-butanol 2,4-dimethyl-2-pentanola acetone 2-methylcyclohexanone

1.49 1.38 1.79 1.82 1.71 1.84 1.67 2.13 1.82

1.50 1.42 1.84 1.79 1.69 1.76 1.60 2.06 1.80

Compound in external prediction set. b Compound in cross-validation set. c As calculated by the best neural network model.

for calculating K0 directly from structure has been reported.19 Prior to that report, there was no known way to accurately predict K0 values for diverse sets of organic compounds.3 Several efforts to link K0 to ion mass have been reported. The mass/inverse mobility correlation has been applied to homogeneous sets of compounds with some success but has been unsuccessful for diverse sets of compounds.20-23 The ion mobility, K (and K0), can be related to molecular structure through the following equation:

K)

3q 2π 16N kT

1/2 1

+M ( ) (mmM ) 1/2



(3)

where q is the ion charge, N is the drift gas density, k is Boltzmann’s constant, T is temperature (K), m is the mass of the ion, M is the drift gas mass, and Ω is the collision cross section. The Ω term in eq 3 provides the necessary link between structure and K0.7,19 To link structure to K0, it is necessary to numerically encode the structure. This is done by calculating numerical values, called descriptors, that describe the structure of the molecule. The method of numerically encoding molecular structure with descriptors and then mapping these descriptors to the K0 values is one example of quantitative structure-property relationships, or QSPRs. This method has proven to be effective in predicting K0 values.19 In this paper, a study using a standardized data set with higher quality and a greater degree of structural heterogeneity than that used in ref 19 is presented. The monomer peaks from each spectrum were used to calculate K0 values for the monomers. Monomers ions are typically either MH+, (M - 1)+, or (M - 3)+, depending on the compound. EXPERIMENTAL SECTION The computations in this work were done with a DEC 3000 AXP Model 500 workstation, except for those calculations involving HYPERCHEM (Hypercube, Inc., Waterloo, ON, Canada), which were performed on a PC. The ADAPT (Automated Data Analysis and Pattern Recognition Toolkit)24,25 software system was used for all other computations except those involving computa(19) Wessel, M. D.; Jurs, P. C. Anal. Chem. 1994, 66, 2480. (20) Griffin, G. W.; Dzidic, I.; Carroll, D. I.; Stillwell, R. N.; Horning, E. C. Anal. Chem. 1973, 45, 1204. (21) Lin, S. N.; Griffin, G. W.; Horning, E. C.; Wentworth, W. E. J. Chem. Phys. 1974, 60, 4994. (22) Benezra, S. A. J. Chromatogr. Sci. 1976, 14, 122. (23) Karasek, F. W.; Kim, S. H.; Rokushika, S. Anal. Chem. 1978, 50, 2013. (24) Stuper, A. J.; Brugger, W. E.; Jurs, P. C. Computer-Assisted Studies of Chemical Structure and Biological Function; Wiley-Interscience: New York, 1979.

tional neural networks. The neural network software was developed independently in our laboratory.26 Data Set. The set of data used in this study were provided by G. A. Eiceman of New Mexico State University. The data were collected using a Graseby Ionics environmental vapour monitor (EVM) gas chromatography/ion mobility spectrometer with a 63Ni ionization source. Standardized conditions with controlled temperature, pressure, and humidity were used, and 2,4-lutidine was used as an internal standard.9 The drift gas was dry air, since this imitates the actual environment in which the EVM usually operates. Water content was held low (below 100 ppb) by using dry air and molecular sieves. K0 values were measured for all monomer peaks. One of the key advantages of standardized conditions is that structural differences between ions should be the primary influence on individual K0 values. The database was screened visually and statistically. To maintain consistency within the database, 2,4-lutidine was assigned a K0 value of 1.95 and used as an internal standard. K0 values for the remaining compounds were calculated using the following equation:

(K0)p ) [(td)s/(td)p](K0)s

(4)

where (K0)p is the reduced mobility for the analyte of interest, (td)s is the drift time of the internal standard (4822.93 µs), (td)p is the drift time of the analyte, and (K0)s is the reduced mobility of the internal standard (1.95). A final data set of 168 compounds was chosen, and they are listed in Table 1. The error associated with the measurements was about 0.04 K0 unit. As a result, this value was used as an approximate lower error bound for our models in order to prevent overfitting of the data. The data set was split randomly into a training set of 150 compounds and an external prediction set of 18 compounds. The training set was used to perform descriptor analysis and to build regression models. The external prediction set was never used in any intermediate steps and was only used as a final validation for the linear model once it was developed and had passed preliminary validation tests, which will be discussed later. Molecular Modeling. All 168 members of the data set were sketched and entered into the computer using HYPERCHEM. (25) Jurs, P. C.; Chou, J. T.; Yuan, M. In Computer-Assisted Drug Design; Olson, E. C., Christoffersen, R. E., Eds.; American Chemical Society: Washington, D.C., 1979; pp 103-129. (26) Xu, L.; Ball, J. W.; Dixon, S. L.; Jurs, P. C. Environ. Toxicol. Chem. 1994, 13, 841.

Analytical Chemistry, Vol. 68, No. 23, December 1, 1996

4239

After the compounds were entered, the semiempirical molecular orbital modeling package EHNDO (developed in our laboratory) was used to optimize the geometries of the structures.27 Accurate three-dimensional structures were necessary for several of the geometric descriptors that were calculated. Descriptor Generation and Analysis. A total of 158 descriptors were calculated. Of these, 83 were topological, 23 were geometric, four were electronic, and 48 were combinations of electronic and surface area information that can be thought of as charged partial surface area (CPSA) descriptors. Topological descriptors were derived from the knowledge of atom types, bond types, and connections. Topological descriptors included bond counts, atom counts, and connectivity indexes.28-32 The electronic descriptors calculated were partial atomic charges and the dipole moment.33-35 Examples of geometric descriptors include moments of inertia, molecular volume, and molecular surface area.36-38 The CPSA descriptors encoded intermolecular interactions such as electrostatic interactions.39 A special type of CPSA descriptor was calculated for the sole purpose of encoding hydrogen bonding information. Since the IMS measurements were carried out in air, the assumption was made that hydrogen bonding could occur only when a structure had a donatable hydrogen atom. Because the water content was held low during the IMS measurements, clustering of water around the analytes was thought to be negligible. After calculation of the 158 descriptors, an automated descriptor screening routine was used to reduce the number of descriptors. If a descriptor had >90% identical values, it was removed from further consideration. When two descriptors were pairwisecorrelated with each other at or above r ) 0.93, one of them was removed. This procedure was effective in eliminating 80 descriptors, leaving a reduced descriptor pool with 78 members. Based on the guidelines put forth by Topliss and Edwards,40 the number of independent variables should be no more than 60% the number of observations, and this condition was met. Multiple Linear Regression Analysis. Multiple linear regression models that linked K0 values to the structures were developed using subsets of descriptors selected from the reduced descriptor pool. In matrix notation, the coefficients for a given model can be calculated using the following equation:

β ) (XTX)-1XTY

(5)

where β is the coefficient vector, X is the descriptor subset matrix, and Y is the vector of dependent variables. (27) Dixon, S. L.; Jurs, P. C. J. Comput. Chem. 1994, 15, 733. (28) Kier, L. B. Quant. Struct.-Act. Relat. Pharmacol., Chem. Biol. 1986, 5, 1. (29) Randic´, M.; Brissey, G. M.; Spencer, R. B.; Wilkins, C. L. Comput. Chem. 1979, 3, 5. (30) Wiener, H. J. Am. Chem. Soc. 1947, 69, 17. (31) Kier, L. B.; Hall, L. H. Molecular Connectivity in Structure-Activity Analysis; Research Studies: Hertfordshire, England, 1986. (32) Randic´, M. J. Chem. Inf. Comput. Sci. 1984, 24, 164. (33) Miller, K. J.; Savchik, J. A. J. Am. Chem. Soc. 1979, 101, 7206. (34) Abraham, R. J.; Smith, P. E. J. Comput. Chem. 1988, 9, 288. (35) Dixon, S. L.; Jurs, P. C. J. Comput. Chem. 1992, 13, 492. (36) Goldstein, H. Classical Mechanics; Addison-Wesley: Reading, MA, 1950; pp 144-156. (37) Bondi, A. J. Phys. Chem. 1964, 68, 441. (38) Pearlman, R. S. In Physical Chemistry Properties of Drugs; Sinkula, A. A., Valvani, S. C., Eds.; Quantum Chemistry Program Exchange No. 413; Marcel Dekker, Inc.: New York, 1980; Chapter 10. (39) Stanton, D. T.; Jurs, P. C. Anal. Chem. 1990, 62, 2322. (40) Topliss, J. G.; Edwards, R. P. J. Med. Chem. 1979, 22, 1238.

4240

Analytical Chemistry, Vol. 68, No. 23, December 1, 1996

A simulated annealing (SA) feature selection routine and a genetic algorithm (GA) feature selection routine were used to find good descriptor subsets.41-43 Each method performed a directed search of the descriptor space in order to determine an optimal subset of descriptors to be used in a linear regression model. The driving force behind both algorithms is the continuous reduction of the root-mean-square (rms) errors from subset to subset. Subsets of descriptors that give lower rms errors are favored over subsets that produce inflated rms errors. Computational Neural Networks. Neural networks were originally designed to mimic the activity of a system of neurons. There are many different types of neural networks, but the type that has been most useful for structure-property relationships is the fully connected, feed-forward neural network. This network consists of a multilayer system of neurons, with each neuron in a given layer fully connected to all neurons in the two adjacent levels. The objective of a neural network is to map a set of input data to a particular set of output data. In this case, molecular structure descriptors served as input data, and the K0 values served as output data. The connections between neurons are known as weights. A neural network is trained to map a set of input data to a corresponding set of output data by iterative adjustment of the weights. In this study, our networks were trained using a quasi-Newton optimization algorithm. Detailed discussions of the type of neural network and the training algorithm used in this study have been published previously.19,26 RESULTS AND DISCUSSION Regression Analysis. Both the GA and SA feature selection routines were effective in finding a high-quality, six-descriptor model with an rms error of 0.047 K0 unit. This preliminary model was then subjected to a series of tests to determine the overall quality of the model. The model was first examined for any possible outliers among the calculated K0 values. It was determined that there were no outliers using the following six statistical tests: residual, standardized residual, studentized residual, leverage, DFFITS statistic, and Cooks distance.44 If four of the six tests exceeded their cutoff values for a given K0 value, the corresponding compound was considered a potential outlier. One compound was flagged as a potential outlier, 2-ethoxyethyl acetate. However, after this compound was removed from the training set and the model coefficients were recalculated, it was clear that the compound had negligible influence on the model coefficients and the rms error. Therefore, the compound was left in the training set. Several additional validation steps were performed. The six descriptors in the final model were checked to ensure that their variance inflation factors (VIF) were low. The VIF is calculated from 1/(1 - R2), where R is the multiple correlation coefficient generated by regressing each descriptor against the others. VIF values over 10 may be an indication of multicollinearity problems.45 For this model, the highest VIF was 9.66, and the average VIF for all six descriptors was 4.24. A plot of residuals vs calculated (41) Bohachevsky, I. O.; Johnson, M. E.; Stein, M. L. Technometrics 1986, 28, 209. (42) Sutter, J. M.; Dixon, S. L.; Jurs, P. C. J. Chem. Inf. Comput. Sci. 1995, 35, 77. (43) Leardi, R.; Boggia, R.; Terrile, M. J. Chemom. 1992, 6, 267. (44) Belsley, D. A.; Kuh, E.; Welsch, R. E. Regression Diagnostics; Wiley: New York, 1980. (45) Stanton, D. T.; Jurs, P. C. J. Chem. Inf. Comput. Sci. 1991, 31, 301.

Table 2. Final Model for the Prediction of Reduced Mobility Constants, K0a coefficient

SE of coeff.

0.810

5.24 ×

-1.55 × -0.131 -9.67 × 10-2 4.76 × 10-2 10-2

10-2

2.25 × 5.81 × 10-3 1.18 × 10-2 5.63 × 10-3 10-3

label QNEG KAPA 3 V1 NO WTPT 3

-1.39 × 10-2 2.21 × 10-3 2SP2 2.48 a

2.38 × 10-2 Y-intercept

descriptor definition charge on most negative atom Kier shape index path 1 valence connectivity number of oxygens sum of path weights from heteroatoms number of secondary sp2 carbon atoms

RMS error ) 0.047 K0 units ; N ) 150 compounds; F ) 174.5.

Figure 1. Plot of calculated K0 vs observed K0 values for the 150member linear regression training set.

Figure 3. Plot of calculated K0 vs observed K0 values for the 135member neural network training set and the 15-member neural network cross-validation set.

Figure 2. Plot of predicted K0 vs observed K0 values for the 18member linear regression external prediction set.

K0 values produced no observable patterns. A plot of calculated K0 vs observed K0 values is shown in Figure 1. The plot has no visual deficiencies. The final validation step was to use the model to predict the K0 values for the 18 compounds of the external prediction set. The rms error was 0.040 K0 unit for the prediction set compounds, a solid validation of the linear model. Figure 2 displays a plot of predicted K0 vs observed K0 values for the prediction set. Table 2 shows the descriptors and coefficients of the linear model. Interestingly, none of the six descriptors in this model are dependent on geometry. This was found to be true in the previous study as well.19 The fact that no geometric descriptors are present in this model is by no means discouraging. The advantage of having a model with no geometric descriptors is that the model can be used for prediction without the necessity of generating three-dimensional structures for the compounds being predicted. Neural Network Analysis. The six descriptors in the linear model were then used to develop a neural network model. The

150-member linear training set was further split randomly into a 15-member cross-validation (CV) set and a 135-member network training set. The same 18-member external prediction set used in the linear portion of the study was also used in this part of the study to validate the neural network model. Network architectures ranging from 6:2:1 (six input neurons, two hidden neurons, and one output neuron) to 6:5:1 were investigated. It was determined that a 6:4:1 network (33 adjustable parameters) minimized the total number of adjustable parameters without sacrificing network quality. The ratio of observations to total adjustable parameters should be at or above 2.0 to avoid chance effects in the network.46 For this study, the ratio of observations to adjustable parameters was 135/33, or about 4.1, well above the recommended minimum value. Once the network architecture was established, a generalized simulated annealing (GSA)47 routine was used to determine an optimal starting point for the neural network. The error surface of a neural network is an ndimensional error surface (where n is the number of adjustable parameters, here 33) with an unknown functional form. There are expected to be several local minimum on the error surface. (46) Livingstone, D. J.; Manallack, D. T. J. Med. Chem. 1993, 36, 1295. (47) Sutter, J. M.; Jurs P. C. In Data Handling in Science and Technology (Vol. 15). Adaption of Simulated Annealing to Chemical Optimization Problems; Kalivas, J. H., Ed.; Elsevier Science B. V.: Amsterdam, 1995; Chapter 5.

Analytical Chemistry, Vol. 68, No. 23, December 1, 1996

4241

Figure 4. Plot of predicted K0 vs observed K0 values for the 18member neural network external prediction set.

The hope is to find a starting point on this error surface that leads to a local minima which produces a well-behaved network with a low rms error. Since it is virtually impossible to locate the global minimum of the complex network error surface, the GSA routine searches the error surface, looking for starting points near good local minima. The GSA routine was effective in finding a good starting point for the neural network. A full training from the starting point was performed, and the cross-validation set rms error was monitored during training to determine an optimal stopping point. This occurred when the cross-validation set rms error no longer continued to decrease in value and began to inflate. Beyond this point, it was assumed that the network was mapping the noise from the training set and was losing its ability to generalize. Therefore, training was terminated in the neighborhood of the lowest rms error for the cross-validation set. The weights and biases of the neural network at the point of termination were then tested to determine their overall quality. The training set rms error was 0.039 K0 units, which was a 17% improvement over the linear model. For the cross-validation set, the rms error was 0.040 K0 unit. A plot of calculated K0 vs observed K0 values is shown in Figure 3. The 18-member external prediction set was then used to validate the neural network model. The prediction set rms error was 0.038 K0 unit. Figure 4 shows a plot of predicted K0 vs observed K0 values for the external prediction set. The neural network model clearly performs better than the linear regression model. This is primarily due to the network’s larger number of adjustable parameters and the ability of the network to link the descriptors to the K0 values in a nonlinear manner. Descriptor Interpretation. The six descriptors in the model can be helpful in advancing the understanding of IMS. However, it should be noted that the descriptors in the model do not represent a causal relationship between K0 and structure; rather, they represent a correlation. It is also important to realize that the descriptors are not working independently of each other to link K0 to structure. Although it is statistically desirable for there to be little or no correlation between the descriptors in the model, 4242 Analytical Chemistry, Vol. 68, No. 23, December 1, 1996

this is not easy to achieve, even for linear models, and becomes increasingly difficult when using neural networks. Therefore, the descriptors are working in concert to a certain degree, and this further complicates the interpretation of the results. Five of the six descriptors in the linear and neural network models were topological, and the remaining descriptor was electronic. The identities of the descriptors are shown in Table 2. The charge on the most negative atom (QNEG), the sole electronic descriptor, the sum of the path weights from heteroatoms (WTPT 3), and the number of oxygens (NO) are all acting as functional group indicator variables. The most simple is, of course, the count of oxygens. The other two descriptors (QNEG and WTPT 3) systematically classify the compounds by functional groups, but do they do so in a smoother transitory manner than does the number of oxygens. Although the calculation of WTPT 3 is somewhat complicated,32 it is still acting as an indicator of heteroatoms, since its value would be zero for any hydrocarbon. The three remaining topological descriptors encode molecular size and shape. The descriptor labeled 2SP2 is simply the total number of sp2-hybridized carbon atoms that are attached to two other carbons and one hydrogen in the structure. The number of secondary sp2 carbons is probably encoding aromatic or double bond information about the structures from a simple presence or absence perspective. Protonation sites on the structures may also be encoded with this descriptor, although the three heteroatom indicator variables described previously could also account for this. KAPA-3, a Kier shape index based on the three-bond paths in the molecule,28 and V1, the path-1 valence connectivity, are indexes that encode molecular size and shape. The collisional cross section value Ω (eq 3), from fundamental ion mobility theory, is dependent on the shape and size of the ion.7,19 These two descriptors are likely related to the Ω value. The compounds in this study were all modeled as neutral molecules. The routine used to model the compounds, EHNDO, does have the ability to perform an open-shell geometry optimization.27 Each compound was assigned a charge of 1+ (i.e., an electron was removed), and the geometry of the resulting ion was optimized. Once the geometries of the ions were optimized, the same procedure described in the Experimental Section was performed using the charged species of the same data set. Indeed, the results obtained were far worse than the models described above. In fact, the best neural network model with charged species was only as good as the linear model for neutral species, with an rms error of about 0.047 K0 unit. The actual structure of ions inside the drift tube is not clear. Those compounds with high proton affinities, such as alcohols or amines, are likely protonated (MH+).48 However, alkanes may actually undergo a hydride abstraction reaction to form (M - 1)+ or (M - 3)+ species.49 Regardless of this uncertainty, it is difficult to model ions accurately, and this undoubtedly contributes to poor model performance. CONCLUSIONS A neural network model that accurately predicts the K0 values for a set of 168 diverse organic compounds has been developed. Information concerning those structural features that influence (48) Karpas, Z.; Bell, S. E.; Wang, Y.; Walsh, M.; Eiceman, G. A. Struct. Chem. 1994, 5, 135. (49) Bell, S. E.; Ewing, R. G.; Eiceman, G. A.; Karpas, Z. J. Am. Soc. Mass Spectrom. 1994, 5, 177.

K0 is also presented. This information tends to reinforce the observations found in ref 19 concerning K0 and structure. The data set used in this study is a more thorough representation of the diversity of common organic structure. Also, the quality of the data is high, since all K0 values were measured under standardized operating conditions. The model has six descriptors, and all six are independent of geometry. As a result, only knowledge of the two-dimensional structure of a compound is necessary in order to accurately predict a K0 value for the compound. Another benefit of the model is that knowledge of complex ion structure is not necessary for K0 prediction, since

neutral molecules were used in developing the six-descriptor model. ACKNOWLEDGMENT This work was supported by U.S. Army Research Office Grant No. DAAH04-95-1-0597. Received for review May 9, 1996. Accepted September 6, 1996.X AC960466T X

Abstract published in Advance ACS Abstracts, October 15, 1996.

Analytical Chemistry, Vol. 68, No. 23, December 1, 1996

4243