Prediction of Reduced Ion Mobility Constants from Structural

has a multiple correlation coefficient (R) value of 0.991 and a standard deviation of 0.0469 Ko units. The neural network model utilizes the same five...
0 downloads 0 Views 778KB Size
Anal. Chem. 1994, 66, 2480-2487

Prediction of Reduced Ion Mobility Constants from Structural Information Using Multiple Linear Regression Analysis and Computational Neural Networks Matthew D. Wessel and Peter C. Jurs' Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratory, University Park, Pennsylvania 16802

Multiple linear regression analysis and computational neural networks are used to develop models that predict reduced ion mobility constants (KO)from quantitativestructuralinformation encoded as descriptors. The errors associated with the models are similar to the calculated experimental error of -0,040 KO units. The best regression model contains five descriptors and has a multiple correlation coefficient ( R ) value of 0.991 and a standard deviation of 0.0469 KOunits. The neural network model utilizes the same five descriptors and has a root mean square (RMS) error of 0.0393KOunits. The descriptors encode molecular size, weight, functional group, and structural classifications. Ion mobility spectrometry (IMS) involves the study of the characteristic behavior of a gas phase ion in an applied electric field at ambient pressure. A sensitive and selective instrumental technique, IMS is applicable to many analytical problems.lI2 The detection of drugs, chemical warfare agents, explosives, and environmental pollutants using IMS has been reported.3 IMS may beused as a stand-alonemonitor/detector or as a detector for gas chromatography (GC) and other chromatographic technique^.^^^ Portable GC/IMS units have been developed for use in the field.'g6 Ion mobility K is the ratio of the ion drift velocity u to the applied electric field E:

The ion will accelerate in the direction of the electric field. However, the drift velocity of the ion is held constant due to collisions with a neutral drift gas, such as nitrogen. Since IMS is carried out in an enclosed region of fixed length, the above relation may be rewritten: K = -L2

vt

(2)

( I ) Eiceman, G. A. Crit. Rev. Anal. Chem. 1991, 22,471. (2) St. Louis, R. H.; Hill, H. H., Jr. Crit. Rev. Anal. Chem. 1990, 21, 321. (3) Hill, H. H., Jr.; Siems, W. F.; St. Louis, R. H.; McMinn, D. G. Anal. Chem. 1990, 62, 1201A. (4) Karpas, Z.; Pollevoy, Y.; Melloul, S.Anal. Chim. Acta 1991, 249, 503. ( 5 ) Karasek, F. W.; Hill, H. H., Jr.; Kim, S.H.; Rokushika, S.J. Chromatogr. 1977, 135, 329. (6) Snyder,A. P.;Harden,C.S.;Brittain,A. H.; Kim, M.;Amold,N. S.;Meuzalaar, H. L. C. Anal. Chem. 1993, 65, 299.

Analytical Chemistry, Vol. 86,

In this equation, q is the charge on the ion, N is the number density of the drift gas, k is the Boltzmann constant, T is temperature (in Kelvin), m is the ion mass, M is the mass of the drift gas, and Q is the ion collision cross section. When only singly charged positive molecular ions and constant operating conditions are considered, K is only dependent on ion mass and collision cross ~ e c t i o n .Since ~ D is determined by the size, shape, and polarizability of an ion, it follows that a direct connection between structure and K exists and can be investigated. Since the early 1970s, efforts have been made to understand the physical and chemical processes that occur in I M S 3There have been several investigations on the effects of ion mass and structure on Correlations between mobility and mass are fairly accurate for homologous series of compounds. IMS studies of ketones and amines have shown this to be true.I0Jl However, when mass is correlated to mobility for nonhomologous compounds, other structural features become important, and the simple correlation breaks down. It is possible to gather some structural information by examining the ion mobility spectrum.I2-l6 Because of its complex dependence on structure, accurate predictions of K have yet to be reported.' Through methods that describe the structural features of ions quantitatively, it may be possible to accurately predict ion mobilities. The ion mobility K is typically reported as a reduced ion mobility constant KO. The relation between K and KO is

(4) where T is temperature and P is pressure. The direct

where L is the ion drift length, Vis the drop in voltage across L, and t is the drift time.

2480

Revercomb and Mason review the fundamental theory that describes ion mobility on a molecular scale.7 At the molecular level, K is dependent on several factors:

No. 15, August 1, 1994

(7) Revercomb, H. E.; Mason, E. A. Anal. Chem. 1975, 47, 970. (8) Griffin, G. W.; Dzidic, I.; Carroll, D. I.; Stillwell, R. N.; Horning, E. C. Anal. Chem. 1973, 45, 1204. (9) Lin, S.N.; Griffin, G. W.; Horning, E. C.; Wentworth, W. E. J . Chem. Phys. 1974, 60, 4994. (IO) Benezra, S . A. J. Chromatogr. Sci. 1976, 14, 122. (11) Karasek, F. W.; Kim, S . H.; Rokushika, S. Anal. Chem. 1978, 50, 2013. (12) Hagen, D. F. Anal. Chem. 1979, 51, 872. (1 3) Karpas, Z.;Stimac, R. M.; Rappoport, Z. Inr. J. MassSpecrrom. IonProcesses 1988, 83, 163. (14) Karpas, Z. Anal. Chem. 1989, 61, 684. (15) Karpas, Z . Int. J. Mass Spectrom. Ion Processes 1991, 107, 435. (16) Karpas, Z.; Pollevoy, Y. Anal. Chim. Acta 1992, 259, 333.

0003-2700/94/0366-2480$04.50/0

0 1994 American Chemical Society

Descriptor Values as Input Values

Descl

Desc2

De.sc3

Desc4

--.

Desc5

I

I

Input I I

I I I I

I I I

Hidden

I I

Layer(s) I

I I I

c..................,........ outj =

-

I I I

output

Layer

K, Figure 1. Architecture and function of a fully connected, feed-forward computational neural network.

comparison of KOvalues measured under different conditions is made possible by using this normalization. Presented here is a computer-assisted study of the relation between KO and structure. Quantitative structure-property relationship (QSPR) methods including multiple linear regression analysis and computational neural networks are utilized. The structural features of a given compound are encoded numerically in what are called descriptors. The descriptors are then used to build models that accurately predict KO for various organic compounds. Real structural parameters that influence KOcan be linked to the descriptors. These structural attributes may be considered to encode the elusive collision cross section element fl of fundamental ion mobility theory.

THEORY Computational neural networks implement complex nonlinear mathematical functions and were originally designed to mimic the activity of a system of neurons.’’ The operation of a neural network is straightforward. A set of input data, in this case molecular descriptor values, is fed to the network. The network processes this information and produces a set of output values which is compared to an associated set of target values. For this study the target values are the experimental KO values for the compounds studied. The goal in any QSPR study is to train the neural network to generate output values that fit the target values with a high degree of accuracy. Since neural networks implement nonlinear functions, the process of training them is akin to nonlinear regression. Therefore, the models built are nonlinear regression models and shall henceforth be referred to as neural network models. Figure 1 illustrates the architecture of a fully connected, feed-forward, three-layer neural network. In this figure, the circles represent individual neurons and the lines connecting them are adjustable weights. The weight wl, connects neuron i in the previous layer to neuron j in the current layer. This (17) Zupan, J.; Gasteiger. J. Anal. Chim. Acta 1991, 248, 1.

particular network has one input, one hidden, and one output layer, but there may be any number of hidden layers. In this study, the input layer utilizes the descriptor values associated with the best model from multiple linear regression analysis. The values are restricted to the interval (0,l) by meansof a linear transformation. The input layer does nothing more than supply the network with the transformed descriptor values, as shown in Figure 1. In each hidden layer neuron, a weighted sum of outputs from the previous layer is combined with an adjustable bias term 0, to generate Netj. A nonlinear activation function operates on Netj and computes the output value, Outj, for that neuron. This procedure is demonstrated in the enlarged neuron on the right side of Figure 1. An output layer neuron works in exactly the same way as a hidden layer neuron, and it generates the output value. The value computed in the output layer is then transformed back to the appropriate range and compared to the target value. An optimization algorithm is employed to adjust the network weights and biases to produce a final set of output values that match the target values as closely as possible for all compounds. This training procedureassociates the input data (descriptors) with the target values (KO) for a set of compounds. The network’s ability to predict KO during training is measured by the sum squared error over each output-target value pair: E=

C(rp - oP)* P

where rp is the target value (KO)for the pth compound and op is the current estimate of the target value for the pth compound. By reduction of the sum squared error, the network can be trained to increase its predictive ability. The method most commonly used to train neural networks is the back-propagation (BP) algorithm.18J9 However, BP is (18) Jansson, P. A. Anal. Chem. 1991,63,357A. (19) Xu, L.; Ball, J. W.; Dixon,S.L.; Jurs, P. C. Emiron. Toxical.Chem., in press.

Ana&tlcaiChemistty, Vol. 88, No. 15, August 1, 1994

2481

essentially a steepest descent technique and is therefore very inefficient compared to other gradient-based methods. As a result, the technique used in this study to train the network is the Broyden, Fletcher, Goldfarb, Shanno (BFGS) quasiNewton training algorithm.2s24 In quasi-Newton methods, the error E k + l and gradient g k + l at cycle k 1 are assumed to be expressible as truncated Taylor series in the parameters

+

X:

where g k T is the transpose of g k , H k is the Hessian matrix, and A x k = X k + l - X k for the k to k+l cycles of the optimization. The parameters x in a neural network are the weights and biases, and the gradient vector g contains the partial derivtives of E with respect to each weight and bias. The point at which a minimum occurs is characterized by g k + l = 0. Substituting this into eq 7 gives the Newton step formula: AXk

=-HL1gk

(8)

Since it is not practical to calculate the inverse Hessian matrix, Hk-l, analytically at each step in the optimization, an approximation of H k - l , denoted G k , is made and iteratively updated. With the estimate G k of the inverse Hessian matrix, a search direction d k is generated. A line minimization is then performed to find an optimal step size CYk in the direction d k . The value of (Yk may be estimated by assuming that E depends parabolically on CY along d k : E(xk

+ CYdk)

E(Xf)

+ a(Y + b(Y2

(9)

The parameters a and b may be determined from the slope at CY = 0, which is given by d k T g k , and from the error E ( X k S d k ) where s is an appropriate sized step, usually CYk-1. The minumum along d k is then given by CYk = -a/(26). The steps taken in a BFGS quasi-Newton optimization algorithm are summarized below: 1. Choose search direction d k using d k = - G k g k . 2. Estimate step size CYk to minize E ( X k + a d k ) . 3. Through the Use O f CYk, x k + l = x k + CYkdk. 4. Determine g k + l for the new parameters X k + l . 5. Update the inverse Hessian G .

+

'k+l

= 'k -

(20) Broyden, C. G. J. Inst. Math. Its. Appl. 1970, 6, 76-90, 222-231. (21) Fletcher, R. Compuf. J. 1970, 13, 317. (22) Goldfarb, D. Math. Compur. 1970, 24, 23. (23) Shanno, D. F. Math. Comput. 1970, 24, 647. (24) Fletcher, R. Practical Methods of Optimization; Wiley: New York, 1980; Vol. 1 .

2482

Analytical Chemistry, Vol. 66, No. 15,August 1, 1994

6 . Iterate. The set of compounds that the neural network uses for training is denoted the training set. The training set provides the network with descriptor values and associated KO values for a large portion of the entire set of compounds studied. While learning, the network will predict KO values for the training set with a continual decrease in the sum squared error. However, there is a point during any training session at which the network will begin to estimate target values for compounds not contained in the training set with increasing error. At this point, the network has o ~ e r t r a i n e d Although .~~ the error in predicting the KOvalues for the training set will continue to decrease with eacy cycle, the network is at a point where it has learned the specifics of the training set and cannot relate a new set of descriptor values to the corresponding target value with any acceptable degree of accuracy. The method of cross validation (CV) is used to prevent this problem. Once a training set is determined, a small fraction (- 1020%) of compounds is held out as a cross validation set. The CV set is used to monitor the progress of training. At regular intervals in the training procedure, the network's ability to predict the KOvalues for the CV set is checked. Since the CV set is not utilized in training, it can effectively monitor the predictive capacity of the network and provides a means for terminating network training at the optimum cycle. An external prediction set (usually 10% of the entire data set) is also used as a final validation step for the network weights and biases as selected by the lowest CV set error. EXPERIMENTAL SECTION The procedure followed for this study is outlined in the flow chart in Figure 2 . Molecular modeling, descriptor generation and analysis, and statistical analysis were performed using the ADAPT software system on a Sun 4/1 10 workstation.26,27All neural network computations were carried out on a DEC 3000 AXP Model 500 workstation. Data Set. For this study, 77 organic compounds were used (Table 1). The reduced ion mobility constant data were taken from Shumate, St. Louis, and Hill and references contained therein.28 Only singly charged positive molecular ions were considered, and the drift gas was restricted to nitrogen. A reported temperature window of 140-1 50 "C was chosen, and the ionization source was NF3 for all compounds in the data set. Since compounds were selected on the basis of similar operating conditions, the only variables that affect KOfor this data set are mass and collision cross section2 For the purpose of model validation, the 77 compounds in the data set were randomly split into a training set of 70 compounds and an external prediction set of 7 compounds. Uncertainties. A possible problem with the methodology used in this study concerns the overfitting of data. To account for this, the references for each compound (given by Shumate, St. Louis, and Hill2*) were examined to determine the (25) Hecht-Nielson, R. Neurocompufing; Addison-Wesley: Reading, MA, 1990; p 115. (26) Stuper, A. J.; Brugger, W. E.; Jurs, P. C. Computer-Assisted Studies of ChemicalStructure and Biological Function; Wiley-Interscience: New York, 1979. (27) Jurs, P. C.; Chou, J. T.; Yuan, M. In Computer-Assisted Drug Design; Olson, E. C., Christoffersen, R. E., Eds.; The American Chemical Society: Washington, DC, 1979; pp 103-129. (28) Shumate, C.; St. Louis, R. H.; Hill, H. H., Jr. J. Chromatogr. 1986,373,141.

Structure

Modeling

Descri tor

/? I

I ~

Neural Networks

1

Flgure 2. Procedural flow dlagram for an ADAPT QSPR study.

experimental error associated with the Kovalue given for each compound. In some cases, standard propagation of error techniques were necessary. The errors were combined and averaged. The mean error for all compounds was approximately f0.040KOunits. Therefore, 0.040 was used as a guide for the lower error bound in order to prevent overfitting of the data. Structure Entry and Modeling. A graphics program was used to sketch in the structure of each member in the data set. The structures were stored in the form of connection tables representing atom and bond types. In theory, KOdepends on a number of parameters, including size and shape. As a result, accurate 3D representations of the compounds in the data set were thought to be necessary. Two molecular modeling programs were used to arrive at a minimum energy conformation for each structure in the data set. MM2, which uses a molecular mechanics force field method, was used to model 63 of the 77 comp0unds.~9 However, MM2 lacked the necessary parameters to model the remaining 14 compounds. MOPAC, which uses a semiempirical molecular orbital method, was used to model these compounds.30 While it is true that the lowest energy conformations arrived at by the modeling algorithms were for neutral molecules rather than positive ions, investigations into the structure of ions and neutral molecules showed that the differences were indeed small. MOPAC was used to compare the optimized geometty for a few compounds as ions and as neutrals. Since a majority of the compounds were modeled with MM2, this comparison is not possible or even necessary, because MM2 is not capable of distinguishing between charged and neutral molecules. ( 2 9 ) Burkert, U.; Allinger, N. L. Molecular Mechanics; ACS Monograph Series, American Chemical Society: Washington, DC, 1982. (30) Dewar, M.J. S.;Thiel, W. J. Quantum Chemistry Program Exchange No. 455 (Version 6.0). J. Am. Chem. Soc. 1977, 99, 4899.

Descriptor Generation. The descriptors calculated can be grouped into the categories of topological, geometric, electronic, and charged-partial surface area (cpsa). Topological descriptors encode structural features such as branching, path counts, atom counts, and weighted path counts.31-35 Geometric descriptors include molecular volume, moments of inertia, and solvent-accessible surface area.3k38 The electronic descriptors provide partial charges of atoms in the structure, dipole moments, and p o l a r i ~ a b i l i t y .Cpsa-type ~~~ descriptors are combinations of the solvent-accessible surface area and the electronic d e ~ c r i p t o r s .Of ~ ~the 126 total descriptors that were generated, 79 were topological, 17 were geometric, 5 were electronic, and the remaining 25 were cpsa descriptors. Descriptor Analysis. The calculated descriptors of the 70 member training set were analyzed for their information content. The methods used are known as objective feature selection because the dependent variable is not used in the analysis. They are designed to remove descriptors from an original descriptor pool on the basis of lack of information or similarity among descriptors. Any descriptor with greater than 70% zero or identical values was eliminated. Such descriptors probably encode little or no useful information. The pairwise correlations of descriptors were also checked. If two descriptors were highly correlated ( r > 0.92), one of them was removed from the pool. Simplicity and ease of interpretation were used to form a basis for choosing which descriptor to retain between the two. These methods were effective in removing 75 descriptors. Vector space descriptor analysis (vsda) was employed to further reduce the descriptor A Gram-Schmidt orthogonalization procedure was used to rank descriptors on the basis of mutual ~ r t h o g o n a l i t y . Through ~~ this method, only information-rich descriptors were retained. In vsda, the descriptors were treated as vectors. A descriptor highly correlated with the dependent variable was chosen as the initial basisvector. Thevector with thelargest orthogonal component to the basis vector was then added to the basis. The routine proceeded in a stepwise fashion. The most orthogonal descriptor was added at each step, and a new basis of higher dimension was created. The process was terminated when the projection angle between the descriptor to be added at any step and the current descriptor space was less than 1'. This procedure was repeated several times using different descriptors as the initial basis vector. The same set of roughly 40 descriptors was found after each attempt. (31) Kier, L. B. Quanr. Srrucr.-Act. Relar. Pharmacol., Chem. Biol. 1986,5, 1. (32) Randi6, M.; Brissey, G. M.; Spencer, R. B.; Wilkins, C. L. Comput. Chem. 1979, 3, 5. ( 3 3 ) Wiener. H. J. Am. Chem. SOC.1947, 69, 17. (34) Kier, L. B.; Hall, L. H. Molecular Connectiuity inStructure-ActiuifyAnalysis; Research Studies: Hertfordshire, England, 1986. (35) RandiC, M. J. Chem. Inf. Comput. Sci. 1984, 24, 164. (36) Goldstein, H. Classical Mechanics; Addison-Wesley: Reading, MA, 1950;pp 146156. (37) Bondi, A. J. Phys. Chem. 1964.68, 441. (38) Pearlman, R. S . In Physical Chemistry Properties of Drugs; Sinkula, A. A,, Va1vani.S. C.,Eds.;QuantumChemistryProgramExchongeNo. 413. Marcel Dekker, Inc.: New York, 1980; Chapter 10. (39) Miller, K. J.; Savchik, J. A. J . Am. Chem. Soc. 1979, 101, 7206. (40) Abraham, R. J.; Smith, P. E. J . Comput. Chem. 1988, 9, 288. (41) Dixon, S.L.; Jurs, P. C. J . Comput. Chem. 1992, 13, 492. (42) Stanton, D. T.; Jurs. P. C. Anal. Chem. 1990.62, 2322. (43) Topliss, J. G.; Edwards, R. P. J. Med. Chem. 1979, 22, 1238. (44) Bradley, G. L. A Primer of Linear Algebra; Prentice-Hall, Inc.: Old Tappan, NJ, 1975.

Table 1. Compounds and Assodated KOValues Used In Thls Study.

compound

no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

didodecylamine methyl stearate tri-n-hexylamine methyl myristate methylamine tribenzylamine dimethy lamine methyl laurate n-dodecylamine ethylamine di-n-hexylamine decylamine methyl caprate 4-chlorobenzophenone lutidine

3,4-dimethylbenzophenone tri-n-butylamine dibenzylamine 4-aminobenzophenone dicyclohexylamine n-octylamine benzophenone 1-bromo-2-chlorobenzene n-heptylamine fluorene n-hexyl acetate isobutyrophenone di-n-butylamine tri-n-propylamine 2,4,5-trichlorobenzene N,N-diethy lanilhe diisobutylamine 2-methylnaphthalene anthracene pyridine 4-terf-butylpyridine n-hexylamine methyl caproate quinoline

KO

no.

compound

KO

0.89 1.02 1.11 1.15 2.53 1.20 2.40 1.24 1.26 2.39 1.34 1.35 1.36 1.37 1.91 1.41 1.42 1.43 1.44 1.44 1S O 1.51 1.54 1.59 1.68 1.70 1.65 1.66 1.66 1.67 1.71 1.68 1.70 1.75 2.06 1.70 1.70 1.71 1.78

40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77

N-methylaniline propiophenone benzylamine aniline naphthalene acetophenone a zu1ene n-pent ylamine 2,4,6-collidine indane indene indoline m-cresol mesitylene di-n-propylamine m-toluidine m-xylene diisopropylamine methyl butyrate tert-pentylamine phenol triethylamine m-chlorotoluene p-chlorotoluene n-butylamine o-xylene p-xylene n-butyl acetate tert-butylamine sec-butylamine toluene benzene cyclohexane diethylamine n-propylamine isopropylamine pyrrole trimethylamine

1.77 1.74 1.78 1.83 1.80 1.82 1.82 1.82 1.84 1.84 1.93 1.87 1.95 1.86 1.88 1.92 1.98 1.91 1.93 1.95 1.95 1.95 2.01 1.96 1.97 1.98 1.98 2.00 2.03 2.05 2.10 2.24 2.12 2.12 2.13 2.19 2.27 2.34

KO data taken from Shumate, St. Louis, and

Regression Analysis. The reduced descriptor pool was then examined for the purpose of developing regression equations that relate KOto the descriptors. Regression equations are of the form

where KO,is the reduced ion mobility of the j t h compound, bo is they intercept, bi is the coefficient of descriptor Xi, for compoundj, and n is the number of descriptors in the regression model. Two regression analysis techniques were used in this study. Regression by leaps and bounds uses the R2 criterion to search the pool for subsets of descriptors that produce high-quality m0dels.~5 Since it is limited to a maximum of 24 descriptors, the best subsets identified were then combined with the remaining descriptors in the pool to search for other possible subsets. An interactive regression analysis technique was used to manually revise models that were suggested by leaps and bounds. This stepwise approach allows the user to watch how models develop as new descriptors are added and old ones removed. When a sound model was found, several methods were used to test the validity of the model. The results of these tests are discussed later. (45) Furnival, G. M.; Wilson,

2484

R. W., Jr.

Anal’icalChemistry,

Technometrim 1974, 16, 499.

Vol. 66, No. 15, August 1, 1994

Neural Networks. Once a final regression model was chosen andvalidated, thedescriptors in this model were used to develop a computational neural network model with the program QNET (developed and used in our laboratorylg) that effectively calculated KOwith a low root mean square (RMS) error. In this case, the RMS error of the neural network model was compared to the standard deviation of regression. RESULTS AND DISCUSSION Multiple Linear Regression Analysis. Through the use of regression analysis, a number of high-quality models were found. The quality of a model was determined by examining R values, standard deviations of regression, and the number of variables in the model. The best model found was a fivedescriptor model. There were four topological descriptors and one electronic descriptor in this model. The primary fivevariable model had an R value of 0.982 and standard deviation (SD) of regression of 0.0647. These values were essentially the same as compared to R and SD values of other models found containing a larger number of variables. In an effort to improve model performance, the theoretical ion mobility equation was examined. One of the terms in this equation is the square root of the inverse of the reduced mass. Since the molecular weight (MW) was one of the descriptors in the primary model, the square root of the molecular weight,

:::/3

Table 2. Flnal Modd for the Predlctlon of Reduced MoMllty Conrtanl & from Mdocular Structure’ coeff SDofcoeff label descriptor definition

3.59 0.720

0.0511 0.0625

-0.0155

2.24 X

2.48 X 10-4 2.82 X 0.0360 5.21 X -0.149 a

4.53 X

Y intercept QNEG charge on most negative atom l t 3 KAPA 3-A Kier path 3 shape index corrected for heteroatoms 10-5 ALLP 5 Wiener number l t 3 WTPT 3 sum of all path weights starting from heteroatoms l t 3 SQMW square root of molecular weight

2.4 e/

Y-int.

2.2

-

4 2.0

-

1

1

“jl.8

kl.6 1.4 -

”V,, :y_lj RMS Error = 0.0475

1 .o

R = 0.991. SD = 0.0469. N = 67 cmpds.

,

0.8 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 Observed KO

2.4

Flgure4. Plot of predictedKOvs observed6 forthe external prediction set using multiple regression analysis.

1

:::v, K!/ ,

,

,

S.D.= 0.0469

1 .o

0.8 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8

Observed KO Flgure 3. Plot of calculated KOvs observed KOfor the training set using multiple regression analysis.

labeled SQMW, was calculated. When this descriptor was added to the pool, the best model found was equivalent to dropping MW and adding SQMW to the above model. The R value was 0.986,and the S D was 0.0572for this revised model, considerably better than those results obtained in the primary model. The revised model was then examined for any outliers. Six standard statistical tests were used for this purpose. They are the residual, standardized residual, studentized residual, leverage, DFFITS statistic, and Cooks distance.46 When three of the six tests flagged an estimate as an outlier, the corresponding compound was removed from the model. Three of the 70 compounds in the training set, anthracene, mtoluidine, and n-butylacetate, were outliers. These three compounds all exhibited large positive residuals. The data set used in this study was diverse, and it is possible that the outliers were not encoded adequately by the five descriptors in the model. Since the outliers exhibited a significant effect on the model, they were removed. The model coefficients were recalculated with the revised training set of 67 compounds. A final model was generated and is shown in Table 2. An R value of 0.991 and a SD of 0.0469were given by the final model. There were no outliers detected for this model. A plot of calculated KOvs observed KOis provided in Figure 3. To further validate the final model, four techniques were used. First, all pairwise correlation coefficients ( r ) of the five (46) Belsley, D. A.; Kuh, E.; Welsch, R. E. Regression Diagnostics; Wiley: New York, 1980.

descriptors in the model were checked. A cutoff value of r > 0.92 was used. No pairwise correlation coefficients were above this value. A test for multicollinearities was then performed. Each descriptor was regressed against the other four. The multiple correlation coefficient R was used to calculate a variance inflation factor (VIF) for each descriptor. The VIF is given by (1 - R Z ) - l . A value larger than 10 for the VIF is indicative of potential multicollinearity pr0blems.4~ With this approach, the average of the five VIF values was 3.00and the highest individual value was 4.03,well below the cutoff. A third test was performed using the residual values. The residual values were plotted against the calculated KO values. Since no distinct pattern was found, it was concluded that no relationship existed between the residual values and the calculated KOvalues. As a final validation test, the model was used to estimate the KOvalues for all seven compounds held aside in the external prediction set. The model performed well, giving an RMS value of 0.0475,very comparable to the SD of the final model. A plot of predicted KO vs observed KO is given in Figure 4. Computation Neural Networks. Once model validation was completed, the five descriptors in the final model were utilized to develop a neural network model linking KOand structure. The revised training set of 67 compounds used to build the regression model was randomly split into a new training set of 57 compounds and a CV set of 10 compounds. Several network architectures were examined, all containing five input neurons and one output neuron. A final architecture of 5-3-1 was chosen because the number of adjustable parameters was low and network performance was not compromised. When the network architecture had been established, training was conducted with an automated version of the neural network program QNET. The automated version conducted a user-defined number of training sessions (in this case 500) and reported the lowest CV set error found for each session. The session that provided the lowest CV set error was then repeated to determine at which cycle to terminate training. This cycle represented the point of lowest CV set error and therefore the optimum set of weights and biases.25 The behavioral patterns of the training set and CV set errors while (47) Stanton, D.

T.;Jurs, P. C.; Hicks, M. G. J. Chem. InJ Compur. Sei. 1991,

31, 301.

Analytical Chemistty, Vol. 66, No. 15, August 1, 1994

2485

0.070

2.8

0.066

'

b

-

0.063

'

2.6

1

Training Set C.V. Set

A

2.4

0059

2.2

0.045 0.041

0.038

i1

.....L&......... .............. &A

....................

A

d @ &.

I.

1 .o

t

-."_IT

0

85

170

255

425 510 Number of Cycles

340

595

680

765

850

Flgure 5. Training set and CV set error as a function of number of training cycles. The arrow indicatesthe point at which training should be terminated.

C.V. Set (1 0 compounds) 2.4 2.2

1

1.4

1

"."

0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 Observed KO

Flgure 8, Plot of calculated KOvs observed KOfor the training and CV sets using computational neural networks.

the network was learning are represented in Figure 5 . The arrow indicates the point where training was stopped, i.e. the lowest point in the CV set error curve. For this session, the training set and CV set RMS errors were 0.0414 and 0.0390, respectively. The RMS errors are significantly better than the SD of the final regression model and are approximately equal to the experimental errors. The calculated KO values correlate well with the observed KO values as Figure 6 demonstrates. The set of weights and biases providing the lowest error was then tested by external prediction. The original external prediction set from regression analysis was used for this purpose. The RMS error for the prediction set was 0.0393, a considerable improvement from the regression model. A plot of the predicted KOvs the observed KOis shown in Figure 7 . It is clear that computational neural networks enhance the ability to predict KOvalues. Examination of the individual descriptors allows for some conclusions to be drawn concerning the structural features that influence KO. It should be noted that statistical correlations such as these do not lead to direct cause-effect relationships. However, it is possible to use the descriptors and the information they encode as a starting point for a more detailed investigation of the physical processes that occur in IMS experiments. K indices, as presented by Kier, are topological descriptors that describe the molecular shape.31 In this model, the K 2488

Analytical Chemistty, Vol. 66, No. 15, August 1, 1994

0.8 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8

Observed KO Flgure 7. Plot of predictedKOvs observed KOfor the external prediction set using computational neural networks.

descriptor used (KAPA 3-A) is the path 3 K index corrected for heteroatoms. Examination of the data show that KAPA 3-A is encoding the shape of a molecule and its degree of branching. The Wiener number (ALLP 5) of a molecule is a quantitative description of the compactness of a given molecule.33 For a series of isomeric compounds, ALLP 5 decreases as the degree of branching, or compactness, of the molecule increases. In terms of ion mobility, ALLP 5 may be encoding a component of the average collision cross section. As ALLP 5 decreases, the molecular compactness increases, thus giving rise to a decrease in the number of collisions with the neutral drift gas and an increase in KO value. It should be noted that ALLP 5 and KAPA 3-A in this model are encoding similar information, but from a different point of view. There is some correlation between the two, but the r value ( r = 0.78) is not too large. KAPA 3-A and ALLP 5 appear to describe the shape of the molecules in a manner similar to Q,the collision cross section. There is an inverse relation between and KO,and the same behavior is observed when KAPA 3-A and ALLP 5 are plotted against KO. QNEG is an electronic descriptor that gives the partial electronic charge on the most negative atom in the neutral structure.40 The data set shows that this descriptor is encoding information about the type of functional groups present as well as the primary, secondary, or teriary nature of the amines. Homologous subsets of compounds within the data set have similar QNEG values. Despite the fact that QNEG is an electronic descriptor, it is apparently acting as a topological descriptor and helping to identify the functional groups and their classifications. The sum of the path weights for all paths starting from heteroatoms (WTPT 3) is essentially a heteroatom count d e ~ c r i p t o r .Esters, ~ ~ amines, and halogenated compounds are grouped together. Hydrocarbons are also grouped because they give a value of zero for this descriptor. WTPT 3 and QNEG are acting as molecular classifiers by grouping molecules into categories on the basis of similar functionalities or other structural features. The square root of the molecular weight (SQMW), which is self-explanatory, was found to be useful because of its appearance in the theoretical equation that describes ion mobility. SQMW was found to have the highest correlation

with KO( r = 0.943)of all five descriptors. This was expected, given the number of mass to mobility correlations reported in the literature. None of the five descriptors in the final model were geometric, meaning that modeling of the molecules was not necessary after all. Perhaps geometric descriptors would have been incorporated into the final model if the modeling routines were parameterized for ions as well as neutrals. However, the models performed very well in spite of this, and the simplicity of the descriptors eases the interpretation of the models.

CONCLUSIONS Multiple linear regression analysis and computational neural network techniques have been useful in developing models that predict reduced mobility constants for a diverse set of compounds from information describing the molecular

structure. The R M S error of the final neural network model was comparable to the experimental error for the compounds studied. The descriptors in the final model encode molecular size and shape, classification, and molecular weight. Through the use of computer-assisted methods, the understanding of ion mobility has been improved. Better results may be possible, given a more complete understanding of the true structural form of ions in the gas phase.

ACKNOWLEDGMENT The DEC AXP workstation was provided by the Digital Equipment Corp. as part of their Loan of Development Products Program. Received for review January 14, 1994. Accepted M a y 17, 1994.' Abstract published in Advance ACS Abstracts, June 15, 1994.

AnaIyticalChemistty, Vol. 66,No. 15, August 1, 1994

2487