Interactive computer system for the simulation of carbon-13 nuclear

span enough IR-MS information to describe many other compounds. A practical benefit of this is that the library may be expanded by projection of new i...
0 downloads 9 Views 901KB Size
Anal. Chem. 1983, 55, 1121-1127

I11 summarizes results for these searches; the strong selection of functionality demonstrated in intralibrary searches is retained. The searches and target transforms indicate that the 110 eigenvectors not only span the original library but also span enough IR-MS information to describe many other compounds. A practical benefit of this is that the library may be expanded by projection of new infrared-mass spectra into the factor space without loss of discriminatory capability. The correlation step of the factor analysis required approximately 22 h of CPU time for completion, making the process of factor analysis costly; however, once completed, it is unnecessary to repeat it as long as any new compounds added can be adequately discriminated by using the factors kept for the compressed library.

CONCLUSION Factor analysis successfully combines the information present in a library of infrared and mass spectra while eliminating redund,ant information. The vector space produced can be used to contain the information of spectra not included in the factor analysis, but the accuracy of the information is dependent on tlhe similarity of the infrared and mass spectra to the factor analyzed spectra. The spectra used in the factor analysis were of high quality (i.e., good signal-to-noise ratios), so poor S I N in either the infrared or mass spectrum could result in poor search performance. This problem is more likely to occur with infrared spectra than with mass spectra because of the relative insensitivity of GC/IR vs. GC/MS; however, a large number of practical analyses are well within the sensitivity range OF both GC/IR and GC/MS (11). Continued advances in GC!/FTIR should also help to alleviate this potential problem with combined searching. Search time (projection, dot product calculation, and compilation of the top ten hits) on the VAX 11/780 was ap-

1121

proximately 5 s. Because of the large number of floating point calculations, search times could be lengthened significantly when using minicomputers; however, the dot product metric is applied successfully in many mini-/midicomputer searches. The 904-compound library used in this factor analysis contained compounds commonly found in G C / R and GC/MS analyses. As H result, the factor space is general enough to describe most GC/IR and GC/MS analytes. More detailed discrimination between compounds may be possible through the use of factor analyses of small libraries of specific compound classes, an area of analysis where it has been suggested that GC/IR/MS can be useful.

LITERATURE CITED (1) Rasmussen. G. T.; Isenhour, T. L. J . Chem. I n f . Cornput. Sci. 1979, 19, 179. (2) Martlnsen, 13. P. Appl. Specfrosc. 1981, 35,255. (3) Wilklns, C. L.; Glss, G. N.; Brissey, G. M.; Steiner, S. Anal. Chem. 1981, 53, 'I 13. (4) Crawford, R. W.; Hlrschfeld, T.; Sanborn, R. H.; Wong, C. M. Anal. Chem. 1982, 54, 817. (5) Shafer, K. M.; Cooke, M.; DeRoos, F.; Jakobsen, R. J.; et al. Appl. Spectrosc. 1981, 35, 469. (6) Hangac, G.; Wleboidt, R. C.; Lam, R. B.; Isenhour, T. L. Appl. Spectrosc. 1982, 36, 40. (7) Mallnowski, E. R.; Howery, D. G. "Factor Analysis in Chemistry"; Wiley-Intersclonce: New York, 1980. (8) Kalser, H. F. Educ. Psycho/. Meas. 1980, 2 0 , 141. (9) Stenhagen, E.; Abrahamson, S.; McLafferty, F. W. "Registry of Mass Spectral Data"; Wiley-Interscience: New York, 1980. (10) Williams, S. S., Isenhour, T. L., unpubllshed work, 1982, Chapel Hill, NC. (11) Jakobsen, R. J., Brasch, J. W., Schafer, K. H. FACSS 9th Annual Meeting, Philadelphia, PA, Sept 19-24, 1982; paper 446.

RECEIVED for review August 20, 1982. Resubmitted and accepted February 3,1983. The financial support of the National Science Foundation, Grant No. Che-78-00632, is gratefully acknowledged.

Interactive Computer System for the Simulation of Carbon- 13 Nuclear Magnetic Resonance Spectra Gary W. Small and Peter C. Jurs* Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratoty, University Park, Pennsylvania

An Interactive computer system for the slmulatlon of carbon-I3 NMR spectra is described. This system provldes unique capabllitles for developlng and storing ilnear models reiatlng observed chemical shlfts to calculated structural descriptors. Once developed, these models can be used to predlct unknown spectra. The entire simulation procedure Is computer based, thus allowing the calculation of both simple and sophlsticatebd descriptors. For the Ylrst tlme, both eiectronlc and geomc&kal structural features can be encoded and Included In the nnodels. Thls slgnlflcantly increases the number of chemlcal systems that can be studied by uslng thls methodology. The organlzatlon and speclflc capablllties of the slmulatlon system are described In thls paper. I n the following paper, an example study that uses the system is presented.

The development of Fourier transform methodology as well

16802

as the increased power of laboratory minicomputers have made carbon-13 nuclear magnetic resonance spectrometry (CNMR) a widely available tool for chemical analysis. The technique is particularly useful in the solution of structure-elucidation problems because of the direct relationship that exists between the structural environment of a carbon atom and its observed chemical shift. When broad-band decoupling procedures are used, the observed CNMR spectrum normally contains one resonance for each carbon atom that exists in a unique structural environment. Since modern NMR spectrometers are computer controlled, the data are collected in digital form. This makes CNMR data extremely attractive for computer-aided interpretation. Work in this area has focused on three categories of analysis methods: (1) pattern recognition techniques, (2) library searches, and (3) spectral simulations. Pattern recognition methods have been used to classify spectra as belonging to various structural categories (1-q3). These techniques are useful when screening a large number

0003-2700/83/0355-1121$01.50/00 1983 American Chomlcal Socletv

1122

ANALYTICAL CHEMISTRY, VOL. 55, NO.

7,JUNE 1983

I I

ZAN3IDATE

I

I

has the virtue of being completely general in concept. It is limited, of course, by the size and quality of the library and the versatility of the coding algorithm. The major drawbacks in terms of practicality for the average user have been that the available systems either have been designed for use with large data bases or have been implemented by using specialized computer languages. Both of these factors pose limitations to making the methodology available to the average CNMR user. The most widely used method for predicting CNMR chemical shifts is based on the construction of linear mathematical models relating shifts to calculated structural descriptors. These models have the form

s = b(0) + b ( l ) X ( l ) + b(2)X(2) + ... + b ( p ) X ( p )

Figure 1. Spectrum simulation. The use of a library search for the suggestion of initial candidate structures is optional.

of unknown spectra for certain structural features, but they give a limited amount of information when a decision has to be made as to the exact structure represented by the unknown spectrum. For specific identifications, the library search scheme has much to offer. This procedure, which involves numerically comparing an unknown spectrum with each member of a library of reference spectra, has been very successful when applied to infrared and mass spectrometric data (4,5). With a large number of spectra now available, library searches of CNMR data have also produced good results (6, 7). The major limitation of the library search technique, however, centers on the library itself. If the unknown is not present, the user must assume that the nearest matches represent structurally similar compounds. In this case, the search results may be misleading or entirely wrong. Moreover, it is unattractive for the average CNMR user to have to perform the amount of work required to assemble and maintain his own library. The acquisition of a commercial library is often precluded by the large costs involved. The third common analysis approach, spectral simulation, is outlined in Figure 1. This method involves the proposal of candidate structures for the unknown, followed by simulation of the spectra of the candidates. The simulated spectra can then be compared to the unknown spectrum. The candidate list can then be modified and the process repeated. The chemist may use his own training and intuition to develop the initial candidates, or the results of a library search may be used. The spectral simulation technique is not limited by the size or quality of a spectral library but by the accuracy of the simulation procedure employed. Three principal approaches have been taken to CNMR spectral simulations: (1)quantum mechanical methods, (2) library shift retrieval, and (3) parametric techniques. Quantum mechanical calculations have been performed in an attempt to estimate directly CNMR chemical shifts in simple molecules (8). While these methods may hold promise for the future, they are not currently versatile enough to be of general utility. Much work has been performed in the area of library shift retrievals (9-12). These techniques have both the advantages and limitations of classical library searches. Searches are performed on a library of coded structural environments. Each environment has an associated chemical shift or range of shifts. A spectrum of a candidate structure is simulated by perceiving the atoms in the structure which will give rise to lines in the spectrum, coding each of these atoms in terms of its structural environment, and searching the library for the closest matches of the environments. The chemical shifts associated with the nearest matches form the simulated spectrum. This approach

(1) where S is the predicted chemical shift, the X(i) values are descriptors which encode structural features of the chemical environment of the atom, the b(i) values are coefficients determined from a multiple linear regression analysis of a set of observed chemical shifts, and p denotes the number of descriptors in the model. The models are formed by using the spectra of a set of reference compounds. In order to calculate accurate models, the reference chemical shifts must be correctly and unambiguously assigned to individual carbon atoms. Once formed, the models can be used to predict the spectra of candidate structures not included in the original reference set. This approach was first applied to the study of linear and branched alkanes by Grant and Paul (13)and by Lindeman and Adams (14). The alkane models have been extended in an attempt to account for heteroatoms and functional groups (15, 16). A limitation of the approach is that the calculated models are highly specific to the types of compounds in the reference set. Therefore, many models would be needed to perform complete spectral simulations. A logical extension is the development of computer programs to apply existing models. These have proven useful, but they are limited to specific models previously determined (17-19). In the broad sense, this simulation approach is limited by the structural descriptors currently used. In the great majority of cases, the models are based on counts of the number of atoms or functional groups of a certain kind that are located at various distances from the carbon center whose shift is being predicted. These descriptors are designed to be easily calculable by hand. This restriction to hand-calculable descriptors places severe limitations on the ultimate power and versatility of the models. In cases in which stereochemistry is important, there can be no encoding of geometrical information with easily calculable descriptors. Similarly, no serious attempt can be made to encode electronic structural information. A logical extension of this simulation approach is to use computer-based methods to calculate more sophisticated descriptors. This has been attempted in a trial study, and excellent results were reported (20). When computer methods are used, sophisticated descriptors are as easy to calculate as simple ones. The scheme only becomes practical for the average user, however, if a computer system can be designed to incorporate all the tasks related to model construction and spectrum prediction. Such a system, if designed to be hardware and software compatible with the average laboratory minicomputer, could be implemented by virtually any CNMR user. In this paper, we describe a minicomputer-based interactive computer software system with complete capabilities for the development of new models, using either simple or sophisticated structural descriptors. Once the models are developed, the system can be used to simulate unknown spectra. The organization and capabilities of this system are described in this paper. In the following paper, the use of the system is

ANALYTICAL CHEMISTRY, VOL. 55, NO. 7, JUNE 1983 DESCRIPTOR GENERATION

>

1123

ANALYSIS AND PREDICTION

w 1, 1 4 SFZ:’FAL

PERCEPTION

demonstrated in an example study.

EXPERIMENTAL SECTION The computer software system described in this paper consists of more than 30 individual programs written in FORTRAN IV and implemented on a PRIME 750 computer operating in the Department of Chemistry at the Pennsylvania State University. Each program ie fully interactive. The user inputs commands to the programs in order to perform the various tasks. Communication among programs is accomplished through the use of 25 established random-access disk files. Some programs employ subroutines from the SSP (21) and IMSL (22) packages. All graphics capabilities are implemented by using Tektronix PLOT-10 software. The minimum hardware requirements considered necessary for the installation of the complete simulation system are a minicomputer equipped with at least 64K words of memory, a graphics terminal with capabilities equivalent to the Tektronix 4012, a digital plotter with capabilities equivalent to the Tektronix 4662, and approrimately 4 megabytes of external direct-access storage. Some computations would be prohibitively slow if the CPU were not equipped with floating point hardware. The authors will be pleased to provide interested parties with more detailed information regarding the simulation system. RESULTS AND DISCUSSION A computer system for both model construction and prediction of unknown spectra must be able to perform several kinds of tasks. The overall organization of the simulation system is presented in Figure 2. Each block in the figure represents the combined contributions of several programs. The various tasks to be performed are handled by individual programs in order to minimize the amount of memory required. Compatibility with the memory capacity of the average minicomputer is therefore maintained. Each group of tasks is described below. Structure and Spectral Entry. An external data entry system is required for the input of the chemical structural information needed for descriptor generation. If the host computer is not interfaced to the spectrometer, a means must be devised for entering or transferring spectral data. The performance of these tasks is outlined in Figure 3. Chemical structures are entered by using the techniques developed for the ADAPT software system (23, 24). A graphics terminal with controllable cursors is used to “sketch” the two-dimensional structure in much the same manner as a chemist would draw it an paper. The cw’sors defiie the atom positions, with atom and bond types being indicated with specific terminal keys. The controlling program perceives the connection characteristics of the structure and uses the relative atom positions on the screen to assign initial two-dimensional coordinates. No hydrogen atoms are explicitly attached to the structure at this time. The structure is assigned an access number and stored on disk for future use. For the compounds comprising the reference set, the corresponding spectra are linked with the structures through the structure access numbers. A maximum of 1000 structures may be stored simultaneously.

A STORED DEPENDENT

Figure 3. The entry of spectral and structural information. For spectra that are to be simulated from existing models, only structural entry is requlred.

The set of reference compounds to be used in model formation is defined as a list of structure access numbers, called the “worklist”. A set of compounds whose spectra are to be simulated by using existing models is specified similarly. The appropriate worklist of structures is created and stored for future reference. Since the spectral resonances represent the contributions of individual atoms, the next step in the analysis is the definition of the atoms within the structures that will be either predicted or included in the formation of the model. To a first approximation, each structurally distinct atom should be treated. Methods have been reported for the perception of structurally distinct atoms (25,26). The total list of atoms to be studied is formed from the structure access numbers and the numbers of distinct atoms within the structures. For the reference set, this list is used to extract the corresponding chemical shifts from the stored spectra. These shifts comprise the dependent variable which will be used in the construction of the simulation model. The atom list and assembled dependent variable, if any, are stored for later use. Three-Dimensional Molecular Modeling. The topology of each chemical structure is defined upon entry. This information (atom types, bond types, connections) is sufficient for many studies, and the user can often proceed directly to the descriptor generation step. For studies in which stereochemical considerations are important, however, geometrical structural information is required. An illustrative example occurs in steroid molecules in which topologically equivalent carbons can produce chemical shifts differing by 5-8 ppm, due entirely to geometrical effects. If these atoms are to be studied, descriptors must be devised for encoding geometrical information. An effective solution to this problem involves the use of molecular mechanics modeling procedures to approximate the three-dimensional Coordinates of each atom as it exists in the lowest energy conformation of the structure. With these coordinates, various descriptors can be calculated to encode the geometrical environments of carbon atoms. Molecular modeling is most commonly performed through the use of empirical force-field calculations (27). Several classical mechanical potential functions are defined which describe bond lengths, bond angles, torsional angles, nonbonded interactions, etc. Deviations from the optimum values for these functions (e.g., 109.5’ for a bond angle around an

1124

ANALYTICAL CHEMISTRY, VOL. 55, NO. 7, JUNE 1983

Table I. Simple Topological Descriptors function ( f )

descriptor

1-5 bonds 1-5 bonds

number of atoms of type t located f bonds from center number of atoms of substitution t located f bonds from center CI for the bonds at distance f from the carbon center individual CI divided by number of bonds at distance f sum of individual CI for distances 1 - f bonds from center corrected CI for the bonds at distance f average corrected CI for the bonds at distance f total corrected CI over the bonds at distances 1-f

class nearest neighbor count valency count

type ( t ) atom type substitution

connectivity index (CI)

individual average total

1-5 bonds 1-5 bonds 1-5 bonds

corrected CI

individual average total

1-5 bonds 1-5 bonds 1-5 bonds

Table 11. Topological Electronic Descriptors class

type ( t )

function ( f )

carbon charge step charge

u charge average net total most positive

0 bonds 1-5 bonds 1-5 bonds 1-5 bonds 1-5 bonds

most negative

1-5 bonds

sp3 hybridized carbon) result in a progressive energy penalty. The sum of the contributions of each potential function defines an overall intramolecular strain energy for the molecule. By use of a numerical optimization algorithm, atoms are moved to minimize the strain energy. The new atomic coordinates arising from this procedure should approximate the lowest energy conformation of the molecule. The difficulty of the problem in the current application is increased due to the initial two-dimensional form of the structures. Relatively large movements of the atoms are required in order to obtain the proper three-dimensional geometries. For this reason, a two-stage process is used to model the structures. This process is outlined in Figure 4. The initial two-dimensional structures are modeled to approximate three-dimensional form through the use of a highly interactive molecular mechanics program (28). This program ignores hydrogen atoms and heavy atom-hydrogen interactions in an attempt to simplify the initial modeling process. Some structures exhibit a tendency to become locked in conformations other than the overall minimum. In these cases, the interactive design of the program allows the user to specify the movement of certain atoms or to adjust individual bond lengths, bond angles, or torsional angles. The progress of the minimization can be monitored closely in this manner. At the completion of this initial modeling step, hydrogen atoms are attached at the proper positions around each heavy atom. Many structures are adequately modeled at this point. For more precise modeling, the capability exists for using MM1, the molecular mechanics program developed by Allinger and co-workers (29). The MM1 program directly models hydrogen atoms and uses more sophisticated potential functions than are used in the interactive modeler. MM1 was designed by Allinger to be run as a batch program with card input and output. We have interfaced it to the CNMR simulation system through interactive preprocessing and postprocessing programs which handle the input of the structures and the replacement of the atomic coordinates. Descriptor Generation. Given the worklist of stored structures, along with the atom list of carbon centers to be studied, structural descriptors can be calculated. This procedure is outlined in Figure 5. Descriptor generation is handled by a set of independent programs, each calculating one or more related descriptors. A calculated descriptor consists of one value for each carbon center in the atom list.

descriptor partial u charge on the carbon center average u charge for atoms f bonds from the carbon center sum of u charges for atoms f bonds from the carbon center sum of absolute values of u charges for atoms f bonds away most positive charge among the atoms f bonds from the center most negative charge among the atoms f bonds from the center

INTERACTIVE

Flgure 4. The three-dimensional molecular modeilng process. The use of the MM1 section is optional.

These collected values are given a library access number and stored on disk. The descriptor storage area on disk is divided into two sections. The main area holds descriptors which have been calculated for the purpose of model formation. The prediction area holds descriptors which are to be used with existing models. Each area can store a maximum of 200 descriptors. Each descriptor can have up to 500 values. This defines the maximum number of atoms that can be treated at one time. Two programs exist for managing and inspecting the stored descriptors. The descriptor management program handles the deletion of descriptors, the review of currently stored descriptors, and other file management tasks. The descriptor analysis program calculates basic statistics, prints correlation matrices, performs principal components analysis, and provides other functions that allow the user to perform an initial screening of the calculated descriptors. In this manner, descriptors exhibiting unacceptably high correlations or those

ANALYTICAL CHEMISTRY, VOL. 55, NO. 7,JUNE 1983 ~

-

.

-

Table III. Geometrical Descriptors class

type ( t )

function ( f )

shell count hydrogen shell count radial distance

heavy atom hydrogen 1-3 power

shells 1-4a shells 1-5 1-5 bonds

hydrogen radial distance

1-3 power

1-5 bonds

heteroatoim distance

atom type

1-3 power

torsional interaction

angle counts

0, 60, 120, 180

van der Waals energy (VDW)

I-[-H energy H-X energy total H C!-X energy

C-H energy total a

1125

Shell boundaries:

c

2.7-3.4, 3.4-4.1, 4.1-4.8, 4.8-5.4 A .

descriptor number of heavy atoms in shell f number of hydrogens in shell f sum of inverse throughspace distance to power t for heavy atoms f bonds from the carbon center sum of inverse throughspace distance to power t for hydrogens attached to atoms f bonds away inverse throughspace distance to power f to closest heteroatom of type t number of torsional angles of value f involving carbon center VDW energy due to all H-H interactions for H a t o m attached to the carbon center VDW energy due to all H-heavy atom interactions for H atoms attached to the carbon center total YDW energy of H atoms attached to carbon center VDW energy of carbon center interacting with other heavy atoms VDW energy of carbon center interacting with all hydrogen atoms total VDW energy of carbon center

Shell boundaries: 0.0-1.5, 1.5-2.4, 2.4-3.2, 3.2-3.6, 3.6-5.0

A.

0 STRUCTURES

8 UORKLIST

DESCRIPTOR GENERATION

A

1

\

Flgure 5. The descriptor generation procedure. Three groups of descriptors may be calculated.

with too few nonzero values can be identified and removed from further consideration. The descriptors available for Calculation can be divided into three groups: (1)simple topological descriptors, (2) topological electronic descriptors, and (3) geometrical descriptors. Each group of descriptors can be divided into several classes, each of which is generated by an individual program. Each class consists of one or more specific descriptors, identified by a type designation and a function designation. The descriptor groups are summarized in Tables 1-111. In each table, the type and function designations for each descriptor are indicated by "t" and "f",respectively. The three descriptor groups are discussed below. Simple Topological Descriptors. The simple topological descriptors are summarized in Table I. The nearest neighbor counts and valency counts comprise the types of simple de-

scriptors used previously in simulation studies. They allow the formation of Lindeman and Adams type models (14). The connectivity index descriptors encode the degree of branching that exists in t,he region surrounding the carbon center. The original development of this index made no provision for distinguishing atom types (30). The corrected connectivity index was introduced to distinguish among heteroatoms (31). Both indexes are computed with only topological information. Topological Electronic Descriptors. This descriptor group utilizes a simple quantum mechanical calculation developed by DelRe (32). The atomic c charge arises from a simple MO-LCAO calculation which attempts to characterize the charge distribution on atoms in saturated molecules through the use of inductive effects. For the purposes of modeling CNMR chemical shifts, these charges are useful in distinguishing between atoms in a way that also encodes information about inductive effects. They are based on topological information only, requiring no molecular modeling to be performed. The specific descriptors in this group are summarized in Table 11. Geometrical Descriptors. The descriptors which use the modeled three-dimensional coordinates are summarized in Table 111. These descriptors attempt to describe the geometrical environments of carbon atoms. They are usually employed only in situations in which locked conformations are being studied. At room temperature, most molecules are able to adopt several possible conformations. The resulting CNMR spectrum represents a weighted average over the various conformations. The calculated models represent the lowest energy conformations of the structures. An increased prediction error is expected if descriptors based on these models are used with conformationally averaged spectra. This problem will be addressed in more detail in the second paper. The shell count descriptors are counts of the number of atoms found in the region between two spherical shells centered on the carbon being studied. The shell sizes and locations were determined empirically from histograms of interatomic distances. They represent natural clusters of atoms in different bonding geometries. The three classes of distance descriptors are implemented as inverse distances exponentiated to a selected power. This allows the descriptor value to approach zero at large distances.

1126

ANALYTICAL CHEMISTRY, VOL. 55, NO. 7, JUNE 1983

DEPENDENT VARIABLE

DESCRIPTORS

I

I

I

REGRESSION DIAGNOSTICS MODEL MANAGEMENT INTERNAL VALIDATION

Flgure 6. The model formation procedure. The use of atom subsets Is an optional procedure. Several steps may be required in the descriptor selection process before an optlmum model is found.

The torsional interaction descriptors count the number of torsional angles of a given value involving the carbon center of interest. For example, the descriptor for 60' angles is a count of the number of gauche interactions involving the carbon center. The van der Waals energy descriptors represent an attempt to encode the effects of nonbonded interactions involving the carbon center or its attached hydrogens. The calculation mimics that used by Allinger in the MM1 program (33). Only 1-4 interactions or greater are included in the evaluation. Analysis and Prediction. The final step in the simulation procedure involves the formation of new models or the use of existing models to predict unknown spectra. Each of these steps is discussed below. Model Formation. An outline of the model formation step is presented in Figure 6. Before model formation can proceed, there must be a dependent variable of observed chemical shifb and a set of structural descriptors which has passed the initial screening procedure for the presence of too few nonzero values and the presence of high correlations (collinearities) with other descriptors. These descriptors are collated with the dependent variable to form a data matrix and are output to a disk file that is read by the analysis programs. The data matrix is a representation of all atoms that are to be studied involving structures in the current worklist. In many cases, some initial grouping of the data is desirable in order to increase the accuracy of the calculated models. The widely used Lindeman and Adams models involve a separation of the data into primary, secondary, tertiary, and quaternary carbon centers, with each group being modeled separately (14). Operationally, this involves the definition of atom subsets within the data matrix, with separate analyses being performed on each subset. The atom perception program used previously provides capabilities for defining and storing these atom subsets for access by the analysis programs. The analysis of the stored data matrix or the specified submatrix involves the search for the best subset of the current descriptors for predicting the chemical shifts comprising the dependent variable. This search procedure uses the techniques of multiple linear regression analysis (34). It is impractical to test every possible subset of descriptors. For this reason, stepwise regression procedures are employed. This involves constructing the model in a stepwise manner, adding at each step the independent variable (descriptor) that explains the greatest fraction of the remaining error between the predicted and observed chemical shifts.

A stepwise regression of a dependent variable on a given set of descriptors produces one model. Often, this is not the best overall model. The key to finding the most accurate model, therefore, lies in altering the descriptor set that is presented to the stepwise regression procedure. A multistep process has been found most useful in obtaining the best model. A study is begun using only the simple topological descriptors (group 1). Many of these tend to be collinear, resulting in a screened descriptor set that is relatively small. These descriptors are submitted to the regression analysis program. A deletion procedure is employed in which each variable in the current best model is withheld in turn, with the stepwise procedure being performed on the remaining variables. This has the effect of uncovering potentially superior models. The interactive nature of the program makes this task convenient to perform. From this deletion procedure, a series of models is found. The observed tendency is that certain descriptors appear in numerous models, while others rarely appear. This provides a means by which the descriptor set can be trimmed. The trimmed set is then augmented with the topological electronic descriptors (group 2) and the process is repeated. If the study demands it, geometrical descriptors (group 3) are added with the same procedure being used. Through this process, a descriptor set is formed containing a high percentage of variables that can enter into models for predicting the observed chemical shifts. At the end of this step, several models will have proven superior in terms of multiple correlation coefficient, standard error of estimate, and F value. These models are assigned a model access number and are stored on disk for future use. The management of stored models is handled by a program similar to that used for the stored descriptors. A maximum of 25 models may be stored a t one time. In order to choose the overall best model, a further set of analysis steps is applied to the stored models. A set of regression diagnostics is computed to analyze the residuals between the predicted and observed chemical shifts (35). These diagnostics make possible the detection of outliers in the data. This provides an evaluation tool, as some models will be less susceptible to the effects of outliers than others. In addition, a set of statistics is calculated which attempts to quantitate the degree of collinearity that exists among the descriptors in each model. Even though the largest pairwise correlations have been removed previously, some models may have serious multicollinearities involving several variables. The condition number and the variance decomposition proportions are indicators of the effects of any existing collinearities (35). The best test of a simulation model lies in its ability to predict chemical shifts of atoms outside the original reference set. For evaluating several models, however, it is often more convenient to use internal validation procedures. These involve leaving out some atoms from the analysis, recalculating the models with the remaining atoms, and predicting the chemical shifts of those atoms withheld. Two procedures for performing internal validation are discussed below. One method involves sequentially leaving out one observation, forming the model from the remaining observations, and predicting the one left out. This procedure is repeated such that each observation is predicted once. The overall residual sum of squares for the predicted vs. observed values is a measure for evaluating the performance of the model. This technique, commonly called jackknifing, was first described by Allen (36). The second procedure involves splitting the data in half, forming the model based on half and predicting the other half. The data are split by using an algorithm that attempts to form

ANALYTICAL CHEMISTRY, VOL. 55, NO. 7, JUNE 1983

~

C:ETE:AL

I

PREDICTION

PREDICTED SHIFTS

1127

science, and statistics. The determination of the value of this effort lies ultimately in the results obtainable and the chemical systems that can now be studied that were impractical or impossible to study previously. It is to this evaluation that we turn in the second paper.

LITERATURE CITED Brunner, T. R.; Wilkins, C. L.; Lam, T. F.; Soltzberg, L. J.; Kaberline, S . L. Anal. Chem. 1976, 48, 1146-1 150. Wilkins, C. L.; Brunner, T. R. Anal. Chem. 1077, 49, 2136-2141. Sjostrom, M.; Edlund, U. J . Magn. Reson. 1977, 25, 285-297. Rasmussen, G. T.; Isenhour, T. L. J . Chem. Inf. Cornput. Sci. 1079,

19,179-186. Powell, L. A.; Hieftje, G. M. Anal. Chim. Acta 1078, 100, 313-327. Schwarzenbach, R.; Meili, J.; Konitzer, H.: Clerc, J. T. Ora. - Magn. Reson. 1076, 8 , 11-16. Jezl, B. A,; Dalrymple, D. L. Anal. Chem. 1975, 47,203-207. Ando, I.; Nlshioka, A.; Kondo, M. Bull. Chem. SOC.Jpn. 1974, 47,

1097-1104. Mitchell, T. M.; Schwenzer, G. M. Org. Magn. Reson. 1978, 1 1 ,

378-384. PREDICTED SPECTRA

Figure 7. The pirediction of unknown spectra. If atom subsets are used, several models may be required to predict each complete spectrum.

the two data groups such that they possess similar correlation structures. This method was developed by Snee in his program DUPLEX (37). The results of these analyses usually allow the selection of one or two best, models which can be used with external data in a final test. This is performed as a spectral prediction and will be discussed below. Prediction of Unknown Spectra. Each stored model has an associated set of descriptors. Given the stored model and the calculated descriptors, predictions may be performed. This is outlined in Figure 7. If atom subsets are used, the appropriate model and descriptors are used for each subset. Each set of predicted chemical shifts is stored. A prediction analysis program uses the stored predicted values and the atom subsets to form the complete spectrum for each structure. If the observed spectra exist, comparisons can be performed.

CONCLUSIONS This paper hLas presented the techniques that we have recently implemented for performing CNMR spectral simulations by use of linear models. The success of such models lies heavily in the design of useful structural descriptors. One of the design features of the present system is that the descriptor calculation programs are modular, thus allowing new descriptors to be designed and implemented with a minimum of additional effort. Each new class of descriptors is simply one additional program added to the system. This allows the system to expand easily as new compounds demand new descriptors. By way of example, no descriptors are currently implemented for encoding unsaturations or aromaticity. These represent possible areas for expansion. The design of this simulation system has required the amalgamation of techniques from analytical chemistry, computer

Zupan, J.; M e r , S.R.; Milne, G. W. A,; Mlller, J. A. Anal. Chim. Acta 1978, 103, 141-149. Milne, G. W. A.; Zupan, J.; Heller, S. R.; Miller, J. A. Org. Magn, Reson. 1079, 12,289-296. Gray, N. A. B.; Crandell, C. W.; Nourse, J. G.; Smith, D. H.; Dageforde, M. L.; DJerassi,C. J . Org. Chem. 1961, 4 6 , 703-715. Grant, D. M.; Paul, E. G. J . Am. Chem. SOC.1964, 86, 2984-2969. Lindeman, L. P.; Adams, J. Q. Anal. Chem. 1971, 4 3 , 1245-1252. Ejchart, A. Org. Magn. Reson. 1080, 13, 368-371. Ejchart, A. Org. Magn. Reson. 1981, 15, 22-24. Surprenant, H. L.; Reilley, C. N. Anal. Chem. 1977, 49, 1134-1139. Clerc, J. 7.; Sommerauer, H. Anal. Chim. Acta 1077, 95, 33-40. Frahm, A. W.; Hambloch, H. Org. Magn. Reson. 1982, 19, 43-48. Smlth, D. H.; Jurs, P. C. J . Am. Chem. SOC.1978, 100, 3316-3321. "Scientific Subroutine Package", Version 3.1;International Business Machines Corporatlon: White Plains, NY, 1970. "Internatlonal Mathematical and Statistical Library", 8th ed.; IMSL: Houston, TX, 1980. Brugger, W. E.; Jurs, P. C. Anal. Chem. 1975, 4 7 , 781-764. Stuper, A. J.; Jurs, P. C. J . Chem. Inf. Comput. Sci. 1976, 16,

99-105. Randic, M.; Brissey, G. M.; Wilkins, C. L. J . Chem. I n f . Comput. Sci. 1981, 21, 52-59. Fujiwara, I.; Okuyama, T.; Yamasaki, T.; Abe, H.; Sasakl, S. Anal. Chim. Acta 1981, 133, 527-533. Engler, E. M.; Andose, J. D.; Schieyer, P. v. R. J . Am. Chem. SOC. 1973, 95,8005-8025. Stuper, A. J.; Brugger, W. E.; Jurs, P. C. "Computer Asslsted Studles of Chemical Structure and Biological Function"; Wiley-Interscience: New York, 1979;pp 63-90. Allinger, N. L.; Tribbie, M. T.; Miller, M. A.; Wertz, D. H. J . Am. Chem. SOC. 1971, 93, 1637-1648. Randlc, M. J . Am. Chem. SOC. 1975, 97,6609-6615. Kier, L. B.; Hall, L. H. J . Pharm. Sci. 1976, 6 5 , 1806-1809. DelRe, G. J . Chem. SOC. 1958, 4031-4040. Wertz, D. H.; Aillnger, N. L. Tetrahedron 1974, 30, 1579-1586. Draper, N. R.; Smith, H. "Applied Regresslon Analysis", 2nd ed.; Wiley-Interscience: New York, 1981. Belsey, D. A.; Kuh, E.; Welsch, R. E. "Regresslon Diagnostics: Identifying Influentlai Data and Sources of Collinearity"; Wiley-Interscience: New York, 1960. Allen, D. M. Department of Statistics, University of Kentucky, 1971, Technical Report 23. Snee, R. D. Technometrics 1077, 19, 415-428.

RECEIVED for review December 15,1982. Accepted February 7, 1983. This work was supported by the National Science Foundation, under Grant CHE-8202620. The PRIME 750 computer used in this research was purchased with partial financial support of the National Science Foundation. Portions of this paper were presented at the 9th Annual Meeting, Federation of Analytical Chemistry and Spectroscopy Societies, Philadelphia, PA, Sept 1982.