Spectral interpretation for organic analysis using an expert system

terized spectra interpretation, base'd on the recognition of the degree of similarity between an unknown spectrum and a spectrum from the data base...
0 downloads 0 Views 783KB Size
Anal. Chem. 1987, 59, 1207-1212

1207

Spectral Interpretation for Organic Analysis Using an Expert System Serban Moldoveanu*’ and Carl A. Rapson

Dowel1 Schlumberger Inc., P.O. Box 2710, Tulsa, Oklahoma 74101

The Computerized Interpretation of infrared spectral, carbon13 nuclear magnetic resonance spectral, and mass spectral Information can be accomplished wlth the help of a small expert system whose data base Is ilmlted to the Information Contained in common correlation charts. By comblnatlon of the data from these three dlfferent spectral techniques and use of some inference rules, it is possible to eliminate many interferences In spectra Interpretation. Consequently, very good results have been obtained for standard compounds. The method Is able to predict molecules not existent In the data base, indicating that this system can be used for the identlficatlon of new compounds as well as well-known compounds. The data base can also easily be extended, such that different users can have modlfled data bases that are tailored to meet their speclflc needs.

The computerized processing of spectral information has become routine for many laboratories in the last several years. Programs were developed for searching and viewing spectra from a previously recorded data base, as well as for comparing them with experimental spectra (1-3). This comparison led to the development of a first group of methods for computerized spectra interpretation, base‘d on the recognition of the degree of similarity between an unknown spectrum and a spectrum from the data base. In this procedure, certain molecules are recognized as possibly generating the unknown spectrum. Much effect has been directed toward developing an efficient and rapid search, the choice of representation of spectra in the data base (4), and the choice of an appropriate (unique or multiple) similarity measure ( 5 6 ) . Several search strategies (7)and interpretative procedures ( 4 9 ) have been used in connection with this group of methods, which have been proven very efficient for several types of spectral interpretation (10-12). The main deficiencies of the approach are related to the necessity of a large data base, and a powerful computing system. Also the procedure cannot be commonly applied for the search of a compound whose spectrum is not stored in the data base. A different type of approach in spectra interpretation is based on the determination of functionalities which may be present in an unknown compound. This approach has been developed mainly in connection with vibrational spectra interpretation (13,14). Several computer programs have been written in order to mimic the human process of interpretation of an infrared (IR) spectrum for an unknown compound (PAIRS, MISIP) (15,16). The limitations of these programs are mainly related to the limitations in the use of IR spectra itself for structure elucidation. A similar problem has been encountered when trying to use only mass spectral information which generates a set of substructures eventually assembled in a molecule (DENDRAL technique) (17, 18). lPresent address: Brown & Williamson Tobacco Co., 1600 W. Hill St., Louisville, KY 40232.

The use of artificial intelligence to combine spectral information for compound identification is a topic receiving increased attention (19,20).Several previous attempts gave promising results (18, 21, 22). In the CHEMICS system for example (21, 22), an examination of mass spectral data, including a detailed analysis of the molecular ion region, determines likely molecular weights for the compound and from those, a set of likely molecular formulas. This set is then restrained by ranking molecules according to the agreement of the experimental with predicted 13C nuclear magnetic resonance (NMR), proton NMR, and IR spectra of the molecule. The proposed approach for interpreting spectral data using an expert system is to combine rough “matches” for different spectral methods (IR, 13C NMR, and mass spectra (MS)), in order to select certain organic groups which may be present in the unknown molecule. By addition of a set of compatibility rules (such as, the compatibility of the number of I3C NMR peaks with the number of C atoms for a given substance), the organic groups are combined in a possible molecule. The method has been successfully used to identify a wide variety of organic molecules. The efficiency of the method is based on the fact that results from different spectral techniques will likely coincide only for the truth. This concept can be visualized in a simplified form by using the Venn diagram shown in Figure 1,where zone I represents the possibilities offered by IR, zone I1 the possibilities offered by NMR, and zone I11 the possibilities offered by MS. The intersection of the three zones yields the desired organic groups. In fact the program is more complex than the diagram, the resulting identification message providing not only the set of organic groups but also the number of each group in the molecule and the possible position of substitution of these groups in the unknown compound. The diagram given in Figure 1does not account also for the quaternary carbon group which is identified only by its 13C NMR signal and for the possibility of including external information in the system. The matches in the proposed system are based only on common correlation charts. A rather complete correlation chart for IR spectra contains about 100 organic groups, each with an average of about five absorbance regions. Such a chart, including peak intensity values, can be stored in a 100 X 5 X 3 matrix. 13CNMR spectral information can likewise be stored in a 100 x 5 matrix. Including memory for group names, masses, some characteristic molecular fragments, free valence, and number of carbon atoms for each group, a complete data base can be contained in approximately 2500 words of computer memory. This amount of memory can be allocated even on a small computer system, without the use of any slow storage device. This system is designed to be interactively used. When queried, the system outputs the answer, which may be only one set of possible organic groups (with their number and position of substitution) or may indicate several sets of organic groups fitting the same requirements. By addition of more information or by elimination of some restraints, the answer can be made more adequate. The program also permits online

00C3-2700/87/0359-1207$01.50/00 1987 American Chemical Society

1208

ANALYTICAL CHEMISTRY, VOL. 59, NO. 8, APRIL 15, 1987

Table I. Organic Groups Used for Molecule Construction

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0

o o O O Z O n e

I000

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0

~

~

o

z 11

0 0 0 0 0 0 0 0

~

0 0 0 0

0 0

# O O O O

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0

0 0

0 0 # 0 0 # 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

# O O

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0

7 - 1 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

oooooZOne

III0000000

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Figure 1. Intersection of IR, 13CNMR, and MS zones of information.

modification of the data base, using information input by the user to extend the data base; as a result, the user may customize the data base. T h e system represents a compromise between being thoroughly knowledgeable in a narrow range and vague over a wide range. Another feature of the program is the ability t o predict a new molecule from fragments, not being limited t o the information about t h a t molecule previously introduced into the data base. A feedback section of the system indicates whether or not one proposed structure fits completely the experimental spectral characteristics.

*

HETHYL CH3 METHYL lC=O NEIGH.) HETHYL (*CH3- Aryl or C=l METHYLENE -CHZ . HETHYLENE, -(CH2)n- ,ln=l for MS se a r ch! HETHYLENE, IC-0 OR C=-N NEiGH ) ALKANE GROUP CH C Quaternary (sp3) ETHYL t-BUTYL Iso-PROPYL ALKENE:e VINYL ALKENE, -CH=CH- trans ALKENE; CIS -CH=CHALKENE: CrCH2 ALKENE; >C=C< ALKYNE: Cz-CH AROMATIC: MONOSUBST. BENZENE AROMATIC; ORTHO DISUBST. AROMATIC; HETA DISUBST. AROMATIC; PARA DISUBST. AROMATIC; TRISUBST. (1,2,3) AROMATIC: TRISUBST. ( 1 , 2 , 4 ) AROMATIC; TRISUBST. ( 1 , 3 , 5 ) NAPHTHALENE; ALPHA HONOSUBSTITUTED NAPHTHALENE BETA MONOSUBSTITUTED NAPHTHALENE, DISUBSTITUTED ( 1 , 2 OR 1 , 4 ) NAPHTHALENE, DISUBSTITUTED ( 1 , 3 OR 2 3 ) NAPHTHALENE, DISUBSTITUTED ( 1 , 5 ) NAPHTHALENE DISUBSTITUTED ( 2 , 6 ) PYF.RYL FURYL PYRIDYNE, MONOSUBSTITUTED INDOLE, MONOSUBSTITUTED ETHER. ALIPHATIC ...... ETHER; AROMATIC ACETAL OR KETAL MERCAPTAN ALCOHOL, PRIMARY: BONDED ALCOHOL AROMATIC ALCOHOL: PRIMARY. DILUTE SOLUTION ALCOHOL; SECONDARY (BONDED! ALCOHOL, SECONDARY, DILUTE SOLN ALCOHOL, TERTIARY (BONDED) ALCOHOL; TERTIARY, DILUTE SOLUTION PHENOLIC GROUP OH SUGAR (Glucose unit used f o r MS) CARBOXILIC ACID, DIMER IONIZED CARBOXYL ~

EQUIPMENT The system described here was written in FORTRAN and initially run on a VAX 11/785 minicomputer under the VMS operating system. The program and data files together occupy 75 kbytes of memory and it was possible to be implemented on IBM PC-XT and IBM PC-AT computers. The difference in the working time required by the VAX and by the IBM PC (XT or AT) has not been found critical for most of the cases.

APPROACH The processing of information input during a query is performed in this sytem by a set of subroutines written in FORTRAN. There are three major sections of the program, namely, the section processing the IR information, the section processing the 13C NMR information, and the section which processes the MS information and combines into a molecule certain organic groups in a manner that is consistent with the information input. Each section can give independent information, but the system has been written to use these three sections in conjunction. The experimental data used by the system can also be classified into three types: main information, auxiliary information, and external information. The main information is needed for finding a self-consistent set of organic groups and consists of the peak position in IR and their relative intensity (weak, medium, or strong), the peak position in I3C NMR, and the molecular mass of the compound. The auxiliary information helps in finding the correct set and number of organic groups, but is not compulsory. It consists of the parity of the number of protons bonded to each carbon, the fragments in the MS spectrum, and the ratio of intensities in the MS spectrum for the molecular ion peak M and the peaks (M + 1)an (M -t- 2). The external information is added by the user as supplemental organic groups that were not identified by the system in the TR or NMR spectrum but are supposed (or known) to be present in the molecule. This external information is also, not required by the system. The IR spectra interpretation is based on a search for the number of matches of the peaks from the experimental spectrum with specific peak intervals for the organic groups stored in the data base. The organic groups chosen to be used in building a molecule are given in Table 1. Each group is regarded independently by the system. For all these groups (with the exception of quaternary carbon), the frequencv range within which an ab-

IMIDE, CYCLIC CARBAMATE . AMINE PRIMARY, ALIPHATIC AMINE PRIMARY, AROMATIC AMINE SECONDARY, ALIPHATIC AMINO-ACID lHDOC-CH< “2) AMINO-ACID. AROMATIC

SILANE, AROMATIC IMINE, ALIPHATIC

NITRATE. ALIPHATIC f-O-NO2) NITRO, AROMATIC SULFONIC ACID, ALIPHATIC PHOSPHATE, ALIPHATIC

--_

-

sorption band is expected coincides with the respective correlation chart. There are several redundant groups such as “Methyl CH3” and “Methyl C=O neighbor”. Because of the influence of the C=O group on the band position of the methyl group, the two groups are considered different by the system. In contrast to other search systems, the search does not follow a hierarchical tree pattern (15,23). The decision regarding the presence of an organic group is made depending on the probability of matching of maxima in the experimental spectrum with the characteristic absorption bands in the correlation chart. Because this probability is different for different organic groups, depending on the number of characteristic absorption bands as well as their widths, the reliability of this type of decision is defined. That is, in order to improve the reliability of the decision in cases with more than five characteristic bands in the IR spectrum, a positive decision is taken for at least 5 from 6 or 7 , 6 from 8 and 7 from 9 or 10 matches. Those cases with only one or two characteristic ab-

ANALYTICAL CHEMISTRY, VOL. 59, NO. 8, APRIL 15, 1987

sorption bands are supplemented in the data base with the bands of associated organic groups (for example, alkyl for aliphatic ketones), as well as including the requiremment for the peak intensity match. As a general rule, however, the peak intensities in the spectrum are not considered in the search. The reason for omitting the peak intensity in making the decision is that peak intensity is more variable than peak position, depending on the neighboring groups, and also that experimental conditions may greatly influence peak intensity. The IR data search has been adjusted to almost always detect an organic group if it exists in the molecule and in the data base. The data base, as written, does not contain certain groups that are not frequently encountered in the common molecules (e.g., the data base contains only a few heterocycles, no spiranes, no metal-organic groups). It is impossible to avoid some false recognitions, so the number of organic groups suggested will almost always be more than the actual number of groups in the substance. This computerized IR spectral interpreter is consequently less accurate than some other similar programs (such as PAIRS) but, being much simpler, has the advantage of being very fast. Also because the analysis of peak positions in an IR spectrum can seldom give a unique answer regarding the molecular structure, it was preferred to have only a ‘overlapping” decision regarding the presence of certain organic groups. As far as the peak recognition in the spectrum is concerned, the IR data search in this system is done on peak position only and does not process the entire digitized spectrum. A peak-picker routine can be found in many computer-assisted IR instruments, and the values selected by such a routine can provide the necessary input data. The NMR information is used in two different sections for making decisions in this system. The first section works similarly to the previously described IR section. The task of this part is to find matches between the expected intervals for certain organic groups and the peak positions in the experimental spectrum. The search can start by considering only the groups detected by the IR section of the system (with the possible addition of the quaternary carbon) or can consider all organic groups in the data base. No intensity interpretation is involved in this first part, primarily because of the Overhauser effect which makes it difficult to interpret the peak intensities in pulse NMR. The second section where the 13C NMR information is used, is the part which assembles the fragments into a molecule. An optional part of the 13C NMR section considers the parity of the number of protons bonded to a carbon atom for a certain organic group. On the basis of off-resonance proton decoupling (24) or on INEPT technique (25),it is possible to differentiate the groups C or CH2 from the groups CH or CH3. After the IR and 13C NMR data search, an attempt is made to combine the organic groups either suggested by the IR and/or NMR routines or input as supplemental information. The molecular ion mass is especially useful in the determination of the mass of a compound, which is an important piece of information when assembling an unknown molecule. False recognitions are frequently eliminated based on the fact that the different methods do not have the same types of interferences. By means of a combination routine, several of these groups are put together to fit a specific molecular (or molecular fragment) mass. This mass can be found by using the MS spectrum, preferably by using chemical ionization techniques to have an identifiable molecular ion peak. The system does not heavily rely on the fragments in the MS spectrum, when compared with programs such as STIRS (8) or DENDRAL (17). The MS data can be limited to the mass of the molecule. The reason for this choice has been based on the fact that the present system uses the molecular mass as an important input. The fragmentation being much reduced when using chemical ionization, it will be necessary to have separate mass spectra obtained in different conditions, in order to have both fragments and the molecular ion mass. This will require more experimental work for providing the proper input values for the system. Also in order to keep the system as simple as possible, a detailed MS fragment, interpretation, which requires a more sophisticated program, has been avoided. The information from the fragmentation in the MS spectrum is only optionally used in this system. There are four possibilities in considering the

1209

proposed groups to be combined in a molecule, namely, (1)absent (even if it is suggested by IR and NMR), (2) presence unknown, (3) present, but in an unknown number, and (4) present in a certain number. The choice must be made by the user. However, the system gives a suggested response (1,2, or 3) for each group proposed, on the basis of molecular fragments in the mass spectrum. The system answer is obtained from the number of fits between expected fragments for a given organic group in the data base and the experimental fragments in the MS spectrum. If no data regarding fragments are provided, no suggestion is given and the user must decide which option to choose (the most common one being “presence unknown”. Because the program has been written to be used in conjunction with a quadrupole mass spectrometer, the masses of molecules and fragments are expected to be rounded to whole or half integers. Also, the masses of the organic groups in the data base are rounded in the same way. The mass of the organic group -(CH2)”- is considered in the MS search with n = 1 ( M = 14). The subroutine package for combining groups in a molecule contains a number generator, providing the number of times each organic group will appear in the proposed molecule. This combination of organic groups is done by using several inference rules. The program considers the free valence of each group (index of hydrogen deficiency) and calculates the number of resulting double bonds and/or cycles for the molecule. The index of hydrogen deficiency can be calculated for a compound containing carbon, hydrogen, nitrogen, halogen, oxygen, and sulfur from the known formula: index = carbon,,, - (hydrogen,,)/2

- (halogen,,,)/2

+

(nitrogen,,)/2

+1

A first rule requires that this number must be an integer larger than or equal to the number of double bonds brought by each included organic group. Based on the value of hydrogen deficiency index and the number of double bonds existent in the fragments used, the system indicates the number of new double bonds or cycles created when combining groups into a molecule. Other inference rules are activated if the 13CNMR information is available. The number and types of carbon atoms in each possibility offered by the combination routine are correlated with the number and position of the NMR peaks, following the rules of interpreting a 13C NMR spectrum. The number of peaks in the 13C NMR spectrum is expected to be less than or equal to the number of carbon atoms in the molecule. The program will suggest also the need to check certain NMR peak intensities if a higher number of carbon atoms than NMR peaks is proposed for a molecule. An optional subroutine processes the ratio of intensities in the mass spectrum, for the molecular ion peak M and the peaks (M + 1) and ( M + 2), in order to suggest the ratio of C vs. N and 0 atoms. The nitrogen rule is also considered by the system. A feedback subroutine checks the agreement between the experimental IR spectrum and the possible spectrum for the new possible molecule. An attempt is made to overlap all the experimental peaks with expected absorption intervals for the given molecule and a negative decisioin is considered if major peaks remained unexplained in the experimental IR spectrum. The result is displayed in the form of proposed organic groups, possible positions of substitution, and, in a few cases, the neighboring groups (such as C=O group near CH3). If the MS part of the program is used, the number of each tyep of group is given to fit a certain molecular weight. The resulting set of groups suggested may be unique or there may be several possibilities. It is easy to interactively modify the correlation data in the data base, allowing the user to fine tune the system. This modification can be done either by writing new values in the data base or by asking the system to recognize as present a certain organic group, considering the provided spectral information. Once recorded in the data base, the information is used for all other molecules. An on-line data acquisition and transfer is feasible but no attempt has been made as yet to implement direct connection to instruments. Mixtures, even only of two compounds, are difficult to identify positively with this sytem and no self-consistent set of organic groups can be identified in this case. However, the system has

1210

ANALYTICAL CHEMISTRY, VOL. 59, NO. 8, APRIL 15, 1987

Table 11. Sample Output for the Interpretation of Spectral Data of 2,6-Dimethylnaphthalene

Table 111. Sample Output for the Interpretation of Spectral Data of p -(Dimethylamino)benzaldehyde

THIS SECTION HELPS TO INTERPRET IR SPECTRA

THIS SECTION HELPS TO INTERPRET IR SPECTRA

WAVENUMBERS IN THE SPECTRUM ARE'

WAVENUMBERS IN THE SPECTRUM ARE 632 0 1456.6

728 0 1537 3

INTENSITIES: ~

812 7 1554 3

824.2 1613.8

1171 2

1231.6

1660.3 2713.7

2796.7

1066.8

1375.4 2823.9

1447.6 2905.4

~~

1 1 3 2 2 2 3 3 1 1 2 1 3 3 1 2 2 2

POSSIBLE ORGANIC GROUPS METHYL CH3 ALKANE GROUP CH AROMATIC; PAFA DISUBST ALDEHYDE, AROMATIC AMINE TERTIARY, ALIPHATIC

METHYL CH3 (*CH3- Aryl or C=j METHYLENE -CH2 ALKANE GROUP CH ARMIATIC; PARA DISUBST AROMATIC; TRISUBST ( 1 , 3 , 5 ) NAPHTHALENE; DISUBSTITUTED ( 2 . 6 1 AMINE TERTIARY, AROMATIC BROMINATED ALIPHATIC HYDROCARBON PHOSPHATE, ALIPHATIC

THIS SECTION HELPS TO INTERPRET NMR SPECTRA PEAK POSITION IN PPM IS:

THIS SECTION HELPS TO INTERPRET NMR SPECTRA

41.2

112 3

126.5

133.2

155.7

191.5

PARITY' 0 0 0 0 0 0

PEAK POSITION IN PPM IS: 21 5

126 6

127 1

POSSIBLE ORGANIC GROUPS 128 0

132 1

134 3

PARITY 0 0 0 0 0 0

POSSIBLE ORGANIC GROUPS METHYL CH3 (*CH3- Aryl or C=I METHYLENE -CH2 AROMATIC, PARA DISUBST AROMATIC, TRISUBST ( 1 , 3 , 5 ) NAPHTHALENE, DISUBSTITUTED ( 2 , 6 ) AMINE TERTIARY, AROMATIC BROMINATED ALIPHATIC HYDROCARBON THIS SECTION HELPS TO COMBINE GROUPS IN A MOLECULES COMPOUND MASS = MOLECULE LOOKING FOR

NO

MORE

REASONABLE FITS

-

-

THIS SECTION HELPS TO COMBINE GROUPS IN A MOLECULE COMPOUND MASS = MOLECULE LOOKING FOR

149 0000

METHYL CH3 ALKANE GROUP CH C Quaternary ( n p 3 1 AROMATIC, PARA DISUBST ALDEHYDE, AROMATIC AMINE TERTIARY, ALIPHATIC

156 0000

POSSIBLE SUBSTANCE 1 HAS 7 DOUBLE BONDS OR CYCLES 2 METHYL CH3 (*CH3- Aryl or C=) 1 NAPHTHALENE, DISUBSTITUTED (2,6j NO. OF CARBON ATOMS= 12; NO. OF NMR PEAKS IN YOUR SPECTRUM= 6 GROUP TYPE: METHYL CH3 (*CH3- Aryl or C=) NO, OF CARBON ATOMS= 2 . POSSIBLE NO OF PEAKS FOR THIS GROUP? 1 INTERVALS: 2 . 0 -70 0 PEAKS' 21 5 MOLECULAR SYMMETRY EXPECTED, PLEASE CHECK PEAK HEIGHT IN NMR GROUP TYPE: NAPHTHALENE; DISUBSTITUTED (2,61 NO, OF CARBON ATOMS= 10; POSSIBLE NO OF PEAKS FOR THIS GROUP= 5 INTERVALS: 96 0 - - 158 0 PEAKS: 126 6 127 1 128 0 132.1 134 3 MOLECIJLAR SYNNETRY EXPECTED, PLEASE CHECK PEAK HEIGHT IN NMR

~ -

METHYL CH3 ALKANE GROUP CH C Quatarnary ( s p 3 ) AROMATIC; PARA DISUBST. ALDEHYDE, AROMATIC AMINE TERTIARY. ALIPHATIC

_______________

proven to be useful in identifying the functional organic groups that are present in mixtures and in reducing the number of their possible combinations.

RESULTS AND DISCUSSION When this system was applied to a series of known molecules, very good results have been obtained by using the IR, NMR, and MS segments of the program in conjunction. The program has been applied for the identification of molecules having up to 10 different organic groups (given in Table I). More than one possibility has always been indicated for substances with many structural isomers, such as those containing longer alkane chains, or multiple functional groups, such as -OH and -0- or -NH2, and 40"-. When the I3C NMR information regarding the odd or even number of protons attached to each carbon atom is used, the number of possible isomers for a compound is reduced and the result indicates a lower number of alternatives. When the IR, NMR, and MS segments of the program are used in conjunction, a correct result heavily depends on the recognition of the present organic groups in the IR section. Besides the quaternary carbon, only by including external information is it possible to add supplementary organic groups. Also, the agreement between the number of NMR peaks and the number of carbon atoms in a molecule has a key role. For that reason the system does not give proper results when the

POSSIBLE SUBSTANCE 1 HAS 5 DOUBLE BONDS OR CYCLES 2 METHYL CH3 1 AROMATIC; PARA DISUBST. 1 ALDEHYDE, AROMATIC 1 AMINE TERTIARY, ALIPHATIC NO. OF CARBON ATOMS= 9 ; NO OF "R PEAKS IN YOUR SPECTRUM= 6 GROUP TYPE' METHYL CH3 NO OF CARBON ATOMS- 2 ; POSSIBLE NO OF PEAKS FOR THIS GROUP= 1 INTERVALS: 2 0 -70 0 PEAKS. 41.2 GROUP TYPE: AROMATIC; PARA DISUBST. NO OF CARBON ATOMS. 6 ; POSSIBLE NO OF PEAKS FOR THIS GROUP= 4 INTERVALS' 96 0 - - 158 0 PEAKS. 112 3 126 5 133.2 155.7 GROUP TYPE: ALDEHYDE, AROMATIC NO. OF CARBON ATOMS= 1 ; POSSIBLE NO OF PEAKS FOR THIS GROUP= 1 INTERVALS. 185 0 - - 210 0 PEAKS 191 5 NO HORE REASONABLE FITS

experimental molecule exhibits tautomerism. Even for a simple molecule such as 2,4-pentanedione which has a ketoenol tautomerism, the system indicates that no structure is possible. The same experimental data can be analyzed under different constraints which can be specified dynamically. This allows selective use of the experimental information available. The easiest way to modify the search conditions for possible molecules is to modify the way of using the organic groups selected by IR and NMR sections, by means of the four options (group absent, presence unknown, present in an unknown number, or indicating the number in which a certain organic group must be present in the molecule). In response to the number of possible molecules indicated by the system (a large number or none), more constraints can be added or released. Three complete examples showing the system's use are described below and illustrated in Table 11-IV. The first example is 2,6-dimethylnaphthalene. Both the peak wavenumbers for IR and the peak position for NMR were obtained experimentally, but the same values can be found in published spectral collections (e.g., ref 26 or 27). The IR detection identified nine different functional groups on the basis of peak position (wavenumbers). Several false recognitions appeared in this selection, as shown in Table 11. The NMR section removed the phosphate group and CH group. The section regarding the MS spectra interpretation used as an input only the mass of the substance, no infor-

ANALYTICAL CHEMISTRY, VOL. 59, NO. 8, APRIL 15, 1987

1211

Table IV. Sample Output for the Interpretation of Spectral Data of 4-Aminoantipyrine THIS SECTION HELPS TO INTERPRET IR SPECTRA

POSSIBLE SUBSTANCE 1 HAS 7 DOUBLE BONDS OR CYCLES 1 NEW DOUBLE BOND(S1 OR CYCLE(S) 1 METHYL CH3 1 ALKANE GROUP CH 1 ALKENE; >C=C< 1 AROMTIC; MONOSUBST. BENZENE 1 AMIDE, DISUBSTITUTED 2 AMINE PRIMARY, AROMATIC NO. OF CARBON ATOMS= 11; NO. OF NHR PEAKS IN YOUR SPECTRUM= 9 GROUP TYPE: HETHYL CH3 NO. OF CARBON ATOMS= 1; POSSIBLE NO. OF PEAKS FOR THIS GROUP: INTERVALS: 2.0 -70.0 PEAKS: 11.4 99.2 GROUP TYPE: ALKANE GROUP CH NO. OF CARBON ATOMS= 1; POSSIBLE NO. OF PEAKS FOR THIS GROUP= INTERVALS: 26.0 -90.0 PEAKS: 39.2 GROUP TYPE: ALKENE; X = C < NO. OF CARBON ATOMS: 2 ; POSSIBLE NO. OF PEAKS FOR THIS GROUP= INTERVALS: 102.0 -- 140.0 PEAKS: 120.5 124.1 127.1 130.3 136.9 139.6 GROUP TYPE: AROMATIC; MONOSUBST. BENZENE NO. OF CARBON ATOMS: 6 ; POSSIBLE NO. OF PEAKS FOR THIS GROUP= INTERVALS: 96.0 -- 158.0 PEAKS: 120.5 124.1 127.1 130.3 136.9 139.6 GROUP TYPE: AMIDE, DISUBSTITUTED NO. OF CARBON ATOMS= 1 ; POSSIBLE NO. OF PEAKS FOR THIS GROUPINTERVALS: 150.0 -- 180 0 PEAKS: 1 6 3 . 5

WAVENUMBERS IN THE SPECTRUM ARE:

METHYL CH3 K3THYLENE -CH2 METHYLENE, IC-0 OR C:-N NEIGH.) ALKANE GROUP CH ALKENE; >C=C< AROMATIC; MONOSUBST. BENZENE KETONE, AROMATIC AMIDE, DISUBSTITUTED AMINE PRIMARY, ALIPHATIC AMINE PRIMARY, AROMATIC AMINE SECONDARY, ALIPHATIC AMINE SECONDARY, AROMATIC AMINE TERTIARY, ALIPHATIC BROMINATED ALIPHATIC HYDROCARBON THIS SECTION HELPS TO INTERPRET NHR SPECTRA

2

1

6

6

1

PEAK POSITION IN PPM IS 11.4 39.2 120.5 PARITY: 1 1 2 1 1 1 2 2 2 POSSIBLE ORGANIC GROUPS:

124.1

127.1

130.3

136.9

139.6

163.5

THIS SECTION HELPS TO COMBINE GROUPS IN A MOLECULE COE(WUND MASS = MOLECULE LOOKING FOR:

203.0000

METHYL CH3 ALKANE GROUP CH ALKENE; >C=C< -TIC; MONOSUBST. BENZENE AMIDE, DISUBSTITUTED M I N E PRIMARY, ARCHATIC M I N E TERTIARY, ALIPHATIC BROHINATED ALIPHATIC HYDROCARBON

POSSIBLE SUBSTANCE 2 HAS 7 DOUBLE BONDS OR CYCLES 2 NEW WUBLE BOND(S1 OR CYCLE(S) 1 METHYL CH3 3 ALKANE GROUP CH 1 AROMATIC; MONOSUBST. BENZENE 1 AMIDE, DISUBSTITUTED 1 AMINE PRIHARY, AROMATIC 1 AMINE TERTIARY, ALIPHATIC NO. OF CARBON ATOMS= 11; NO. OF NMR PEAKS IN YOUR SPECTRUM= 9 GROUP TYPE: METHYL CH3 NO. OF CARBON ATOMS: 1; POSSIBLE NO. OF PEAKS FOR THIS GROUPz 2 INTERVALS: 2 . 0 -70 0 PEAKS: 11.4 39.2 GROUP TYPE: ALKANE GROUP CH NO. OF CARBON ATOMS= 3 ; POSSIBLE NO. OF PEAKS FOR THIS GROUP= 1 90 0 INTERVALS: 26.0 -PEAKS:

39 2

TYPET-AROMATIC;

GROG MONOSUBST. BENZENE NO. OF CARBON ATOMS: 6; POSSIBLE NO OF PEAKS FOR THIS GROUP= 6 INTERVALS: 96.0 -- 158.0 136.9 139.6 PEAKS: 120.5 124.1 127.1 130 3 GROUP TYPE: AMIDE, DISUBSTITUTED NO, OF CARBON ATOMS= 1; POSSIBLE NO. OF PEAKS FOR THIS GROUP: 1 INTERVALS: 150.0 - - 180.0 PEAKS: 163.5

POSSIBLE SUBSTANCE 3 HAS 7 DOUBLE BONDS OR 1 NEW WUBLE BOND(S1 OR CYCLE(S) 2 METHYL CH3

CYCLES

PEAKS IN YOUR SPECTRUM= 9 OF PEAKS FOR THIS GROUP:

2

. OF PEAKS FOR THIS GROUP:

1

PEAKS: 120 5 GROUP TYPE: AROMATIC; MONOSUBST. BENZENE NO. OF CARBON ATOMS= 6 ; POSSIBLE NO. OF PEAKS FOR THIS GROUP: INTERVALS: 96.0 - - 158.0 130.3 136.9 139 6 PEAKS: 120.5 124.1 1 2 7 . 1 GROUP TYPE: AMIDE. DISUBSTITUTED NO OF CARBON ATOMS- 1 ; POSSIBLE NO OF PEAKS FOR THIS GROUP. INTERVALS: 150.0 - - 180.0 PEAKS: 163.5

6

1

NO MORE REASONABLE FITS

mation regarding fragments being provided. All the organic groups selected after the IR and NMR sections were considered with the option "presence unknown" in assembling a molecule, supposing that no preliminary knowledge about the analyzed substance is available. The result indicates only one possible answer, the correct molecule and also the expected symmetry being recognized. The second example is p(dimethy1amino)benzaldehyde. The IR package identified five different organic groups. Besides the actual groups present (methyl, para-disubstituted aromatic, aldehyde, and tertiary amine), the group CH was suggested by the system. The NMR package added a new group to the list, namely, a quaternary carbon group. These possible groups must fit the molecular weight of the substance, given by the molecular ion peak in the mass spectrum. When the option "presence unknown" was taken for all possible organic groups and no molecular fragments were considered from the mass spectrum, the system proposed just one set of possibilities: two groups methyl, one group aromatic pars disubstituted, one aldehyde aromatic, one amine tertiary

aliphatic (Table 111). This offers only one possible arrangement, namely, p-(dimethylamino) benzaldehyde. the CPU time used by this example on the VAX computer was 9 s and on the IBM PC-AT was approximately 2 min. The third example was 4-aminoantipyrine (l-phenyl-2,3dimethyl-4-amino-5-pyrazolone).Neither pyrazole nor pyrazolone were contained in the data base, and the substance itself can have a phenole-betaine structure. The IR package of the program offered 14 different organic groups. The NMR package used also the parity of number of protons attached to each carbon in the molecule. The NMR spectrum is shown in Figure 2. Six of the organic groups suggested by IR were eliminated. When only the molecular mass was used as the input for the MS section, the system offered three possible results (Table IV). All the proposed groups were used in finding a set fitting all the requirements of the molecule by taking the option "presence unknown". The first possibility indicated by the system and shown in Table IV contains two groups of primary aromatic amines. No tertiary amine recognized by the IR section is selected in this set by the com-

1212

ANALYTICAL CHEMISTRY, VOL. 59, NO. 8, APRIL 15, 1987

L 00

18C

163

141

12r

:oc

Figure 2. Experimental 13C NMR spectrum of 4-aminoantipyrine.

bination routine. I t is however difficult t o reject this case because the C-N stretch bands appear in amides in similar positions to those seen in amines and a false recognition of the tertiary amine can be suspected. The NMR spectrum gives no direct information for that group. The presence of the proper heterocycle in the data base would certainly have improved the situation. The second possibility offered by the system can be rejected by inspecting the NMR spectrum. Considering the monosubstituted benzene group present, there are two more unexplained NMR peaks in the region 110-140 ppm, showing carbons with an even number of hydrogens attached. The system indicates six possible NMR peaks for the benzene ring but the part considering the hydrogen number parity is not enough developed to be able to reject the case. The rejection would have happened, for example, for a case with seven peaks in the region. The remaining (third) possibility corresponds to the true substance, and no conflict with the NMR spectrum is noticed by the system. Assembling groups in the indicated numbers and positions remains to be done, but the system has reduced the range of search considerably. The VAX CPU time required by this example was less than 20 s. The same example run on the IBM PC-AT required less than 5 min. CONCLUSIONS The described expert system is an original program designed to assist in the interpretation of IR, 13C NMR, and mass spectra, which we frequently used in chemical organic analysis. Very good results have been obtained in the interpretation of standard spectra. The application of this system to unknown materials has proven very useful. The system can be used successfully even by relatively inexperienced personnel or by specialists in order to narrow the area of search in identification of organic compounds. The main features of the system are the reduced data base, the three combined sources of information connected by logical inferences rules, the ability to generate any molecule from fragments even if no previous data are available about the molecule, and the

fine-tuning capability provided by the possibility to modify the initial data base so as to better fit a specific need. LITERATURE CITED (1) Razinger, M.; Penca, M.; Zupan, J. Anal. Chem. 1981, 5 3 , 1107-1 110. (2) Razinger, M.; Penca, M.; Zupan, J.; Janezic, M. Fresenius' 2.Anal. Chem. 1982, 373, 496-499. (3) Novic, M.; Zupan, J. Anal. Chim. Acta 1983, 757,419-424. (4) McDonald, R. S. Anal. Chem. 1982 5 4 , 1250-1259. (5) Damen, H.; Henneberg, D.; Weimann, B. Anal. Chim. Acta 1978, 103, 289-302. (6) Xu Yu Xin; Moldoveanu, S.; Lepadatu, C. I. Rev. Roum. Chim. 1983, 28. 83-89, -. .. .. (7) Lebedev, K. S.; Tormyshev, V. M.; Derendyaev, B. G.; Koptyug, V. A. Anal. Chim. Acfa 1981, 733, 517-525. (8) Kwok, K. S.;Venkataraghavan, R.; McLafferty, F. W. J . Am. Chem. SOC. 1973, 95, 4185-4194. (9) Duffield, A. M.; Robertson, A. V.; Djerassi, C.; Buchanan, B. G.; Sutherland, C. L.; Feigenbaum, E. A.; Lederberg, J. J. Am. Chem. SOC. 1080, 97, 2679-2683. IO) Martinsen, D. P.; Song, Ban-Huat Mass Spectrom. Rev. 1985. 4 , 46 1-490. 11) Gray, N. A. B.; Nourse. J. G.; Crandell, C. W.; Smith, D. H.; Djeiaxi, C. Org. Magn. Reson. 1981, 15, 375-379. 12) Novic, M.; Zupan, J. Anal. Chlm. Acta 1985, 777, 23-33. 13) Woodruff, H. 9. TrAC, Trends Anal. Chem. (Pers. E d . ) I984 3 , 72-75. (14) Woodruff, H. B.; Smith, G. M. Anal. Chem. 1980, 52, 2321-2.327. (15) Woodruff, H. 6.; Smith, G. M. Anal. Chlm. Acta 1981, 733, 545-553. (16) Tomeilini, S. A.; Hartwick, R A,; Stevenson, J. M.; Woodruff, H. B. Anal. Chim. Acta 1984, 762, 227-240. (17) Gray, N. A. 9.; Bacus, A.; Smith, D. H.; Djerassi, C. Helv. Chim. Acta 1981, 64. 458-470. (18) Harmon, P.; King, D. Expert Systems; Wiley: New York, 1985. (19) Dessy R. E. Anal. Chem. 1984, 5 6 , 1200A-1211A. (20) Dessy, R. E. Anal. Chem. 1984, 5 6 , 1312A-1332A. (21) Sasaki, S.;Fujlwara, I.; Yamasaki, T. Anal. Chim. Acta 1980, 722, 87-9 1. (22) Fujiara, I.;Okuyama, T.; Yamasaki, T.; Abe, H.; Sasaki, S. Anal. Chim. Acfa 1981, 733, 527-534. (23) Zupan, J.; Munk, M. E. Anal. Chem. 1985, 57, 1609-1616. (24) Pachler, K. G. R. J . Magn. Reson. 1972, 7 , 442-448. (25) Moris, G. A.; Freeman, R. J . Am. Chem. SOC. 1979, 707, 760-762. (26) Pouchert, C. J. The Aldrich Library of Infrared Spectra; Aldrich Chem. Co.: Milwaukee. WI, 1975. (27) &dtler Standard Carbon 73 NMR; Sadtler, Res. Lab.: Philadelphia, PA. 1978.

.

RECEIVED for review April 9, 1986. Resubmitted December 10, 1986. Accepted December 10, 1986.