Expert system based on principal components ... - ACS Publications

Expert system based on principal components analysis for the identification of molecular structures from vapor-phase infrared spectra. 2. Identificati...
0 downloads 0 Views 1MB Size
050

Anal. Chem. WS2, 6 4 , 656-663

Expert System Based on Principal Components Analysis for the Identification of Molecular Structures from Vapor-Phase Infrared Spectra. 2. Identification of Carbonyl-Containing Functionalities Erik J. Hasenoehrl, Jonathan H. Perkins,+and Peter R. Griffiths* Department of Chemistry and Idaho Center for Hazardous Waste Remediation Research, University of Idaho, Moscow, Idaho 83843

Prlnclpal components analysls (PCA) has been used to develop a sel of rubs for an expert system that can dbtlngulsh compounds contalnlng at bard one carbonyl functionality from other compounds, based on the compound’s vapor-phase Infrared epectnm. Further rules have been developed that can subsequently be used to subcladfy the carbonyl compound as an acld, ester, aldehyde, or ketone. Trlals using the flrst rule show that the expert system correctly determines whether or not a compound contains a carbonyl functionality 98% of the thw. Once a compound has been determined to contain a carbonyl group, the expert system correctly subclasdlks the ampound as a carboxyk add, ester, aldehyde, or ketone In 96% of our trlals.

INTRODUCTION The application of artificial intelligence or expert systems to qualitative chemical analysis is an area of continuing importance. There are many situations where the automated interpretation of infrared spectra would be particularly useful. These situations share in common the need to interpret collections of large numbers of spectra. For example, routine laboratory analysis of environmental samples by GC/FT-IR can produce spectra at the rate of thousands per day. The requirement of manual interpretation of each of these spectra would be a significant’burden on the laboratory’s cost of operation. The current method of choice for automated spectral interpretation is library searching’ where the unknown sample’s spectrum is compared quantitatively with each spectrum in the library. The identity of the library-entry spectrum that is most similar to that of the unknown is assigned to the unknown. This approach yields the incorrect answer when the unknown is not represented in the library, although the closest ‘hit” resulting from the library search may be structurally similar to the unknown. Certainly a data base of this type is of little use to a synthetic organic chemist investigating novel structures. Furthermore, a library user is somewhat limited by the range of conditions under which the spectra were collected and the signal-to-noise ratio of the spectra. A more robust approach to automated interpretation is an expert system. An expert system is a set of rules that codify a human expert’s knowledge. An expert system should be able to recognize and accommodate the conditions under which the spectrum was acquired and should be able to identify, or at least to characterize, novel compounds. Early approaches toward the development of an expert system for IR spectra *Author to whom corres ondence should be sent. Present address: Mobd hsearch and Development Corporation, Paulsboro Research Laboratory, Paulsboro, NJ 08066. 0003-2700/92/0364-0656$03.00/0

were based on binary spectral interpretation techniques developed by Isenhour and co-workers.2 The program for the analysis of infrared spectra (PAIRQ3 has also been used for the interpretation of spectra despite the need to input parameters defining each band in the spectrum. Problems arising from variations in the assignment of band parameters of IR spectra by different human operators can be circumvented by curvefitting. However, accurate curve-fitting of spectral data can take a great deal of computer time and is particularly difficult for vapor-phase spectra, where bandshapes are not well defined! Van Der Maas and co-workers have investigated the use of artificial intelligence for automated spectral interpretation? They have devised a system that correlates infrared spectra with the structural units or “superatoms” present in a molecule. Each superatom is assigned a unique symbol (similar to the Wiswesser line notation). A rule generator then combines superatoms based on spectral data until the spectral region of interest has been explained. All of these techniques rely on the development of rules for spectral interpretation. An alternative approach involves the application of principal components analysis (PCA) as the basis of an expert system for the analysis of spectroscopic data. Mass spectral data has been evaluated using PCA.6,’ PCA has also been used as a prefilter in a library searching routine for GC-IR data.8 PCA also provides an efficient means to quantitatively describe a rule that can distinguish the presence of a functional group in a compound from its mid-infrared spectrum, as described in the first paper in this series? The protocol to develop a rule is quite simple. A training set is selected that is composed of the IR spectra of two classes of compounds: those containing the functional group of interest and those without it. After the judicious use of autoscaling and feature weighting of the spectra, PCA is performed. The sample scores in the first and second (and sometimes third) principal components should show a distinct separation of the two classes of compounds. A line that defines the separating rule can usually be determined visually. A large number of spectra (-- 1000) is projected onto the scores plot to detekmine the best separation line. It is also possible to define a bracket about the decision line that defines a region where the presence or absence of the functional group cannot be determined unequivocally. Each rule developed in this way answers whether or not a certain functional group is present in the compound. This task is simpler than actually determining the complete structure. Supplementing this information by answering structural questions such as the carbon skeletal arrangement may be achieved better by mass spectrometry. The expert system described in this series of papers produces a list of functional groups deemed to be present in the unknown compound. In general the presence of functional groups is not exclusionary,e.g., a molecule can contain both an alcohol 0 1992 Amerlcan Chemical Society

ANALYTICAL CHEMISTRY, VOL. 64, NO. 6, MARCH 15, 1992

and a carbonyl group. As such, the general goal to the use of PCA-based rules is simply to answer a list of questions. However there are circumstances where the questions are interrelated (e.g. the absence of a carbonyl group necessitates the absence of an ester). Furthermore, there are efficiency considerations, for example, the presence of aromaticity should be tested before any questions regarding aromatic substitution patterns. Thus in a working expert system there will be an organized tree structure that involves a logical progression of questions. The arrangement of rules in the expert system is not trivial. This paper describes several ways in which the individual rules can be combined.

f3Acid?

657

m RCOOR?

a

THEORY The six-step protocol for generating rules for the expert system developed in our previous paper includes (1)selection of a suitable small training set from a large library, (2) autoscaling the training set, (3) calculating the feature weight spectrum of the training set, and (4) performing the PCA. (5) The iirst loo0 spectra from a library of vapor-phase IR spectra are then projected onto two principal components, and (6) a classification line is determined. Unless otherwise noted, this protocol is used for all classifications. The spectra used for the small training set (step 1) were selected from the Sadtler (Philadelphia, PA) vapor-phase IR spectral library. Twenty-fivespectra of compounds containing the functional group of interest (category I) and 25 spectra of compounds without that functional group (category 11) comprise the training set. These spectra were picked with the intent of constructing a robust training set. The spectra are autoscaled (step 2) by subtracting the mean spectrum of the training set from each sample spectrum, and each measurement is then divided by the square root of the sums of squares (SQSS)at the corresponding wavelength. Feature weighting (step 3) involves scaling the data such that measurements that provide the greatest variance between the two categories are enhanced. A feature weight, wk(I,II), is calculated for each wavelength as follows:1o

where xI and xn are measurements at wavenumber k far the samples of class I and 11, and NIand NIIare the number of samples in the training set of class I and 11,respeiAively. This parameter will subsequently be referred to simply as w. The greater the discrimination ability of a measurement at a particular wavenumber, the greater the feature weight: if a measurement has no discriminating power, w = 1.0. Three functions of the calculated feature weights have been investigated in this study: w , w2, and (w - 1)2.A PCA is then performed on each weighted training set. Scatter plots of pairs of principal components 1through 4 are plotted, and the graph showing the best separation is used for the classification of the unknown spectra. Usually the categories within the small training sets are well separated and any classification line would be somewhat arbitrary due to the number of separating lines that can be drawn while still maintaining accurate classification. To enhance the accuracy of the classifications a much larger set of spectra, in this case the first lo00 spectra in the Sadtler vapor-phase IR library, is projected onto the two principal components deemed best for classification purposes. The projection is accomplished by autoscaling and weighting the new spectra using the training set mean, SQSS,and w spectra The dot product of the scaled unknown spectrum and the loading vector associated with the PC used in the classification is then calculated and divided by the singular value of the principal component. This value is the projected score of the spectrum. A classification line is then drawn visually to

Figure 1. Three expert system trees for the classlflcatlon of the Infrared spectra of carbonyl containing compounds. (a, left) A lineby-line tree,(b, center) a two-tiered tree, (c, right), a branched tree.

provide the optimal separation between the two classes. To summarize the results of these classifications we have employed the use of a 2 X 2 classification matrix. All correct classificationsappear on the primary diagonal, and the summation of this diagonal divided by the summation of all elements indicates the total correct rate for a given classification. The off-diagonal elements represent the total number of spectra that were misclassified. Two types of structures for such an expert system based on PCA can be conceived. They are a list structure (or line-by-line) and a tree structure. For this classification of list structure and two variations of the tree structure, one that separates into similar structural groups (two-tiered tree) and one that separates one class at a time (branched tree) (see Figure 1)were tested. The list structure for an expert system is possible when the answers covered by the system are not mutudy exclusive. In a classification scheme based on a list structure, the unknown’s spectrum is tested against each rule, and the results from the prior rules do not affect the progression through the expert system. An example of such a structure for the classification of carbonyl compounds is shown in Figure la. In this case, the unknown’s spectrum is first tested with a rule developed for esters. The rule will determine whether or not an ester functionality is preaent in the unknown compound. Regardleas of the answer, the spectrum is then tested with the carboxylic acid rule, and so on. The final result of such a list-structured expert Wtem would be a list of the functionalitiesdetermined to be present in the compound. The distinct advantage of the list approach is that it implicitly recognizes that functionalities are not exclusive, e.g., a compound can contain both an ester and a ketone functionality. A compound that contains two similar functionalities (e.g. ester and ketone) would, in theory, give a positive response for both rules. There are, however, two disadvantages of this approach that can be illustrated by a consideration of the spectra of compounds containing the C=O functionality. Firstly, the differences between the IR spectrum of a ketone and an ester are much smaller than the differences between the IR spectrum of either a ketone or an ester and that of a compound that does not contain a carbonyl group. It then becomes possible that an IR spectrum of a ketone could be classified both as an ester and a ketone because the spectra are quite similar. Secondly, the testing of every spectrum with each rule is much more time-consuming than either the two-tiered tree or branched tree structure discussed below. In a two-tiered tree structure (Figure Ib), the first step involves testing an IR spectrum for the presence of any C=O functionality. If an unknown is shown to contain a C - 0 group, the question as to whether it contains a R’COOR structural subunit (ester or carboxylic acid) is posed. If the

658

ANALYTICAL CHEMISTRY, VOL. 64, NO. 6, MARCH 15, 1992

response is positive for R'COOR, then the question as to whether the unknown is a carboxylic acid (R' = alkyl or aryl, R = H)or an ester (R' and R = alkyl or aryl) is asked. A negative response for R'COOR leads to the question as to whether the unknown is a ketone or an aldehyde. The disadvantage of the two-tiered tree is that this structure assumes that functionalitiesare mutually exclusive (e.g. if a compound contains an ester functionality this structure assumes it therefore cannot contain an acid functionality). For example, the IR spectrum of a keto-ester might give a positive response for the ester but not the ketone because once a positive response for the ester has been established no more rules are tested. The branched tree (Figure IC)is similar to the two-tiered tree, but branches into simpler structural units. The first question in both trees is the same: does the infrared spectrum indicate the presence of the carbonyl functionality? Once it has been established that a compound contains the C=O group, the branched tree determines whether the IR spectrum is that of a carboxylic acid. If the response is positive the expert system classifies the unknown spectrum as a carboxylic acid. If the response is negative, the next question along the tree (Le. is it an ester?) is tested. A positive response classifies the unknown IR spectrum as an ester while a negative response leads to the next question and so on until a positive response occurs. If the unknown spectrum is classified as a spectrum of an ester, it can then be subclassified a to the type an ester functionalityit contains: RC02R,RC02Ar,ArC02Ar, or ArC02R,where R = alkyl or Ar = aryl. The disadvantage of the branched tree is similar to that of the two-tiered tree, namely that a compound containing both -COOH and -COR (R = H,OR', or R") structural subunits will only give the correct answer for the COOH subunit because it is first in the tree. Once a compound has been classified as containing a carbonyl group, simply changing the order with which each rule in the branched true structure is applied will allow compounds with two different carbonyl functional groups to be recognized. The distinct advantage of the latter two tree structures (Figure lb,c) is that a much lower error rate is possible. EXPERIMENTAL SECTION The spectral library used in this work was leased from the Sadtler Research Division of Bic-Rad Laboratories (Philadelphia, PA). The full library consists of 9200 vapor-phase infrared spectra that had been measured from 4000 to 449 cm-' at a resolution of 4 cm-'. The first 2000 spectra from the library were used in this analysis. These spectra were converted from hexadecimal to ASCII format. Each spectrum was visually inspected, and its name and structure were confirmed. Base-line correction of the reference infrared spectra was performed if necessary. The PCA was performed on training sets consisting of the spectra of 25 compounds of interest (category I) and 25 counterexample compounds (category 11). The spectra were selected for the training set based on their Wiswesser line notation. Each training set was selected with the intent of spanning a broad range of compounds. Each spectrum was "deresolved" (reduced in size) by removing three out of every four data points. After deresolution the first 458 data points (40OC-470 cm-') were used to build the training set. All training sets were autoscaled and feature weighted with a w , w2,or a (w - 1)2 function. The PCA was then performed on each scaled data set. Only the data set yielding the best separation is shown in this paper and used in the expert system. All computer programs were run on an 25-MHz 386 computer equipped with a math coprocessor. The program used to perform the PCA was written using Turbo Pascal software (Borland Int., Scotts Valley, CA) and the non-iterativepartial least squares (NIPLS) algorithm."-13 Programs used to project the validation sets onto the training sets were written in Turbo C (Borland Int.). The resulting data were analyzed using Lotus 1-2-3 (Lotus Development Corp., Cambridge, MA).

-0.4

I -0.3

0.1

-0.1

(

3

Principal Component 1 Flgwe 2. Scatter plot of the 1st and 2nd components of the training are spectra after PCA. Esters are dedgnated with 1's end " 3 t m designated with 2's. The 90% Bayesian volwm, lnes are also plotted. Some of the outibrs are labeled.

Table I. Classification Matrix for the Separation ofEstera and Non-Esters Using the First lo00 Library Spectra (the Separation Was Identified by the Linear Discrimination Line Determined Visually from the 1000 Spectra)

authentic ester authentic non-ester

classified ester

classified non-ester

170 44

47 739

total correct rate 91%

RESULTS (a) Line-by-Line Classification. Initially, it was hoped that the best separations would be obtained by directly s e p mating the reference spectra of carbonyl compounds into their specific class (ester, carboxylic acid, ketone, etc.) via the list structure (Figure la). A training set containing 25 IR spectra of esters and 25 IR spectra of compounds not containing an ester group was autoscaled, feature weighted with w,w2,and (w - 112,and reduced to its principal components (PCs). The best separation was found with w2;Figure 2 shows the scatter plot of this analysis, in which PC 1is plotted against PC 2. It can be seen that most of the separation is achieved by PC 1. The scatter plot shows that for the training set there is some overlap between the IR spectra of esters and the IR spectra of other compounds. These outlier IR spectra of non-esters are the primarily the spectra of other carbonylcontaining compounds. It is not surprising that this approach appears to have difficulty in discriminating between the infrared spectra of esters and carboxylic acids in the vapor-phase where hydrogen bonding is minimal. The first lo00 entries in the Sadtler library were then projected onto principal components 1and 2 to validate the training set. A linear discriminating line was estimated by eye to give the best separation. This line was drawn pardel to the y-axis and intersected the x-axis at 0.020. Note that this separation uses only PC1. The results of the projection are shown in Table I. The associated probabilities yield the following information about the classification. I. If an unknown is classified as a non-ester there is a 94% chance that it is an non-ester. 11. If an unknown is classified as an ester there is a 78% chance that it is a ester. 111. There is a 94% chance that a non-ester will be classified as a non-ester. IV. There is a 79% chance that an ester will be classified as an ester. These probabilitiea (particularly11and IV)are not high enough for this separation to be used as the basis of a rule in an expert system. Of the false positives associated with this separation, the overwhelming majority are carboxylic acids, aldehydes, and ketones because of the similarity of the

ANALYTICAL CHEMISTRY, VOL. 64, NO. 6, MARCH 15, 1992

015 i

i

z

65Q

I

0.4

-0.3 -0.4

4OOO

3500

3OOO

2500 ZOO0 1500 Wavenumber (cm-1)

loo0

500

Three featureweight spectra plotted against wavenumber for the esterlnon-ester separatlon. Spectrum (A) Is for a randomly picked training set, spectrum (B) is for a training set with a small bias toward other carbonykmtalnlng compounds,and q”(C) is the featureweight spectrum for a training set heavlly biased toward other carbonykontalnlng compounds. Figure 9.

intense v ( C 4 ) frequency. The C - 0 stretching mode of an ester in the vapor-phase absorbs in the region between about 1720 and 1765 cm-’, while the v(C=O) of carboxylic acids, aldehydes, and ketones typically lies in the regions 1755-1780 cm-’, 1710-1750 cm-’, and 1690-1740 cm-’, respectively. The significant overlap of the carbonyl stretching modes of all carbonyl compounds in the vapor phase makes it difficult to achieve a low error rate for the direct classification of compounds in each class. The selection of compounds used for the training set is critical when the list method is applied. Figure 3 shows the feature weight spectra generated from three different training sets for the separation of esters from other compounds. The top feature weight spectrum (Figure 3A) was calculated when the 25 IR spectra that form counterexamples of esters were chosen randomly from the non-ester entries in the Sadtler library. This training set included three carbonyl compounds that are not esters. In this case, the carbonyl stretching vibration is the largeat feature in the feature weight spectrum, which is intuitively obvious because it is typically the largest feature in the IFt spectrum of an ester of low molecular weight. The C-0 stretch at 1150 cm-’ is also strongly evident in the feature weight spectrum. If the 25 counterexamples are selected in a nonrandom manner such that there is a small bias in the counterexamples toward ketones, aldehydes, and carboxylic acids, the intensity of the feature weight at the c-0 stretching region in the feature weight spectrum becomes much smaller and the C-O stretch increases in intensity. Such a training set is represented by the feature weight spectrum shown in Figure 3B. The effect of this bias in the training set is to increase the capability of the PCA to distinguish between esters and ketones, aldehydes, and carboxylic acids. However, since the intensity of the feature weight spectrum increases around 1150 cm-I in intensity, bands in the IR spectrum of all compounds that absorb strongly in this region (e.g. ethers and alcohols) will also be weighted. In this case the efficiency of the PCA to separate esters from compounds that do not contain C-0 functionalities but still have absorption bands in the C-O stretching region is reduced. The lower feature weight spectrum (Figure 3C) was generated using 25 counterexamples that were heavily weighted toward ketones, aldehydes, and carboxylic acids. The effect of this training set on the feature weight spectrum is to decrease the intensity of the c--O stretching band and to further increase the C-0 stretching intensity, thus giving even better discrimination between the IR spectra of esters and the spectra

I

1 -0.3

-0. I 0. I Principal Component 3

0.3

Figure 4. Scatter plot showing the 3rd and 1st Pcs for the spectra of carbonyl compounds (1’s) and counter-examples (2’s). The 90% Bayeslan volume lines are also plotted.

of other compounds containing C-0 groups than the prior example. However, discrimination between non-carbonyl containing compounds that have absorption bands in the C-O stretching region (e.g. ethers and alcohols) was diminished. This result was true no matter whether the separation was attempted using the w , w2, or (w - 1)2feature weight spednun. (b)Classificationof Carbonyl-ContainingCompounds. The difficulty with the list approach is that the counterexample set contains spectra that are both very different from and very similar to the category I set, so that it is difficult to separate esters from non-carbonyls and from non-ester carbonyls simultaneously. Since this method was not as successful as f i t hoped, we believed that a better procedure would be to separate all of the IR spectra of carbonyl-containing compounds from the set of all other IR spectra and then to subclassify the carbonyls into carboxylic acids, esters, ketones, and aldehydes by means of a tree-structured expert system. Once an analyte had been classified as a carbonyl or non-carbonyl, further rules could be developed to identify the type of carbonyl functionality each compound contained more accurately and precisely. The training set constructed to separate the IR spectra of carbonyl-containing compounds from the IR spectra of noncarbonyl compounds consisted of the IR spectra of 50 compounds, 25 of which contained a C = O group while the other 25 did not. Each spectrum was autoscaled, feature weighted with a w 2 function, and then separated into its principal components. Excellent separation of carbonyls and noncarbonyls occurred using only one principal component, PC 3. The plot of PC 3 against PC 1is shown in Figure 4, in which 1’s designate compounds containing a carbonyl functionality and 2’s designate compounds containing no carbonyl functionality. The large ellipses are set to encompass 90% of the volume of the bivariate normal distributions fit to each class. There is one outlier carbonyl amongst the non-carbonyls. This compound is 2-hydroxybutyrophenone, which forms an intramolecular hydrogen bond in the vapor phase that shifts the carbonyl peak to lower wavenumber (1653 cm-’), well below the region where most C-0 bonds of compounds in the vapor-phase absorb. Interestingly, the best separation of carbonyls and noncarbonyls occurred using the third principal component. Usually, when using feature weights, the separation is expected to fall into the first and (occasionally) the second principal component. Nevertheless, it is easy to see from the line drawn in Figure 4 at PC 3 = 0 that all of the separation is occurring in this third principal component. One possible explanation for this occurrence is that the intragroup variance (the variance among carbonyls) is greater that the intergroup variance (the variance between carbonyls and non-carbonyls). The w2 function does not increase the intergroup variance significantly

660

ANALYTICAL CHEMISTRY, VOL. 64, NO. 6, MARCH 15, 1992

1 7

~~~~

Table 11. Names and v(C=O) Wavenumber for the Carbonyl-Containing Compounds in the Training Set Used To Separate Carbonyls from Non-Carbonyls

stretch

3.5-

name a,af-diethyl-4,4’-stilbenediol diacetate 2,4-dimethyl-3-pentanone acetic acid, isobutyl ester 2- thiophenecarboxaldehyde (2-hydroxyethy1)formamide butyric acid, methyl ester 2,6,8-trimethyl-4-nonanone enanthic acid

1784 1726 1762 1705 1734 1761 1724 1780 3-acetylmorpholine 1690 3-pentanone 1730 2-ethylbutanal 1743 a-pentylcinnamaldehyde 1707 2*-hydroxybutyrophenone 1653 citral 1699 (2,6,6-trimethyl-2-cyclohexen-l-yl)-3-buten-2-one 1695 1-monoacetin 1769 myristic acid, isopropyl ester 1747 a-chloroacrolein diacetate 1790 ethylene glycol diacetate 1769 carbonic acid, diethyl ester 1767 linalool acetate 1755 N-methylformanilide 1713 1-naphthaldehyde 1711 phenol propionate 1782 coramine 1666

when compared to the intragroup variance. This is more easily seen when the molecular structures of the compounds in the training set are examined. The names and C-0 stretching frequencies of the carbonyls that were used in the training set to distinguish between carbonyls and non-carbonyls are listed in Table 11. This training set is comprised of the infrared spectra of compounds that contain acid, aldehyde, ketone, and carbonyl ester functionalities. The inclusion of the spectra of many types of carbonyl-containing compounds in the training set was designed to enrich the overall robustness of the classification. The c-0 stretching vibrations in these spectra vary in wavenumber by over 135 cm-’. Esters and carboxylic acids both have characteristic C-0 stretching vibrations while aldehydes and ketones do not. Although the spectra of carboxylic acids always have a strong 0-H stretching vibration, the region around 3600 cm-I rarely appears strongly in the feature weight spectrum as its presence could lead to the selection of alcohols as carbonyl compounds. The first lo00 spectra in the Sadtler library were projected onto PC 3 of the training set and subsequently used to train the classification system. A discriminating point (at PC 3 = 0.02)was determined by eye to separate these spectra. The fist lo00 spectra contain 662 non-carbonyl compounds and 337 carbonyls. (One spectrum in this database had to be rejected because of an error in the spectral intensities.) Seventeen outlier carbonyls were classified as non-carbonyls, giving a chance of approximately 95% that if a compound is a carbonyl it will be classified as such. Of the 662 noncarbonyls, 55 were classified as carbonyls. Surprisingly, 34 of the misclassified compounds were halogenated compounds. The feature weight spectrum generated from the training set selected to classify carbonyls and non-carbonyls is shown in Figure 5. The largest feature (centered at approximately 1750 cm-’) can obviously be assigned to the carbonyl stretch, the weaker feature centered at 1200 cm-I is due to the C-0 stretch of esters and acids, and the feature centered at 710 cm-’ is assigned to r(C=O). This C=O bending feature can also weight a halogenated compound if ita carbon-halogen stretching band overlaps the peak in the feature weight spectrum. For example, a C-F stretch that occurs near the

3000

Zoo0

loo0

Wavenumber (cm-1) Figure 5. Feature weight s p e d ” for the training set used in the PCA shown in Figure 4. The maxima near 1750, 1350, and 710 cm-‘ can be attributed to the spectral features due to carbonyl compounds (C=O stretch, C - 0 stretch of esters and carboxylic acids, and C=O bend, respectively).

C-0 stretch (=1200 cm-’) will be weighted. The carbonhalogen stretching mode of most molecules is very intense due to the strongly polar nature of the C-X bond. If the carbon-halogen stretch were significantly weaker, the feature weight would not affect the intensity enough to cause the unknown halogenated compound to be classified as a carbonyl. Similarlyan alcohol is not classified as a carbonyl even though it contains a C-O stretch because of the relative weakness of this mode. Even though several non-carbonyl outliers were encountered, the predictive capability is reasonable when the very large range of structures containing the C 4 moiety is considered (see Table 11). Of the 589 spectra classified as non-carbonyls, 9 were authentic carbonyls (false negatives) and of the 411 spectra classified as carbonyls, 83 were authentic non-carbonyls (false positives). The false positive rate was deemed too high for use in the expert system. We investigated the use of principal components 1 through 4 in our classification scheme (PC’s higher than 4 represent mostly noise when the w 2 function is used). The evaluation of the increased number of PC’s did not substantially improve upon the classification rates, however. Since the majority of the false positive compounds were halogenated, it was reasoned that it should be possible to distinguish the false positives from true positives by a second PCA rule. From among the samples identified as carbonyls (true and false positives), a second training set of 50 spectra was selected. In this training set 25 were true carbonyls and 25 were false positives. The spectra were autoscaled and feature weighted using (w - 02. The first and second principal Components showed a good separation between the classes of false and true positives. It is interesting to note that these classificationsare only loosely based on molecular structures as the only link between the members of the false positive set is that they were all incorrectly identified as carbonyls by the first rule. All the samples identified as carbonyls were projected onto the scores plot (Figure 6). The samples falling to the left of line A are identified as carbonyls and those falling to the right of line A are classified as non-carbonyls that were missed by the first rule. For this validation data set (329 carbonyls and 83 non-carbonyls), 322 carbonyls and 79 noncarbonyls were correctly identified. Thus the f i t and second rules taken in conjunction operate as follows: If the first rule indicated that an unknown is a non-carbonyl, that classification stands. If the first rule claims that an unknown is a carbonyl, the second rule is invoked to double check the result. If the second rule confirms that the unknown is a carbonyl then the classification stands. If the second rule claims that

ANALYTICAL CHEMISTRY, VOL. 64, NO. 6, MARCH 15, 1992

Table 111. Summary of Classifications of the IR Spectra of Carbonyl-ContainingCompounds Using PCA for the Branched-Tree Structure (All Classifications Were Done Using the Linear Discriminating Line Determined Visually) classification set

classified as

group

(a) carbonyl noncarbonYl

validation carbonyl noncarbonYl

carbonyl noncarbonYl 322 15 3 559 carbonyl noncarbonYl 335 5 5 655

0

acid 23 3

301 non-acid 4 263

ester 210

non-ester 9

non-ester 20 ester validation ester 201 non-ester 18

75 non-ester 6 a8

(C)

ester projection

(d) ester projection

ester

RCOOR’

(e) RCOOAr ester RCOOAr 25 training ArCOOR 0

cn

aldehyde ketone

validation aldehyde ketone (9) aldehyde RCHO training ArCHO

:::1

l-A

1.2

l-!

I

-0.6 -0.8

-

-1 -1.2 , -8

98

I 1

-6

4

-2

0

39

Table IV. Classification Matrix for the Separation of RCOOR and RCOR Using a Linear Rule Determined by Eye, Given That the Unknown Has Been Previously Classified as Carbonyl by the Carbonyl Rule

99

classified RCOR classified RCOOR authentic RCOR authentic RCOOR

98

91 92 R,R’ = alkyl 97

R’and/or R2 # alkyl a

100

Ar = aryl

aldehyde ketone 16 0 2 aldehyde 13 2

71 ketone 1 90

RCHO 10

ArCHO

98 97

0

a

R = alkyl; Ar = aryl a There are not enough spectra of this type to complete a similar validation to those described above. 1

9

68 15

31 221

total correct rate 86%

ArCOOR 0 25

Flguro 6. Scatter plot of the projected scores for the validation set used in the classification of outliers in the determination of carbonyl compounds. Carbonyls are designated with 1’s and noncarbonyls are designated with 0’s. Line A is the separating line; compounds on the left of the line are classified as carbonyls while those to the left are classified as non-carbonyls.

99

RCOOR’ R1COOR2 158 1

R1COOR2 6

aldehyde projection ketone

.. B

LI

Principal Component 1

non-acid 2

validation acid non-acid

2

1.8

total correct classified rate, % comments as

acid 19

(b) carboxylic acid projection non-acid

881

95

the unknown is a non-carbonyl (opposite to the f i s t conclusion), then the second rule takes precedence. The resulta of using these rules on the f i s t 999 spectra are summarized in Table 111, part a. An overall correct rate of 98% was achieved. These two rules were validated by the second lo00 spectra in the library. The results are shown in

Table 111, part b. An overall correct rate of 97% was confirmed with the following breakdown of the probabilities: I. If an unknown is classified as a non-carbonyl, there is a 96% chance that it is a non-carbonyl. II. If an unknown is c h i f i e d as a carbonyl, there is a 98% chance that it is a carbonyl. 111. There is a 99% chance that a non-carbonyl will be classified as a non-carbonyl. IV. There is a 92% chance that a carbonyl will be classified as a carbonyl. (c) Two-Tiered-TreeClassification. The twetiered tree was then used to separate the IR spectra of compounds that were classified as containing carbonyl functionalities into R’COOR (acids and esters) and R’COR (ketones and aldehydes), where R’ is alkyl or aryl and R is any group, including H. Classification based on this tree was selected due to the structural similarities between acids and esters and between aldehydes and ketones. Twenty-five spectra of R’COORcontaining compounds and 25 IR spectra of R’COR-containing compounds were chosen for the training set. The six-step protocol summarized in the Theory section was followed, with PC’s 1and 2 showing the best separation. The classification matrix for this separation is shown in Table IV. A total correct rate of 86% was achieved, which is not acceptable for this expert system. This separation illustrates that just because compounds such as acids and esters are structurally similar does not necessarily mean that they are spectroscopically similar. (d) Branched-Tree Classification. In an attempt to improve on this result the branched-tree approach (Figure IC) was then tested. To discriminate carboxylic acids from other carbonyls, the spectra of 25 carboxylic acids and 25 carbonyl compounds were chosen from the Sadtler library to build a training set. Again, PC 1and PC 2 provided the best separation. The 322 authentic carbonyls that were classified as carbonyls after the initial step were then projected onto principal components 1 and 2. Table 111, part b, shows the results of the classification matrix for this separation. The corresponding probabilities are as follows: I. If an unknown

882

ANALYTICAL CHEMISTRY, VOL. 64, NO. 6, MARCH 15, 1992

Table V. Stretching Regions for Various C=O Ester Functionalities v(C=O) (cm-’)

RCOOR’, R and R’ = alkyl

RCOOAr, R = alkyl; Ar = aryl ArCOOR, Ar = aryl; R = alkyl ArRCOOAr‘, Ar and Ar’ = aryl acetates lactones

1750-1760 1750-1763 1740-1753 -1760 1755-1810 1750-1851

is classified as a non-acid carbonyl, there is a 99% chance that it is a non-acid carbonyl. 11. If an unknown is classified as a carboxylic acid, there is a 100% chance that it is a carboxylic acid. 111. There is a 100% chance that a non-acid carbonyl will be classified as a non-acid carbonyl. IV. There is a 90% chance that an authentic carboxylic acid will be classified as a carboxylic acid. An overall correct classification rate of 99% is obtained. To validate the classification line determined by eye, the carbonyl containing compounds in the second 1000 vaporphase spectra (determined by the first rules) were projected onto their principal components and plotted. Table 111, part b, summarizes the outcome of this validation. An overall correct rate of 98% was calculated. The corresponding probabilities are as follows: I. If an unknown is classified as a non-acid carbonyl there is a 99% chance that it is a non-acid carbonyl. 11. If an unknown is classified as a carboxylic acid there is a 88% chance that it is a carboxylic acid. 111. There is a 99% chance that a non-acid carbonyl will be classified as a non-acid carbonyl. IV. There is a 85% chance that a carboxylic acid will be classified as a carboxylic acid. For the classification of ester and ketone/aldehyde moieties, a training set was selected and the PCA was performed following our standard protocol. The best separation was obtained when the data set was scaled with the (w - 112function and when principal components 1 and 2 were used. The carbonyl spectra that were not classified as carboxylic acids by the above rule were then projected onto principal components 1and 2. A separating line was determined visually. The results of this classification are shown in Table 111, part c. An overall correct rate of 91% was achieved. To validate the classification, the spectra in the second 1000 entries in the Sadtler library that were not classified by the previously described rules were projected onto PC 1and 2 defined by the ester analysis. The pertinent probabilities are as follows: I. If an unknown is classified as a non-ester there is an 83% chance that it is a non-ester. 11. If an unknown is classified as an ester there is a 97% chance that is an ester. 111. There is a 94% chance that a non-ester will be classified as a nonester. IV. There is a 92% chance that an ester will be classified as an ester. A total correct rate of 92% was calculated from the validation of the classification of compounds that contain the ester functionality. The correct rate is relatively low due to the overlap of u ( W ) of ketones and a-aryl esters.

It is well known that the C==O stretching vibration of esters is strongly dependent on the nature of the substituents. Nyquist reported that the u ( W ) of esters in the vapor-phase varies by over 100 cm-’ (see Table V).14 Thus, if an unknown spectrum can be classified as an ester it should be amenable to further classification based on the class of ester functionality it contains. Further classification of esters was, therefore, pursued. For the classification of RCOOR’ where R and R’ are both aliphatic, 25 reference spectra of these compounds and 25 spectra of other esters were selected for the training set. The training set was autoscaled and feature weighted with the three feature-weight functions. A PCA was performed on the training seta, and PC‘s 1and 3 of the training set scaled

0.5

-p

N

i

s

I

I

2

03 0.2 0.1

B o

i

-0.2

-03 -0.4

1

-0.4

-02

0

02

0.4

Principal Component 1

Figure 7. Scatter plot of PC1 and FC2 of the training set used for the classification of RCOOAr from ArCOOR’ and ArCOOAr‘ esters. RCOOAr esters are designated wlth 2’s w h k ArCOOR’ and ArCOOAr‘ esters are designated with 1’s. Vaydatlng spectra are designated wlth X’s if they belong to class 1, and Y’s if they belong to class 2.

with (w - 1)2were determined to yield the best classification of the spectra contained within the training set. Spectra that had been classified as esters by the first PCA were then projected onto PC’s 1and 3. The optimal separating line was determined visually. The results of this classification are shown in Table 111, part d. Unknown esters that were classified as nonaliphatic were then further tested. The training set consisting of 25 RCOOAr and 25 ArCOOAr’ esters was assembled and analyzed according to the standard protocol. The scatter plot of PC 1 vs PC 2 of the training set with (w - 1)2scaling is shown in Figure 7. The two ellipses are the Bayesian normal-distribution 90%-volume ellipses. The 1’s represent the scores of compounds that contain either the ArCOOR’or the ArCOOAr’ functionality while 2’s represent the scores of compounds which contain the RCOOAr functionality. The X’s and Y’s represent projected ester spectra used to validate the training set. Compounds designated with the X s and Y’s contain functionalities corresponding to the class of 1’s and 2’8, respectively. All X’s and Y’s are correctly classified. Ketones and aldehydes were then separated by another PCA. The best separation was obtained by plotting principal components 1and 2 of the (w - 1)2weighted training set. The remaining spectra (those which are either ketones or aldehydes) were projected onto principal components 1and 2, and a separating line was determined visually. The resulting classification matrix is shown in Table 111, part f. A total correct rate of 98% is calculated from these data. Those entries in the second loo0 spectra that were classified as aldehydes and ketones were projected onto PC 1 and 2 to validate the visual classifying line. The probabilities corresponding to the validation set are as follows: I. If an unknown is classified as a ketone there is a 98% chance that it is a ketone. 11. If an unknown is classified as an aldehyde there is a 87% chance that it is an aldehyde. 111. There is a 93% chance that an aldehyde will be classified as an aldehyde. IV. There is a 99% chance that a ketone will be classified as a ketone. The total correct rate for the validation set is 97%. Probability number I1 is significantly lower than the other probabilities. This can be related to the limited number of spectra of aldehydes in the second lo00 entries in the library. There are significantly more ketones than aldehydes in this part of the library and, even though they are correctly classified 99% of the time, because there are so many more ketones, the 1% error leads to a significant decrease in probability 11. Aldehydes can be further classified into two groups, RCHO and ArCHO. Because of the limited number of aldehyde

ANALYTICAL CHEMISTRY, VOL. 64, NO. 6, MARCH 15, 1992

663

r-7

Table VI. Summary of the Outcome of the Final Tree, Shown in Figure 8, for the Classification of Carbonyl-ContainingCompounds

Carbonyl?

classified as authentic non-carbonyl carboxylic acid ester ketone aldehyde

noncarbonyl

carboxylic acid

655 0

1

0 23 2

4 0

ester

ketone

aldehyde

3

2 3

4

0 0 0

1

1 199 13

0

0

73 2

13

Acid?

1

total correct rate = 96%

spectra in the database, the training set for this classification consisted of the IR spectra of 10 alkyl aldehydes and 10 aryl aldehydes. The entries in this training set were autoscaled, feature weighted, and then reduced to their principal components. The best result was obtained after scaling with w2; these results are tabulated in Table 111,part g. A validation set consisting of the IR spectra of 10 aldehydes was projected onto PC 1 and PC 2, and all 10 were correctly classified. As the classifications become more specific there are fewer spectra that can be used both in training and validation sets. This poses the problem of determining the validity of the expert system rules. A test needed to be developed to determine whether or not a classification was significant. For our purpcaea, the data from each classificationmatrix was used to test whether the classification was significantly better than a classification based solely on chance. This was checked by applying Huberty's one-tailed z stat is ti^:'^

where o is the number of spectra that were correctly classified, e is the expected number of spectra correctly classified based on chance, and N is the total number of spectra Classified. The value for e is calculated by summing the squared numbers in each group and dividing by N. For all of the classifications in Table IV, the z-test was passed at a confidence of 99%. A total correct rate for the entire branched-tree validation set can be calculated by employing a 5 X 5 classification matrix. The results of the branched tree are summarized in Table VI. The total COR& rate for the branched tree is 96%. This table shown that acids and aldehydes very easy to classify, while esters are more difficult. Most of the misclassified esters were classified as ketones while 13 ketones were classified as esters. Four misclassified non-carbonyls were classified as ketones. This is intuitively obvious based on the structure of the tree: acids are first separated from the larger group of carbonyls, esters are then separated from this group, and fiially the aldehydes are separated, leaving ketones and all other unclassified spectra. In summary, additional rulea for a PCA-based expert system to determine molecular structure from vapor-phase IR spectra have been developed. These rules are used to classify compounds containing a wide variety of carbonyl functionalities. In our trials, the rules correctly determine from vapor-phase IR spectra whether or not a compound contains a carbonyl functionality better than 98% of the time. If it is determined that a compound contains a carbonyl functionality, the expert system then correctly subclassifies the compound as a car-

RCooR' RR'..Uphltir

Ketone?

1

63, 1738-1747. (10) Sheraf, M. A.; Illman, D. L.; Kowalski, B. R. Chemometrics;John WIley & Sons: New York, 1986. (1 1) Wold, H. MuMvarlSte Analysis; Krishnalah, P. R., Ed.; Academic Press: New York, 1966; pp 391-420. (12) Geladi, P.; Kowalskl, B. R. Anal. Chim. Acta 1988. 785. 1-17. (13) Donahue, S. M.; Brown, C. W. Anal. Chem. 1991. 63, 980-985. (14) Nyquist, R. A. The InterpretaNon of Vapor-Phese Infrared Spectra: Group Frequency Data ; Sadtler Research Laboratories: Philadelphie, PA, 1984. (15) Huberty, C . J. Psycho/. Bull. 1884, 95, 156-171.

RECEIVED for review June 18,1991. Revised manuscript received November 22, 1991. Accepted November 27, 1991.