An Expert System for Organic Structure Determination - American

We are developing an expert system which interprets low-resolution mass ..... of methyl-benzene and the IR denial not only reduces the belief in methy...
0 downloads 0 Views 1MB Size
27

An E x p e r t S y s t e m for O r g a n i c S t r u c t u r e

Determination

Bo Curry

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

Chemical Systems Department, Hewlett-Packard Laboratories, Palo Alto, CA 94304-1209

We are developing an expert system which interprets low-resolution mass spectra, infrared spectra, and other user-supplied information and produces a l i s t of functional groups present in an unknown organic compound. The input data are interpreted as evidence supporting the presence or absence of each of the over 900 functional groups and organic substructures represented in the knowledge base. This evidence i s then combined by an "inference engine" to determine the probability that the group is present. Each type of input spectra is interpreted by a separate module, which has private internal data structures; these modules can use different techniques and even be written in different computer languages. The modular architecture was designed to allow new modules interpreting different types of spectra to be easily incorporated into the system. A major goal has been the reduction of the number of false positive assertions. An analyst attempting to i d e n t i f y an unknown compound from s p e c t r a l data begins by searching l i b r a r i e s o f spectra of known compounds (Figure 1). Programs which r a p i d l y and r e l i a b l y search s p e c t r a l l i b r a r i e s are widely available.(1-2) However, although these l i b r a r i e s continue to grow, i t w i l l remain true that the majority of compounds encountered i n r e a l samples are not represented i n the l i b r a r i e s . These compounds can at present be i d e n t i f i e d only through a laborious manual process r e q u i r i n g considerable expertise. I n t e r p r e t a t i o n of molecular spectra involves four basic steps. F i r s t , major s k e l e t a l and functional group components o f the molecule are i d e n t i f i e d , e i t h e r from assumptions about the compound o r i g i n or from features of the spectra. Second, non-localized molecular properties such as the molecular weight, elemental compos i t i o n , and chromatographic behavior are considered. These global constraints can be used to eliminate u n l i k e l y f u n c t i o n a l groups, deduce the presence o f groups and s k e l e t a l u n i t s which have no d i s t i n c t i v e features i n the spectra, and detect m u l t i p l e occurrences of 0097-6156/86/0306-0350$06.00/0 © 1986 American Chemical Society

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

27.

CURRY

An Expert System for Organic Structure Determination

351

f u n c t i o n a l groups. Complete candidate structures are then generated by assembling the functional groups subject to the global cons t r a i n t s . More data may be c o l l e c t e d to narrow down the number of candidates. F i n a l l y , the candidate structures are tested f o r compati b i l i t y with a l l o r i g i n a l data. F i n a l confirmation i s obtained by synthesis of the candidate compound and comparison with the unknown. We are developing an expert system to automate the f i r s t step of t h i s process, the i n t e r p r e t a t i o n of molecular spectra and i d e n t i f i c a t i o n of substructures present i n the molecule. The automatic i n t e r p r e t a t i o n of spectra would by i t s e l f provide a u s e f u l t o o l f o r an organic chemist who may not be an expert spectroscopist. Also, reported algorithms f o r the assembly of candidate structures from known substructures, such as the GENOA program.(3-6) r e l y on the input of accurate and s p e c i f i c substructures i n order to f u n c t i o n c o r r e c t l y and e f f i c i e n t l y . I d e n t i f i c a t i o n of substructures i s thus a l o g i c a l s t a r t i n g point. Information about substructures present i n an unknown can be obtained from a wide v a r i e t y of sources, and one of our major objectives has been to allow a l l a v a i l a b l e data to be used by the program. Programs have been described i n the l i t e r a t u r e which i n t e r p r e t C-13 and 1-H NMR spectra,(7-13) low and high-resolution mass spectra, (14-15) i n f r a r e d spectra,(16-23) MS-MS spectra,(24) and 2D-NMR spectra.(25) The methods employed may be generally c l a s s i f i e d as rule-based methods or pattern-matching methods. Rule-based methods apply i n t e r p r e t a t i o n rules to discrete features of the spectra.(26) These rules are usually empirical c o r r e l a t i o n s having p h y s i c a l s i g n i f i c a n c e , expressed i n a form s i m i l a r to that used by human i n t e r preters. Rule-based systems maintain a r e l a t i v e l y d e t a i l e d i n t e r n a l representation of t h e i r knowledge, and can e x p l a i n t h e i r conclusions i n a language i n t e l l i g i b l e to the user. Pattern-matching methods attempt to c l a s s i f y the spectrum based on some global measure of "spectral distance" from spectra of known compounds.(27) Any p h y s i c a l knowledge used by the algorithm i s embodied i n i t s distance measure, which may be a complicated function of many features of the spectra. The c l a s s i f i c a t i o n decision i s made from a s t a t i s t i c a l analysis of the distance from representative members of the classes being d i s tinguished. Explanations of the system's conclusions are are usually l i m i t e d to reporting the computed s p e c t r a l distances. Whichever method i s employed, the output i s i n the form of a l i s t of suggested substructures, chosen from a predefined set, with confidence factors v a r i o u s l y computed. The choice between rule-based and pattern-matching approaches depends not only on the p r e d i l e c t i o n of the experimenters, but also on the nature of the data being interpreted. The reported NMR i n t e r preters a l l use rule-based methods. The pattern-matching algorithm used i n the STIRS program (14) appears to be the most successful at i n t e r p r e t i n g low-resolution mass spectra of general organic compounds. Both rule-based and pattern-matching techniques have been applied to the i n t e r p r e t a t i o n of i n f r a r e d spectra. The rule-based methods seem to be the most successful.(16-23) We have therefore designed our program to allow each type of spectrum to be interpreted by the most e f f i c i e n t method; d i f f e r e n t methods can even be simultaneously applied to the same spectrum. When the unknown i s present i n sub-microgram amounts, as i s often the case when i t has been i s o l a t e d chromatographically, the

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

352

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

primary s t r u c t u r a l techniques are mass spectrometry, i n f r a r e d spectroscopy, and various methods of determining elemental composition. We have therefore concentrated our i n i t i a l e f f o r t s on i n t e r p r e t i n g these types of data, while recognizing the need to be able to use data from other sources, such as NMR, when they are a v a i l a b l e . A s k i l l e d chemist can often c o r r e c t l y i d e n t i f y an unknown of moderate s i z e (molecular weight < 200) using only the IR spectrum, the lowr e s o l u t i o n mass spectrum, and some knowledge of the sample o r i g i n . Even when a precise i d e n t i f i c a t i o n i s not p o s s i b l e , a generic classi f i c a t i o n of the compound type i s u s e f u l and often s u f f i c i e n t . A program which interprets IR and mass spectra i s therefore a u s e f u l a n a l y t i c a l t o o l i n i t s own r i g h t , and provides the basis f o r development of more comprehensive c a p a b i l i t i e s i n the future. In our present system, i n f r a r e d spectra are i n t e r p r e t e d using a rule-based approach, while mass spectra are i n t e r p r e t e d by the STIRS algorithm. The a b i l i l i t y to use d i f f e r e n t techniques f o r d i f f e r e n t types of data implies a modular architecture, i n which the "expert" responsible f o r the i n t e r p r e t a t i o n of each spectrum maintains i t s own rules and data structures (Figure 2). I t i s important, however, that the i n t e r p r e t a t i o n of the various spectra be mutually consistent. Information obtained from the mass spectrum, f o r example, should a f f e c t the way the i n f r a r e d spectrum i s assigned. Conversely, the i n t e r p r e t a t i o n of mass spectral l i n e s must be consistent with the presence of f u n c t i o n a l groups known to be present from other sources. This requires a means of communication among the parts of the program responsible f o r the i n t e r p r e t a t i o n of d i f f e r e n t types of data. Consistency also requires a means of combining evidence from d i f f e r e n t sources. When data from d i f f e r e n t sources c o n t r a d i c t each other, the i n d i v i d u a l modules should be able to r e i n t e r p r e t t h e i r data so as to resolve the contradiction. As i n any c l a s s i f i c a t i o n problem, there i s a tradeoff between the rate of r e c a l l , or proportion of c o r r e c t substructures detected, and the r e l i a b i l i t y , or avoidance of f a l s e p o s i t i v e assertions. I t i s rather the exception than the rule f o r an observation to have a s i n g l e , unequivocal explanation. When reasonable a l t e r n a t i v e i n t e r pretations are p o s s i b l e , a d e c i s i o n must be made about what to report. At one extreme, a l l p o s s i b i l i t i e s could be asserted, ensuring 100% r e c a l l ( i . e . no substructure which i s a c t u a l l y present w i l l f a i l to be detected) at the cost of a high rate of f a l s e p o s i t i v e s . At the other extreme, ambiguous data could be ignored, which guarantees no f a l s e p o s i t i v e s , although many substructures which are present w i l l be missed. We have taken a middle road between these extremes by developing a measure of the "best" or most probable i n t e r p r e t a t i o n , taking into account a l l of the data a v a i l a b l e . When the best choice i s not clearcut, the d i s j u n c t i o n of the competing a l t e r n a t i v e s i s e x p l i c i t l y asserted. The goal has been to minimize the rate of f a l s e p o s i t i v e s , while r e p o r t i n g the most s p e c i f i c possible i n t e r p r e t a t i o n of the data. An important feature of expert systems i s the a c c e s s i b i l i t y to the user of the knowledge base and the reasoning process. Both the terminology used by the program and i t s i n t e r p r e t a t i o n of data have chemical s i g n i f i c a n c e . Each conclusion reached by the program can be traced by the user to the o r i g i n a l data. When a l t e r n a t i v e explanations f o r an observation are p o s s i b l e , the choice i s v i s i b l e to the

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

27.

CURRY

An Expert System for Organic Structure Determination

Identify Subunits

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

Specify Global Constraints

Figure 1. Flow chart f o r i d e n t i f i c a t i o n of an organic compound.

MS J, ι

MW 1 48

ill. il., il

rnQthyl-ketone monosubst-benzQnQ

Figure 2. Schematic drawing of the interpreter. The program i s represented by the area inside the s o l i d rectangle. Program modules are drawn as c i r c l e s , and t h e i r associated databases as rectangles. A l l of the modules have read access to the Chemical Classes database.

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

354

user. I f the program has made an error, the user can c o r r e c t i t , thereby modifying the o r i g i n a l conclusions.

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

Program D e s c r i p t i o n The architecture of our current system i s shown schematically i n Figure 2. The design i s modular, with a C o n t r o l l e r module, a Reasoner module, a database of over 900 organic substructures, and a separate "Expert" module assigned to each k i n d of input data. The C o n t r o l l e r module controls the progress of the c a l c u l a t i o n by cons i d e r i n g each of the substructuras which has not yet been eliminated, beginning with the most general. I t requests each of the Expert modules to supply i t with evidence supporting or denying the presence of the substructure currently being considered. This evidence i s c o l l e c t e d and passed to the Reasoner. When no more evidence can be c o l l e c t e d , the analysis i s f i n i s h e d . The Reasoner combines evidence from a l l sources and makes deductions from t h i s evidence. The combination of evidence r e s u l t s i n a s i n g l e "confidence l e v e l " f o r each substructure. These c o n f i dence l e v e l s designate the degree to which the evidence supports the presence of the substructure i n the unknown compound. They range from -100% (substructure d e f i n i t e l y absent), through 0% (no i n f o r mation) , to +100% (substructure d e f i n i t e l y present). The confidence l e v e l s are u l t i m a t e l y derived from s t a t i s t i c a l analysis of representa t i v e s p e c t r a l l i b r a r i e s . D e t a i l s of the generation and propagation of confidence l e v e l s w i l l be described i n a separate report.(28) Each Expert module i s permitted to use any convenient method to carry out i t s mission of i n t e r p r e t i n g i t s assigned data. The Experts use p r i v a t e rules and data structures, and communicate with the C o n t r o l l e r module both by suggesting the presence of substructures, and by evaluating the l i k e l i h o o d of substructures under considerat i o n . Each Expert can read the current confidence l e v e l associated with each substructure, and thus has access to information generated by other Experts or deduced by the Reasoner. Communication among these modules i s accomplished i n two ways. F i r s t , the chemical database, besides s t o r i n g the chemical knowledge of the program, serves as a "blackboard" on which the progress of the computation i s recorded.(29) Only the C o n t r o l l e r and Reasoner modules are allowed to w r i t e on the blackboard, but a l l modules can read i t . In t h i s way the conclusions of each Expert module are a v a i l a b l e to a l l the others to guide t h e i r i n t e r p r e t a t i o n . Second, the C o n t r o l l e r module controls the o v e r a l l path of the analysis by sending messages to the i n d i v i d u a l Experts. The only requirement of a new Expert module being added to the system i s that i t be able to respond appropriately to these messages. The current prototype system includes three Expert modules, the IR Expert, the STIRS Expert, and the Human. A l l modules are w r i t t e n i n L i s p . The IR Expert i s a rule-based i n f r a r e d i n t e r p r e t e r which we have developed. The STIRS Expert i s an interface to the STIRS program, a pattern-matching mass spectrum i n t e r p r e t e r developed by McLafferty and coworkers at Cornell U n i v e r s i t y , which i s w r i t t e n i n Fortran.(14) The interface translates the output of STIRS into a form palatable to our program, and handles the message-passing p r o t o c o l required by the C o n t r o l l e r . The Human module controls communication with the user. I t allows user-supplied elemental or substructure information to influence the course of the analysis. The power of

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

27.

CURRY

An Expert System for

355

Organic Structure Determination

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

the modular approach i s shown by our a b i l i t y to integrate the r e s u l t s of three i n t e r p r e t a t i o n methods which d i f f e r profoundly i n t h e i r internal details. The Chemical Database. The chemical knowledge of the system i s embodied i n a database of over 900 organic substructures, arranged i n a hierarchy (Figure 3). With each of these substructures i s associated a connection t a b l e , s t a b i l i t y information, and a p r o b a b i l i t y of occurrence denoting how common the group i s . This information may be used by the Expert modules when deciding among possible i n t e r p r e t a tions . As the analysis progresses, evidence i s accumulated supporting the presence or absence of defined substructures. The evidence i s combined by the Reasoner module to form a b e l i e f function, which describes the degree to which each substructure i s c u r r e n t l y bel i e v e d . This information i s stored i n the chemical database, where i t i s a v a i l a b l e to the Expert modules and to the C o n t r o l l e r as i t decides the course of the a n a l y s i s . As the b e l i e f function evolves, the current state i s displayed g r a p h i c a l l y to the user, who may h a l t the a n a l y s i s , query the current state, and r e d i r e c t the course of the analysis by supplying evidence f o r or against a substructure. IR Expert Module. The IR Expert's r u l e base consists of over 1000 c o r r e l a t i o n s between observed i n f r a r e d bands and v i b r a t i o n a l modes of s p e c i f i c substructures. Associated with each r u l e i s a wavenumber range, an i n t e n s i t y range, and two confidence l e v e l s . Four i n t e n s i t y l e v e l s are allowed. The i n t e n s i t y l e v e l s are defined on an approximate semilog scale, r e l a t i v e to the most intense peak i n the spectrum: WEAK - 2 - 5%, MEDIUM - 5 - 15%, STRONG - 15 - 40%, VSTRONG 40 - 100%. The program does not attempt to assign bands weaker than 2% of the strongest band. Each IR r u l e i s equivalent to the p a i r of propositions : a) IF a band of i n t e n s i t y I appears i n the region x l - x2 cm-1, THEN i t i s due to the v i b r a t i o n a l mode M of substructure S, AND b) IF no band of i n t e n s i t y I appears i n the region x l - x2 THEN the substructure S i s not present i n the unknown.

cm-1,

About 800 of these r u l e s were chosen by t e s t i n g a l l the IR corr e l a t i o n s we could f i n d i n the literature,(30-32) mostly f o r condensed phases, against the EPA gas-phase l i b r a r y of 2300 compounds. (33-34) About 30% of the l i t e r a t u r e c o r r e l a t i o n s were not generally s a t i s f i e d by the l i b r a r y spectra, and were discarded. Another 200 rules were discovered by searching f o r patterns i n compound classes i n the l i b r a r y which could reasonably be a t t r i b u t e d to expected v i b r a t i o n a l modes of those classes. S t a t i s t i c s were generated f o r the p r o b a b i l i t y that each of the IR rules would be s a t i s f i e d f o r compounds which contained, or d i d not contain, the substructure specif i e d by the r u l e . These s t a t i s t i c s were used to compute two c o n f i dence l e v e l s f o r each r u l e , corresponding to the confidence i n the two propositions a) and b) implied by the r u l e . Messages. As noted above, the expert modules communicate t h e i r r e s u l t s to the user and to the C o n t r o l l e r by responding to messages

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

356

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

sent by the C o n t r o l l e r . There are s i x messages to which each Expert module i s required to respond: The ALIVE? message asks the Expert i f i t i s a v a i l a b l e f o r cons u l t a t i o n i n t h i s analysis. The receiving Expert resets i t s i n t e r n a l state, and responds TRUE i f i t has data, FALSE i f i t doesn't. The SUGGESTIONS message asks the Expert to report any substructures i t believes, on i t s own, to be present or absent. The report takes the form of a l i s t of items of evidence, each supporting the presence or absence of a p a r t i c u l a r chemical group. The SPECIALIZE message asserts the hypothetical presence of a chemical group, and asks the Expert which subgroups may be present. For example, the message "SPECIALIZE carbonyl" would cause the rec e i v i n g Expert to return evidence f o r or against the presence of ketone, aldehyde, ester, amide, and other s p e c i f i c types of carbonyl, under the assumption ( f o r the moment) that the compound does i n f a c t contain a carbonyl group. The TEST message asks the Expert to return any evidence i t may have against the presence of the group being tested. The REEVALUATE message i s sent when a piece of evidence supp l i e d by an Expert has been contradicted. I t asks the Expert to modify or r e t r a c t the evidence, i f possible. Many i n f r a r e d c o r r e l a tions have known exceptions i n s p e c i f i c cases. For example, a n i t r o group on a benzene r i n g raises the expected frequency ranges of the hydrogen wags. I f the presence of a n i t r o group i s known or suspected, the aromatic wag assignments must be reevaluated. The EXPOUND message asks the Expert to p r i n t out, f o r the user's b e n e f i t , the reasons supporting a piece of evidence. Each piece of evidence o r i g i n a t e d i n i t i a l l y i n some feature of the data. The degree of d e t a i l supplied i n response to t h i s message depends on the i n d i v i d u a l Expert. The IR Expert, f o r example, can report the i n f r a red bands which were assigned to a p a r t i c u l a r v i b r a t i o n a l mode of a substructure, as w e l l as possible a l t e r n a t i v e assignments. The STIRS Expert reports the incidence of the substructure among the best h i t s i n d i f f e r e n t STIRS data classes. Example :

4-phenyl-2-butanone

The r e s u l t s of the i n t e r p r e t a t i o n of the gas phase IR and low-resolut i o n mass spectra of 4-phenyl-2-butanone are given i n Figure 4. This compound, with a molecular weight of 148, i s t y p i c a l of the s i z e and complexity of compounds which our program handles w e l l . The IR spectrum was taken from the EPA gas-phase IR l i b r a r y , and the mass spectrum from the Registry of Mass Spectral Data.(35) The program was run three times: f i r s t with only the STIRS r e s u l t s , second with only the r e s u l t s of the IR i n t e r p r e t a t i o n , and f i n a l l y with both spectra together. A l l functional groups reported by the program with confidence l e v e l s > 10% are l i s t e d . In addition, STIRS c o r r e c t l y determined the molecular weight. The most s p e c i f i c defined functional groups a c t u a l l y present i n the unknown are benzyl, monosubstituted-benzene, X-CH2CH2-X (where the "X" represents any group other than -H or -CH2-), and methylketone. That i s , the program would have achieved a perfect score had i t reported these substructures and no others. In f a c t , the program was unable to determine the correct environments of the ketone and -CH2- groups, although i t reported only one incorrect substructure.

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

27.

CURRY

An Expert System for

Organic Structure Determination

357

These r e s u l t s are consistent with our goal of reducing the rate of f a l s e p o s i t i v e s , at the cost of f a i l i n g to report the most s p e c i f i c possible substructures which are a c t u a l l y present. I f the low-confidence report of the presence of benzyl and X-C-CH3 groups i s accepted (Figure 4), the reported r e s u l t s s u f f i c e to uniquely determine the complete structure. The e f f e c t s of the low-level combination of evidence are i l l u s t r a t e d by two features of the output. F i r s t , the confidence l e v e l for the ketone group increases from 19% f o r the IR-only i n t e r p r e t a t i o n to 30% f o r the combined i n t e r p r e t a t i o n , despite the f a c t that STIRS had nothing to say about the presence of a ketone or even of a carbonyl. This i s explained by the increased confidence i n monosubstituted-benzene derived from the combined spectra, which causes a f i n g e r p r i n t l i n e t e n t a t i v e l y assigned to an ester C-0 s t r e t c h to be reassigned to a phenyl v i b r a t i o n . Reducing the l i k e l i h o o d of an ester group increases the l i k e l i h o o d that the C-O s t r e t c h i s due to a ketone group. Secondly, the c o n t r a d i c t i o n between STIRS' a s s e r t i o n of methyl-benzene and the IR denial not only reduces the b e l i e f i n methyl-benzene, but also allows the a s s e r t i o n of benzyl and unsaturated- CH3 (X-C-CH3). These substructures were not suggested by e i t h e r spectrum taken alone. A s l i g h t l y abridged explanation offered by the program f o r i t s b e l i e f i n methyl-benzene i s shown i n Figure 5. There i s both p o s i t i v e and negative evidence. The p o s i t i v e evidence comes p r i m a r i l y from STIRS, and the negative evidence r e s u l t s from the f a i l u r e to observe a medium i n t e n s i t y C-H s t r e t c h i n g band expected f o r methylbenzene. A small amount of p o s i t i v e support f o r methyl-benzene i s also supplied by the IR Expert, showing that c o n f l i c t s can occur between d i f f e r e n t features of a s i n g l e spectrum. The degree to which each piece of evidence i s i n c o n f l i c t with other evidence i s noted. The explanation f a c i l i t y traces the f i n a l b e l i e f back to p r i m i t i v e pieces of evidence supplied by the Expert modules. The Experts are then responsible f o r explaining how the evidence depends on the observed spectrum. STIRS i s unable to do more than report which of i t s data classes supported the substructure and with what p r o b a b i l i t y . The IR Expert module, on the other hand, can give a r i c h l y d e t a i l e d d e s c r i p t i o n of the assignment of the spectrum. Results We have evaluated our prototype system at several l e v e l s . Each Expert module has been tested i n d i v i d u a l l y . Detailed r e s u l t s of t e s t s of the STIRS program have been published by McLafferty et al.(36) The IR Expert module was tested extensively against the EPA l i b r a r y . The e f f e c t s of competition among the IR rules were explored by using the complete system, with the STIRS module disabled, to i n t e r pret the spectra of 1807 compounds from the l i b r a r y . For the t e s t , we selected 500 of the 900 chemical substructures which both are chemically i n t e r e s t i n g and display at l e a s t one d i s t i n c t i v e i n f r a r e d band. Some of the selected substructures were subsets of others: f o r example, a l c o h o l , phenol, and primary alcohol were a l l i n the t e s t set. As expected, some f u n c t i o n a l groups d i s p l a y i n g very d i s t i n c t i v e i n f r a r e d bands were detected much more r e l i a b l y than others. Figure 6 shows the r e l i a b i l i t y , f a l s e p o s i t i v e and r e c a l l rates f o r a few selected f u n c t i o n a l groups.

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

358

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

900

defined

substructures

Figure 3. A subset of the chemical substructures database, showing the h i e r a r c h i c a l ordering.

ο ^^CH CH CCH 2

MS

Class

Only

80%

© χ

IR

Only

2

MS

8, IR

69%

99%

95

95

• CCC

19

30

-CH -

65

65

98

56

98

69

-44

25

2

-CH

3

14

O r X=C-CH

3

3

37

Figure 4. Substructures reported f o r 4-phenyl-2-butanone at > 10% confidence, f o r three runs of the i n t e r p r e t e r using d i f f e r e n t data sets.

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

27.

CURRY

An Expert System for Organic Structure Determination

Why 36%

wQthyl-bQnzQno?

POSITIVE: 41% from STIRS ( c o n f l i c t 27%) 8% from IR band a t 2933 cm-1 assuming unsaturatQcJ-C-CH3 (37%) ( c o n f l i c t 27%)

11%

NEGATIVE:

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

23%

bQcausQ o f f a i l u r e t o s a t i s f y C-Hsym-mQthy 1 -benzQne-1 IR band 2860-2883 m ( c o n f l i c t 45%)

F i g u r e 5. Sample o f the e x p l a n a t i o n s p r o v i d e d by the program f o r i t s c o n c l u s i o n s . More d e t a i l about the source o f the r e p o r t e d c o n f l i c t , the assignments o f IR bands, o r the data c l a s s e s r e s p o n s i b l e f o r the STIRS evidence can a l s o be p r o v i d e d .

IR r e s u l t s

for

1807 compounds Reliabi1ity False

positives

IXXXXI Recal1

> 45% c o n f i d e n c e

Figure 6. S t a t i s t i c s f o r 5 selected substructures of the 500 tested on the EPA IR database. Values of the R e l i a b i l i t y , False P o s i t i v e s , and R e c a l l (see text) are compared at the 45% confidence l e v e l . The number of compounds i n the database containing each substructure i s given beneath the substructure name. Note the expanded scale used to p l o t the False P o s i t i v e measure.

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

359

360

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The " r e c a l l " i s the p r o b a b i l i t y that a substructure present i n the unknown w i l l be reported, while the " r e l i a b i l i t y " i s the probab i l i t y that a reported substructure i s a c t u a l l y present.(36) These functions are defined as: Recall(S) - Number_correctly_reported(S)

/ Total_number_present(S)

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

R e l i a b i l i t y ( S ) » Number_falsely_reported(S)

/ Total_number_reported(S)

for a l l compounds i n the database containing substructure S. Both measures are functions of the confidence l e v e l (CL) threshold above which we count a substructure as "reported". A l l substructures are reported at CL > -100%, while none are reported at CL > +100%. We have a r b i t r a r i l y chosen CL > 45% as a threshold i n Figure 6. An a l t e r n a t i v e measure of r e l i a b i l i t y often used i s the " f a l s e p o s i t i v e " r a t e , defined as: FP(S) - Number_falsely_reported(S)

/ Total_number_present(NOT S) ,

which i s r e l a t e d to the r e c a l l and r e l i a b i l i t y measures by: Total_number_present(S) * R e c a l l * R e l i a b i l i t y FP(S) Total_number_present(NOT S) * (1 - R e l i a b i l i t y ) This i s the p r o b a b i l i t y that a compound which does not contain substructure S w i l l be i n c o r r e c t l y reported to contain i t . For substructures which occur r a r e l y i n the database, the (1 - FP) rate w i l l be considerably greater than the r e l i a b i l i t y , and may be misleading. For example, f o r the S02 group (1% of the database), the FP rate was < 8%, although the r e l i a b i l i t y was only 25% (Figure 6). That i s , although the program f a l s e l y asserted the presence of an S02 group (with > 45% CL) only 8% of the time, 3/4 of the assertions of S02 were i n c o r r e c t . The l a t t e r s t a t i s t i c i s probably of more i n t e r e s t to an analyst t r y i n g to evaluate the program's reports. On the other hand, the FP i s a better measure of the raw d i s c r i m i n a t i n g power of the program, since i t would presumably be unchanged by changing the proportion of the target substructure i n the database. The two measures serve d i f f e r e n t functions, and should both be reported. The tradeoff between r e l i a b i l i t y and r e c a l l can be adjusted f o r i n d i v i d u a l f u n c t i o n a l groups by changing the frequency ranges allowed for the IR c o r r e l a t i o n s . For some of the f u n c t i o n a l groups which are w e l l represented i n the EPA l i b r a r y (e.g. esters, alcohols) we have manually optimized the r u l e ranges to maximize ( 3 * R e l i a b i l i t y + R e c a l l ) . Since the l i b r a r y i s known to contain e r r o r s , and i s skewed towards the smallest (often anomalous) members of homologous s e r i e s , we have not t r i e d to do t h i s f o r a l l groups (e.g. S02). Further t e s t i n g on l a r g e r l i b r a r i e s w i l l allow further refinements of the IR rules. Many of the errors observed r e s u l t from the consistent confusion of two p a r t i c u l a r f u n c t i o n a l groups. For example, although the presence of a methyl group was erroneously reported (at >45% confidence) for 30% of the 400 compounds which lack methyl groups, a methyl group was reported f o r only 1 of the 33 compounds l a c k i n g both CH3 and CH2 groups. Conversely, the presence of a methylene group was never i n -

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

27. CURRY

An Expert System for Organic Structure Determination

361

c o r r e c t l y asserted f o r compounds which lack methyl groups. Examinat i o n of the reasons f o r the confusion confirm that the C-H s t r e t c h i n g and HCH deformation v i b r a t i o n s , whose frequency and i n t e n s i t y ranges are s i m i l a r f o r methyl and methylene, are often misassigned. Such consistent confusion between s i m i l a r substructures can be dealt with by assigning the bands to a generic -CH2X group, and deciding between methyl and methylene only a f t e r the nearby environment has been determined. Average r e s u l t s f o r 500 IR-active substructures are shown i n Figure 7 at four d i f f e r e n t confidence l e v e l s . The average compound i n the database contains 8.1 of the 500 substructures. At a c o n f i dence l e v e l of > 45%, only 1.4 (of 492) i n c o r r e c t substructures are reported, while 4.6 of 8.1 substructures a c t u a l l y present are reported. I n other words, a " t y p i c a l " analysis w i l l report 6.0 substructures at > 45% confidence, of which 4.6 are correct. 3.5 substructures a c t u a l l y present i n the compound w i l l f a i l to be reported. In an actual a n a l y s i s , i n f r a r e d data i s combined with other types of data, so that many of the substructures undetected by i n f r a r e d would be found by other techniques. We have analyzed over 100 unknown compounds using both the mass spectrum and the IR spectrum i n combination. The combination of the two techniques gives s u b s t a n t i a l l y better r e s u l t s than does e i t h e r technique alone. As expected, many f u n c t i o n a l groups are preferent i a l l y detected by one technique or the other. For example, ketone groups are r a r e l y detected i n the mass spectrum, but are u s u a l l y corr e c t l y interpreted from the infrared. Chlorine and bromine, on the other hand, are e a s i l y detected i n the mass spectrum but often missed by the i n f r a r e d i n t e r p r e t e r . Also, because of the i n t e r a c t i o n between the two i n t e r p r e t a t i o n methods, substructures are frequently detected by the combined techniques which are not found by e i t h e r technique alone. This can occur as a r e s u l t of r e s o l v i n g a contrad i c t i o n between the two Experts, as i n the example above, or because one Expert i s able to further s p e c i a l i z e a r e s u l t suggested by the other. For example, i n the i n t e r p r e t a t i o n of b i s - 2 - c h l o r o - e t h y l ether, the IR Expert alone f a i l s to detect the presence of c h l o r i n e . When chlorine i s suggested by the STIRS Expert, however, the IR Expert c o r r e c t l y reports the -CH2C1 group. A few substructures, such as non-terminal o l e f i n s , are not r e l i a b l y detected i n e i t h e r mass or i n f r a r e d spectra. For such groups, other techniques (NMR, UV absorpt i o n , Raman) are necessary. In many cases, the r e s u l t s of the IR and mass spectrum i n t e r p r e t a t i o n are s u f f i c i e n t to allow a complete molecular structure to be deduced. I n preliminary t e s t s on 12 unknown compounds of molecul a r weight 100-200, the author, using the r e s u l t s reported by the program but without access to the o r i g i n a l spectra, was able to c o r r e c t l y i d e n t i f y 9 of the unknowns. These r e s u l t s are encouraging, and suggest that our system i n s u b s t a n t i a l l y i t s present form could serve as a u s e f u l t o o l f o r an a n a l y t i c a l chemist, as w e l l as eventually providing a framework f o r completely automated i d e n t i f i c a t i o n of organic compounds. Conclusions We have developed an expert system which can i n t e r p r e t various kinds of data and report f u n c t i o n a l groups present i n an unknown organic

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

362

Avorago C o r r e c t Assortions for

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

ο

0

and

Incorroct

1807

Compounds

Infrared Only

Confidence Level (%)

Figure 7. Average number of substructures reported c o r r e c t l y ( s o l i d color) and i n c o r r e c t l y (hatched) a t four d i f f e r e n t confidence l e v e l s , f o r IR data only. A t o t a l of 500 substructures were considered, of which an average of 8.1 were present i n each compound.

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

27.

CURRY

An Expert System for Organic Structure Determination

363

compound. The program employs a modular construction, which allows each type of data to be interpreted i n the most e f f i c i e n t way. The conclusions derived by d i f f e r e n t modules are able to influence each other a t a low l e v e l . The program knows the chemical r e l a t i o n s h i p s between f u n c t i o n a l groups, and can use t h i s knowledge i n i t s reasoning process. The reasoning process i s accessible to the user, so that each conclusion can be traced back to the o r i g i n a l data responsible f o r i t . Choices made by the program can be i s o l a t e d and overridden by a knowledgeable user. Contradictions a r i s i n g among evidence from d i f f e r e n t sources are resolved i n a natural way, using knowledge about the e f f e c t s o f perturbations and common interferences on the spectra. A rule-based i n f r a r e d spectra i n t e r p r e t e r has been developed as a major module of the program. This module has been tested as a stand-alone system, and i n conjunction with STIRS. The low rate of f a l s e p o s i t i v e assertions i s encouraging, and work continues to reduce t h i s rate s t i l l further by incremental refinement of the knowledge base. In i t s present form, our system can provide s i g n i f i c a n t a s s i s t ­ ance to a chemist t r y i n g to i d e n t i f y an unknown organic compound. Research i s i n progress to extend the c a p a b i l i t i e s o f the program both by expanding the number of d i f f e r e n t data sources i t can handle (NMR, UV/visible absorption spectra) and by incorporating a "molecule b u i l d e r " which assembles complete candidate structures, where pos­ s i b l e , from the suggested substructures. Acknowledgments I would l i k e to thank Reed Letsinger and others i n the Expert Systems Department at HP Labs f o r h e l p f u l discussions and t e c h n i c a l a s s i s t ­ ance . Literature Cited 1. Hippe, Z.; Hippe, R. Appl. Spectrosc. Reviews 1980, 16, 135-186. 2. Bally, R. W.; van Krumpen, D.; Cleij, P.; van't Klooster, H. A. Anal. Chim. Acta 1984, 157, 227-243. 3. Masinter, L. M.; Sridharan, N. S.; Lederberg, J.; Smith, D. H. J. Am. Chem. Soc. 1974, 96, 7702-7723. 4. Carhart, R. E.; Smith, D. H.; Gray, Ν. A. B.; Nourse, J. G.; Djerassi, C. J. Org. Chem. 1981, 46, 1708-1718. 5. Nelson, D. B.; Munk, M. E.; Gash, Κ. B.; Herald, D. L. J. Org. Chem. 1969, 34, 3800. 6. Shelley, C. Α.; Hays, T. R.; Munk, M. E. Anal. Chim. Acta Computer Techniques and Optimization 1978, 103, 121-132. 7. Fujiwara, I.; Okuyama, T.; Yamasaki, T.; Abe, H.; Sasaki, S. ibid 1981, 133, 527-533. 8. Szalontai, G.; Simon, Z.; Csapo, Z.; Farkas, M.; Pfeifer, Gy. ibid 1981, 133, 527-533. 9. Debska, B.; Duliban, J.; Guzowska-Swider, B.; Hippe, Z. ibid 1981, 133, 303-318. 10. Dubois, J.-E.; Carabedian, M.; Dagane, I. Anal. Chim. Acta 1984, 158, 217-233. 11. Gribov, L. Α.; Elyashberg, M. E.; Koldashov, V. N.; Plentnjov, I. V. ibid 1983, 148, 159-170.

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

364

12. 13. 14. 15. 16.

Downloaded by UNIV LAVAL on July 13, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

17. 18. 19. 20. 21. 22. 23. 24.

25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36.

Small, G. W.; Jurs, P. C. Anal. Chem. 1984, 56, 1314-1323. Gray, Ν. A. B. A r t i f i c i a l Intelligence 1984, 22, 1-21. Haraki, K. S.; Venkataraghavan, R.; McLafferty, F. W. Anal. Chem. 1981, 53, 386-392. Buchs, Α.; Schroll, G.; Duffield, A. M.; Djerassi, C.; Delfino, A. B.; Buchanan, B. G.; Sutherland, G. L.; Feigenbaum, Ε. Α.; Lederberg, J. J. Am. Chem. Soc. 1970, 92, 6831. Ishida Y.; Sasaki, S. Computer Enhanced Spectrosc. 1983, 1, 173-184. Varmuza, K. Anal. Chim. Acta 1980, 122, 227-240. Zupan, J. ibid 1978, 103, 273-288. Visser, T.; van der Maas, J. H. ibid 1980, 122, 363-372. Smith, G.; Woodruff, H. B. J. Chem. Inf. Comp. Sci. 1984, 24, 33. Gray, Ν. A. B. Anal. Chem. 1975, 47, 2426. Delaney, M. F.; Denzer, P. C.; Barnes, R. M.; Uden, P. C. Anal. Lett. 1979, 12 963-978. Bink, W. G.; van 't Klooster, H. A. Anal. Chim. Acta 1983, 150, 53-59. Cross, K. P.; Giordani, A. B.; Gregg, H. R.; Hoffman, P. Α.; Beckner, C. F.; Enke, C. G. "An Automated Structure Determination System for MS/MS Data", 190th ACS National Meeting, Chicago, IL (1985). Christie, B. D.; Munk, M. E. "Computer-assisted Structure Elucidation Using 2-Dimensional NMR Data", 190th ACS National Meeting, Chicago, IL (1985). Buchanan, B. G.; Shortliffe, Ε. H. "Rule-based Expert Systems"; Addison-Wesley: Menlo Park, CA, 1984. Jurs, P. C.; Isenhour, T. L. "Chemical Applications of Pattern Recognition"; Wiley: New York, NY, 1975. Curry, Β., manuscript in preparation. Charniak, E.; McDermott, D. "Introduction to A r t i f i c i a l Intelligence"; Addison-Wesley: Menlo Park, CA, 1985. Bellamy, L. J. "The Infrared Spectra of Complex Molecules"; Chapman and Hall: London, 1975. Nyquist, R. A. "The Interpretation of Vapor-Phase Infrared Spectra", vol. 1; Sadtler Research Labs: Philadelphia, PA, 1984. Socrates, G. "Infrared Characteristic Group Frequencies"; John Wiley and Sons, Ltd.: New York, NY, 1980. Griffiths; et al., GC-IR Subcommittee of the Coblenz Society Evaluation Committee, Appl. Spectrosc. 1979, 33, 543. de Haseth, J., Chemistry Dept., Univ. of Georgia, Athens, GA, personal communication. "Registry of Mass Spectral Data"; Electronic Data Division, Wiley: 605 Third Ave., New York, NY 10158. Dayringer, H. E.; McLafferty, F. W. Org. Mass Spectrosc. 1976, 11, 543-551.

RECEIVED December 17, 1985

Pierce and Hohne; Artificial Intelligence Applications in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1986.