Acquisition and Representation of Knowledge for Expert Systems in

Dec 17, 1985 - Thus, they have gained importance as a knowledge base for assisting the chemist in solving his problems. Clearly, the con- struction of...
0 downloads 0 Views 1MB Size
21

Acquisition

a n d R e p r e s e n t a t i o n of K n o w l e d g e

f o r E x p e r t S y s t e m s in O r g a n i c C h e m i s t r y 1

J. Gasteiger, M. G.Hutchings ,P.Löw,and H. Saller

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

Institute of Organic Chemistry, Technical University Munich, D-8046 Garching, West Germany Many of the models used by the organic chemist to ex­ p l a i n his observations provide a good basis for repre­ senting chemical knowledge in an expert system. Such knowledge can be acquired by developing algorithms for these models and parameterizing them with the aid of physical or chemical data. This i s demonstrated for concepts such as e l e c t r o n e g a t i v i t y , polarizability, or the inductive and resonance effects. Combination of these models permits construction of systems which make predictions worthy of an experienced chemist. This i s exemplified by EROS, a system that can predict the course of chemical reactions or can design organic syntheses.

Chemistry - as a s c i e n t i f i c and technological d i s c i p l i n e - has some unique c h a r a c t e r i s t i c s . I n contrast to physics, where most of the underlying laws can be given i n e x p l i c i t and sometimes simple mathematical form, many of the laws governing chemical phenomena are e i t h e r not e x p l i c i t l y known, or else have a mathematical form that s t i l l eludes an exact s o l u t i o n . S t i l l , chemistry does provide - and rests on- quantitative data of physical or chemical properties of high numerical p r e c i s i o n . A search for quantitative r e l a t i o n s h i p s i s thus suggested, despite the lack of a tractable t h e o r e t i c a l basis. Chemists have accumulated over the l a s t two centuries an enormous amount of information on compounds and reactions. However, t h i s i n formation appears l a r g e l y as a c o l l e c t i o n of i n d i v i d u a l facts devoid of any comprehensive structure or organization. This i s most painf u l l y f e l t by the novice studying chemistry. However, the more he progresses i n h i s s c i e n t i f i c d i s c i p l i n e , the more concepts and rules emerge that allow him to bring order into h i s knowledge. These concepts include p a r t i a l atomic charges, e l e c t r o n e g a t i v i t y , inductive, resonance, or s t e r i c e f f e c t s , which have a l l been coined by the 1

Current address: Organics Division, Imperial Chemical Industries pic, Blackley, Manchester M9 3DA, England 0097-6156/86/0306-0258$06.00/0 © 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

21.

GASTEIGER ET AL.

Acquisition and Representation of Knowledge

259

chemist do derive models for the p r i n c i p l e s governing chemical observations. The design of these models has involved the reduction of c o l l e c t i o n s of i n d i v i d u a l observations to general p r i n c i p l e s . Throughout t h i s paper we use the term model. I t w i l l r e f e r to concepts of varying degrees of s o p h i s t i c a t i o n and s p e c i f i c a t i o n . A model can be a notion developed by the chemist to c l a s s i f y an observation, i t can be an e x p l i c i t procedure for the c a l c u l a t i o n of a value f o r a physico-chemical concept, or, i t can r e f e r to a mathematical equation for the p r e d i c t i o n of an observation. We i n t e n t i o n a l l y do not d i s t i n g u i s h between these d i f f e r e n t uses i n order to stress the point that the development of a model to further understanding i s quite a common approach i n science. The huge amount of information a v a i l a b l e i n chemistry early on i n v i t e d the use of the computer for s t o r i n g and r e t r i e v i n g information. Documentation systems have been developed, and are being maintained, that contain a sizeable amount of the known chemical information. Thus, they have gained importance as a knowledge base f o r a s s i s t i n g the chemist i n solving h i s problems. C l e a r l y , the cons t r u c t i o n of a large chemical information r e t r i e v a l system i s an enormous endeavor. Furthermore, the work w i l l never be complete as new information i s constantly being gathered and should be incorporated i n t o the system. Beyond that, pure r e t r i e v a l can only give access to known information. Without appropriate s t r u c t u r i n g of information no predictions can be made of new information. Thus, some of the most important and i n t e r e s t i n g problems of a chemist could not be tackled. These are: 1. What w i l l be the properties of an unknown compound? 2. What i s the structure of a new compound? 3. How can a compound with a new structure be synthesized? These questions f a l l into the domains of s t r u c t u r e - a c t i v i t y r e l a t i o n s h i p s , structure e l u c i d a t i o n , and synthesis design, respectivel y . They a l l ask for new information not yet known e x p l i c i t l y . That i s , they require p r e d i c t i o n s . I t would be highly desirable to reduce the i n d i v i d u a l facts i n an information r e t r i e v a l system to general p r i n c i p l e s j u s t as the chemist has done i n devising h i s empirical concepts mentioned pre^ v i o u s l y . Such a reduction of information to i t s e s s e n t i a l contents asks for i n s i g h t s , to transform information to knowledge. We have not attempted to make the computer do the job of autom a t i c a l l y f i n d i n g the fundamental laws of chemistry from a compilation of i n d i v i d u a l f a c t s . Rather, we have e x p l i c i t l y b u i l t i n t o the computer s p e c i f i c models that we believe can represent the structure of chemical information. We were guided i n t h i s endeavor by concepts derived by the chemist and have t r i e d to develop models and procedures that quantify these concepts. In doing so we have put more emphasis on the a c q u i s i t i o n and representation of knowledge than on problem-solving techniques. In any expert system the q u a l i t y of the knowledge base i s of primary and desicive importance. We are mainly concerned with the development of EROS (Elaborat i o n of Reactions for Organic Synthesis), a program system for the p r e d i c t i o n of chemical reactions and the design of organic syntheses (J_-_3) . This system does not r e l y on a database of known reactions. Instead, reactions are generated i n a formal manner by breaking and

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

260

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

making bonds and s h i f t i n g electrons. In Figure 1 one of those react i o n schemes contained i n the program i s shown. This scheme, breaking two bonds and making two new ones i s quite important; many rather diverse reactions follow that scheme. Such a scheme can be applied i n both a forward search (reaction p r e d i c t i o n ; l a and lb) as w e l l as i n a retrosynthetic search (synthesis design; l c ) . C l e a r l y , not a l l reactions obtained by such a formal scheme can be r e a l i s t i c ones. In f a c t , many have no chemical r e a l i t y ( c f . I d ) . A major task i n program development i s therefore, to f i n d ways of automatically extracting the chemically f e a s i b l e reactions from amongst the formally conceivable ones. To t h i s end a modelling of chemical r e a c t i v i t y seems indispensable.

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

Finding the Pieces The high q u a l i t y numerical data on physical and chemical properties of atoms, molecules, and compounds present a good s t a r t i n g point f o r the development of a knowledgebase. The task i s to condense the i n formation contained i n a series of i n d i v i d u a l data into a quantitat i v e parametric model which w i l l reproduce the primary data with a c e r t a i n accuracy. I f t h i s i s successful i t can be used to predict new, as yet unknown data f o r which the same kind of accuracy can be expected. Furthermore, the parameters could also be of use i n other models which i n turn give new types of data. In developing models f o r t r e a t i n g chemical r e a c t i v i t y we have been guided by the concepts used by the organic chemist i n discussing the causes of organic reactions and t h e i r mechanisms. Examples of the more prominent effects are shown i n Figure 2. Our i n t e n t i o n has been to derive models that can quantify these various effects and thereby build a basis f o r a quantitative t r e a t ment of chemical r e a c t i v i t y . The following simple models that enable calculations to be performed rapidly on large molecules and b i g data sets have been developed. Heats of Reaction and Bond D i s s o c i a t i o n Energies. The simplest form of a model i s an a d d i t i v i t y scheme that derives a molecular property through summation over increments assigned to atoms, bonds or groups (4). We have explored such an approach by assuming that heats of formation can be estimated from values assigned to d i r e c t (1,2) and next nearest (1,3) atom-atom interactions (5). Values f o r these parameters have been derived from experimental heats of formation through m u l t i - l i n e a r regression analyses (6). As an example, the heats of formation of 49 alkanes have been condensed into four fundamental parameters that reproduce the data with a standard error of 0.8 7 k c a l /mol (6). This amounts to a sizeable reduction of the information that has to be stored, while conserving a rather good accuracy i n the data. With these four parameters unknown heats of formation of alkanes can be estimated by the a d d i t i v i t y scheme with a s i m i l a r l y high accuracy. This approach has been extended to other series of compounds. Using these parameters for the estimation of the heats of f o r mation of s t a r t i n g materials and products of a reaction and then taking the difference i n these two numbers provides values f o r react i o n enthalpies. Only parameters of those substructures that are changed i n a reaction need be considered.

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

GASTEIGER ET AL.

Acquisition and Representation of Knowledge

I—J +

—>

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

K—L

J

I

+

Κ

CH —Br

L

CH,

3

a)

ι I

Br

I +I

HO—Η

HO

Η

b)

9

2

2

CH — Br + H—OH 3

d)

CH^CH-Ç-H H O-N-OH

CH =CH—C=N

CH =CH-C=N + 0 H-OH

+

I

:

I H OH

CH.

Br

I +I

H

OH

Figure 1. Formal r e a c t i o n scheme with examples

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

262

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Furthermore, the e f f e c t s of strained rings and of aromatic com­ pounds must be considered (7), and algorithms that perform these tasks have been developed (8,9). Values on bond d i s s o c i a t i o n energies can be calculated by extending the parametrization to r a d i c a l s (10). Table I gives r e s u l t s obtained f o r methyl propionate; experimental values are from compounds containing s i m i l a r s t r u c t u r a l s i t u a t i o n s around the bond being considered ( Π ) .

Table I . Comparison between calculated and experimental bond d i s ­ s o c i a t i o n energies i n methyl propionate ( i n kcal/mol)

2 Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

2

Λ

ι CH-CH -C< _ 2

0

5

bond C'-H 2

C -H

BDE (calc)

6

3

BDE (exp.)(ref. ±

98.7

98.2

93.4

92.3 ± 1.4

1

93.8

94

1

2

2

85.0

86.4

±

1

2

3

83.6

81.2 - 1

3

4

123.4

3

5

96.9

95.5 i

6

5

86.4

83.6 * 1.5

6

C -H c'-c c -c c -o c -o c -o

1.5

An a d d i t i v i t y scheme i s a rather simple model, but despite t h i s , such schemes can be applied to a v a r i e t y of physical data of mole­ cules. Benson and Buss have c l a s s i f i e d a d d i t i v i t y rules into suc­ cessive approximations and have given examples of t h e i r a p p l i c a b i l i ­ ty (40. According to t h e i r terminology the zero-order approximation of a molecular property i s given by a d d i t i v i t y of atomic properties, f i r s t - o r d e r approximation by a d d i t i v i t y of bond properties, and second-order approximation by a d d i t i v i t y of group properties. More recent widespread use of a d d i t i v i t y schemes i s found i n methods |or estimating spectroscopic data, i n p a r t i c u l a r those f o r deriving Hor C-NMR chemical s h i f t s o f organic molecules. P o l a r i z a b i l i t y E f f e c t s . The next model demonstrates that an addi­ t i v i t y scheme can be combined with other forms of mathematical r e ­ l a t i o n s to extract the fundamental parameters of a model from primary information. And furthermore, i t shows than an a d d i t i v i t y scheme useful f o r the estimation of a global molecular proparty can be modi­ f i e d to obtain a l o c a l , s i t e s p e c i f i c property. M i l l e r and Savchik (12) have given Equation 1 f o r estimating the mean p o l a r i z a b i l i t y , a, of a molecule, where Ν i s the t o t a l number of electrons i n the molecule, and τ.is a p o l a r i z a b i l i t y contribution f o r each atom i , c h a r a c t e r i s t i c of èhe atom type and i t s h y b r i d i zation state.

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

21.

GASTEIGER ET AL.

Acquisition and Representation of Knowledge

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

ά = |(Στ.)

2

263

(!)

i Mean molecular p o l a r i z a b i l i t y can be calculated through the Lorenz-Lorentz- Equation from r e f r a c t i v e index, η , molecular weight, MW, and density, d, of a compound, demonstrating that the parameters T£ can be derived from these elementary molecular properties (Figure 3). P o l a r i z a b i l i t y i s a measure of the r e l a t i v e ease of d i s t o r t i o n of a dipolar system when exposed to an external f i e l d . The s t a b i l i ­ zation energy due to the i n t e r a c t i o n between an external charge and the induced dipole i s highly distance-dependent and can be c a l c u l a ­ ted through c l a s s i c a l e l e c t r o s t a t i c s . The s i t u a t i o n i s , however, less c l e a r l y defined when the charge resides w i t h i n the molecule that i s being polarized. To model the s t a b i l i z a t i o n r e s u l t i n g from p o l a r i z a b i l i t y i n these s i t u a t i o n s , we have modified Equation 1 by introducing a damping factor d i ~ ^ , where 0 < d < l , and n£ gives the smallest number of bonds between an atom i and the charge center (Equation 2)(13) · n

n i

2

2

»d-i