Expert-System Rules for Diels-Alder Reactions - ACS Symposium

SpinPro Ultracentrifugation Expert System is a computer program that designs optimal ultracentrifugation ... The goals of this project were to dev...
0 downloads 0 Views 1MB Size
19

Expert-System Rules for

Diels-Alder

Reactions

C. Warren Moseley, William D. LaRoe, and Charles T. Hemphill

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

Texas Instruments Inc., Dallas, TX 75265

Expert systems of today are powerful when used in the proper domains. Unfortunately, the most difficult part of applying these systems is the structuring of knowledge into rule format. This paper describes methods developed which allow the capture of Diels-Alder reaction knowledge into simple and elegant expert system rule format. Essential components of the system include: a grammar for matching the input molecular structure expressed in Wiswesser Line Notation (WLN), the unification of many reactions into a single generalized mechanism using synthon template patterns, use of WLN rules to produce valid synthons, and use of frontier molecular orbital theory (FMO) to verify the disconnection. This system is implemented in Prolog, whose natural backtracking and generation capabilities easily express and produce the many structural combinations possible. There have been attempts to apply formal methods to the representation of organic compounds [l],[2], some attempts to apply artificial intelligence to organic synthesis [3],[4], and numerous attempts to apply the use of molecular orbital calculations to the verification of the validity of compounds in the synthesis route. This effort was a moderate attempt to examine the representation issues involved in writing production rules for Diels-Alder disconnections. The disconnection approach [5] is adopted in this work because it is amenable to backward chaining systems. The starting point is the target compound, which is, in this case, a Diels-Alder product. The target compound is broken or disconnected into two distinct parts called synthons. The synthons are the ideal representations of the actual reactants used to produce the target compound. Synthons embody the physical properties of the actual compounds they represent. As an initial implementation approach, rules could consist of specific targets and a list of their synthons. No one uses this method because the naive approach of expressing every possible chemical disconnection is impracticable: the number of rules involved to express even trivial synthetic routes grows exponentially. Any expert system solution to 0097-6156/ 86/0306-0231 $06.00/0 © 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

232

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

the synthesis problem must attack two fundamental problems: the variety of functional groups which may participate in a given reaction and the symmetry involved between function groups in a reactant (intra-synthon and inter-synthon functional group interaction, respectively). The thrust of this research has been to capture the reaction routes for a chemical disconnection in a clear, symbolic notation which accommodates qualitative reasoning with functional groups and which comprehends the symmetry of this problem. Ideally, an implementation language would support symbolic and linguistic approaches to representation and manipulation, a qualitative approach to verification, and a deductive approach to disconnection. Prolog [6] is a symbolic language which directly supports backward chaining deduction. Viewed as a declarative language it naturally supports elegant grammar formalisms and its procedural aspects support qualitative reasoning. For these reasons, Prolog was chosen as the implementation language for this project. In summary, the following research goals are addressed in this effort: 1. A linguistic approach to the representation of chemical information. 2. Use of molecular orbital theory to qualitatively validate derived synthons. 3. Unification of synthetic disconnections into a general form. 4. Use of symbolic structure rearrangement in WLN.

2

Grammar Rules for Structure Recognition

The Definite Clause Grammar (DCG) formalism [7] is utilized throughout this project. Grammar rules are used in the expert system rules to recognize the general class of the parent molecule in the disconnection (e.g., cyclohexene). The class determines the patterns used to construct the resultant synthons (discussed in Section 4). 2.1

B a c k g r o u n d for W L N a n d D C G

Many researchers have recognized the importance of having an unambiguous grammar for chemical notation, but they have mainly applied WLN [8] to on-line compound search [9] and structural summary (identification of common structural features) [10]. Johns and Clare point out that it is a linguistic rather than merely a symbolic notation. This means that the symbols are represented and manipulated in well defined structures. This section relies on the unambiguousness of WLN to recognize parent molecules while Section 5 relies on the WLN rules to actually manipulate symbol structures. The DCG formalism is based on first order predicate logic and provides a clear and powerful method for describing languages. The formalism generalizes the Context Free Grammar (CFG) formalism and DCG grammars may be efficiently executed. DCG is most often implemented through a translation process from the DCG notation to a top-down, left-to-right, backtracking Prolog program. This program becomes a parser for the language specified by the DCG. The required amount of work at each step in a backtracking parser is exponential in the number of constituents already found, just for recognition. This occurs because intermediate effort, which could become useful later, is not saved. Of course, classes of grammars exist for which this behavior does not occur. Most programming language

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

19.

M O S E L E Y ET A L .

Expert-System Rules for Diels-Alder Reactions

233

grammars are carefully written to avoid exponential behavior. However, parsing algo­ rithms exist (e.g., the active chart parser [11]) where the worst case parsing time is 0(n ) for any C F G grammar and 0(n ) when the grammar is unambiguous (n is the sentence length). Nevertheless, Prolog provides an adequate DCG grammar parsing mechanism for the purposes of this work. 3

2.2

2

G r a m m a r for D i e l s - A l d e r Reactions

This section examines grammars used to recognize parent molecules (carbocyclic rings for example). The following regular expression [12] recognizes cyclohexene: L6UTJ [Ασ ] [Βσ ] \Co } {Όσ } [Εσ ] [Fa ] Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

Α

Β

c

Ό

Ε

F

where if r is any regular expression, [r] is an abbreviation for (e + r) (in other words, r is optional), e is a regular expression that denotes the empty set and is the union operator for the languages represented by the regular expression arguments. The symbol V represents an arbitrary substituent, with the subscripts indicating to which ring locant the substituent belongs. Using DCG, the more general class of carbocyclic rings can be recognized. The grammar rule a

tt

tt

tt

carbocyclic(Substituents, Number) — • L " , number(Number), U", T", J", substituents(Substituents, Number). achieves the desired result. Within this rule the logical variables are denoted by a leading capital letter. This declaratively states that carbocyclic rewrites into the letter L", followed by a number (which in turn is recognized by DCG grammar rules), followed by the letters "UTJ", followed by the substituents. The substituents rule recognizes the Substituents at each ring locant and uses the instantiated value for Number to verify that the ring locant values are within the proper range. Subsequent steps in the disconnection process utilize the variables mentioned in the head of the rule. Finally, using the grammar rule described above (and related rules not presented), the goal tt

carbocyclic(S, N, "L6UTJ A l BNW F3", []) rewrites the string "L6UTJ A l BNW F3" into the empty set [] (meaning that the entire string is recognized) and produces the result S = [[A,1],[B,N,W],[F,3]],N = 6. S is a list of ring locants and the corresponding substituents used in subsequent discon­ nection stages. Ν represents the number of ring locants. 2.3

A p p l i c a t i o n to O t h e r Reactions

The general grammars and the mixture of declarative and procedural Prolog code allows easy grammar rule writing for other reactions. As an additional example, consider heterocyclic rings. The grammar rule heterocyclic(Substituents, Number, Heteroatom) — • "T", number(Number), heteroatom(Heteroatom), J", substituents(Substituents, Number), recognizes this class of molecules. tt

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

234

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The following grammar rule recognizes the heteroatom: heteroatom(Heteroatom) — • [Heteroatom], {member(Heteroatom, "NOS")}. Curly braces allow direct inclusion of Prolog terms within DCG grammars (the terms are not translated). In this case, the member predicate tests the value of the Heteroatom variable for membership in a list of heteroatoms.

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

3

The Reaction Check

This system covers concerted reactions of the π electron systems on tworeactants to form new σ bonds yielding carbocyclic rings with a single unsaturation. If the reaction follows the rule of maximum orbital overlap, then it is a suprafacial, suprafacial process and is termed a [,-4, + 2 ] reaction. By the Woodward-Hoffmann rules this is a symmetryallowed thermal reaction [13]. r

9

The theoretical underpinnings used in this program are derived from those used by Jorgensen et. al. in the C A M E O system [14], [15] with the exception that our system works backwards, going from a product to either the reactants which form it, or issuing a statement informing the user that a disconnection is not possible.

3.1

Basic Frontier Molecular Orbital Theory

It is known from molecular orbital theory that molecules possess sets of individual molecular orbitals (as long as the molecules are sufficiently far apart from each other). These are the basic unperturbed molecular orbitals used in the evaluation of the reaction. As the molecules move more closely together, their orbitals begin to overlap. This interaction between the orbitals on the different molecules results in the mixing of the orbitals on each molecule [13]. According to frontier molecular orbital theory, the strongest interactions are be­ tween those orbitals that have coefficients with similar magnitudes relative to the unper­ turbed molecules, i.e. the interaction is between the small coefficient on the dienophile and the small coefficient on the diene [16], [17]. If both of the molecular orbitals involved in the bonding are filled, the resulting orbital is not significantly reduced in energy [18]. The greatest reduction in energy arises in the interaction between a filled molecular orbital and an empty one. Since the interaction is strongest between the orbitals of like energy, the ideal combination of orbitals is between the highest occupied molecular orbital (HOMO) on one molecule and the lowest unoccupied molecular orbital (LUMO). Although Diels-Alder reactions can occur in the unsubstituted case, the reaction is most successful when the diene and the dienophile contain substituents which exert a favorable electronic influence [19]. In the normal electron demand case, the most favorable interactions are between dienes with electron-donating groups and dienophiles with electron-withdrawing groups. Cases have been reported in which inverse electron demand occurs and the electronic nature of the diene and dienophile are reversed [20], [21], [22]. This case of inverse electron demand is accounted for in the system.

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

19.

MOSELEY ET AL.

3.2

Expert-System Rules for Diels-Alder Reactions

235

S t r u c t u r a l C o n s t r a i n t s o n Reactants

It became necessary early on in the project to develop a method for quickly checking the reactants for structural features which would make them unsuitable for the Diels-Alder reaction. The constraints are integrated into the notation package, since they are most easily recognized in terms of the notation patterns resulting from the disconnection. The synthons produced by a Diels-Alder disconnection are checked for proper configuration. All synthons are checked before the FMO algorithm begins, resulting in the failure of program execution and the return of a "no" to indicate no reaction. This assures that synthons produced by the rules are actually reactive. The following structural features of diene-synthons are considered unreactive in + 2 ] cycloadditions: Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

T

e

1. Any diene-synthon unable to have an s-cts conformation. 2. Diene-synthons in which an exocyclic double bond is conjugated to a double bond in the ring (e.g., a double bonded substituent on the diene). 3. Diene-synthons in large (greater than 7-membered) rings. 4. Acyclic compounds that have bulky substituents at the central positions on the diene-synthon. The substituents at these positions are relatively close to each other, and bulk leads to steric hindrance. 5. Substitution at both terminal diene-synthon positions is allowed only if the sub­ stituent is a primary atom or a triply bonded functional group (such as a cyano group). All double bonds are perceived as possible dienophile synthons by the notation package. The screening involves only the elimination of all double bonds in aromatic compounds (WLN symbol R"). tt

3.3

Basic H O M O - L U M O Calculations

From work performed in 1983 by Burnier and Jorgensen [15], the following ab initio calculations for the HOMO and LUMO energies of the synthons were developed. The function n(x, parent) returns the number of atoms of type χ in the parent. This function is abbreviated below as simply n(x) where the parent is understood. The symbols UU, Ο, N, S represent triple bonds, oxygen, nitrogen, and sulfer, respectively. The subscripts 'c' and 't' denote central and terminal locations respectively in the parent for the elements which they modify. For brevity, the terms diene-synthon and dienophile-synthon will be replaced with diene and dienophile respectively. For Dienes: £HOMO

= -2n(0) - n(UU) - 0.2n(N ) - 0.5n(S ) - n(S ) - 9.0

(1)

= -n(O) - 0.5n(N) - 2n(S ) + 1.5n(S ) + 0.6

(2)

= -n(UU) - 4n(0) - 2n(N) - n(S) - 10.5

(3)

= n(UU) - n(O) - 0.5n(N) - 4n(S) + 1.8

(4)

c

#LUMO

t

t

c

c

For Dienophiles: £HOMO

£LUMO

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

236

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

In the carbocyclic ring case, the HOMO-LUMO values default to the constants at the end of the equations. The formulas above are used to compute the orbital energies (both HOMO and LUMO) of the unsubstituted parent compounds. In the case of substituted compounds, additional formulas account for the electronic effects of the substituents. The explanation of the regiospecificity of Diels-Alder reactions requires knowledge of the effect of substituents on the coefficients of the HOMO and LUMO orbitals. In the case of normal electron demand, the important orbitals are the HOMO on the diene and the LUMO on the dienophile. It has been shown that the reaction occurs in a way which bonds together the terminal atoms with the coefficients of greatest magnitude and those with the coefficients of smaller magnitude [18]. The additions are almost exclusively cis and with only a few exceptions, the relative configurations of substituents in the components is kept in the products [19]. It is known that the effects of substituent groups on a diene or dienophile vary between different types of parents [23]. A function, τ(Υ), has been determined for several functional groups, with Y corresponding to their electron donating or withdrawing capability such that a reasonable estimate of the HOMO energy could be obtained by use of the equation [15]: £ H O M O = l(P)

+ T(Y) +

(5)

EHOMO(P)

This equation yields a value for the substituted molecule where η(Ρ) is the sensi­ tivity of the parent P. Some initial values, called r values, which describe the electronic effects of functional groups have been found and developed by Jorgensen et. al. Hydro­ gen was assigned a τ of 0.0 eV so that electron withdrawing substituents have negative τ values and electron donating groups have positive τ values. The values for τ were chosen so that a 0.5 eV change in the substituent gives a change of 10 in the τ value. This algorithm, when combined with the notation rules, yields useful results for many functional groups and gives reasonable estimates of the values for those not known. The factor 7(P) for an ethylene analog is given by: 7(P) = O.Oln(UU) + 0.06n(O) + 0.03n(N) + 0.03n(S) + 0.05

(6)

For any given diene the value for 7(P) can be adequately represented by the value 0.03 eV. This provides the proper value for correction in the calculation due to the sensitivity of the parent compound towards different types of functionality. 3.4

D e t e r m i n a t i o n of Substituent Effects

To determine substituent effects, substituent groups are built from primary recognized atoms and functional groups. A functional group is scanned one Wiswesser symbol at a time. A Wiswesser symbol can represent either an individual atom (e.g., G " for chlorine) or a functional group (e.g., Z " for the amino group). This allows us to adapt the "layer" method of Jorgensen to the scanning of the functional groups on the rings. These groups are provided as Prolog sublists as outlined in the previous section. Once the comparison between the functional group elements and the known values are compared, τ is calculated by the following method. The formula for the numeric calculation is: a

a

+ 2 w / ( l + NFG)

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

(?)

19.

MOSELEY ET AL.

Expert-System Rules for Diels-Alder Reactions

237

Table 1. Example τ Table Entries

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

name tau_entry (p-methoxyaryl, tau_entry(trimethylamino, tau_entry(aryl, tau_entry('methyl sulfate', tau _entry (amino, tau jent ry (olefinic, tau_entry (sulfate,

WLN "R DOl", "Ν1&Γ, "R", a

sr,

"Z", "1U2", SH", U

τ 51). 44). 42). 38). 36). 36). 32).

The legend for this equation is: • T - the largest calculated reference value of τ in either the positive or negative direction. max



Tsum -

the sum of the remaining r values in the functional group.

• NFG - the number of functional groups attached to the parent system. The above is based on the calculation of a collective τ for the whole molecule. This value changes the HOMO of either the diene or dienophile, as is necessary. This equation is accurate to about 0.5 eV on either side of the "known" values [15]. The value of r tai is inserted into the HOMO-LUMO calculation as the parameter τ(Υ). Note that in its pure form, this equation only yields values for the HOMO orbitals. Corrections are used for the calculation of the LUMO values. Table 1 contains examples of the Wiswesser Line Notation and the raw r values used in the computation of orbital energies. to

3.5

D e t e r m i n a t i o n of P e r m u t a t e d L U M O Coefficients

The following rules were used for the determination of the LUMO orbital coefficients from the values determined for the HOMO coefficients [15]. 1. An electron donating functional group raises the energy of the HOMO orbital of a system about twice as much as it raises the LUMO. 2. In contrast, an electron withdrawing functional group lowers the HOMO energy about one third as much as it lowers the LUMO. 3. Groups which add conjugation such as olefinic, acetylenic and aromatic groups lower the LUMO orbital energy one third to one half as much as the HOMO energy. The same equations are used to determine both the HOMO and LUMO values. This is consistent with the fact that the HOMO and LUMO orbitals are calculated from the same parent system, and that the difference between the orbital energies can be adequately covered by the two parameters 7(P) which represents the sensitivity of the parent to substitution and τ(Υ) which represents the electronic effect exerted by the functional group acting as a substituent.

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

238

ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

To implement the rules mentioned above, only the r(Y) values for the functional groups are changed. Thus, the r(Y ) values for the calculation of the LUMO orbitals on both the diene and dienophile are changed following these rules: 1. Positive τ values except those for conjugated hydrocarbons are divided by a factor of 2. 2. Negative τ values are multiplied by 3.

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

3. r values for conjugated hydrocarbons are divided by a factor of 3 and their signs are reversed. This method covers many combinations of functional groups that influence the orbital energies. A feature of this method is that it uses the same functional group r values as in the HOMO energy calculation. The algorithm described above is used for the calculation of both the HOMO and LUMO atomic coefficients. The r values of the substituents are permutated to give the proper values for the LUMO orbitals. The following steps are required: 1. r values on terminal positions are taken from the list previously described. 2. Resultant τ values on the central diene positions are divided by a factor of two to accommodate the fact that the orbital coefficients at these positions are very small. 3.6

A l g o r i t h m for R e g i o c h e m i c a l Selection

Any functional group attached to a terminal carbon on either a diene or dienophile increases the magnitude of the coefficient on the opposite terminal. Any functional group attached to a central position on the diene (there is no analogous case for the dienophile) increases the magnitude of the coefficient on the terminal farthest from the substituted position. For cyclohexene, the central locants are the A and Β positions on the Diels-Alder adduct. Thus, if a functional group is on position A the magnitude of the coefficient at terminal C increases. One of the remarkable aspects of the DielsAlder reaction is the specificity of the bonding between the carbon atoms [13]. The orientation of the addition can be accurately predicted by an extended form of the frontier molecular orbital theory as developed by Fukui and Fujimoto et. al. [16]. For dienes the coefficients are determined as follows: if the sum of the absolute values of r on positions F and Β is greater than the sum of τ on positions A and C , then the coefficient on position C has the greater magnitude, otherwise the coefficient of position F has the greater magnitude. On dienophiles, if the sum of the absolute values of τ is greater on position D than on position £ , then £ has the greater magnitude.

4

Reaction Unification Using a General Form

This section examines the notion of a general form for representing the possible synthons in a reaction. Derivation of this form is illustrated and examples of the general form are presented. Symmetry and the encoding of optional notation is discussed and some examples of the naive approach are presented.

In Artificial Intelligence Applications in Chemistry; Pierce, T., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

19.

MOSELEY ET AL.

Expert-System Rules for Diels-Alder Reactions

239

Table 2. The Naive Approach Parent

Synthons tf

discon( L6UTJ Al Β Γ , discon( L6UTJ D1Q", discon( L6UTJ Al Bl DOVI", discon( L6UTJ Al Bl Dl E N W , a

tt

a

Downloaded by UNIV OF AUCKLAND on May 3, 2015 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

4.1

["lUYl&Yl&Ul", "lUl"]). tt

a

w

[ l U 2 U r , Q2Ul ]). [ « I U Y I & Y I & U I " , "îvoiur]).

["lUYlfcYlfcUl", "WN1U2"]).

M o t i v a t i o n : the N a i v e A p p r o a c h

In the naive approach, disconnections are simply listed as facts with the molecule to disconnect as the first argument and a list of the synthons as the second. Table 2 contains some examples. This approach suffers in many ways; primarily, the number of rules would become unmanageable (quite huge even for cyclohexene), slowing the inferencing speed of the expert system. A sample inference mechanism using these facts (given the natural backward chaining of Prolog) might be disconnect(Parent, Given.Synthons) : discon(Parent, Synthons), disconnect(Synthons, Given.Synthons). disconnect(Parent, [Parent]) : given(Parent). disconnect [First I Rest] . [First.Disc|Rest J)isc] ) :disconnect(First, F i r s t J ) i s c ) , disconnect(Rest, Rest.Disc), disconnect ([] , []). This procedure recursively disconnects synthons until the final synthons for the original parent are all available (or given) compounds. Upon successful completion, the variable Given-Synthons contains a tree (in list notation) which denotes the synthon combination order to reproduce the parent compound. 4.2

D e r i v a t i o n of the G e n e r a l F o r m

Consider the domain of a six-membered ring with single unsaturation. Table 3 expresses the synthetic route with one substituent. Again, the symbol V represents an arbitrary substituent. Square brackets surrounding a set of symbols indicates optionality of those symbols (as in regular expression notation). For example, the string may reduce to the string V or σ&* depending on whether the substituent represented by σ ends in a terminal symbol or not (following the rules of WLN). α

Symmetry in the patterns, however, hides many details in the diene and dienophile patterns. Table 4, with combinations of symmetric substituents, reveals more of the details. The order of the symmetric substituents may be chosen arbitrarily. Alphabetical ordering was chosen here for consistency. Finally, for a full cyclohexene molecule, the patterns become