6 Rapid Generation of Reactants in Organic Synthesis Programs MALCOLM BERSOHN Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
Dept. of Chemistry, University of Toronto, Toronto, Canada M5S 1A1
The question o f e f f i c i e n c y of reactant generation has not r e c e i v e d primary a t t e n t i o n and this is as it should be s i n c e i n a problem r e q u i r i n g more than about four steps it is more important to develop b e t t e r h e u r i s t i c s to r e s t r a i n the generation of r e a c t a n t s than it is to f i n d ways to generate the l a t t e r more r a p i d l y . Furthermore, the program of the future may w e l l spend as much o f its time searching an e x t e r n a l data base as it does generating r e a c t a n t s . The e x t e r n a l data base would be, i n essence, the s y n t h e t i c p a r t of Chemical A b s t r a c t s . I t would contain the s o l u t i o n s of standard problems so whenever the molecule at hand is recognized to be s i m i l a r to a standard problem then the s t o r e d s o l u t i o n , a sequence of r e a c t i o n s , i s retrieved. In t h i s f u t u r e s i t u a t i o n the r e p r e s e n t a t i o n of molecular s t r u c t u r e used i n t e r n a l l y by the s y n t h e s i s program may have to be the same as t h a t of the e x t e r n a l data base. Hence the molecular s t r u c t u r e r e p r e s e n t a t i o n might have to be chosen p r i m a r i l y from t h i s p o i n t o f view r a t h e r than to optimize the speed o f r e a c t a n t generation. All this being s a i d , reactants still have to be generated and the r a p i d generation of reactants is an economic b e n e f i t . Speeding up the generation o f reactants means speeding up the component r o u t i n e s . Of these the most time consuming are 1, c a n o n i c a l i z a t i o n of the molecular s t r u c t u r e r e p r e s e n t a t i o n , 2, f i n d i n g the r i n g s , 3, f i n d i n g the f u n c t i o n a l groups and 4, r e t r i e v i n g the r e a c t i o n s and performing the t e s t s to decide whether the p a r t i c u l a r product molecule a t hand i s s u i t a b l e as a product o f the r e a c t i o n . Comparatively speaking, the a c t u a l generation o f the s t r u c t u r e of the r e a c t a n t molecule(s) is a b r i e f o p e r a t i o n . In my programs the most time consuming s i n g l e r o u t i n e is the c a n o n i c a l i z a t i o n of the molecular s t r u c t u r e r e p r e s e n t a t i o n and t h e r e f o r e approaches to the a c c e l e r a t i o n o f t h i s r o u t i n e will be d i s c u s s e d first. I.
C a n o n i c a l i z a t i o n of the Molecular
Structure
Representation
B a s i c a l l y , c a n o n i c a l i z a t i o n c o n s i s t s of numbering the
128 Wipke and Howe; Computer-Assisted Organic Synthesis ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
6.
BERSOHN
Reactants
in
Organic
Synthesis
Program
129
non-hydrogen atoms o f the molecule according to a s e t of r u l e s . T h i s means t h a t a molecule can be represented i n only one way. Having numbered the atoms according t o the r u l e s we are rewarded with s e v e r a l advantages, namely: 1. We can s p e e d i l y recognize whether the molecule a t hand i s the same as a p r e v i o u s l y generated r e a c t a n t molecule or the same as a molecule known i n the program as being a v a i l a b l e . Without c a n o n i c a l i z a t i o n we would be f o r c e d t o do some k i n d o f atom by atom matching (1) i n order to determine i f two s t r u c t u r e s are the same. 2. In the course o f d e c i d i n g the precedence o f the atoms we n e c e s s a r i l y have t o d i s c o v e r any equivalence between atoms t h a t e x i s t i n the molecule. Atoms are s a i d to be e q u i v a l e n t i f they are c a r r i e d i n t o each other's p o s i t i o n by g l o b a l or l o c a l symmetry operations o f the molecule. Global symmetry operations are r o t a t i o n s or r e f l e c t i o n s or combinations of these with respect to i n f i n i t e l y long axes or i n f i n i t e planes passing through the center of the molecule. L o c a l symmetry operations are r o t a t i o n s about bounded bounded axes and r e f l e c t i o n s i n bounded planes. In the f i g u r e below we see a molecule
Wipke and Howe; Computer-Assisted Organic Synthesis ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
COMPUTER-ASSISTED ORGANIC SYNTHESIS
130
l,l-Dimethyl-3-trichloromethylcyclohexane, which has l o c a l symmetry, A l o c a l C 3 a x i s terminates i n atom 1 and a l o c a l C2 a x i s terminates i n atom 2. (The p o i n t group of the molecule i s the d i r e c t product group C 3 ^ ) C 2 L r where the s u f f i x L means l o c a l . ) The equivalence o f the c h l o r i n e s and the two methyl groups can be discovered by c a n o n i c a l i z a t i o n without having to b u i l d a model and a c t u a l l y perform a l o c a l r e f l e c t i o n or r o t a t i o n to see i f the r e s u l t i s the same molecule. S i m i l a r l y , with molecules t h a t have g l o b a l symmetry, such as methylcyclohexane, the presence of two p a i r s o f equivalent methylene carbon atoms can be found without performing a g l o b a l r e f l e c t i o n . Knowing which atoms are equivalent to each other i s necessary f o r concluding t h a t c h i r a l i t y i s absent i n the common l i g a n d of e q u i v a l e n t atoms. Under v a r i o u s c o n d i t i o n s , many r e a c t i o n s produce two products i n s i g n i f i c a n t y i e l d and are t h e r e f o r e not to be recommended when a c e r t a i n p a i r of r e a c t i n g atoms are not e q u i v a l e n t . When t h i s p a i r i s equivalent the r e a c t i o n produces one product and the r e a c t i o n i s to be recommended. I t i s t h e r e f o r e a b s o l u t e l y necessary f o r a synthesis program to know which atoms i n the molecule a t hand are e q u i v a l e n t . The reactions include e l e c t r o c y c l i c reactions, reactions involving the carbon atom alpha to a ketone when both alpha atoms have the same number o f attached hydrogen atoms, W i t t i g type r e a c t i o n s etc. 3. Having numbered the atoms c a n o n i c a l l y , these numbers provide a g l o b a l o r d e r i n g of the atoms which can be used l o c a l l y a t a c h i r a l center to determine whether the center should be c a l l e d R or S. Thus, i f the l i g a n d atoms o f a c h i r a l atom are c a n o n i c a l l y numbered 3,7,9 and 14 and atoms 3,7,9 are arranged i n a counterclockwise f a s h i o n when viewed from the s i d e opposite atom number 14 then we can mark the atom with an S. This i n t e r n a l R,S n o t a t i o n may d i f f e r f o r some centers from those provided by the Cahn-Ingold-Prelog procedure (2_) but there i s no d i f f i c u l t y i n t r a n s l a t i n g between the systems since the absolute c o n f i g u r a t i o n i s embodied i n the molecular r e p r e s e n t a t i o n . (Normally t h i s w i l l not be necessary as most reactants never see the l i g h t o f day: i n my n o n i n t e r a c t i v e s y n t h e s i s programs, tens o f thousands o f molecular s t r u c t u r e s are o f t e n generated before an acceptable s y n t h e t i c pathway i s produced.) Thus we are spared the t r o u b l e of determining the sequence of the four l i g a n d s i n the Cahn-Ingold-Prelog sense. 4. I f the atoms are numbered, other things being equal, i n order of t h e i r atomic number and t h e r e a f t e r i n order o f t h e i r degree of unsaturation, then choosing the descending orders, oxygen atoms precede n i t r o g e n atoms which precede carbon atoms, unsaturated oxygen atoms precede saturated oxygen atoms e t c . Now i f we f u r t h e r order the l i s t o f the l i g a n d s of each atom i n ascending order of the numbers of the atoms then i t i s o f t e n p o s s i b l e f o r a subprogram to know " i n advance" which l i g a n d s are which. For example, i f a program i s examining the d e s c r i p t i o n V
Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
v L
Wipke and Howe; Computer-Assisted Organic Synthesis ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
6.
BERSOHN
Reactants
in
Organic
Synthesis
Program
131
of the n i t r o g e n o f an amine oxide, then the f i r s t number encountered i n the l i s t o f l i g a n d s of the n i t r o g e n i s t h a t of the oxygen atom and the next number i s t h a t o f the carbon atom l i g a n d . Again, i f the program i s examining an e s t e r carbonyl carbon, the f i r s t l i g a n d encountered i n the l i s t w i l l a u t o m a t i c a l l y be the unsaturated carbonyl oxygen, the next l i g a n d w i l l be the saturated ether oxygen and the l a s t l i g a n d w i l l be the alpha carbon atom. The program can p i c k up these numbers and use them i n other contexts without examination o f the data about the atoms to which the numbers r e f e r . Accepting t h a t the a p p l i c a t i o n of a set o f r u l e s f o r numbering the atoms o f each molecule considered by the program i s necessary, one might ask why not use the book f u l l of IUPAC r u l e s ? (_3) The problem here i s t h a t s i n c e the IUPAC numbering r u l e s depend upon the r i n g system being considered, the programming o f t h i s book f u l l o f s p e c i a l cases i s an enormous task, not worth while. I t i s much e a s i e r to have b r i e f l y s t a t e d rules. Computerized Procedures
f o r Numbering the Atoms o f a Molecule
W.T. Wipke {4) f i r s t achieved the c a n o n i c a l i z a t i o n o f a molecular s t r u c t u r e r e p r e s e n t a t i o n i n a computer program t h a t i n c l u d e s a l l aspects o f stereochemistry. In some other schemes, stereochemistry i s not invoked to a i d i n the numbering o f the atoms. Such schemes are incomplete. I f we r e l e g a t e stereochemistry to a footnote and the numbering of the atoms and the connection t a b l e s o f two isomers can be the same, then we l o s e most o f the above-stated advantages of c a n o n i c a l i z i n g . H. G e l e r n t e r ' s program (5_) i s unique i n t h a t the s o r t i n g i s done v i a the Wiswesser l i n e n o t a t i o n . The Wiswesser scheme (6) s o r t s the groups and th~ numbering o f the atoms f o l l o w s from t h e i r p o s i t i o n s i n the groups and the order i n which the groups are given i n the Wiswesser symbols t h a t convey the s t r u c t u r e of the given molecule. A l l other schemes reported i n the l i t e r a t u r e Ç7'8y2.'i2.) c a n o n i c a l i z i n g a molecular s t r u c t u r e r e p r e s e n t a t i o n r e q u i r e a d i r e c t o r d e r i n g o f the non-hydrogenic atoms o f the molecule. In what f o l l o w s we w i l l omit the m o d i f i e r "non-hydrogenic" and ask the reader to note t h a t the hydrogen atoms are not numbered but are considered to be p r o p e r t i e s o f t h e i r l i g a n d s . We can d i v i d e the methods already i n use i n t o t r e e algorithms and sum algorithms, depending on how the e x t e r n a l environment of each atom i s represented. The atoms of a molecule can be p a r t i t i o n e d i n t o equivalence c l a s s e s on the b a s i s of a s i n g l e property or a set of p r o p e r t i e s . The value of each c l a s s f o r the p r o p e r t y ( s ) w i l l be c a l l e d the i n i t i a l c a n o n i c a l v a l u e . The set o f p r o p e r t i e s could i n c l u d e the atomic number, predominant i n the Cahn-IngoldP r e l o g system i f t h i s i s used to number the atoms o f the molecule, or the number o f l i g a n d atoms, which i s the property f
o
r
Wipke and Howe; Computer-Assisted Organic Synthesis ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
132
COMPUTER-ASSISTED
ORGANIC
SYNTHESIS
o f the g r e a t e s t use i n the Morgan algorithm. We can a l s o i n c l u d e the degree o f unsaturation, the number of attached hydrogen atoms, the charge and information about the s i z e o f the r i n g ( s ) o f which the atom i s a member. I t i s evident t h a t the more o f these p i e c e s o f information t h a t comprise the i n i t i a l c a n o n i c a l value, the more equivalence c l a s s e s we w i l l have. Thus i f we use only the atomic number we have a l a r g e equivalence c l a s s with the value 6. I f we add t o t h i s a number r e p r e s e n t i n g the degree o f unsaturation, the number o f equivalence c l a s s e s i s increased and the s i z e o f each c l a s s i s reduced. S t i l l there may be many atoms with the i n i t i a l c a n o n i c a l value 6.0, i . e . saturated carbon atoms. There are many o t h e r s , perhaps with the value o f 6.1, aromatic carbon atoms o r 6.2, doubly bonded carbon atoms, e t c . I f we f u r t h e r i n c l u d e the number o f attached hydrogen atoms we can have the equivalence c l a s s e s o f 6.0.2 and 6.0.1 r e f e r r i n g t o saturated methylene and methinyl carbon atoms r e s p e c t i v e l y . Adding the r i n g s i z e we then d i s t i n g u i s h the c l a s s with the i n i t i a l c a n o n i c a l value 6.0.2.5 from the c l a s s with the i n i t i a l c a n o n i c a l value 6.0.2.6. D i f f e r e n t schemes o f c a n o n i c a l i z a t i o n can be d i s t i n g u i s h e d by the p r o p e r t i e s s e l e c t e d t o e s t a b l i s h the i n i t i a l c a n o n i c a l value and whether the l i g a n d s o f each atom are c h a r a c t e r i z e d by a t r e e o f such i n i t i a l c a n o n i c a l values o r the sum o f such values obtained by a successive summation process t o be d e t a i l e d later. Tree Algorithms
and Sum
Algorithms
We i l l u s t r a t e the t r e e approach with an example, u s i n g the molecule o f Figure 1. In t h i s f i g u r e the atoms are numbered arbitrarily,
13
9
Figure
1.
A molecular structure with arbitrarily bered atoms
num-
Wipke and Howe; Computer-Assisted Organic Synthesis ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
6.
BERSOHN
Reactants
in Organic
Synthesis
133
Program
for reference i n the d i s c u s s i o n , not c a n o n i c a l l y . We w i l l take as the s e t o f c a n o n i c a l p r o p e r t i e s the s i n g l e property o f atomic number, and t r e a t double bonds as meaning the double occurrence o f the atom concerned, both as i n the Cahn-IngoldPrelog system. We consider atoms 2 and 6. The i n i t i a l c a n o n i c a l values f o r these are both 6. Hence i n the e f f o r t to d i s t i n g u i s h them, we walk out i n a l l d i r e c t i o n s and compare the c a n o n i c a l values along the paths from both the atoms. The paths of length one, i . e . t a k i n g account only o f the l i g a n d s , give s t r i n g s o f 6.6 f o r both atoms. The paths o f length two give s t r i n g s o f 66.66.66 f o r both atoms. The paths o f length three give s t r i n g s o f 668.668.666.666.666 f o r atom 2 and 668.668.668.666.666 f o r atom 6. Hence we can conclude that atom 6 outranks atom 2. The t r e e algorithm terminates under the f o l l o w i n g v a r i o u s c o n d i t i o n s : 1. No two atoms have i d e n t i c a l t r e e s . 2. The p a i r s o f t r e e s d e s c r i b i n g the environment o f a l l p a i r s o f atoms whose equivalence i s i n doubt e i t h e r converge to a common atom o r e l s e i n v o l v e every other atom o f the molecule. In a r e a l s i t u a t i o n , a t t h i s p o i n t a t r e e o f stereochemical values has to be b u i l t i f there are equivalent atoms i n the molecule. Now l e t us examine the behaviour o f the corresponding sum algorithm. Here we w i l l use the same i n i t i a l c a n o n i c a l value. The second c a n o n i c a l values are the sum o f the l i g a n d s i n i t i a l c a n o n i c a l v a l u e s . In general the i t h c a n o n i c a l value f o r an atom i s the sum o f the i - 1th c a n o n i c a l values o f i t s l i g a n d s . In t h i s way we o b t a i n a second c a n o n i c a l value o f 12 f o r both atoms 2 and 6 o f Figure 1. The t h i r d c a n o n i c a l value i s 30 f o r both atoms. The f o u r t h c a n o n i c a l value i s 76 f o r atom 2 and 78 f o r atom 6. Thus i t i s on the t h i r d i t e r a t i o n o f the summing process t h a t the c a n o n i c a l value f o r atom 6 f i n a l l y r e c e i v e s the information t h a t atom 12 i s an oxygen atom. This information i s not conveyed d i r e c t l y but i t i s mixed i n t o the sum along with the p r o p e r t i e s o f other atoms. The sum method terminates when Canonical Value Number 4 3 row number 1 2 60 1 24 12 6 2 76 6 12 30 114 3 6 52 18 108 4 12 36 6 102 5 54 6 18 78 6 12 6 30 7 48 192 28 6 8 96 8 12 56 48 9 28 6 6 212 10 54 6 30 108 11 12 60 8 12 38 66 8 12 38 13 12 8 6 the number o f equivalent atoms cannot be reduced between two 1
Wipke and Howe; Computer-Assisted Organic Synthesis ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
COMPUTER-ASSISTED
134
ORGANIC
SYNTHESIS
Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
s u c c e s s i v e i t e r a t i o n s (method o f r e f s . 9,10) o r when the atom of h i g h e s t c o n n e c t i v i t y has been found (Morgan's a l g o r i t h m ) . In a r e a l case, i f there are e q u i v a l e n t atoms i n the molecule then sums i n v o l v i n g stereochemistry have to be computed. The advantage o f the sum method i s ease o f computation. I t i s e a s i e r t o compare simple numbers than long s t r i n g s o f numbers. I t i s a l s o f a s t e r t o generate the sums. The c a n o n i c a l i z a t i o n procedure o f P r o f e s s o r Ugi's group i s a t r e e algorithm, i n which the i n i t i a l c a n o n i c a l value i s composed o f the atomic number and the c o o r d i n a t i o n number o f the atom. ( I f one o f these two numbers makes the atom unique i n the molecule then only the one number i s used as the i n i t i a l c a n o n i c a l value.) Consider the molecule p a r t l y d e p i c t e d below, i n Figure 2.
Figure
2
By the dotted l i n e s we w i l l mean any s t r i n g o f atoms without side chains which are e q u i v a l e n t with r e s p e c t to atoms 1 and 2. I t i s c l e a r t h a t atoms 1 and 2 are nonequivalent. But they w i l l send out sums o f v a r i o u s c a t e g o r i e s which are the same. The sums o f the unsaturations are a l l zero, a l l the atoms concerned a r e a c y c l i c , the sum o f the atomic numbers are the same as w e l l as the sum o f the number o f the attached hydrogen atoms. Morgan's a l g o r i t h m on encountering a " t i e " l i k e t h i s would examine the l i g a n d s t o decide which atom should be chosen, according t o i t s r u l e s , as the atom to r e c e i v e the lowest number. The beginning atom i s the only one which i s numbered because o f the precedence o f i t s f i n a l c a n o n i c a l value. Other atoms are numbered according to t h e i r r e l a t i o n to i t . The i n i t i a l c a n o n i c a l value i s the number o f l i g a n d s . The summing p a r t o f the a l g o r i t h m terminates when we can no longer i n c r e a s e the number o f equivalence c l a s s e s . There i s no attempt to use the c a n o n i c a l values themselves as the b a s i s f o r o r d e r i n g a l l the atoms. In my a l g o r i t h m the f i n a l c a n o n i c a l values are used as the c r i t e r i o n f o r determining the precedence o f atoms with i d e n t i c a l i n i t i a l canonical values.
Wipke and Howe; Computer-Assisted Organic Synthesis ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6.
BERSOHN
Reactants
in
Organic
Synthesis
Other Algorithms Not C l a s s i f i a b l e as Tree or Sum
Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
1.
135
Program
Algorithms
The Repeated Renumbering Algorithm
I t i s p o s s i b l e t o achieve the purpose o f a t r e e a l g o r i t h m without d i r e c t l y comparing s t r i n g s . We number the atoms according t o t h e i r i n i t i a l c a n o n i c a l values, u s i n g an a r b i t r a r y numbering f o r atoms with the same i n i t i a l c a n o n i c a l value but making sure t h a t i f i
CH:
\
CH
y
'\
/ 3
CH
CH:
Η
3
2. I f the atoms are c h i r a l the program must determine the c h i r a l i t y and i f they are both R or both S they are a c t u a l l y not e q u i v a l e n t , ( c f . reference 9). I f , l e t us say, atoms 9 and 10 are e q u i v a l e n t , then which should be numbered 9 and which should be numbered 10? How, f o r
Wipke and Howe; Computer-Assisted Organic Synthesis ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by CORNELL UNIV on August 27, 2016 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0061.ch006
6.
BERSOHN
Reactants
in
Organic
Synthesis
141
Program
example, shpuld we number the atoms of cyclohexane? The answer i s t h a t i f i