Molecular complexity - American Chemical Society

Mar 21, 1985 - This can be seen, for example, in S3 (for whichthere are two equivalent ..... terms more relative weight, the factor in the total compl...
3 downloads 0 Views 524KB Size
J. Chem. Inf. Comput. Sci. 1987, 27, 63-67

63

Molecular Complexity: A Simplified Formula Adapted to Individual Atoms JAMES B. HENDRICKSON,* PING HUANG, and A. GLENN TOCZKO Edison Chemical Laboratories, Brandeis University, Waltham, Massachusetts 02254 Received March 21, 1985 The Bertz formula for calculating molecular complexity is a sum of bond connectivities. This is converted to a simpler form based on the number of hydrogens attached to each atom. Also, the calculation of symmetry terms is derived from simple equations applied to atoms of the same equivalence class. We outline here a simple program (CPXCAL) in FORTRAN for microcomputer which yields the same values as Bertz's formulas. It is applied to a number of examples as well as to all four-, five-, and six-carbon skeletons to evaluate its validity. The formulas are well adapted for use in locating synthesis pathways with least molecular complexity in our synthesis program. The concept of molecular complexity has been advanced by Bertz' and shown to be relevant in the design of syntheses through minimizing the sum of molecular complexities of the synthetic intermediates. His formula is derived from information theory and is based on the sum of bond connectivities on non-hydrogen atoms as well as on the variety of kinds of these atoms. The concept represents a useful test of the relative simplicity of different synthetic pathways to a target molecule and, as such, interests us for use in ranking the different syntheses produced on our computer program (SYNCEN), for generating syntheses. Bertz's measure of molecular complexity ( C ) is a sum of two parts (eq 1): the first and major term, C,, measures skeletal complexity as a function of bond connectivities (TJ); the second term, CE,is a function of the diversity of elements, or kinds of atoms, present. Each of these terms also is com-

posed of two parts: first, an overall complexity term; and second, a symmetry term subtracted from it so as to reduce the complexity to the extent that symmetry is present. The formulas are shown as eq 1-3 (lg is used for log,). In the elements term (eq 3), E is the total number of nonhydrogen atoms and Ej is the number of type j . Thus, the second term represents a concept of the "symmetry" of like atoms: if all atoms are the same kind, CE = 0; if there are atoms of many kinds, the first term for CEwill be much larger than the second and the overall complexity will incorporate this measure of atomic diversity. The central feature of the complexity calculation, in eq 2, is the measure TJ,which represents the sum of all bond connectivities, i.e., the number of pairs of bonds connected to each other. In chemical terms, it is the number of ways a linear three-atom (e.g., propane) skeleton3can be extracted from the whole molecular skeleton. In the second, or symmetry, term vi represents the number of symmetrically identical bond pairs of type i. It is both visually simpler and more adaptable to our system, of molecular description to compute the values of TJ from the characteristics of individual atoms instead of connected bond pairs. Any pair of connected bonds intersects at a particular atom. In Figure 1 the bond connectivities ( T J )around both saturated and unsaturated atoms are illustrated; the only bonds counted are those to non-hydrogen atoms. For the saturated atoms, it will be observed that TJ at that atom is simply a function of the number of hydrogens ( h ) on that atom,4 as in eq 4.

= K(4 - h)(3 - h )

TJ

(4)

For unsaturated atoms the formula is the same, i.e., the sum of TJ so calculated at each atom, except that the r-bond is counted twice and so must be separately discounted. Hence, the formula for TJ for a whole molecule is given in eq 5, in which TJ

= YzCj(4 - hJ(3 - hi) - D - 3T

(5)

D is the number of double bonds, T is the number of triple bonds, and i refers to the ith atom. This equation represents a much simpler way of calculating TJ for a molecule by just counting the hydrogens (and unshared electron pairs4) on each atom and the number of double and triple bonds in the molecule. These values for TJ are easily summed by hand from eq 5 as illustrated in the three examples of eq 6 .

-

r)

-

19

2

-

17

.o

U I

7

-

I

32

- 1-

3(1)

-

28

-

35

1

#

3 ,

'I

3

- 43

*

2 - 3(2)

In a similar fashion, the second, or symmetry, term of eq 2 may be calculated from the symmetrically equivalent atoms, Le., those atoms which are automorphic or of the same equivalence class.5 In the third example of eq 6 , there is a symmetry element (dotted line) through the molecule, creating five pairs of equivalent atoms with TJ, of 3, 3, 6 , 6 , and 3 each, respectively. We can operate the second terms for eq 2 in a general way by collecting (as in Figure 1) all possible symmetry types and deriving the appropriate symmetry equation for each type, so that symmetrccal atoms can then be recognized by type and their symmetry terms summed. Thus, any bond-pair con-

0095-2338/87/1627-0063101.50/0@ 1987 American Chemical Society

64 J . Chem. If: Comput. Sci., Vol. 27, No. 2, 1987

HENDRICKSON ET AL.

sauK&Qd

L A B

B

\ So-0

7 - 3

ll-5

l l - 8

1)

- 11

r)

-

C(B)

E,-N

Ig N

B

B

I

I

I

/ \ D c

/ \ c B

/ \

S,-3N

Ig N

S,-2N 18 2N + N I g N

S,-6N

Ig N

S , 4 N 18 2N + 2N Pg N

i

16

B

D 50

B

S,-3N

S , 4 N Ig ON + 2N Ig N

B

Pg 3N

So-IN I$ 3N

S,,-6N Ig 6N (mar t e x t )

B

B

B

I

I I

I I

/ \

C

s7

S l o A 4 N Ig 4 N t 2N Ig 2N

s,

(8s.

l l - 3

Figure 1. Bond connectivities at saturated and unsaturated atoms. Table I. Numerical Values' N NIgN N 1 0 18 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 "Ig = log,.

2 4.8 8 11.6 15.5 19.7 24 28.5 33.2 38.1 43.0 48.1 53.3 58.6 64 69.5

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

N 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

NIgN 75.1 80.7 86.4 92.2 98.1 104.0 110.0 116.1 122.2 128.4 134.6 140.9 147.2 153.6 160 166.5 173.0

These symmetry terms are derived from the duplication of bond connectivities in the atom assemblies shown, understood as themselves duplicated N times in the molecule. These symmetry terms (S,) are then added for all such equivalent sets and subtracted, as the second term in eq 2, from the overall complexity (first term in eq 2); in other words, eq 2 becomes eq 7 .

NIgN 179.5 186.1 192.7 199.4 206.1 212.9 219.7 226.5 233.3 240.2 247.1 254.1 261.1 268.1 275.1 282.2

Cq

= 27 lg 7 - Ci(Sdi

(7)

The value of the symmetry type, k, for a given set of N equivalent atoms A can then also be derived from a simple formula (eq 8) based on the characteristics of atom A. Such

k = (3 - h)(2 - h)

nectivity centers on an atom A, which in turn is connected to atoms B, C, D, etc. If there are N such symmetrically equivalent atoms A, the attached atoms must also be the same (B, C, D, ...) on each other equivalent atom A. The possible combinations of such equivalent atom sets are shown in Figure 2, and the symmetry term (Sk) is shown for each, a function of the number (N) of such equivalent sets in the molecule.

+R

(8)

an equation is needed for the computer to identify the symmetry terms. Here the term R designates the number of repetitions of bonds to identical attached atoms on atom A, either u- or ?r-bonds. Thus, R = 0 if all bonds from A are to different atoms, R = 1 if two bonds go to the same or to two identical attached atoms (B=ACD or AB2CD),R = 2 if there are two repetitions (AB2C2or B=AC2), etc. Values of R are shown for each symmetry type in Figure 2. The formula (eq 8) is simply an empirical derivation designed to yield an increasing series for increasing symmetry. It does not serve for atoms A with only one attached other atom, for which there is no symmetry term: k = 0 and So = 0. For atoms A with

Table 11. Numerical Values of Symmetry Terms (Sk)" N

1 SI = N Ig N S2 = 3N Ig N SJ = 2N Ig 2N N lg N S, = 3 N lg 3 N S6 = 6 N Ig N S7 = 4 N Ig 2 N 2 N I g N S8 = 4 N Ig 4 N + 2N Ig N Se = 6 N Ig 3 N Slo= 6 N Ig 6 N Si,, = 4 N Ig 4 N + 2N lg 2N S I o B = 6 N Ig 2 N "Ig = log,.

+ +

0

0 2 4.8 0 4 8 9.6 15.5 10 6

2 2 6 10 15.5 12 20 28 31 43 32 24

text)

Figure 2. Symmetry classes at central atom A. (Molecule contains N such symmetry equivalent units; lg = log2.)

3 4.8 14.4 20.3 28.5 28.8 40.6 52.6 57 75.1 58.5 46.5

4

8 24 32 43 48 64 80 86 110 88 72

5 11.6 34.8 44.8 58.6 69.6 89.6 109.6 117.2 147.2 119.6 99.6

6 15.5 46.5 58.5 75.1 93 117 141 150.2 186.1 153 129

7 19.7 59.1 73 92.2 118.2 146 174 184.4 226.5 187.9 159.9

8 24 72 88 110 144 176 208 220 268.1 224 192

SIMPLIFIED

J . Chem. In$ Comput. Sci., Vol. 27, No. 2, 1987 65

FORMULA OF MOLECULAR COMPLEXITY

only two attached atoms and no *-bonds (ABC or AB2; alternatively, h = 2 ) , we designate k = l as shown for Figure 2 (the repetition in AB2 is ignored). For convenience in simple hand calculations, values of N lg N are listed in Table I and values for the symmetry terms, sk,are listed in Table 11. Although the symmetry values are described for N equivalent atoms A, in many cases there are equivalent bond connectivities even for N = 1. This can be seen, for example, in S3 (for which there are two equivalent B-A-C connectivities: a symmetry term of 2 lg 2), which is satisfied by the formula for S3 = 2N lg 2N N lg N when N = 1. In fact, only SI, S2,and s6 give Sk = 0 for N = 1. Since even single atoms (N = 1) can thus exhibit duplicated bond connectivities for the symmetry term, the computer simply calculates the appropriate Sk for every equivalence class of atom in the molecule, Le., for all atoms. Those with no symmetry (,SI, Sz, and s6) then contribute nothing to the subtracted symmetry term of eq 7. Although they are very rare in common molecules, there are a few bond-symmetry situations that further expand on these terms when symmetrical bridges link the four attached atoms on tetravalent atoms A. The program must go further then to identify ring symmetry in these cases. Briefly, symmetrical bridges between B and C in S8 (9) become S7,although B-B and/or C-C bridges only remain S8. In the case of Slo,the formula is correct for no bridges or for six equivalent bridges connecting atoms B all ways (10). Lesser symmetry characterizes other bridging. Of the six possible B-B links in Slo,if two are of one kind and four another, the formula (SloA) is less (1 l), and if three kinds of pairs exist, the formula is Sloe(12). Allenes of the same type (B=A=B) also require

+

fB\A

(9)

Sl

n

$5) B

B

v

SI, BB-

- 6N

?

(t$)

B

\ / A

/ \

BvB

W

S,,,

I&6N

L/

- 4N

.Og 4 N t

S,,,

-

B

II 1I

A

(11)

B

2N Pg 2N

6N .Og 2N

term S,,,. In virtually all real cases, these do not appear, but in cases identified as Slo,the computer must examine the equivalency of rings about atom A as well. In the process of converting from a basis of bond connectivities of one of atom attributes, we find that *-bonds are counted twice, once for each connected atom. This is easily corrected with subtractive terms D and T i n eq 5 for 7 itself, but the same duplication must be corrected in the symmetry terms which involve multiple bonds. These terms for double bonds are all shown in Figure 2; from all these symmetry terms may be separated an N lg N term that contains the duplication.

r)-1/2[4(4-0)(3.0)+2(4-1)(3-1)+

4(4-2)(3-2)]-4-30 0

S2-3*3 Pg 3-14.3

0 S,-2-3 I g (2-3)+3

0

Pg 3-20.3

S,-4.2

0 S,-4*2

Pg (2*2)+2*2 Pg 2-20

Pg (2-2)+2.2 Ig 2-20 Pg (2.2)+2 Pg 2-10

0 si-3

I g 3-4.8

0

< si-3

Pg 3-4.8

< s,-2 P g 2-2 *x: 1/2.2 Pg 2+1/2.2 Ig 2-2 C-Cq-2.30 Pg 30-(20+20+10+2-2)

C-Cg-2*24 Pg 24-(14.3+20.3+4.8+4.8)

s,-2.2

-220.0-44.2-175.8

-294.4-50-244.4

n-1/2[4(4.0)(3-0)+2(4-1)(3-1)+

s-1/2[2(4-0)(3-0)+3(4-1)(3-1)+

2(4.2)(3-2)]-3-29 0

S,-6.2

Pg (3*2)-31.0

0

Ig 2-20 s,-2.2 Ig (2.2)+2 Pg 2-10 e.: 2 Pg 2-2/2 Pg 2/2-2 C,-2*29 Pg 29-(31+20+10-2) Cz-14 Ig 1&-(12 Ig 12+2 Ig 2) C-281.8-59+53.3-(43+2)-231,1 0 S,-4*2 Pg (2.2)+2.2

6(4-2)(3-2)]-27 S,-6.2 Ig (3*2)-31.0

N S,-2*3 Pg (2*3)+3 Ig 3-20.3 < S,-6 Pg 6-15.5 C,,-2*27 Pg 27-(31.0+20.3+15.5) C,-14 Pg 14-(11 Pg 11+3 Pg 3) C-256.8-66.8+53.3-(38.0+4.8)-200.5

^??-duplication terms

Figure 3. Examples of molecular complexity calculation.

Hence, for cases of double bonds A-B, subtraction of '12N lg N for each atom, A and B, serves to remove the duplication. This is only so if both atoms have Sk # 0. Thus in cases of >A=BH2, since the B atom contributes no symmetry term (sk= O), no subtraction of ' / * Nlg N is necessary. For triple bonds -A=B-, the 'triplication" term subtracted is 3 / 2 Nlg 3N (unless again one atom is =BH). In cases of A=A, double bonds with both atoms equivalent, the subtracted term will be N lg N - N / 2 lg N / 2 ) and for triple bonds, A=A, it is (3N lg 3N - 3 / 2 Nlg 3 / 2 N ) . For unsymmetrical allenes, C=A=B, one subtracts two ' l 2 N lg N terms for A and one each for B and C, but in symmetrical ones, B=A=B, one subtracts for atom A half the second term in SloA (eq 1l), i.e., -'/2(2N lg 2N). Aromatic molecules are, however, much more simply handled by atom attributes rather than bond connectivities since the program treats each *-bonded atom the same way without distinguishing its partner. In this way, all Kekult forms give the same result. This may be illustrated with the two forms of dihydroanthracene (1 3 ) , which would give different sym-

3

1

3 (I

-

50

-

6

-

44

metry terms on strict bond connectivity treatment but which give the same result in the present atom treatment since it registers the presence of a *-bond on each atom without noting the atom to which it is bonded. The calculation of dihydroanthracene is delineated in (14) which shows four equivalence classes of atoms (A-D), of which all but B bear double bonds. The four Sk equations and their double-bond corrections for eq 7 are all shown, making it clear that the particular Kekult forms are not perceived. Other

HENDRICKSON ET AL.

66 J . Chem. InJ Comput. Sci., Vol. 27, No. 2, 1987 H H

H H (14)

B.

4 35

A

A

(different

a-duplication

A:

S,

B:

S,

C:

S,

D:

S,

Cg

-

-

4.4 Ig 2.4

+ 2.4 .Og 4

-

( 4 Ig 4

2 Pg 2

Pg 2.4

2.44 P g 44

#

39

1)

of atoms)

-

E

48

16

( d i f f e r e n t atoms and/or 9 )

2 .og 2 )

- 0

2.4 Pg 2.4 2.b

-

’-”

-6 r l - 7

-

+ 4 Pg

+

4

- 2 I g 4

4 Pg 4

- 2 I g 4

[ 3 2 Ig 8 + 8 I g 4 + 4 Ig 21

-

480

O m 7

. 116

-

0e

364

examples are treated in Figure 3; the symmetry terms for each atom (or equivalence class of N atoms) are calculated from the Sk equations in Figure 2. There is one other subtle (and very rare) instance in which the treatment by atoms creates an error. These are cases in which all atoms in a set are equivalent but their connecting bonds are not. This is exemplified by prismane. Here, a

1)-9

A

33

( d i f f e r e n t number of rings)

12 36

a

“-I4+

85

57

@

ze 37

Complexity values shown as: C (eq. 7 ) .ZS:

-

-Qtreatment by atoms shows all atoms to be identical, but the bond connectivity approach would recognize the distinction between the six 3-ring bonds and the three 4-ring bonds. The maximization program5 that we use to define equivalence classes of atoms sees all six atoms as equivalent, but if one is fixed, the others are no longer equivalent. This secondary equivalence class now allows the differences to be observed in the Sk terms for symmetry and thus affords the same overall value of complexity as that derived by the Bertz method. The concept of molecular complexity has heretofore been an intangible one, seldom addressed in any quantitative terms by chemists. A molecule can be treated as a graph expressing atom connectivity, but even in the quantitative treatments of graph theory, there has not been a satisfactory index of complexity. Bertz makes a strong case’ for the validity of his choice of mathematical expression in that (a) it is parallel to the similar analysis of information theory, (b) it affords different values for different molecules, and (c) it correlates well with various kinds of “simplicity” as expressed by overall yields in synthesis.ld We have written a simple FORTRAN program, named CPXCAL, to calculate the complexity of any molecule using eq 1, 3, 5, and 7 as well as the numerical values from Tables I and 11. This allows a facile examination of many molecules and so a broad, systematic examination of complexities in a variety of molecular families, or indeed, of graphs generally. Thus, the family of all possible saturated molecules of 4-6 carbons consists of 78 graphs on 6 points (atoms) with maximum degree (valence) of four, 21 graphs from 5 points, and 6 graphs from 4 points. Examination of this family of 105 molecules, or graphs, shows a rather small incidence of duplication. There are only four pairs of exact duplicates in the 105 cases, shown in Figure 4. There are several other pairings with fortuitously identical values of C which, however, differ in number of atoms or rings or in the value of q or C,,. These are also appended in Figure 4. A summary of the ranges of complexities for the 105 graphs is appended in Table 111. It can also be seen that the use of q alone is a relatively indiscriminate measure of molecular complexity. In each of

39

48

1)

prismane

(j

- 14

6

C

68 106

Figure 4. Duplicate complexities. Table 111. Summary of Molecular Complexities for n = 4-6 Atoms’ no. of b r cases 9 c, n = 4 atoms 43 6 3 1 12 43 36 5 2 1 8 12 4 1 2 4-5 4-8 8-19 3 0 2 2-3 2-5 2-5 tot = 6

csk

n = 5 Atoms 10 9 8 7 6 5 4

6 5 4 3 2 1 0

12 11 10 9 8 7 6 5

7 6 5 4 3 2 1 0

1

1 2 4 5 5 3 tot = 21

30 24 18-19 13-15 9-1 1

5-8 3-6 n = 6 Atoms 36 30-31 24-21 18-22 14-18 10-14 6-10 4-7

147 68 30-42 12-36 8-20 4-12 2-16

186 3 50-84 8 20-66 0-75 14 18 0-51 17 0-22 12 0-20 5 2-10 tot = 78 a b = number o f bonds; r = number of rings; b = n 1

147 152 108-131 76-95 37-68 12-38 8-16

186 220-244 156-212 75-178 73-139 48-96 16-62 12-30

+ r - 1.

the three groups the maximal, fully connected case has the full symmetry term equal to q lg q, as do the simple symmetrical cycloalkanes in each group, which accordingly also have the least C, for any of their isomeric monocycles. As Bertz points out, this fact leads to the factor of 2 used in the total complexity term, 2q lg q , in order that C, shall not be zero. As a result, the total complexity term in general considerably outweighs the subtraction symmetry terms (csk).If it is deemed more appropriate to allow the symmetry terms more relative weight, the factor in the total complexity could be

J. Chem. ZnJ: Comput. Sci. 1987, 27,67-69

for the two orders are shown in Figure 5 (functional groups and the minor symmetry terms are not included). The results show a clear preference for the convergent order, as expected from other considerations.6 This procedure points up another interesting conclusion about synthesis. In a real convergent synthesis with added refunctionalization reactions, the total complexity of intermediates will be much less if these added reactions precede the final coupling of the two skeletal halves. The conclusion for synthesis planning is clear: the final coupling of intermediates should come near the end rather than the beginning of a convergent synthetic route. In more general terms, any step that exhibits a large increase in complexity is more efficiently positioned near the end rather than the beginning of a synthetic sequence. The calculation of molecular complexity is rendered much easier to carry out by hand when the variations in eq 1, 3, 5, and 7 are used and the values from Tables I and I1 are applied. The method presented here, used by hand or computer, yields the same complexity values as the Bertz method. Also, it is less liable to error than identifying and counting bond connectivities, when used by hand, and much more amenable to computerization in this way as well. The program CPXCAL~ is available to anyone with an interest in comparing molecular complexities in any molecular families of interest.

llnnnr 2

i”

P‘ 2, eZL&J n - 4

2

4

n-6

4

16

n-8

6

31

n-1P

8

48

10

66

n n

-

P

2

2

n - 4

86

12

Z

-

I:

251

.Intermediates .are l i n e a r skeletons of n atoms

(I)

-

n

-

67

-

ACKNOWLEDGMENT

78

2)

We are grateful to the National Science Foundation for a grant (CHE-9 102972) that has generously supported this work.

Figure 5. Complexity comparison of linear and convergent syntheses.

lessened, Le., QT lg 1,with 2 > Q > 1. The application of complexity calculations to assess synthesis efficiency has been illustrated by Bertz, who has used the sum of complexities of intermediates as a comparative measure for the efficiency of synthesis.’ In comparisons of syntheses for the same target, however, this will generally simply prefer the synthesis with the fewest steps, hence fewest intermediates to sum. However, in a comparison of two syntheses with the same starting fragments and the same number of steps, this measure should show a preference for one. This amounts to two different orders of assembly of the same units and is characteristic of the difference between a convergent and a linear synthesis? If we apply this method to the assembly of a linear target skeleton of 16 atoms from 8 2-carbon starting units, the results

REFERENCES AND NOTES (1) (a) Bertz, S. H. Chem. Commun. 1981, 818. (b) Bertz, S. H. J . Am. Chem. Sac. 1981, 103, 3599; 1982, 104, 5801. (c) Bertz, S. H. Bull. Math. B i d . 1983,45,849. (d) Bertz, S. H. In Chemical Applications of Topologv and Graph Theory; King, R. B., Ed.; Elsevier: New York, 1983; p 206. (2) Hendrickson, J. B.; Grier, D. L.; Toczko, A. G. J. Am. Chem. SOC. 1985, 107, 5228. (3) The three-atom skeleton extracted may be of any three linked, nonhydrogen atoms, Le., CC-C, C-0-C, C-N-S, etc. (4) In eq 4, h is the number of attached hydrogens plus the number of unshared electron pairs, regarded as the conjugate bases of potentially attached hydrogens. (5) Hendrickson, J. B.; Toczko, A. G.J . Chem. l n j Comput. Sci. 1983,

23, 171. (6) Hendrickson, J. B. J . Am. Chem. Sac. 1977, 99, 5439. (7) CPXCAL is written for a DEC VAX computer with an input module for

direct graphic input from a Tektronix graphics terminal.

An Algorithm To Identify and Count Coplanar Isomeric Molecules Formed by the Linear Fusion of Cyclopentane Modules SEYMOUR B. ELK Elk Technical Associates, New Milford, New Jersey 07646 Received September 18, 1985 Because each of the various possible isomers formed by the coplanar linear fusion of cyclopentane modules may be represented by a binary sequence, the reverse technique of examining each binary sequence of a specified length underlies the formation of an algoithm to identify and count such isomeric molecules. The algorithm involves the specification of a set of three binary operations-which correspond to allowable physical transformations. This may be expressed in the form of a formal algebraic table of operations. Application of this algorithm, with Patterson’s drawing convention, produces a canonical representation for each such isomer.

Despite the presence of a certain amount of noncoplanarity in the cyclopentane molecule (which is caused by the relieving of the strain that would result if the hydrogen atoms were

allowed to remain eclipsed),’ the simplified geometrical skeleton model formed by the successive “straight line”* concatenation of regular pentagonal modules gives a fairly good

0095-2338/87/1627-0067$01.50/00 1987 American Chemical Society