The Atom Connectivity Matrix (ACM) and Its Characteristic Polynomial

The Atom Connectivity Matrix (ACM) and Its Characteristic Polynomial (ACMCP). ... High-Speed Management of the National Compound Registry. P. A. D. de...
0 downloads 0 Views 998KB Size
THEATOMCONNECTIVITY MATRIX

26 1

The Atom Connectivity Matrix (ACM) and Its Characteristic Polynomial (ACMCP)" By LEONARD SPIALTER Chemistry Research Laboratory, ARL, Wright-Patterson Air Force Base, Ohio Received February 24, 1964

INTRODUCTION

Chemical pictographs or molecular diagrams are the natural language in which chemists describe molecules and reactions thereof. Unfortunately, such pictographs do not lend themselves to direct filing and comparison primarily because dimensions, orientation, and angles are not important fundamentals in the usual case. T h e essential components are the constituent atoms and the bonds between them (in the current view of molecular structure). The need to develop a reference capability for pictographs has led to many schemes and manipulations thereof based on some sort of accounting or atom-bonding balance sheet. One family of such schemes includes those which consider molecules as being connected collections of "readily-recognizable" local assemblies of atoms such as chains, n-membered rings, functional groups (carbonyl, carboxylic acid, etc.), and the like. T o such local assemblies are assigned characteristic symbols (digits. letters, or marks) so that the complete molecule is then simply represented by a collection of such symbols with suitable additional symbolism to describe their interconnections. To this particular family belong all the linear cypher notations such as those of Wiswesser,' Dyson,' IUPAC,3 Bouman,4 Gruber,' Gordon, et a1 , 6 Silk,. and the like. T h e advantages of such systems include their compactness, ready correlation with molecular structure and current naming systems, and apparent ease of fragment searching or handling of partial indeterminates. On the debit side are the requirements for services of a trained intermediary for "molecular dissection," multiplicity of rules as t o seniority or hierarchy of presentation and local conventions, necessity for a glossary or dictionary to define symbols, use of arbitrary nonchemically significant symbols and marks which are not meaningful or recognizable in all languages and which may conflict with standard chemical symbols, non-universality in t h a t

.Portions OS this paper \%erepresented tiefore t h e I h i i s i o n of Chemical Literature. 145th Satiimal hleeting of the h m e r i c m Chemical Society. S e w York. S . Y , Sept. 11. 1"cii.i. and hetrire t h e i'hemmcal Infnrmariim and D a t a \Vnrkshop. I'. S Army Slissile Cinnrnznd. Huntsville. Ala , . i l i LV .I. LViswesier. "A Line Formula ('hemica1 Sotation." T. Y. Crowell Co.. S e w Yrirk. \ . Y.. 195-1. 12) (; M Llvson. " A S e w Nntatirm and

Enumerarwn System for Organic Compnunds." 2nd E d 1.onyman.i. (:wen and Co.. S e n York. S Y . 1958. f.il International I'nion of Pure arid Applied Chemistry. "Rules for I . Ll. P . A. C. S o t a t i o n tor Organic Compounds," c J o m Wile?. a n d Sons. Inc.. S e w York. S . Y.. 1961. 1-11 H . Houman. J Ch1.m Lhi , 3, $12 i196.ii l i l M. Gruher. . A n m i Chem 61. 429 (19501: "Die Genfer S n m e n k l a t u r in Chiffren." Heihefte Arivrii Chem Chern - 1 i i ~ 'T e < h , S o . 23.1950. (61SI. (;ordiin. ('. E . liendall. and CV H . T. Davidson. "Chemical Ciphering." Koval Institute 01 Chemistry Slonogizph. 1948 (71 .I A Silk, .I ('hem lhi 3. 18:' (1961).

only currently known structures have defined rules and new types of structures require rules supplementation, searching difficulties (ameliorated only partially by the permuted line entries concept'), and lack of good ordering conventions for a list of notations for different molecules. Another drawback is the lack of compatibility with the most widely used, chemically significant, and fundamental indexing parameter even for compounds of unknown structure, the molecular formula (although a suggested means of deriving this from the subject notations is implicit in a recent thesis and publication by Garfieldg). I t must be recognized that all of the above schemes are simply extrapolations of the usual verbalizable chemical nomenclature with essentially all of the advantages and limitations thereof. I t is good for relatively rapid communication between chemists knowledgeable in the rules, but fails when language barriers must be crossed or when one universally clear, unique, and acceptable name is to be placed in an ordered file. Alternate and, possibly, increasingly more popular methods of describing molecular pictographs are those based on some sort of topological mapping of the individual atoms and the bonds between them. Into this category fall the "accounting" techniques of Ballard and Seeland," and Gluck and Rasmussen," the Polish treebased method of Hiz," and the matrix-based concepts of Ray and Kirsch," and Sussenguth." I t appears likely that such a representation based only on atoms and interconnecting bonds is the simplest mathematical transformation of a chemical pictograph. By "simplest" is meant that a minimal number of rules is required, essentially no more additional symbols than the usual chemical ones are utilized, and a clear-cut one-to-one correspondence with the chemical pictograph is obtained. The present paper is devoted to a concept which belongs to the latter class of representations. I t is initially based on a matrix construct which is then converted by standard algebraic techniques to a derivative form which has

(81 P. F . Sorter. C . E . Gramto. J . C.Gilmer. A . Gelberg, and E. A . Mercali. h i d 4, 56 (19641: 145th Sational Meeting of t h e American Chemical Society. S e u York. S . Y..Sept. 8-13, 1969. Abstracts 12G (9) E Garheid, "An Algorithm for Translating Chemical S a m e s to llolecular Formulas," Institute for Scientific Information. Philadelphia. P a . . 1961. (IO) D. L. B a l l a d a n d F. Neeland. J C h e m . 3 . 196 (19631. Chrm 4 1 . 129 (April l i . 196;3): 144th S a t i o n a l Meeting of t h e American Chemi Los Anpeles. Calif.. March :11-April5. 1963. Abstracts 9F. (111 Chrni Em .\-e!(,\ 11. 35 (Dec. 9. 19610. i)ij
r. Fred .A ’rate. i the chemical literiitiire can he ~ndexed

.AhstraitsSer\ice h.ir e.tlrndted char 60 iiiider name- of chemical cimipnunds.

THEATOMCONNECTIVITY MATRIX

diagonal elements,” may possess for a given exponential degree a variety of terms. Note, for example, the A C M C P in Table I for ketene, wherein the third degree terms are: -4H% - 4CH’ - 2CHo. I t is characteristic of the ACMCP. as will be shown later,’” that the nonnumerical ( i . e . ,lettered) factors of terms beyond the first are derived chiefly from combinations of atoms which have few connectivities. Hence, monovalent or singly bonded atoms such as hydrogen and the halogens show u p in profusion among these terms. Since hydrogen is not a molecular skeletal binding component, except in cases of hydrogen bonding. it can, as was mentioned earlier, be omitted from the ACM. This elimination diminishes the total number of terms and simplifies their nature in the ACMCP. However, as a result of dropping or “suppressing” the hydrogen atoms, some information is lost and must be compensated for in other ways. T h e informational loss consists of the following: (a) T h e hydrogen count which entered into the first term, among others, and made this term identical with the molecular formula is no longer carried implicitly through the computation from the ACM. T o maintain this useful attribute, it is necessary and expeditious to introduce externally into the first term the requisite number of hydrogens (as a factor in the symbol H , for hydrogen, raised to the appropriate power). This added factor, in powers of H . appears only in the first term in this modification to the “hydrogen-suppressed” ACMCP. (b) A more fundamiental and serious loss is that information concerning the total valency of each atom in a given molecule. For example, except for a few exceptions such as isonitriles, carbon monoxide, metal carbonyls, and radicals or molecular fralgments, carbon exhibits a valence of four in organic compounds. When the hydrogen is suppressed, however, a pictograph may show carbon symbols, C , with apparent valences varying from one up to four. This may be illustrated by the structure (HS or hydrogen-suppressed) fo-r4-methyl-2-pentene ( I )

267

as some artificial mathematical constructs which are subject to accidental peculiarities in number theory, such as

X-X

X and X

-

X = X. bothofwhose ACMCP’s

yields the same HS-ACMCP as does its isomer ( I ) : C b 8C4+ 11C’. Other similar examples can be found as well

equal”’ X ’ - SOX, partly because 1’ + 7’ = 5’ + Ei2 = 50. Again, in this latter case, there is also the basic defect of allowing a given symhol. X , to vary its valency over values of 1, 5 , 7, 8, and 10 in the two pictographs, yet considering the symbols to be algebraically, or computationally, equivalent. Several ways to resolve these hydrogen-suppressed ACMCP degeneracies present themselves: (a) One could ignore the ambiquities, since empirical checks suggest that in physically real molecules, where atoms possess only small apparent valency magnitudes (almost always less than 5 ) and these in only limited number, such cases occur only a small fraction of the time (possibly less than 0.17).T h e duplications could then be sorted out a t information retrieval time by manual examination. Such a crude solution is subject to error and is not philosophically or conceptually satisfying. I t must be rejected. (b) Certain of the degeneracies, but not all, can be broken by inserting the number of hydrogens suppressed into the first term. However, isomeric cases such as those typified by compounds I and I1 are not resolvable by this means. (c) One could retain all valency information by not adopting the hydrogen-suppression technique but by including all atoms in the ACM from which the ACMCP is derived. While it is conjectured2’ that this practice will eliminate the cited problems, it, nevertheless, will result in exceedingly large and cumbersome AC MCP’s. This is highly undesirable for both computational and storage reasons. although it may eventually prove to be necessary. ( d ) Retaining consecutively numbered subscripts on atoms of the same symbol breaks the degeneracy but leads to ACMCP’s which are both cumbersome and no longer invariant to numbering permutations. ( e ) Another means of retaining the lost valency information is to assign different symbols for atoms of the same kind which, however, exhibit different apparent valences in the HS case. For example, in compound I . C , , Ci, and C,,, being univalent, can be called C(1): tervalent C ? . Ca, and C : can be called C ( 3 ) . T h e resulting HS-ACMCP is C ( l j ” C ( 3 ) ’- [3C(1)’C(3iL+ 5 C ( l ) ’ C ( 3 ) ]+ [9C(l)’ + 2 C ( l ) C ( 3 ) ] .By an analogous convention, compound I1 yields as its HS-ACMCP, C(l)’C(2)’C(4) 15C(li?C(2)’+ C ( U C ( 2 ) + C(l)’C(2)C(4) + C ( l j C ( 2 ) ’ . c ( 4 ) + [4C(1)’ + 6 C ( l ) C ( 2 ) + C(2)’]. Although this method can be used, the enlargement of the ACMCP in this manner is neither desirable nor convenient. (f) A technique which appears to resolve all the ambiguities found to date, yet retains the simplicity of ordinary symbolism, is one which appends to the HS-ACMCP a

(241 This is readilv proven hy t h e Sact t h a t an n t h order matrix can yield a t mo-t a n n t h degree pol\nomial containing n + 1 coefficients when t h e terms are grouped by total exponential degree However. since t h e A C M diagonal elements contain no numerical additive component other than the constituent a t o m s (groups or partlclesi the “trace“ ( s u m CIS such numerical additive components) of t h e matrix is zero. a n d the coefficient of t h e ( n -- 11th degree term (determined by t h e trace) in the A C h l C P is also z e r o Hence. the ACMCP contains only n differently ordered terms with t h e first t e r m heing oSrhe n t h degree a n d t h e next one of t h e ( n - 21th degree. ( 2 5 ) L. Spialter. .I Chi-m Ihi 1. :!69 (19641.

(26) This simple case was pointed out t o the author by DI. S y l \ a n E k m a n . PitmanDunn Institute for Research. G . S. Armv. Frankford .4rsenal, Pa. ( T i ) N o mathematical proof for t h e uniqueness of t h e transinrmatlon from t h e total A C h l C P t o t h e initial pictograph (where equivalent symbolism carries with it equivalent valency) has yet been obtained. However. since t h e A C M C P is invariant mith respect to the defined permutations o f t h e A C M (which leave only atomic svmbols in t h e principal diagonall. t h e mapping from pictograph t o A C M C P IS sinele-valued for a given type of symholism and quantitative standard for t h e pictured connectiiitles

C,

I

c -c,-c

=c:-c;

I

wherein carbons numbered 1, 5, and 6 appear to be univalent. and carbons numbered 2 , 3, and 4 are tervalent. T h e consequences of losing this valency data by treating all C symbols as equivalent in valency, despite the acknowledged falsity of this assumption, and multiplying them together to give terms such as Cb, is the development of serious ambiguities and degeneracies in the ACMCP. For example, HS 2-methyl-1-pentane(11) C,

I1

c,-c.-c -c,-c

I1



LEONARD SPIALTER

268

MHS-ACMCP can be derived from the complete ACMCP by dividing all terms beyond the first by H raised to the hydrogen count (exponent) and eliminating all those resulting terms which contain a negative power of H as one of their factors. Other ACMCP Approximations.-There are two other simple ACMCP approximation schemes, which, while not uniquely sufficient in identifying a molecular compound or system, are necessary to be satisfied. As such, they are adaptable to small or personal data collecting and referral or to very rapid screening processes. Particularly suitable for the latter purpose is the "condensed" ACMCP, wherein the first term (or molecular formula) is first written down and is then followed by the algebraic sum of the coefficients of the remaining terms. By this procedure, condensation of the appropriate complete ACMCP's gives C'H'O - 2 for ketene, and C5HG- 9 for cyclopentadiene (refer to Table 111). The modified hydrogen-suppressed ACMCP's yield, respectively, C 2 H 2 0- 8 and C'H' + 23, by compression. The two systems give differing results and, hence, operational conventions must be adopted. Another method of approximation, equally adaptable t o

table identifying types and numbers of atoms of the different apparent valencies. For example, the one quadrivalent, no tervalent, three bivalent, and two univalent carbon atoms in compound I1 can be assembled into a "valency count" as C(1, 0, 3, 2 ) . The comparable valency count for compound I is C(0, 3. 0, 3). By inclusion of this simple concept, all degeneracies observed t o date in HS-ACMCP's have been broken. This artifice now appears to retain the virtues of convenience and reliability.28 I t is now proposed that, for the hydrogen-suppressed cases, the appropriate polynomial to use as a computeroriented name is the resultant ACMCP with the inclusion of hydrogen count in the first term and an appended valency count (or VC) collection. To this assemblage has been given the name "modified hydrogen-suppressed, or MHS-ACMCP. A list of some MHS-ACMCP's for various hexadecenes is presented in Table 11. The conjecture is that if two pictographs have identical MHS-ACMCP's, they represent identical molecules a t least to the extent to which information has been entered into the ACM," which may include ignoring stereoisomeric data." I t is interesting and useful to point out that, as can be seen from Table 111, the polynomial portion of the

Table II Modified Hydrogen-Suppressed ACMCP's for Various Isomeric Hexadecenes" A( \I( l'( r,eliiiieiit\ \ ame

3-Propyl-1-tridecene 2-Pentyl-1-undecene 2-Heptyl-1-nonene 2-Hexyl-1-decene 2-Butyl-1-dodecene 2-Hexadecene 4-Hexadecene 6-Hexadecene 8-Hexadecene 7-Hexadecene 5-Hexadecene 3-Hexadecene 1-Hexadecene

C H

C

1 1 1 1 1 1 1 1 1 1 1 1 1

-18 -18 -18 -18 -18 -18 -18 -18 - 18 -18 -18 - 18 -18

C

+126 +126 +126 +126 + 126 +127 +127 +127 +127 +127 +127 +127 +130

c

c

C

-444 -444 -444 -444 -444 -451 -454 -454 -454 -454 -454 -454 -484

+842 +842 +842 +842 +842 +855 +879 +a79 +879 +879 +879 +882 +990

-840 -844 -844 -844 -844 -840 -903 -906 -906 -906 -906 -924 - 1092

c

c

+388 +404 +404 +404 +404 +378 +438 +450 +450 +450 +453 + 483 +588

-56 -68 -72 -12 -80 -57 -72 -81 -84 -84 -90 -102 - 120

\c

c 0

0 0 +4 +4 +1 +1 +1 +1

+4 +4 +4 +4

*&-trans isomerism is ignored. 'All terms of odd degree have zero coefficients. Valency count (see text). Table Ill Comparison of ACMCP's for Various Molecules Complete AClIC'P

Cnmpiiund

RIHS-ACXICP

+ 2CHO) + 8 H

Ketene

C2HH20 - ( 4 C H + 4H'O

3-Iodocyclopropene

C ' H I - (C'H'+ 3C'H'I 4H31+ ( 2 C H 2 +3CHI (H + I )

C2Hz0- (4C + 4 0 ) ; C(1,0,2.0)

+ 6CH31)+

+ 4 H ' + 6H'I)

-

C'H'I - ( C - + 6CII C(O.3,O.Oi

+ 41 + 4:

C yclopentadiene

C'H" - (6C'H' + 11C'H"i + (14C'H' + 42C'H' + 26CH")+ 8H"- (18C'H' + 49CH' + 42") + (9CH' + ?OH') - %H

C ' H n - l l C 1 + 26C + 8 : C(0,4,1,O)

Benzene (Kekule)

C6HE- (6C'H' + 15C4H6)+ (15C'H' + 6OC'H' + 63'2"") (20C'H' + 90C2H"+126CH'+ 81H6) + (15C'H' + 60CH3 + 63") - (6CH + 15H') + 1

C6H6- 15C'

+ 63C'

- 81: C(0.6,0,0)

Modified hydrogen-suppressed ACMCP (see text). ( 2 8 ) Although only comparison of valency count for carhon was empiricallv found to be sufficient for this purpose, it is probably safest t o llSt the valency count for all atoms wherein the "apparent valency" differs from the actual \slue in the molecule available from the .A.CM because of hydroKen suppression Such V C data a r e because they involve simply a tabulation of the sum of t h e rUR ior column) connectlvirv entries for each atom in the diagonal.

(2991 This concept suggests the direction in w h c h t o seek a mathematical pro01 tor the uniqueness of the hIHS-ACRICP I t should be necessary and suthcient tn establish t h a t , given the set of atoms or points defined by a giien valency count. no two pictographs or linear graphs can he constructed therefrom which yield the same A C M C P Assistance in ths proof may he rendered by utilization of the ph\,s;cogeometric significance of t h e ACRICP which ISpresented elsewhere.

ATOMCONNECTIVITY MATRIXCHARACTERISTIC POLYNOMIAL fast computer scanning, is the "truncated" ACMCP. This is obtained by considering only the first three of four terms of the ACMCP for indexing and then hand-sorting for further selection. This approximation method is particularly convenient because, with techniques described elsewhere." it is possible to compute the value of the first few coefficients by simple inspection of the molecular pictograph without the necessity of constructing the ACM and reducing it to the polynomial. This visual computation technique is rapid, !screens a large quantity of data efficiently, and yet maintains a close discernible rapport between the chemical and the mathematical concepts of the ACM-ACiMCP nomenclature. As such it lends itself to personal filing systems. commercial chemical catalog listing. and the like. It must be emphasized that most of the specific details of formulation and procedure discussed above need not be the concern of the average chemist. H e is free to converse with his colleagues by means of molecular pictographs, as

269

he now does, even in abstracts, '' or by compound names, systematic or trivial, but he leaves most of the literature searching and data retrieval-particularly that related to specific compounds-to computer-based systems. His link to and from the computer is the pictograph, b u t the ACMACMCP concept is suggested as providing the working language within the storage-retrieval system. Acknowledgments.-Acknowledgment is gratefully made to the Applied Mathematics Laboratory, ARL, for assistance in the more sophisticated development of some material in this paper. I n particular should be mentioned the computer programming help given by Henry E. Fettis and James C. Caslin. the location of many hydrogensuppressed ACMCP redundancies by Lt. Donald L. Barnett. and the excellent sounding board qualities of Major John V. Armitage, all of the Laboratory.

The Atom Connectivity Matrix Characteristic Polynomial (ACMCP) and Its Physico-Geometric (Topological) Significance" By LEONARD SPIALTER Chemistry Research Loboratory, ARL, Wright-Patterson Air Force Bose, Ohio Received February 24, 1964

INTRODUCTION

The concept of the Atom Connectivity Matrix (ACM) and its characteristic polynomial (ACMCP) as a computer-oriented nomenclature for chemical pictographs has been described elsewhere.' The ACM is an array in which symbols for the constituent atoms or vertices of a pictograph (chemical or otherwise) are placed as mathematical elements in the main diagonal, and values or symbols for the respective connectivities for pairs of atoms are entered a t appropriate off-diagonal column-row intersections. The ACMCP is derivable from the ACM by any of a number of algebraic p r o c e ~ s e s . ~ The ACM is clearly diriectly relatable, datum-by-datum, to its pictographic progenitor. However, it appeared initially that, in the process of transforming the ACM into the ACMCP, any physical significance would be lost and the ACMCP would be simply a collection of mathematical artifacts, exceedingly useful but physically and chemically meaningless. Such proved not t o be the case. The Homoatomic ..ICMCP.--If one takes a homoatomic' ACM of completely general connectivity, such as the fifthorder symmetric AC M '

X p q r s

p x t u u

q

t

r

u x u y

l u X z

s , y z x

(where X represents the homoatom; i and J are running indexes (from 1 to 5 ) to identify row and column, respec. u:,y , z represent the appropriate tively; and p , q , r , connectivity betwe the X,; and X , atoms) and determines the coefficients of each degree in X, in the related ACMCP, the following results are obtained. Coeff. ofX': Coeff. ofX': Coeff. of X3: Coeff. of X': Coeff. of X:

1 O -(p' + q' + r2 + ,s'+ t' + u L + L'' + u' + y'+ 2') +Z(pqt + p r u + psc + qrii. + q s ) + rsz + tu1c + t ~ ?+ UL"? + w y z ) -[2(pyuic + pqi'j + prtu + prcz + pst>' + psiiz + qrut + qryz + qstc + qsicz + rsuc + r s i q + tu?? + p?z2

l,2u')]

Constant term:

+

&'+

q"u'

+

q'z? +

r y+

+ 2 [ ( p q u > z+ pqcuz + prtyz + p r w y + p s t u z p s u u j ' + qrtcz + qruuy + qstuz + q s u w + rstii? rstcw) - (p'uj,z + q'ucz + r h y + s'tuu. + t'rsz u'qsy + c'qru + ~ P p s r+ y'pru + z.'pqt) I

+ + +

L31 Slethods and mathematical nornenclarure hereln used are to he tound in any text or reference (in matrix algehra. matrix calculus. n r determinant,. For t h e qnke 0 1 tire\iiv and concentratirin on the newer ~'iinceptsIntroduced in this w o r k . standard o p e r a t i m i will not he detailed. 14) For the sake of bimplicity In understanding t h e initial deuelopmenr. It is preahle that all a t o m s in t h e diagonal he made equivalent Extension I O heterrmirimic em- will he presented later 7) T h e svmmetric ra.e is representative oi'a molecule or tran4tion state