Economical procedure of coding molecular structures for the computer

Mar 1, 1973 - Economical procedure of coding molecular structures for the computer-handling of spectra collections. J. Franzen, H. Hillig, W. Riepe, a...
4 downloads 9 Views 578KB Size
Economical Procedure of Coding Molecular Structures for the Computer-Handling of Spectra Collections J. Franzen, H. Hillig, W. Riepe, and S. Stavridou Institut fur Spektrochemie und Angewandte Spektroskopie, Dortmund, West Germany A method for the coding of structures of organic molecules in a computer-compatible form i s described. The coding i s based upon the topological description of the structures, and is developed further by the introduction of a small number of abbreviations for structural sub-units which can be deliberately modified during use. The method represents a favorable compromise with respect to short coding time, simple coding procedure, punched card economy, and error detection capability.

@ @

@

>

H

Figure 1. Structure of 3-aminobenzaldehyde Numbers in circles mark the enumeration chosen for the coding list of Table I

EFFICIENT USE of computer-stored spectra collections necessitates the storage of molecular structures in a computertreatable form. Only then is the collection suitable for searching of spectra of compounds with given structural details, and to the automatic investigation of correlations between structural and spectral features b> statistical or “learning” met hods. Several computer-suitable coding methods for molecular structures have been developed in the field of literature documentation. There are essentially three different wals to code molecular structures: Line notation codes [Wiswessel ( I , 2 ) , IUPAC ( 3 ) , topological codes (4-6), and fragment codes, e.g., Gremas (7, 8) or DMS (9)]. The fragment codes are not considered here because they d o not descibe the full structure. Line notations have the advantage of an extremely short coding form, easy t o read for the chemist after a short period of training. It is, however, difficult to correctly convert structures into this code. and the notations do not support the computer in finding coding errors. The most favorable representation of structures for searching purposes with computers is given by “connection tables” which contain descriptors of all non-hydrogen atoms and their mutual bonds. Such a representation is called a “topological code.” There is, however, no agreement on the most suitable form of external coding by the chemist, and on the most useful input format for this code. For ihe purpose of easy data conversion, optical readers for graphical molecular structures have been built (10). These machines, however, are applicable only in the daily routine of documentation centers, and a simple do-it-yourself method for use in everyone’s laboratory is still needed. ( 1 ) E. G. Smith, “The Wiswesser Line-Formula Chemical Nota-

tion,” McGraw Hill, New York, N.Y., 1968. (2) W. J. Wiswesser, J Clrem. Doc., 8 , 146 (1968). (3) IUPAC Commission of Codification, Ciphering and Punched Card Techniques, “Rules for IUPAC Notation for Organic Compounds,” Longmans, Green and Co., London, 1961. (4) C. N. Mooers, Zutor T e c h . Bid/., 59,1 (1951). ( 5 ) 1:. Meyer, Augew. Clzem., 15, 605 (1970); Angew. Chem. h t . Ed. Eng/., 9, No. 8 (1970). (6) M. F. Lynch, J. .M. Harrison, W. G. Town, and J. E. Ash, “Computer Handling of Chemical Structure Information,” Macdonald, London, and American Elsevier Inc., New York, N.Y., 1971. (7) R. Fugman, W. Braun, and W. Vaupel, NacRr. Dok., 14 (4), 179 (1963). (8) S . Rossler and A. Kolb, J. Chem. Doc., 10 (2), 128 (1970). (9) G. Bergmann and G. Kresze, .4ugew. Clrem., 67, 680 (1955). ( I O ) E. Meyer, lVac/rr. Dnk., 13, 144 (1962).

The full notation for a topological description of a structure is much longer than a line notation, but it can be quickly written after a short training. Since all bonds are coded twice, automatic debugging is rendered possible. The introduction of abbreviations for frequently appearing structural details is the usual means t o shorten topological coding. Such lists of abbreviations, however, tend t o grow bigger and bigger, and the advantage of easy learnability vanishes. Therefore, we developed a flexible coding system in which a moderate number of universal abbreviations is provided but can be adapted to special cases of applications. This coding method combines some advantages of a topological code with some of the line notations. There are, however, differences with respect to the ranges of applications: Whereas a line notation is easy to be read but difficult to be coded correctly, our system provides easy coding; however, the coded words are more difficult t o be read. It therefore is a typical computer-input code, whereas the line notations are better cornputeraoutput codes in cases when no graphical structure formulas on oscilloscope screens or plotter paper are available. The coding method described subsequently was applied to a collection of mass spectra with respect to learning programs. Basic Aspects of the Topological Code. The topological code describes only the non-hydrogen atoms and their mutual bonds. An example is given by the coding of 3aminobenzaldehyde (Figure 1). The coding starts with a n enumeration of all non-hydrogen atoms of the molecule. The sequence is arbitrary. Then a list is set up which describes in each row the kind of one non-hydrogen atom and its bonds to the other atoms in the list. The types of bonds are characterized by numerals. I n Table I, the coding list of 3-aminobenzaldehyde is presented. The data of Table I are fed into the computer. The program knows the symbols and valences of all elements and the valences of the different types of bonds. For elements which can have different valences, several different symbols are introduced. From the knowledge of the valences, the program can easily calculate numbers and positions of hydrogen atoms. Thus, the complete structure including hydrogen atoms is known to the computer. The computer converts the input data into a “connection table” which is an equiialent representation according to the topological coding list of Table I . In the connection table of 3-aminobenzaldehyde (Table 11), position (3,3) denotes a carbon atom, which is connected b y a single bond (type 1) to the carbon aiom in position (2,2) and by aromatic bonds (type 5 ) with the carbon atoms in positions (4,4)and (8,s).

ANALYTICAL CHEMISTRY, VOL. 45, NO. 3, MARCH 1973

475

Table I. Topological Coding List of 3-Aminobenzaldehyde Bond la Bond 2 Bond 3 with with with No. Symbol Type No. Type No. Type No. 1 0 2 2 2 c 2 1 1 3 3 c 1 2 5 4 5 8 4 c 5 3 5 5 5 c 5 4 5 6 1 9 6 c 5 5 5 7 7 c 5 6 5 8 8 c 5 7 5 3 9 N 1 5 a Type of bonds: 1 denotes a single bond, 2 a double bond, and 5 an aromatic bond.

Table 111. List of Abbreviations of Type 2

R1 R2

-CH=CH,

Q1

0 -C-

-CH,-CH=CH2

Q2

-CN

N8

-NO,

Table 11. Connection Table of 3-Aminobenzaldehyde’ 1 2 3 4 , 5 6 7 8 9

‘C2H5 CHAIN

The figures in the main diagonal list the non-hydrogen-atoms cited by their chemical numbers. The numbers in the other positions of the connection table describe the types of bondings with the same numerals as in Table I.

Abbreviations. To shorten the topological coding list, we introduced four kinds of abbreviations which are denoted by symbols similar to those of the elements. The first type of abbreviations describes elements which can appear in different states of valences, and some special isotopes, each by a separate symbol. The second type embraces about 30 frequently appearing fixed structural units with either one or two open bonds. Structures with two open bonds have to be symmetrical. The abbreviations shown in Table I11 are chosen with respect to their frequency of occurrence in the molecular structures of the MSDC spectra. The third type (Table IV) represents unbranched saturated carbon-chains in end-positions, mid-positions, o r in rings. For these abbreviations, the number of C atoms is variable. The fourth type of abbreviations (Table V) is the most flexible one. It describes rings and systems of condensed rings. Only 10 symbols represent some basic skeleton structures consisting of rings with 5 and 6 atoms. The basic shape can be used for describing saturated or aromatic structures with the same symbol. The positions of substituents can easily be indicated, the type of special skeleton bonds can be altered, and hetero atoms can be introduced in exchange with C atoms. Finally, it is possible to condense several skeleton structures to give a more complex molecule. 476

ANALYTICAL CHEMISTRY, VOL. 45, NO. 3, MARCH 1973

Table IV. Symbols of Third Type Abbreviations“ A1-A9 aliphatic chains in end-positions Cl-C9 aliphatic chains in rings Dl-D9 aliphatic chains in mid-positions The letters indicate the type of C-chains, and the numbers their length. (L

The use of these abbreviations and the specification of their exact form is described in the section “coding procedure.” Bonds. Bonds are characterized by eight figures as shown in Table VI. Seven of them mark single! double. and triple bonds each in open chains and in rings, and the aromatic bond. The last figure characterizes bonds between complete substructures of the molecule such as ion bonds or bonds in complexes. Punched Card Format. The topological coding list is fed into the computer via punched cards. For reasons of card and punch time economy, we developed two options of card formats. Option 1 is limited to a maximum of 35 non-hydrogen atoms in one molecule. This option proved to be sufficient in our cases. The limitation chosen allows the addresses of the non-hydrogen atoms (which hitherto have been represented by numerals) to be expressed by the alphameric characters A-Z, and 1-9. These addresses, therefore, take only one column of the punched cards.

Table V. Symbol List of Fourth Type Abbreviations. Benzene Cycl o p e n t adicnc

Indane

N a p h t h a len e

5

L

7

Table VII. Arrangement of the Description Field. Address count (A-Z, 1-9) of either the non-hydrogen atom, or the first non-hydrogen atom of a structural subunit, described in the following ten columns. Symbol of the non-hydrogen atom, or of the abbreviated structural sub-unit. 4(n) type (0-7) of the first bond and address count (A-2, 1-9) 5 ( a ) ] of the bound atom type and address of second bond l(a)

];:I:

1

type and address of third bond

Indene

83

type and address of fourth bond 5

10

L

7

Space aTwelve columns (including space) of the punched card code contain the full description of a non-hydrogen atom including its bonds, or of a structural sub-unit. Six of these fields are placed into the first 7 2 columns of a punched card, a continuation mark into column 73, and the substance identification number into columns 74-80. All characters of the code are chosen in such a way that for each column the keyboard of the card punch machine is by preference either in the lower case shift ( a = alphabetic) or in the higher case shift (ti = numeric). 12

1

F Iu o r e n e

These btructure units can be changed by hydrogen substitution, carbon substitution by the heteroatoms N. 0, and S, and by alteration of bonds within the structure. Finally, several of these structure units can be condensed to give a larger unit. The numbers represent the position count within these structure units in the case of aromatic systems. If the symbols are used to describe saturated units. the position count starts at the same atoms. but includes all C atoms which can have hydrogen substitution. 11

Table VI.

Types of Bonds and Their Symbols. Number of T>pe Description of bond valences 0 ionic bond. bond between 0 complexes, and other bonds with valence zero I single bond i n an open chain 1 2 double bond in an open chain 2 3 triple bond i n an open chain 3 4 single bond i n a ring 1 5 aromatic bond 1.5 6 double bond in a ring 2 7 triple bond in a ring 3 For the description of stereo isomeric compounds. additional types of “position descriptors” may be introduced. I‘

The punched card format for the molecule structure is divided in several “description fields,” each of which depicts one line of the topological coding list. Each description field has to contain the address of the described non-hydrogen atom (1 column), its symbol (2 columns), and several fields for the listing of connections. For each connection, the type of bond (1 column) and the address of the connected atom (1 column) have to be noted. With respect t o the four valences of carbon, we decided to place four bond fields into one d e w i p t i o n field, which in this way comprehends 11 columns. If atoms or substructures with more than four bonds to nonhydrogen atoms have t o be described, a second description tieid may be ubed. Table VI1 presents the details of the demiption fields and their arrangement on the punched card. Option 2 provides two columns for each address. The use of numbers for the enumeration of the non-hydrogen atoms extends their upper limit to 99, and the addition of alphabetic characters allows even for the coding of molecular structures

Table VIII. A Q1 B E

B1

Coding of 3-Aminobenzaldehydea 1B 1*

3*

1A

1E

1D N a Q1 is the abbreviation of the carbonyl group.

with more than thousand non-hydrogen atoms. Of course, enough storage capacity has to be provided for the connection tables within the computer programs. In case of option 2, the punched card can take up only four description fields instead of six. The following description of the coding procedure is based upon option 1 only, but can easily be extended to option 2. Coding Procedure. The coding starts with a n enumeration of all non-hydrogen atoms of the molecule, using the letters A-Z and the figures 1-9. In the following, these enumeration symbols are called “address counts.” The address counts are used to describe the other partners of bonds, and are assigned to a distinct non-hydrogen atom by the listing in columns 1-3 of the descriptive field. If only abbreviations up to type 3 are used, the sequence of the enumeration is arbitrary. Within a n abbreviated structural sub-unit of type 1, the enumeration must follow the sequence given by the “position count” (see Table V), but can be broken o f f if the position of the last substituent has been numbered. Normal coding now can be done without further information using Tables I11 to VII. The use of abbreviations of type 4, however, requires several possible modifications. These modifications are all performed by use of the “position count” shown in Table V, not by the address count. Code symbols of the modifications are placed into columns 4-11 (bond fields) of the description field. If this space is insufficient, it is possible to add more descriptive fields for the same atom. The following modifications of type 4 abbreviations are possible: (1) Hydrogen substitution. The positions of substitutions are given by a sequence of position counts marked by asterisks, followed by a description of the bonds in the same sequence. For example, “ l * 3*” means two substitutions in positions 1 and 3. The application is shown in ANALYTICAL CHEMISTRY, VOL. 45, NO. 3, MARCH 1973

477

Table IX. Condensation of Structural Subunits“ 1 * (asterisk)

Table XII. Structure and Coding List of

2

0

2,5-Dimethyl-morpholine~

Q

(blank) position count of first common atom in unit 1 address count of structure unit 1 position count of first common atom in unit 2 address count of structure unit 2 position count of last common atom in unit 1 address count of structure unit 1 position count of last common atom in unit 2 address count of structure unit 2 a Descriptive field (11 columns) for the condensation of two structural sub-units 1 and 2. Columns 4 and 5 mark an atom of unit 1 which is identical to the atom of unit 2 marked in columns 6 and 7. Columns 8 to 11 mark another atom common to both structural sub-units. All atoms which lie between these two common atoms (shortest way) in both sub-units are considered as common atoms, too. 3

:]

!]

:]

Table X. Structure and Coding List of l-Hydroxy-2diethylaminoethanea A A2 1C B A2 1C

AcH3

A B1 A B1 0, lN@ A B1 E A1 F A1 CH3

---

1E 1A 1D

Table XI. Structure and Coding List of 2-Methoxy-5-Allyl-Phenol~ C

a

g

?

H

2

-

0

c;=SHz

A

B1

1E 1*

E

R2

F

0 01

1A 1C 1D

G

3* 1F

Table VlII which demonstrates the coding of the structure of 3-aminobenzaldehyde. (2) Alterations of the basic structure of abbreviated subunits are enclosed in two pairs of “minus” signs, e.g. “ - (sequence of alterations) - -”. Three types of alterations are provided: (a) Carbon substitutions by the hetero atoms N , 0, or S are simply described by the position count of the carbon atom, followed by the hetero atom element symbol, e.g. “7”’ means substitution of the carbon in position 7 by a nitrogen atom. (b) Saturation of the complete structural sub-unit is marked by “$H.” In this case, the position count changes as described in Table V. (c) Introduction of a double bond for a single bond is described by ‘‘n/m@” where n and m are the position counts of the C atoms which are doubly 1,the second part“m@” can be omitted. bound. If m = n (3) The condensation of several aromatic sub-units into larger molecules requires one complete descriptive field with eleven columns, as described in Table IX.

+

ANALYTICAL CHEMISTRY, VOL. 45,

50

a Within the first two lines, the notation $H 2N 5 0 in between the four minus signs changes the meaning of the symbol B1, normally used for benzene. $H saturates the aromatic ring to cyclohexane, and 2N 5 0 replaces the second and fifth carbon atom by N and 0 respectively. The 1* 4 * 1E lF, finally, characterize positions, types of bonds, and address counts of the two methyl substituents (symbol Al) marked by address counts E and F.

Table XIII. Structure and Coding List of 2,3-Dimethoxy5,6,8,8a,9,10,13,13a octahydro 11H dibenzo(a,g)quinolizin-11-one0

-

-

0 ,

*

& ),-!

b“0 0

Q

E 01 F 01 G 0

1* --

-4A 1A 1B 2D

2* 1E 1F $H 7N E/ 2* 2G BC 5A AC

a The fourth line describes the condensation of the two substructure units B1 and B3, marked by address counts A and C, respectively. Position 4 of substructure A is identical with position B of substructure C, and position 5 of substructure A matches with position A of subbtructure C.

4* 1G

a The first line denotes the positions of substituents of the aromatic ring (symbol Bl), and the second line lists the corresponding types of bonds and address counts (lE, lF, lG). The last three lines describe the types of the substituents.

478

2N

4*

0

A B1 C B3 C B3 a The letters in circles mark the address counts chosen in this case. Both ethyl-substituents (symbol A2) enumerated by address counts A and B are connected to the nitrogen (symbol N) numbered by address count C. The third bond of the nitrogen atom is connected to one end of the alkane chain (symbol D2) marked by address count D. The hydroxy group (symbol 0, address count E) saturates the other end of the alkene chain (symbol D2. address count D). Note that all connections are doubly coded.

$H 1* 1F

NO. 3, MARCH 1973

Coding Examples., Table X demonstrates the coding of a simple alkyl structure by use of abbreviations of type 3. Tables XI to XI11 illustrate the possibilities of varying the basic structure of type 4 abbreviations. The examples are arranged in order of increasing complexity showing the arrangement of hydrogen substitution in Table XI, complete saturation of the aromatic system and carbon substitution be hetero atoms in Table XII: and condensation of ring structures in Table XIII. Programs for Input and Checking. During the input of the punched cards, the program builds up the complete connection tables substituting the abbreviations by the full structure. Thus the content of the connection table is independent of the individual choice of abbreviations. The connection tables, therefore, allow the computer search for any chosen structural detail, using published principles (5, 6). While building up the connection tables, the program performs several checks. Since all bonds between two non-hydrogen atoms are coded twice, coding errors can easily be detected (compare Table 11). In addition, the molecular weight is calculated from the structure and compared with its expected correct value which is independently fed into the computer. A check against the sum formula is possible, too. To facilitate the searching operations, it is useful to arrange the non-hydrogen atoms of the connection table in an unequivocal sequence. Methods to order the atoms in an unambiguous sequence are known in the literature (6) but are, for the moment, not incorporated in our program system.

1130 computer, input and checking of a structure takes about 1 second. By our experience, the coding system described here seems to represent an optimum procedure with respect to short coding time, required previous knowledge, ease of training, punched card economy, computer time, and error detecting capability. The coding, however, is based upon existing graphical structure formulas. Producing a graphical picture of the structure from the official chemical names as the only basis, however, requires a highly skilled chemist with special knowledge and takes longer by far than the coding itself. The principle of the procedure presented here is by no means new. However, this special composition of different steps which leads to simple and time-saving operation may find some interest.

The input and checking programs are written for an IBM 1130 computer. Most of the programming was done in FORTRAN IV, some input and output routines, however, in 1130-ASSEMBLER to achieve overlapped execution. Program listings may be obtained upon request. Condensed Storage of Connection Tables. To reduce the need for total storage capacity, the connection tables are compressed to about 6 bytes per non-hydrogen atom, i.e., 80-120 bytes are necessary for storage of each structure. During the reading of these short-format connection tables from external storage devices, the connection table is simultaneously filled up to its full format. PRESENT EXPERIENCE AND ASSESSMENT The coding method described here was applied to the collection of mass spectra of the Mass Spectrometry Data Centre (MSDC), Aldermaston. About 8000 structures have been coded. After a few hours of training, a coding speed of about 50 structures per hour was achieved. Using a n IBM

RECEIVED for review July 13, 1972. Accepted October 19, 1972.

Computerized Quantitation of Drugs by Gas Chromatography-Mass Spectrometry Lubomir Baczynskyj, David J. Duchamp, John F. Zieserl, Jr., and Udo Axen Phj.sicul and Analytical Chemistry Research, The Upjohn Company, Kalamazoo, Mich. 49001

A fully computerized method for quantitative determination of picogram amounts of drugs using a GC-MScomputer system is described. In this method, the magnetic field is scanned repetitively Over a mass range and the intensities of selected ions of a drug and its deuterated analog are measured by the comPuter. In the final report, the ratio of Protium to deuterium form of the drug is obtained. The lower limit of detection of the drugs investigated so far i s 200-300 picograms. The reproducibility of the individual measurements is good, as shown by the reported results.

using an accelerating voltage alternator (AVA). The heights of the two traces are measured, and their ratio yields the composition of the mixture. This technique applied to isotopically labeled mixtures such as glucose and gluCose-d7 ( 2 ) gave isotopic abundances on aslittle as 0.55 pg with an average deviation of 5.5z. More recent applications of this method to such drugs as chlorpromazine and its metabolites (3), nortriptyline (4), and prostaglandin El (PGE1) ( 5 ) have proved its usefulness and general applicability to the field of drug analysis. The extension of this approach to PGEz and PGFzcy has allowed the determination of picomole quantities of these prostaglandins by using the corresponding (3,3,4,4-d4) compounds as carriers (6). Contrary to the AVA approach (2), in our method the accelerating voltage is maintained constant, while the magnetic field is repetitively scanned at a slow rate over a narrow mass range (20-30 amu). This is similar to the system recently described by Hites and Biemann (7) in which the whole mass range is continuously scanned during a G C run and certain ions diagnostic of particular structural features are plotted as a function of the scan number to yield a “mass chromatogram.” Such an approach gives a very useful qualitative information, but previously has not been widely used for quantitative measurements.

THE COMBINATION of gas chromatography (GC) with mass spectrometry (MS) has given the analyst an extremely versatile and sensitive tool. Although the use of the mass spectrometer for quantitative determinations of components of mixtures has been known for many years in the petroleum industry ( I ) , the recent emphasis on the qualitative information contained in a mass spectrum has overshadowed its quantitative aspect. In the present communication, we report the use of a computerized GC-MS system for quantitative determinations of very small amounts of drugs. It had been shown previously ( 2 ) that it is possible to determine the composition of an unresolved mixture in the gas chromatographic effluents by utilizing the mass spectrometer as the detector for the gas chromatograph. In this method, the intensities of two preselected ions, each characteristic of one component of the mixture, are continuously recorded by ( I ) J. H. Beynon. “Mass Spectrometry and Its Applications to Organic Chemistry,” Elsevier Publishing Co., Amsterdam 1960, p

424. (2) C. C. Sweeley. W. H. Elliott, I. Fries, and R. Ryhage, ANAL. CHEM., 38,1549 (1966).

(3) C. G. Hammar, B. Holmstedt, and R. Ryhage, A/ru/. Biochem., 25, 532 (1968). (4) T. E. Gaffney, C. G . Hammar, B. Holmstedt, and R. E. McMahon, ANAL.CHEM., 43,307 (1971). (5) B. Samuelsson, M. Hamberg, and C. C. Sweeley, A n d . Biochem., 38, 301 (1970). (6) U. Axen, K . Green, D. Horlin; and B. Samuelsson, Biochem. Biophys. Res. Commun., 45, 519 (1971). (7) R. A. Hites and K. Biemann, ANAL.CHEM.,42,855 (1970).

ANALYTICAL CHEMISTRY, VOL. 45, NO. 3, MARCH 1973

e

479