244
J . Chem. InJ Comput. Sci. 1992, 32, 244-255
The r values are 2.9, -6.8, -18.9, and 54.5, respectively for the coefficients; 9= 0.9872, s = 4.89 OC, F = 1749, and n = 72. Calculated values and residuals for the latter correlation are presented in Table 111. The two outliers with absolute residuals larger than 12 are the same as the most serious ones in correlation 1 above. For 44 Sulfides (with mean BP = 150.7): BP = 20.01(*7.49)
+ 43.93(*0.83)'~- 6.34(*1.20)Jhct (5)
The t values are 2.7, 52.7, and -5.3, respectively; 9 = 0,989; = 4.28 OC, and F = 1807. BP = 50.59(*23.95) - 32.91(*13.45)ESS 9.86(*0.68)NMc 24.10(*0.54)K~tl (6)
s
+
The t values are 2.1, -2.4, -14.5, and 44.3, respectively; r-2 = 0,989; s = 4.3 OC; and n = 44. Although in this case both types of correlations yield practically the same statistical results, we reproduce in Table IV only the second batch of data. In both correlations there is just one outlier with a boiling point whose experimental accuracy is subject to doubt (ethyl heptyl sulfide). One may conclude this section by stating that for ethers the best correlation a ords a fit error of 4.2 OC and for sulfides a fit error of 2.8 C. In conclusion, we have found that four parameters (the molecular connectivity ' x , the topological index J modified for the presence of heteroatoms, the electrotopological state S of the heteroatoms, and the number N, of sulfur atoms) give a good correlation with BPs for unknown compounds belonging to these classes of compounds and this range of carbon atoms and heteroatoms.
sf
~
ACKNOWLEDGMENT A.T.B. thanks Sterling Drug Inc. for support. The assistance of Mr. B. Brown for computer programming was much appreciated. REFERENCES AND NOTES (1) Balaban, A. T.; Joshi, N.; Kier, L. B.; Hall, L. H. J . Chem. Inf. Compur. Sci. 1992, 32 (preceding paper in this issue). (2) Stanton, D. T.; Jurs, P. C.; Hicks, M. G. J . Am. Chem. SOC.1991, 31, 301. (3) Bordwell, F. G.; Anderson, H. M.; Pitt, B. M. J. Am. Chem. SOC.1954, 76, 1082. (4) Jurecek, M.; Vecera, M. Chem. Lisry 1954, 48, 542. (5) SAS Institute Inc. SAS User's Guide: Statistics, Version 5; SAS: Cary, NC; 1985. (6) The programs MOLCONN and MOLCONN~can be obtained from Prof. L. H. Hall, Hall Associates Consulting, 2 Davis Street, Quincy, MA 02170. (7) RandiE, M. J . Am. Chem. SOC.1975, 97, 6609. (8) Kier, L. B.; Hall, L. H. Molecular Connectivity in Chemistry and Drug Research; Academic Press: New York, 1976. (9) Kier, L. B.; Hall, L. H. 'Molecular Connectiuify in SrrucrureAcriuiry Analysis" Research Studies Press and Wiley: New York, 1986. 10) Balaban, A. T. Chem. Phys. Lerr. 1982, 80, 399. 11) Balaban, A. T. Pure Appl. Chem. 1983, 55, 199. 12) Balaban, A. T. Marh. Chem. 1986, 21, 115. 13) Kier, L. B. Quanr. Srruct. Act. Relar. 1985, 4, 109. 14) Kier, L. B. Med. Res. Reu. 1987, 7, 417. 15) Kier, L. B. Quanr. Srrucr.-Acr. Relat. 1986, 5 , 7. 16) Kier, L. B.; Hall, L. H. Pharm. Res. 1990, 7, 801. 17) Hall, L. H.; Mohney, B.; Kier, L. B. J . Chem. InJ Compur. Sci. 1991, 31, 76. (18) White, P. T.; Barnard-Smith, D. G.; Fidler, F. A. Ind. Eng. Chem. 1952,44, 1430. (19) CRC Handbook of Physics and Chemistry, 68rh ed.; CRC: Bwa Raton, FL. (20) Dictionary of Organic Compounds, 5th ed.;Chapman and Hall: New York, 1982.
Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited ARTHUR DALBY, JAMES G. NOURSE,* W. DOUGLAS HOUNSHELL, ANN K. I. GUSHURST, DAVID L. GRIER, BURTON A. LELAND, and JOHN LAUFER Molecular Design Limited, 21 32 Farallon Drive, San Leandro, California 94577 Received January 23, 1992 Y
A series of file formats used for storing and transferring chemical structure information that have evolved over several years at Molecular Design Limited are described. These files are built using one or more connection table (Ctab) blocks. The Ctab block format is described in detail.
The file formats described are the MOLfile for a single (multifragment) molecule, the RGfile for a generic query, the SDfile for multiple structures and data, the RXNfile for a single reaction, and the RDfile for multiple reactions and data. The relationships of these files are given as well as examples. 1. INTRODUCTION
This paper describes the chemical table file (CTfile) formats currently used in a wide variety of chemical structure-manipulating computer programs. These file formats were developed by a large number of people at Molecular Design Limited (MDL) over the past 13 years.' While the formats were developed for use with the various MDL programs, their use has gone well beyond this role. The evolution of the CTfile formats did not proceed by a well-defined plan. Changes were frequently made over the years to accommodate new program features or application needs. While the changes often added new data fields or appendices, they were done such that older files would remain valid and older programs could read the portion of new files
they could interpret. Because of this policy, as well as the widespread use of programs which read and write CTfiles, there is probably more valid chemical structure information in existence in these formats than in any other format, current or proposed. While the evolution of these CTfile formats was closely tied to the development of the various MDL computer programs, the purpose of this paper is to describe just the file formats and not the various computer programs and their features. In general, references to the various programs which use these file formats will be made only when necessary to describe the purpose of a particular part of a file format. Literature references will be given in place of detailed descriptions of various program features.
0095-233819211632-0244$03.00/00 1992 American Chemical Society
MDL COMPUTER CHEMICAL STRUCTURE FILEFORMATS I
RGm
I
1
RXMlld
I
'JC
"
T.MC I. Roprrtics and Identifying lmna Applicable to Various CTfilc Typcs CTfilc type icon property MOLfilc RGfilc SDfilc RXNfile RDfilc generic + IC] + Isrl SgrOUP + Wsl Rgmw 1301 3D [CPl CPSS IRxl reaction IQI w r y
+ + +
+
+ + + + +
+
+ +
CMPN
9
Figure 1. CTfiles: SINC~UR and interrelationships.
+ + + +
J. Chem. In/: Comput. Sei.. Vol. 32*No. 3, 1992 US
+
+ +
The file formats described and their interrelationships are indicated in Figure I. The key piece describing a chemical structure is the connection table (Ctab) block indicated in all the files illustrated in Figure I. The Ctab block is not a file by itself but is used as a building block for the various files. The Ctab block is dcscribed in Section 2. The simplest complete file is the MOLfile, which contains just one Ctab block. This is described in Section 3. The RGfile (for Rgroup file) is a special query file format for use in generic searching? It is described in Section 4. The SDfile (for Structure-Datafde) can contain many MOLfdes combined with data for each. It is intended as a format for moving large numbers of chemical structures and associated data between databases. The SDfde is described in Section 5. The RXNfile (for Reaction file) contains the reactants and products of a single chemical reaction. The RXNfile is described in Section 6. The RDfile (for Reaction-Data file) is a file format that contains multiple RXNfiles and associated data. Since the RDfile is more general in that it can have MOLfiles in place of RXNfiles, it is illustrated twice in Figure 1. The RDfile is described in Section 7. Because of the very large number of chemical structure properties, it is convenient to group them by class. The classcf are. symbolized by icons, shown in Table I, which also indicates which of the various CTfile types can currently contain structures with these properties. The Generic class includes basic chemical structure properties that can be registered or used as queries. These can appear in any of the CTfiles. The Sgroup class includes properties that are associated with identified substructures of a chemical structure. These are used for a variety of purposes, often involving more complex chemical substances, and have been described in detail else where.' The Rgroup class includes properties that can only be used in Markush-like queries? The 3D class includes properties associated with a three-dimensional chemical structure. These can be query or registerable properties and have been described in detail elsewhere.' The CPSS class includes same fairly routine structural properties that are interpreted only by the CPSS (Chemist's Personal Software Series) programs. These are generally duplicated elsewhere in other CTfile locations. The Reaction class includes query and registerable properties that are associated with chemical
mn
Figure 2. Connection table (Ctab) organization illustrated using
alanine.
reactions. The more elaborate ones have bccn described elsewhere? The query class includes properties that can only be used in search queries. 2. THE CONNECTION TABLE [CTAB]
A connection table (Ctab) contains information d&bing the structural relationships and properties of a collection of atoms. The atoms may be wholly or partially connected by bands. Such collections may, for example, d&be molecules, molecular fragments, substructures, substituent groups, polymers, alloys, formulations, mixtures, and unconnected atoms. The connection table is fundamental to all of the CTfde formats. Figurea 2.5, and 6 show the connection tables of a simple molecule (alanine), an Sgroup structure (polymer), and a 3D query, respectively. The various data blocks in each Ctab are identified. The atom numbers on the structures in Figures 2 and 5 correspond to atom numbers in the Ctabs. An atom number is assigned according to the order of the atom in the Atom Block (see Section 2.2.). The format for a Ctab block is Counts line Important specifications here relate to the number of atoms, bands, and atom lists, the chiral flag setting and the Ctab version Atom block Specifies the atomic symbol and any mass difference,charge, stermchemistry, and associated hydrogens for each atom Bond block Specifies the two atoms connected by the bond, the bond type, and any bond stereochemistry and topology (chain or ring properties) for each bond Atom list block Identifies the atom (number) of the list and the atoms in the list Stext block Structural text descriptor block used by CPSS programs Properties block Provides for future expandability of Ctah features, while maintaining compatibility with earlier Ctab configurations The detailed format for each block outlined above follows. Note: A blank numerical entry of any line should be read as "0" (zero). Spaces are significant and correspond to one or more of the following: Absence of an entry Empty character positions within an entry Spaces between entries; single unless spocircaUy noted otherwise 2.1. The Counts Lim. aaabbblllfffcccsssxr~p~iimmm~
246 J . Chem. In$ Comput. Sci., Vol. 32, No. 3, 1992 Fleld
Meaning
x y z
atomcoordinates
aaa
atom svmbel
dd
ccc
values
entry in periodic table or L lor atom list. A, 0, * lor unspecifiedatom. and LP for lone pair, or R# tor Rgroup label
mass difference
-
-
i
sss
-
0 unchargedor value other lhan these. 1 +3, 2 = r2,3 = d 4 -doublet 5 -1, 6 = -2. 7 -3
-
atom stereo parny
Notes
Fleld
Maanlnrr
[GI [GI
111
first atom number
222
second atom number
ttt
bond type
1 = Single, 2 IDouble. 3 = Triple. 4 =Aromatic. 5 = Single or DouMe, 6 = Singie or Aromatic. 7 = Double or Aromatic, 8 =Any
[O] Values 4 through 8 are for SSS queries only.
sss
bond stereo
Singie bonds: 0 =not stereo, 1 = Up, 4 = Either, 6 Down Double bonds: 0 = Uw x-. y , z-coofds from atom block to determine cis or trans, 3 =Cis or trans (either)
[GI The small (pointed) end of the stereo bond is at the first atom (Field 111 above)
[Ql
PI
Pol tRO1
-3,-2,-1.0,1. 2 . 3 . 4 (0 ifvalue beyond these limils)
charge
[GI Dinereme from mass in periodlc table. Wider range of values allowed by M IS0 line, below. Retained lor compatibility wdh older Ctabs, M IS0 takes precedeme.
[GI Wider ranqe 01 values in M CHG and M RAD lines below
(*I Retainedlor wmpatibiihy with
-
0 not stereo, 1 -odd, 2 = even, 3 either or
older Ctabs, M CHG and M RAD lines lake precedence. Ignored when read. See Secton 2.8 (Stereo Notes)
[GI
unmarked stereo centef hhh
hydrogen count + 1
1 =HO, 2 -
H1. 3 H2, 4 - H 3 . 5 - H4 i
-
bbb
Stereo care bex
o gnore stereo w n t g of this atom, 1 = stereo w n f g of atom must match
""V
valence
0 -no marking (defaun) (1 to 14) (1 10 14)
-
15 I Zero valence
[Q] HO means rm H atoms
Notes
[GI [GI
-
nofused
rrr
bond topology
0 = either,l = Ring, 2 = Chain
ccc
reacting center status
0 = unmarked. 1 a center, -1 = not a center. Additional: 2 I no change, 4 = bond madebroken. 8 = bond order changes 12 418 (both madebroken and changes); 5 = (4 t 1). 9 (8 t 1). and 13 (12 t 1) are also possible
-
-
-
[QI
-
[GI
Shows number of bonds to this atom, imluding bonds to implied H'S.
Figure 4. Meaning of values in the bond block.
ICPl
list block. The atom list block is retained for compatibility, but information in an M ALS item supersedes atom list block information. Made up of atom list lines, one line per list, with the following format: aaa kSSSSn 111 222 333 444 555
HO designator
rrr
readon component type
reactant = 1, plOduct = 2, intermediate 3
[CPI
111
readion component number
OtO(IF-1)
[CPI
mmm
atom-atom mapping number
1-255
[Rxl
nnn
inversionfretention tlag
1 configurationis inveflec, 2 = configuration is relained 0 = propeny not applied
-
[Rxl
eee
exact change flag
1 =change on atom must be exactly as shown 0 = properly not applied
-
Values
xxx
allowed unless explicitly drawn Hn means atom rrust have nor more H's in excess 01 expllcit H s
HXH
where aaa
= number of atom (L) where list is attached
Figure 3. Meaning of values in the atom block.
where aaa bbb 111 fff ccc
DALBYET AL.
= number of atoms (current max 255) [C] = number of bonds (current max 255) [C] = number of atoms lists (max 30) [Q]
= (obsolete) = chiral flag; 0 = not chiral, 1 = chiral [C] sss = number of stext entries [CP] xxx = number of reaction components 1 [CP] rrr = number of reactants [CP] PPP = number of products [CP] iii = number of intermediates [CP] mmm = number of lines of additional properties, including the M END line [C] vyww = current Ctab version: ' V2000' [C] For example, the counts line in the Ctab shown in Figure 2 shows six atoms, five bonds, the CHIRAL flag on,and three lines in the properties block: 6 5 0 0 1 0 3 v2000 2.2. The Atom Block. Made up of atom lines, one line per atom with the following format: xxxxx.xxxxyyyyy.yyyyzzzzz.zzzz aaaddcccssshhhbbbvw HHHrrriiimmmnnneee where the values are described in Figure 3. Note: With Ctab version V2000, the dd and ccc fields have been superseded by the M 1s0,M CHG, and M RAD lines in the properties block, described below. For compatibility, newer programs write appropriate values in both places if the values are in the old range and read the atom block fields if there are no M ISO, M CHG,or M RAD lines in the properties block. 2.3. The Bond Block. Made up of bond lines, one line per bond, with the following format: 1 1I222tttsssxxxrrrccc where the values are described in Figure 4. 2.4. The Atom List Block [Q].Note: Newer programs use the M ALS item in the properties block in place of the atom
+
k n
= T = [NOT] list, F = normal list = number of entries in list; maximum
111...555
= atomic number of each atom on the
is 5 list = space 2.5. The Stext Block [CP]. Made up of two-line entries.6 2.6. The Properties Block. Made up of ppp lines of additional properties, where ppp is the number in the counts line described above. If a version stamp is present, ppp is ignored and the file is read until an M END line is encountered. Most lines in the properties block are identified by a prefm of the form M xxx where two spaces separate the M and xxx. Exceptions are G xxx (two-line entry), A xxx (two-line entry), and v xxx which indicate CPSS properties (group abbreviation, atom alias, and atom value, respectively! [CP] s s m n n which causes the next nnn lines to be ignored. The prefm M END terminat@ the properties block. All lines that are not understood by the program are ignored. The descriptions below use the following conventions for values in field widths of 3: n15 number of entries on line; value = 1-15 number of entries on line; value = 1-8 nn8 number of entries on line; value = 1-6 nn6 nn4 number of entries on line; value = 1-4 nn2 number of entries on line; value = 1 or 2 nn 1 number of entries on line; value = 1 atom number; value = (1 to number of ataaa oms) The format for the properties included in this block follows. The format shows one entry; ellipses (...) indicate additional entries. Chorge [GI M cHGnn8 aaa vw ...
S
vw
-15 to +15, Default of 0 = uncharged atom. When present, this property super-
MDL COMPUTER CHEMICAL STRUCTURE FILEFORMATS sedes all charge and radical values in the atom block (Section 2.2), forcing a 0 charge on all atoms not listed in an M CHG or M RAD line.
Radical [GI M RADnn8 aaa vw ... w Default of 0 = no radical; 1 = singlet (:); 2 = doublet ( A ) ;3 = triplet ( A A ) . When present, this property supersedes all charge and radical values in the atom block (section 2.2), forcing a 0 (zero) charge and radical on all atoms not listed in an M CHG or M RAD line. Isotope [GI M 1sonn8 aaa vw ... vw Absolute mass differing from natural abundance within the range -18 to +12. When preswrt, this property supersedes all isotope values in the atom block. Default (no entry) is natural abundance. Ring Bond Count [Q] M RBDnn8 aaa wv ... Number of ring bonds allowed: default of vw 0 = off; -1 = no ring bonds (rO); -2 = as drawn (r*); 2 = (r2); 3 = (r3); 4 or more = (r4). Substitution Count [Q] M susnn8 aaa vw ... vw Number of substitutions allowed: default of 0 = off; -1 = no substitution (SO); -2 = as drawn (s*); 1,2, 3, 4, 5 = (sl) through (s5); 6 or more = (s6). Unsaturated Atom [Q] M u ~ s n n 8aaa vw vw At least one multiple bond: default of 0 = off; 1 = on. Link Atom [Q] M ~ 1 ~ n aaa n 4 vw bbb ccc ... vw,bbb,ccc Link atom (aaa) and its substituents, other than bbb and ccc, may be repeated 1 to vw times, (wv 2 2). Atom List [Q] M ~uaaann5 e 11112222333344445555 aaa Atom number; value = (1 to #atoms). nn5 Number of entries on line. Exclusion, value is T if a ‘NOT’ list, F if a e normal list. 111l... Atom symbol of list entry in field of width 4. Note: This line contains the atom symbol rather than the atom number used in the atom list block. Any information found in this item supersedes information from the atom list block. Attachment Point [Rg] M ~ p o n n 2aaa vw ...
...
vw
Indicates whether atom aaa of the Rgroup member is the first attachment point (wv = l), second attachment point (vw = 2), both attachment points (vw = 3); default of 0 = no attachment. Rgroup Label Location [Rg]
J . Chem. InJ Comput. Sci., Vol. 32, No. 3, 1992 247 M
RGPnn8 aaa rrr
,,.
rrr
Rgroup number, value from 1 to 32, labels position of Rgroup on root. Rgroup Logic, Unsatisfied Sites, Range of Occurrence [Rg] M Lwnnl rrr iii hhh ooo rrr iii hhh
OOO
Rgroup number, value from 1 to 32. Number of another Rgroup which must only be satisfied if rrr is satisfied (IF rrr THEN iii). RestH property of Rgroup rrr; default is 0 = off, 1 = on. If this property is applied (on), sites labeled with Rgroup rrr may only be substituted with a member of the Rgroup or with H. Range of Rgroup Occurrence required: n = exactly n; n-m = n through m;>n = greater than n; 0. Any noncontradictory combination of the preceding values is also allowed; for example: 1, 3-7, 9, >11.
Sgroup Type3 [Ssl M s ~ y n n 8SSS ttt
...
Sgroup number. SUP = superatom; MUL = multiple group; SRU = SRU type; MON = monomer; MER = mer ty-pe; COP = copolymer; CRO = crosslink; MOD = modification; GRA = graft; COM = component; MIX = mixture; FOR = formulation; DAT = data Sgroup; ANY = anypolymer; GEN = generic. Note: For a given Sgroup, an STY line giving its type must appear before any other line that supplies information about it. For a data Sgroup, an SDT line must describe the data field before the SCD and SED lines that contain the data (see Data Sgroup Data below). When a data Sgroup is linked to another Sgroup, the Sgroup must already have been defined. (See Figure 5 . ) Sgroup Subtype [Sg] M s s ~ n n 8sss ttt ...
sss ttt
ttt
Polymer Sgroup subtypes: ALT = alternating; RAN = random, BLO = block.
Sgroup Labels [Sg] M s ~ s n n 8SSS VW -.. vw Unique Sgroup identifier (integer label from 1 to 512). Sgroup Connectiuity [Sg] M sc~nn8 sss ttt ... ttt H H = head-tehead; HT = head-to-tail; EU = either unknown. Left justified. Sgroup Expansion [Sg] M SDS EXPnl5 SSS ...
sss Sgroup index of expanded superatoms. Sgroup Atom List [Sg] M SAL sssnl5 aaa ... aaa Atoms in Sgroup sss. Sgroup Bond List [Sg] M SBL sssnl5 bbb ... bbb Bonds in Sgroup sss (For data Sgroups, bbb’s are the containment bonds, for all
DALBYET AL.
248 J. Chem. InJ Comput. Sci., Vol. 32, No. 3, 1992
h... ii
Jran
I GI
GIIUfCS-IIlOl791104120 1 7
6
0
0
2.9463 1,6126 0.2119 0.21B9 -1.0548 -2.3815
-3.9246 1 2 1 0 2 3 1 0 3 4 1 0 3 6 1 0
0
0
0.3489 1.1189 0.3489 -1.1911 1.1190 0.3490 1.1470 0 0 0 0 0 0 0 0 IN
0.00374
0.00100
16 V2000 0.0000 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 C1 0 0 0.0000 C 0 0 0.0000 C 0 0 0.0000 * 0 0
0 0 3 0
0 0 0
0 0 0 0 0
0 0
0
0 0 0 0 0 0 0
-
.
0
0 0
0 0 0 0
I
u a t a r of e n t r i e s on l i n e
Figure 5. Connection table organization of an Sgroup structure.
other Sgroup types, bbb's are crossing bonds'.) Multiple Group Parent Atom List [Sg] M SPA sssnl5 aaa ,.. Atoms in paradigmatic repeating unit of multiple group sss. Sgroup Subscript [Sg] M SMT sss m ... aaa
Text of subscript for Sgroup sss. For multiple groups, m ,.. is the text representation of the multiple group multiplier. For superatoms, m ... is the text of the superatom label. Sgroup Correspondence [Sg] M CRS sssnn6 bbl bb2 bb3 m...
bbl, bb2
Crossing bonds that share a common bracket. bb3 Crossing bond in repeating unit that connects to bond bbl, Sgroup Display Information [Sg] M SDI sssnn4 xl yl x2 y2 xl ,yl,x2,y2 Coordinates of bracket endpoints (FORTRAN format 4F10.4). Superatom Bond and Vector Information [Sg] M SBV sss bbl xl yl Bond connecting to contracted superatom. Vector for bond bbl connecting to contracted superatom sss (FORTRAN format 2F10.4). Data Sgroup Field Description [Sg] M SDT sss fff ...fffgghhh...hhhiijjj ... bb 1 xl, yl
sss
f...
88
Index of data Sgroup. 30 character field name (in MACCS-I1 no blanks, commas, or hyphens). Field type (in MACCS-11: F = formatted, N = numeric, T = text).
20 character field units or format. Nonblank if data line is a query rather than Sgroup data: MQ = MACCS-I1 query, IQ = ISIS query, PQ = program name code query. Data relation operator (blank for MACCSj... 11). Data Sgroup Display Information [Sg] M SDD sss xxxxxxxxxxyyyyyyyyyy eeefgh i jjj kkk 11 m noo sss Index of data Sgroup. x, Y Coordinates (2F 10.4). eee (Reserved for future use.) f Data display: A = attached, D = detached. Absolute, relative placement: A = absolute, g R = relative. h Display units: blank = no units displayed, U = display units. 1 (Reserved for future use.) Number of characters to display (1 ...999 or jjj ALL). Number of lines to display (unused, always kkk 1). 11 (Reserved for future use.) m Tag character for tagged detached display (if nonblank). n Data display DASP position (1 ...9). 00 (Reserved for future use.) Data Sgroup Data [Sg] M SCD sss d.. M SED sss d.. d... Line of data for data Sgroup sss (69 chars/line, columns 12-80) Note: A line of data is entered as text in 69-character substrings. Each SCD line adds 69 characters to a text buffer (starting with successive SCDs at character positions 1, 70, and 139). Following zero or more SCDs must be an SED, which may supply a final 69 characters. The SED initiates processing of the buffered line of text: trailing blanks are removed and right truncation to 200 characters is performed, numeric and formatted data are validated, and the line of data is added to data Sgroup sss. Left justification is not performed. A data Sgroup may have more than one line of data, so more than one set of SCD and SED lines can be present for the same data Sgroup. The lines are added in the same order in which they are encountered. If 69 or fewer characters are to be entered on a line, they may be entered with a single SED not preceded by an SCD. On the other hand, if desired, a line may be entered on up to 3 SCDs followed by a blank SED that terminates the line. The set of SCD and SED lines describing one line of data for a given data Sgroup must appear together, with no intervening lines for other data Sgroups' data. Sgroup Hierarchy Information [Sg] M s p u d ccc ppp ... CCC Sgroup index of the child Sgroup. Sgroup index of the parent Sgroup (ccc and PPP ppp must already be defined via an STY line
J . Chem. In5 Comput. Sci., Vol. 32, No. 3, 1992 249
MDL COMPUTER CHEMICAL STRUCTURE FILEFORMATS
The Identification Line. The 3D feature identification line has the following format: M $3~tttcccaaa...aaa ttt. ..ttt
30 0u.y w)rCCS-II10179l09553D 1
e
7 o o o o 1.0~52 0.2892 -0.4562 0.6578 -1.4ei3 0.3687 -1.0252 -0.2e92 0.4562 -0.6578 1.4813 -0.3687 -0.1989 4.1401 4.6453 0.5081 1 2 1 0 0 0 2 3 2 0 0 0 3 4
5 6 7 W
W
n I4
n M
n
n n
4 5
1 2
0 0
0 0
-e
$10 9 n 530 -3 W 530 9 W $30-16 W 130 12 W $30-12 I! $30 1 2 W $30 -9 W 130 1
1.3456 N 1.7417 C
] Excbsionrpher8
3 1
-12 -13
0.00000
l e vzooo 1.1122 c o o 1.3156 c o o 0.2033 c o o -1.1122 c o o -1.3156 c o o -0.2033 c o o 0 0
0 0
0
1--1
Counts line
o o o
o o o o o o o o o o o o o o o o o o o o o 0 0
0 0
0 0
0 0
J 1
1
0 0
6 1 0 0 0 1 2 0 0 0 e 1 0 0 0 $30 7 6 $30 -7 130 3 530 6 4 2 s30 -5 13 $30 6 0.0000 S3D 1 2 3 4 5 130
1.00000
where the variables represent ttt 3D feature type ccc Color number (an internal MDL number which is terminal dependent) 3D feature name (up to 32 characters) aaa...aaa Text comments (up to 32 characters) used ttt ...ttt by MDL programs (see section 2.7.3 below) The 3D feature t w e identifiers are -1 Point def&ed by 2 points and a distance in angstroms (A) -2 Point defined by 2 points and a percentage -3 Point defined by a point, a normal line, and a distance -4 Best fit line defined by 2 or more points Plane defined by 3 or more points. (A best fit plane -5 if more than three points) Plane defined by a point and a line -6 Centroid defined by points -7 Normal line defined by a point and a plane -8 Distance defined by 2 points and a range (A) -9 Distance defined by a point, a line, and a range (A) -10 -1 1 Distance defined by a point, a plane, and a range
*"Bordbhxk
Cnb bloc*
6
i
1
10 6
I1 12 1 IO 9
-2.0000 0
1.5000
7
75.0000
105.0000
3
9
4.4000
5.7000
Point
Dis'ame
J Figure 6. Connection table organization of a 3D query. n END
prior to encountering this line). Sgroup Component Numbers [Sg] M sNcnn8 sss ooo ... sss Index of component Sgroup. Integer component order (1...256). 000 3 0 Feature Properties [3D] M $3~nnn M $3D... See below for information on the properties block of a 3D MOLfile. These lines must all be contiguous. End of Block. M END
This entry goes at the end of the properties block and is required for MOLfiles which contain a version stamp in the counts line. 2.7. The Properties Block for 3D Features [3D].For each 3D feature, the properties block includes one 3D features count line one or more 3D features detail lines The characters M $ 3 appear ~ at the beginning of each line describing a 3D feature. The information for 3D features starts in column 7. 2.7.1. 3D Features Count Line. The first line in the properties block is the 3D features count line and has the following format: M $3~nnn where nnn is the number of 3D features on a model. 2.7.2. 3D Features Detail Lines. The lines following the 3D features count line describe each 3D feature on a model. Each 3D feature description consists of an identification line and one or more data lines. The identification line is the first line and contains the 3D feature's type identifier, color, and name. Each data line describes the construction of the 3D feature.
-14 -1 5 -16 -17
(A)
Angle defined by 3 points and a range (in deg) Angle defined by 2 intersecting lines and a range (in deg) Angle defined by 2 intersecting planes and a range (in deg) Dihedral angle defined by 4 points and a range (in deg) Exclusion sphere defined by a point and a distance
(A)
Fixed atoms in the model Positive integer indicates atom or atom-pair data constraints The Data Line. The 3D feature defines the data lime format. Each 3D object is treated as a pseudoatom and identified in the connection table by a number. The 3D object numbers are assigned sequentially, starting with the next number greater than the number of atoms. The data line formats for the 3D feature types are type description of data line -1 The data line for a point defined by 2 points and a distance (A) has the following format:
nnn
M
-2
$3~iiijjjddddd.dddd
where ... the variables represent: 111 ID number of a point ID number of a second point jjj ddddd.dddd Distance from first point in direction of second point (A), 0 if not used. The following example shows POINT-1 created from the atoms 1 and 3 with a constraint distance of 2 A. The first line is the identification line. The second line is the data line. M $3D-1 4POINT-1 M $3D 1 3 2.0000 The data line for a point defined by two points and a percentage has the format: M
$3~iiijjjddddd.ddddd
DALBY ET AL.
250 J . Chem. Inf. Comput. Sci., Vol. 32, No. 3, 1992
-3
where the variables represent: iii ID number of a point jjj ID number of a second point ddddd-dddd Distance (fractional) relative to distance between first and second points, 0 if not used. The data for a point defined by a point, a normal line, and a distance (A) has the format:
-6
M $3~iiilll
where the variables represent: iii ID number of a point 111 ID number of a line The following plane is defined by the point 1 and the plane 16. The first line is the identification line. The second line is the data line. M $3D -6 3 PLANE-1
M $3~iiilllddddd.dddd
-4
where ... the variables represent: ID number of a point 111 ID number of a normal line 111 Distance (A), 0 if not used. ddddd.dddd Note: For chiral models, the distance value is signed to specify the same or opposite direction of the normal. The data line for a best fit line defined by 2 or more points have the following format:
M
-7
-5
$3DPPP
...
M $3~iiijjj zzz
... where the variables represent: Number of points defining the PPP centroid ... 111 Each iii, jjj, and zzz is the ID number of an item in the jjj model that defines the ... centroid (to maximum of 20 zzz items per data line). The following centroid, ARO-CENTER, is defined by three items 6, 8, and 10. The first line is the identification line. The second and third lines are the data lines. M $31, -7 1 ARO-CENTER $3D 3 M $3D 6 8 1 0 M
-8
The data line for a normal line defined by a point and a plane have the following format: M $3~iiijjj
where ... the variables represent: 111 ID number of a point ID number of a plane jjj The following normal line, ARO-NORMAL, is defined by the point 1 4 and the plane 15. The first line is the identification line. The second line is the data line. M $3D -8 1 ARO-NORMAL
... where the variables represent: Number of points defining the PPP plane Deviation (A), 0 if not used. ttttt.tttt Each iii, jjj, and zzz is the ID iii number of an item in the jjj model that defines the ... plane (to maximum of 20 zzz items per data line). The following plane is defined by three points. The first line is the identification line. The second and third lines are the data lines. M $3D-5 4 PLANE-2 M $3D 3 M $3D 1 5 1 4
$3D 1 1 6
The data lines of a centroid defined by points have the following format: M
... where the variables represent: Number of points defining the PPP line Deviation (A), 0 if not used. ttttt.tttt ... Each iii, jjj, and zzz is the ID 111 number of an item in the jjj model that defines the line ... (to maximum of 20 items per zzz data line). The following line is defined by the four points 1, 14, 15, and 1 9 and has a deviation of 1.2 A. The first line is the identification line. The second and third lines are the data lines. M $3D-4 2N-TO-AROM M $3D 4 1.2000 M $3D 1 1 4 1 5 1 9 The data lines for a plane defined by three or more points (a best fit plane if more than three points) have the following format:
The data line for a plane defined by a point and a line has the following format:
M
-9
$3D 1 4 15
The data line for a distance defined by two points and a range (A) has the following format:
$3~iiijjjddddd.ddddzzzzz.zzzz where the variables represent: iii ID number of a point jjj ID number of a second point ddddd.dddd Minimum distance (A) zzzzz.zzzz Maximum distance (A) The following distance, L,is between items 1 and 14 and has a minimum distance of 4.9 A and a maximum distance of 6.0 A. The first line is the identification line. The second line is the data line. M
MDL COMPUTER CHEMICAL STRUCTURE FILEFORMATS M $3D-9
-10
$3~iiilllddddd.ddddzzzzz.zzzz
$3~iiijjjddddd.ddddzzzzz.zzzz
where the variables represent: iii ID number of a point ID number of a plane jjj ddddd.dddd Minimum distance (A) zzzzz.zzzz Maximum distance (A) The data line for an angle defined by three points and a range (in deg) has the following format: M
-1 3
M
where ... the variables represent: ID number of a point 111 ID number of a line 111 ddddd.dddd Minimum distance (A) zzzzz.zzzz Maximum distance (A) The data line for a distance defined by a point, a plane, and a range (A) has the format: M
-12
6L
M $3D 1 14 4.9000 6.0000 The data line for a distance defined by a point, a line, and a range (A) has the format: M
-11
J. Chem. In$ Comput. Sci., Vol. 32, No. 3, 1992 251
-16
$3~iiiuuuaaaddddd,dddd M S3~aaabbb zzz
...
where ... the variables represent: 111 ID number of the center of the sphere uuu 1 or 0. 1 means unconnected atoms are ignored within the exclusion sphere during a search; 0 otherwise. aaa Number of allowed atoms ddddd.dddd Radius of sphere (A) iii Each iii, jjj, and zzz is an jjj ID number of an allowed atom (to maximum of 20 zzz items per data line). The following exclusion sphere is centered on atom 24, has a radius of 5, and allows atom 9 within the sphere. The first line is the identification line. The second and third lines are the data lines. M $ 3 ~ - 1 6 7 EXCLSPHERE
where ... the variables represent: 111 ID number of a point jjj ID number of a second point kkk ID number of a third point ddddd.dddd Minimum degrees zzzzz.zzzz Maximum degrees The following angle, THETA^, is defined by the three points 5, 17, and 16. The minimum angle is 80° and the maximum is 105O. The first line is the identification line. The second line is the data line. M $ 3 ~ 1 25 THETA1 M $31, 5 17 16 80.0000 105.0000 The data line for an angle defined by two lines and a range (in deg) has the following format:
..I
M
-1 4
S3~iiijjjddddd.ddddzzzzz.zzzz where the variables represent: iii ID number of a plane ID numbers of a second plane jjj ddddd.dddd Minimum degrees zzzzz.zzzz Maximum degrees The data line for a dihedral angle defined by four points and a range (in deg) has the following format: M
-1 5
where ... the variables represent: 111 ID number of a point jjj ID number of a second point kkk ID number of a third point 111 ID number of a fourth point ddddd.dddd Minimum degrees zzzzz.zzzz Maximum degrees D I H E D ~is defined by the items 7,6,4, and 8 with minimum and maximum angles of 45O and 80°, respectively. The first line is the identification line. The second line is the data line. M $3~-15 5 DIHEDl M $3D 7 6 4 8 -45.0000 80.0000 The data lines for an exclusion sphere defined by a point and a distance (A) have the following format: M
$3~iiijjjkkkddddd.ddddzzzzz.zzzz
$3~lllmmmddddd.ddddzzzzz.zzzz where the variables represent: 111 ID number of a line ID number of a second line mmm ddddd-dddd Minimum degrees zzzzz.zzzz Maximum degrees THETA^ is defined by the lines 27 and 26 with maximum and minimum angles of 45O and 80°. The first line is the identification line. The second line is the data line. M $ 3 ~ 1 35THETA2 M $3D 27 26 45.0000 80.0000 The data line for an angle defined by two planes and a range (in deg) has the following format:
$3~iiijjjkkklllddddd.ddddzzzzz.zzzz
$3D 24 0 1 5.0000 9 The data lines of the fmd atoms have the following format: M
M $3D
-1 7
M $3DPPP
...
M $3~iiijjj zzz
... where the variables represent: PPP Number of fixed points iii Each iii, jjj, and zzz is an jjj ID number of a fixed atom ... (to maximum of 20 items per zzz data line) The following example shows four fixed atoms. The first line is the identification line. The second and third lines are the data lines. M $3~17 M $3D 4 M $3D 3 7 12 29
DALBY ET AL.
252 J. Chem. In& Comput. Sci., Vol. 32, No. 3, 1992
~
GS~CCS-I110169115J(20 1 6
1
0
0
1
-0.6622 0.6220 -0,7207 -1,8622
0.5342
-0.3000 2.OBll -0.3695
-1.6037
0.6220
0.4144
1.9464
U
n u 2
1
Figure 7. Illustration of parity.
2.7.3. 3D Data Constraints [3D][Q]. A positive integer is used as a type identifier to indicate an atom or atom-pair data constraint. Two lines are used to describe a data constraint. The lines have the following format: M $3~nnncccaaa ...aaabbbbbbbbpppppppppsss...sss M $3~iiiJjjddd ...ddd where the variables represent: nnn Database-field number ccc Color Database-field name (up to 30 charaaa...aaa acters) /BOX = box-number (source of data) bbbbbbbb (up to 8 characters) /DASP = nl,n2 where nl and n2 are ppppppppp digits from 1 to 9 (data size and position) (up to 9 characters) /DISP = 3DN (name), 3DV (value), sss...sss 3DQ (query), NOT (no text). First three in any combination to maximum total of 15 characters ... ID number of an atom 111 ID number of a second atom for atomjjj pair data, 0 if data is atom data Data constraint (based on format from ddd ...ddd database) (up to 64 characters) The following example shows a numeric data constraint for the field CNDO.CHARGE on atom 12. The first line is the identification line. The second line is the data line. M $3D 7 0CNDO.CHARGE M $3D 12 0 -0.3300 -0.1300 The following example shows a numeric data constraint for the field BOND.LENGTH on the atom pair 1 and 4. The first line is the identification line. The second line is the data line. M $3D 9 0BOND.LENGTH M $3D 1 4 2.0500 1.8200 The following example shows a data constraint allowing any charge value for the field CHARGE on all the atoms. The first line is the identification line. The second line is the data line. M %3D 12 0 CHARGE M $3D999 0 @ 2.8. Stereo Notes. Parity is illustrated in Figure 7. A bond attached at a stereo center is marked up or down to define the configuration. Number the atoms surrounding the stereo center with 1, 2, 3, and 4 in order of increasing atom number (position in the atom block) (a hydrogen atom should be made number 4). View the center from a position such that the bond
l 1
Z 3
I
,
l 0 0 0 1 1 0 0 1 . 0. 0. 0
2
5
2
0
2 6 CHG 150 END
1 2
0
1
0 I 3
0.00366
1
I-block
&itno /~tw
0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0
.;-.JL”
0 1
0
Bond&&
0
0
0.00000
3 v2000 0.0000 C 0 0 2 0.0000 C 0 0 0 1 0 0 0.0000 C 0.0000 N 0 3 0 0 0 0 0.0000 0 0 5 0 0.0000 0
0
6
-1
13
P “ s a M-
Figure 8. MOLfile organization of alanine.
connecting the highest-numbered atom (4) projects behind the plane formed by atoms 1,2, and 3. Note: In Figure 7, atoms 1,2, and 4 are all in the plane of the paper and atom 3 is above the plane. Sighting toward atom number 4 through the plane (1,2,3), you see that the three remaining atoms can be arranged in either a clockwise or counterclockwise direction in ascending numerical order. The Ctab lists a parity value of 1 for a clockwise arrangement at the stereo center and 2 for counterclockwise. A center with an Either bond has a parity value of 3. An unmarked stereocenter is also assigned a value of 3. The example at the top of Figure 7 has a parity value of 2. 3. MOLECULE FILES [MOLFILES]
A MOLfile consists of a header block and a connection table. Figure 8 shows the structure and corresponding MOLfile of alanine. The format for a MOLfile is Header block This identifies the MOLfile with the molecular name, user’s name, program, date, and other miscellaneous information and comments Ctab block (described in Section 2) The detailed format for the header block follows. 3.1. The Header Block. Molecule name. This line is unformatted, but like Line I all other lines in a MOLfile may not extend beyond column 80. User’s first and last initials (I), program name Line 2 (P),date/time (M/D/Y,H:m), dimensional d e s (d), scaling factors (S,s), energy (E) if modeling program input, internal registry number (R) if input through MDL form. This line has the format: IIPPPPPPPPHHDDrYH~dSS~~~aa~~~~~EECEEEECEEEERR~ 1FORTm:
A2
A0
AZIZ
F10.5
112.5
I6
1
A blank line can be substituted for line 2.
Line 3
A line for comments. If no comment is entered, a blank line must be present. 4. RGROUP QUERY FILES [RGFILES]
The format of an RGfile (Rgroup query file) is shown in Figure 9. Lines beginning with $ define the overall structure of the Rgroup query; the MOLfile header block is embedded in the Rgroup header block. In addition to the primary connection table (Ctab block) for the root structure, a Ctab block defines each member within each Rgroup. Figure 10 shows the structure and corresponding RGfile of an Rgroup query.
J . Ckem. Inf. Comput. Sci., Vol. 32, No. 3, 1992 253
MDL COMPUTER CHEMICAL STRUCTURE FILEFORMATS S W L REV 1 d a t e / t i m $MOL SHDR [HOLfal. H e a d e r Block ..sI
S.CtIoI)
$END HDR SCTIB [Ctab Block ..sI SEND C I A B
~ r r 1rh.r.
S.ctlOn
rrr
-
DI-ODOI R u % L
1.2 CYCU-Cf
-
2)
3)
-
~ - I 1 0 6 0 3 9 0 1 ~ 2 ~12 D 0.0033) 0 0 1 -0.0238 -0.7702
12 I2
n a m , pgm I n f o , sonunnti
1 2.6974 2 1 0'
0 0.7634 0
710
0
0
0
0.0000 0
0
0
0
0' 0
count + atoms + bonds + IlSfS + proprl
I
O
0
F w 9. Format of an RGfile. The Rgroup format (*r) is repeated up to a maximum of 32. The Rgroup member format (*m) is repeated up to a maximum of 255 total atoms and bonds in the Rgroup.
- 183.0
> 2 1 01-10-80
I
> 25 > 55
(W-089741
CBOIL1NG.POINTD
DT12
> D T ~ Z 55 >
(W-00941
fROM M C H I V E S
A [Data] value may extend over multiple lines containing up to 200 characters each. A blank line terminates each data item. A line containing four dollar signs ($$$$) terminates each complete data block describing a compound. A DATfile (data file) is an SDfile with no [MULfile] descriptions or $$$$ delimiters. The [Data Header] in a DATfile must include either an external or internal registry number in addition to a field name or number. 6. REACTION FILES (RXNFILES]
RXNfiles contain structural data for the reactants and products of a reaction. Figure 13 shows the format for a RXNfile. Figure 14 shows a simple reaction and the corresponding RXNfile. 6.1. Header Block. Line 1 $RXN in the first position on the line identifies the file as a reaction file. Line 2 A line which is always blank.
DALBYET AL.
254 J . Chem. InJ Comput. Sci., Vol. 32, No. 3, 1992
IRDfi 1e Header] IMoleculo or R e a c t i o n I d o n t a f i e r l [ D d t a - f i e l d Idmnti fierl [Data]
0
E
0
11
CHI
l
*r
/c,
Figure 15. Format of an RDfile. The molecule or reaction format (*r) is repeated for each molecule or reaction. The data item format (*d) is repeated for each data item in a molecule or reaction.
CI
l C 3 3 C c 3.3213 -3.2318
3 3 0 3.3000 c
3
c
2
-1.3146
-3.9623
C.0003 C
3
C.3231 :.613i : 2 . c
1.4181
O.OCO0 0 3 C C O P C:
0
C C C
C P
3 3 3
3 3 3
3 3 3
P 0
-:0 3 > 8
0
3
0
3
0
1
3
:
2
>
3 3 0
C
3
:
3
c
P
1 3
0
MOL!* fiat
0 0
3
2
3
0
3
S
r o ~ u ~ t
lor REACCSB:
1
REACCS81:0:191:3412D 1
4 L
6
3
C
3
1.3335
:
2
:
:
;
2
3
C
P
c
7689
-1,5415
C.7689
3
c
3
c
3.0100
c
3 3 0
0
0
0
0
0
C
5
P
0
o
o
o
o
o
o
6
o
3
3
3
3
0
0
0
0
3 3 C
3 0
0 P
0
c
3 3
3
D
3 3 1 c
7 0 8 0 9 0 10
0 0
0
P P
0 0
P
3
0
3
o
o
P 0
0
3 3
C
0
u x I * lor
5
1 4 SMOL
:
3
0
0
c
0
3
0
0
0
0
0.00000
315
0.00000
8
I
i
6
6
0
3
0
0
0
5
6
2
0
0
0
2
.
.
.
0
0
REACCSE1101791104120 1
9 3 3 3 C P -0,531: -0.138, 3.6321 -1 8 6 2 6
-:6 8 3 1
-J.2278
6
9
2
2
3
:
3
O
C
9
O
C.0OPC 0.0000
c 0 : 3
0
0
0 0
: C
C
5 6
3
0
3 2
0
3
0 0
3
P
0
C
3.3CCC
C
0
3
3
3
C
C
C
C
P
10
3
0
3
0
I
0
9
.
0
0
.
0
0
.
0
0.00314
0
0
0
0,00000
I
J
;
The program name and version (P), date/time (M/D/Y,H:m), and reaction registry number (R). This line has the format:
255
4
3
0
0
0
0
0
1
4
l
0
0
0
0
.
.
.
0
0.00311 0
0
i
1 1 1
Zur Rura Fir-Chlm,
SDTYPE RXN: VAUIATION (1I :CATALYST (11 :REGNO SDATUM 5WMl SHIRES 688 REACCS8110119110412D 1
I
second reactant
MOLfile for Pdu@
0
6 9 2 0 0 0 2 SDTYPE RXN :VAUIATION 11I : RXNTEXT i l l SDATUM C r C 1 3 SDTYPE RXN:VARIATION 111 :LITTEXTill SDATUM A G Repln. Y Y N . s k a x O v - Z . d y m s k i i , 44, p.2360, 1974
Figure 14. Simple reaction and its corresponding RXNfile.
Line 3
I denlilier
0.00371
SMOL
9
MOL,,^^^^^
0
4
REACCSBll01191~0412D 1
2
c
3
P
3.CCOO C 3.CCOP C
-0.7689
:
3
3.0030
1.5411
:
3
3.7000 : 3.oono c
-P.7689
1.3335 c.oc:3 3 3003 -:3 3 3 1 -: 3 1 3 1
3.00380
E
3.33100
3.CO33:
1439
1017911041
2 1 SMCL
2
REliCI8::3l’9:131:1D
] RDfile header
SRDFILZ SOATM 10/11/91 10.41 SRFHT S R I E G 1439 S U N
0.00000
Data block lor reaction
0
0
5DTYPE RXN:VI\RlATIONI1l:PRODUCTill:YlELD SDATOM 1 0 . 0
.
.
4
SWMl SRIREG 841C SRXN
REACCS81
1011911041
Stan of next record
8410
PPPPPPPPMHDDYYHH“RRRRRRRR
IFCRlRRN:
A14
I8
I
A blank line can be substituted for line 3. A line for comments. If no comment is entered, a blank line must be present. 6.2. Reactants/Products. A line identifying the number of reactants and products in that order. The format is
Line 4
rrrPPP where the variables represent: rrr Number of reactants PPP Number of products 6.3. MOLfde Blocks. A series of blocks, each starting with $MOL as a delimiter, giving the MOLfde for each reactant and product in turn. The MOLfile blocks are always in the same order as the molecules in the reaction; reactants first and products second.
7. REACTION-DATA FILES [RDFILES] An RDfile (reaction-data file) consists of a set of records. Each record defines a molecule or reaction, and its associated data. Figure 15 shows the format for an RDfile. An example RDfile incorporating the RXNfile described in Section 6 is shown in Figure 16. Each logical line in an RDfile starts with a keyword in column 1 of a physical line. One or more blanks separate the first argument (if any) from the keyword. The blanks are ignored when the line is read. After the first argument, blanks are significant. An argument longer than 80 characters breaks at column 80 and continues in column 1 of the next line. (The argument may continue on additional lines up to the physical limits on
I
2 1 SMOL
Figure 16. Example of a reaction RDfile incorporating the RXNfile described in Section 6.
text length imposed by the database.) The RDfile must not contain any blank lines except as part of embedded MOLfiles, RXNfiles, or data. An identifier separates records. 7.1. RMile Header. Line 1 $RDFILE 1: The [Rdfile Header] must occur at the beginning of the physical file and identifies the file as an RDfile. The version stamp “1” is intended for future expansion of the format. Line 2 $DATM: Date/time (M/D/Y, H:m) stamp. This line is treated as a comment and ignored when the program is read. 7.2. Molecule and Reaction Identifiers. A [Molecule or Reaction Identifier] defines the start of each complete record in an RDfile. The form of a molecule identifier must be one of the following: $MFMT [$MIREG internal-regno [ $MEREG external-regno]] embedded MOLfile $MIREG internal-regno $MEREG
external-regno
where $MFMT defines a molecule by specifying its connection table as a MOLfile $MIREG internal-regno is the internal registry number (sequence number in the database) of the molecule
255
J . Chem. If. Comput. Sci. 1992,32, 255-260 $MEREG external-regno is the external registry number of the molecule (any uniquely identifying character string known to the database, for example, CAS number) Square brackets ([ 1 ) enclose optional parameters. An embedded MOLfile (see Section 3) follows immediately after the $MFMT line. The forms of a reaction identifier closely parallel that of a molecule: $RFMT [$RIREG internal-regno [$REREG external-regno]] embedded RXNfile $PCRXN [$RIREG internal-regno [$REREG external-regno]] embedded
CPSS
RXNfile [CP]
For fields whose data type is "molecule regno*, the datum must specify a molecule and, with the exception noted below, use one of the formats defined above for a molecular identifier. For example $DATUM $MFMT embedded MOLfle $DATUM $MEREG external-regno $DATUM $MIREG internal-regno In addition, the following special format is accepted $DATUM molecule-identifier Here, molecule-identifier acts in the same way as externalregno in that it can be any text string known to the database that uniquely identifies a molecule. (It is usually associated with a data field different from the external-regno.)
$RIREG
internal-regno
8. CONCLUSION
$REREG
external-regno
A series of chemical structure file formats built up from one or more connection table blocks have been described. These formats allow for the storage and transfer of chemical structure information used typically for search queries, individual structures, or entire databases. It is hoped that these file formats will see even wider use.6
where
SRFMTdefines a reaction by specifying its description as a RXNfile and $PCRXN [CP] defines a reaction by specifying its description as a CPSS-style RXNfile $RIREG internal-regnois the internal registry number (sequence number in the database) of the reaction $REREG external-regnois the external registry number of the reaction (any uniquely identifying character string known to the database) Square brackets ([ 1 ) enclose optional parameters An embedded RXNfile (see Section 6) follows immediately after the $RFMT line, and an embedded CPSS-style RXNfile follows immediately after the $PCRXN [CP] line 7.3. Data-Field Identifier. The [Data-field Identifier] specifies the name of a data field in the database. The format is $DTYPE field name 7.4. Data. Data associated with a field follows the field name on the next line and has the form $DATUM datum
The format of datum depends upon the data type of the field as defined in the database. For example: integer, real number, real range, text, molecule regno.
REFERENCES AND NOTES ( I ) The various CTfile formats have been programmed, tested, and documented by a large number of people at MDL over the years. Besides the authors of this paper, these include S. Anderson, J. Barstow, R. Blackadar, T. A. Blackadar, R. Briggs, R. E. Carhart, B. D. Christie, J. D. Dill, G.Freitas, R. J. Greenberg, A. J. Gushurst, D. Henry, R. Hofmann, D. Horner, A. Hui, T. E. Moock, D. G. Raich, J. Steele, W. T. Wipke, and K. Wiseman-Sleeter. (2) Wipke, W. T.; Nourse, J. G.; Moock, T. Generic Queries in the MACCS System. In Computer Handling of Generic Chemical Structures; Barnard, J. M., Ed.; Gower: Hampshire, 1984; pp 167-178. (3) Gushurst, A. J.; Nourse, J. G.;Hounshell, W. D.; Leland, B. A.; Raich, D. G. The Substance Module: The Representation, Storage, and Searching of Complex Structures. J . Chem. InJ Comput. Sci. 1991, 31, 447-454. (4) Moock, T. E.; Christie, B.; Henry, D. MACCS-3D: A New Database System for Three-Dimensional Molecular Models in Chemical Information Systems. In Beyond the Structure Diagram; Bowden, D., Mitchell, E. M., Eds.; Ellis Howard: New York, 1990; pp 42-49. ( 5 ) Moock, T. E.; Nourse, J. G.;Grier, D.; Hounshell, W. D. The Implementation of Atom-Atom Mapping and Related Features in the Reaction Access System (REACCS). In Chemical Structures: The International Language of Chemistry; Warr, W. A,, Ed.; SpringerVerlag: New York, 1988; pp 303-313. (6) For more details and information on future changes, contact Affinity, Molecular Design Limited, 2132 Farallon Drive, San Leandro, CA 94517.
Computer-Aided Molecular Formula Determination from Mass, 'H, and 13C NMR Spectra B. G.DERENDJAEV,* S. A. NEKHOROSHEV, K. S.LEBEDEV, and S.P. KIRSHANSKY Novosibirsk Institute of Organic Chemistry, USSR Academy of Sciences, Novosibirsk, USSR Received March 5, 1991 A computer-aided technique for the determination of the molecular formula of a compound by its mass, I3C,and 'HNMR spectra is suggested. Efficiency of the method has been verified on 81 uunknownsn. It has been shown that in 89% of instances the requested formula is found among the top three candidates of a computer answer, and in 45% of instances the computer
suggests a single formula.
The use of computer systems for structure elucidation of organic compounds from a spectral data set is generally based on a known or assumed molecular This information was obtained by additional experiments (high-resolution mass spectrometry, CHN analysis, chemical analysis, etc.) 0095-2338/92/1632-0255%03.00/0
or is postulated by the researcher from the background of the sample. In the context of our work on a spectral data analysis system,k10we have developed software to determine molecular formulas directly from analysis of the most simple and ac0 1992 American Chemical Society