J . Chem. Inf. Comput. Sci. 1993, 33, 604-608
604
New Computer Representation for Chemical Structures: Two-Level Compact Connectivity Tables? Cheng Wentang, Zhang Ying, and Yu Feibai’ School of Chemical Engineering, Dalian University of Technology, Dalian 116 012, China Received January 19, 1993 A new computer representation for organic structures, two-level compact connectivity tables (TLCCT), is presented. A set of algorithms and programs has been developed in which an interactive graphics system for entering structure diagrams, the automatic generation of the two-level compact connectivity tables (a basic node connectivity table and an expanded node connectivity table), the canonicalization of the expanded node connectivity table, and an interactive structure search system are included. INTR0DUCTION The studies of computer representation of chemical structures, which has been investigated for a long period, are an important subject in the area of computer chemistry. Many different methods or systems have been proposed; some of them have been accepted by chemists and widely used in various fields.l-1° In the application of computers, to chemistry, however, likely to be in demand is the development of computer representation methods. They are unique and unambiguous, readily understandable, easy to handle, as compact as possible to minimize the computer work space and time consumption, and have good accessibility for handling chemical structure information. A good notation system would be provided for the chemists to do research on structure/property correlation, computer-aided molecular design, and so on. Scientists are still working on this subject for newer and better notation systems. After a survey of methods in the literature, a new system, two-levelcompact connectivity tables (TLCCT), is presented. It is a connection table system in combination with the linear notation. The well-known symbols of elements, atomic groups, and a few specially designed linear codes for ring skeletons are defined as the basic nodes in the connectivity table. A graphics input system of chemical structure is developed. From the entered diagram, a basic node connectivity table is generated, which is then further processed to a more compact form, namely, an expanded node connectivity table. Five feasible rules and a clear algorithm for canonicalization are presented to rearrange the expanded nodes. A unique and unambiguous internal representation of a chemical structure would thus be obtained. DESCRIPTION OF THE NOTATION SYSTEM (A) Basic Node and Basic Node Co~ectivityTable. A compact connectivity table system was proposedll in order to save memory space and input time and keep the substructure information accessible for chemical structure manipulation, as follows. (1) Basic Nodes. Three types of structure fragments (atoms, groups, and ring skeletons)commonly used in writing chemical structure formulas are defined as the basic nodes in the connectivity table. ( a ) Atom Nodes. The symbols of elements in the periodic table could be directly used as the basic nodes (see Figure 2). + This project was supported by the National Science Foundation of China.
( b ) Linear Code Nodes. Chemical groups, such as OH, NO2, COOH, SO3H, CO, S02, NH, NH2, PO, etc., are directly used as the basic nodes (see Figure 3). Unbranched hydrocarbons are denoted by the general formula “Cn=dl,d2 ,... #tl,t2 ,,..”, where n indicates the number ofcarbonatoms,dl,d2, ...and tl,t2, ...are thelocation numbers of the double and triple bonds, respectively. For example, 1,3-butadiene can be described by “C4= 1,3”. Branched hydrocarbons are split into two kinds of fragments, single carbon atoms with three or four branches and straight carbon chains. For example: 5-methyl-4-propyl nonane can be described by six nodes as follows: 1
2
3
c3-c-c-c4 I I
4
c3c 5
6
Bridged compounds are denoted by the general formula: “Ln=dl,d2, ...[bl,b2, ...I”, where n is the total number of ring members and dl,d2, ... indicate the location of double bonds. The numbers of bridged atoms are marked in the square brackets: [bl,b2,...I ; that is fairly similar to the IUPAC nomenclature for a bridged ring. When the carbon atoms are replaced by heteroatoms, a notation “ZXxl,x2,...” is inserted into the above formula, where X is the chemical notation of heteroatom and xl,x2, ... indicate its locations in the ring. For examples, bicyclo[2.2.1] hept-2-ene and 2-oxa-7-thia-5azabicyclo[2.2.2]octane could be described by “L7 =2[2,2,1]” and ‘L8Z02S7N5 [2,2,2]”, respectively. The numbering of the ring position is named according to IUPAC nomenclature for bridged ring compounds. A program was written in C language for generating the graph of the bridged ring structure. The difficult task of encoding spiro-compounds was tackled by splitting one or more spiro-rings into individual atoms according to specially defined rules. ( c ) Graphic Nodes. The ring skeletons are encoded by a few linear symbols (some of which are shown in Figure 4). These ring codes are not necessary for the user to encode and memorize. They could be automatically generated by the program through the chemical structure graphic input. (2) Basic Node Connectivity Table. A chemical structure could be represented by use of a basic node connectivity table (BNC table) which consists of four one-dimensional arrays; one is assigned to store the basic node strings, and the three others are assigned to store the connection relationships between basic nodes. For example, the structureof a reactive
0095-233819311633-0604$04.00/0 0 1993 American Chemical Society
J. Chem. Znf. Comput. Sci., Vol. 33, No. 4, 1993 605
TWO-LEVEL COMPACT CONNECTIVITY TABLES 8
Table EI. ENC Table of the Reactive Dye in Figure 14 no. enode N EC1 EC2 1 A;l-S03H,2-N:N,4-S03H 1 1 2 2 N:N;l-A,l-A2 2 2 3 3 A2;3-NH,6-S03H,7-N:N$-OH 3 3 4 4 NHl-AZ,l-AZN246 4 4 5 5 AZN246:1-NHJ-NH,S-F 5 5 6 6 NH:1-A91-AZN246
I1
S03H
6
OH
--ln4
-
z
&$!++
Figure 1. (a, top) Reactive dye structure. The position numbers are generated through arbitrary graphic input by the user. (b, bottom) Fragment of the structure of a. Table I. BNC Table of the Reactive Dye in Figure l4 no. mode c1 c2 2 1 A 1.2 8 2 1.1 N:N 9 A2 1.4 3 3.7 4 NH 2 4 AZN246 3.3 5 10 6 NH 3.6 7 11 3.8 A S03H 4 5.1 8 9 6 S03H 5.3 12 10 S03H 5.5 7.1 11 OH 6 F 12
CN 1 1 1 1 1 1 1 1 1 1 1
0 Symbols are defined as follows: no., ordinal number of basic nodes (arranged in graphic input order); C1, connecting position array (the digit before "." is an ordinal number of a basic node located at the first end of a bond; the one behind "." is a position number in the ring); C2, connecting position array (the digit before "." is an ordinal number of a basic node located at the seocnd end of a bond; the one behind y." is a position number in the ring; CN, bond type array (the integer numbers (1, 2, 3, 4, 5, 6) are defined as single, double, triple, ionic, coordinate, and aromatic bonds, respectively.
dye, as shown in Figure la, can be represented by the BNC table in Table I. The BNC table can be automatically generated through the chemical structure input. (B) ExpandedNode and Expanded Node ConnectivityTable. An expanded node is a complex node which is encoded by a basic node with two or more branches in conjunction with its first layer of basic nodes. A fragment of the reactive dye, as shown in Figure 1b, can be represented by the expanded node: A2;1-OH,2-N:N,3403H,6-NH,in which A2 (before the symbol ";") is the basic node representing the naphthalene ring; behind the symbol ";"there are four adjacent basicnodes, each of which consists of three parts: (1) connecting position number; (2) bond type symbol; (3) adjacent basic node string. Actually an expanded node is a sort of substructure which clearly shows the connectivity of a center atom (or a group or a ring skeleton) with its adjacencies. Particularly, for the ring structure, it not only denotes what kind of ring it represents but also clearly shows the substitutional situation on the ring. Then, by use of the expanded node connectivity table (ENC table), which consists of four one-dimensional arrays, an expanded node array, and three others representing connectivity of expanded nodes, the chemical structures would be represented. Thus the same dye structure in Figure 1 would be represented by using a ENC table, as shown in Table 11. The ENC table can be automatically generated by the program based on the basic node connectivity table. (C) Canonidzalion of ENCTable. The connectivitytable generated through the arbitrary graphic input of chemical structure by the user is not unique enough. An algorithm and
~~
ECN 1 1 1 1 1
4 Symbols are defined as follows: no., ordinal number of expanded nodes; enode, expanded node string array; ECl, EC2, two connecting positionarrays which are assigned to store the ordinal numbersof expanded nodes located at the first and second ends of bonds, respectively; ECN, bond type array.
program have been developed to convert the ENC table into a unique and unambiguous internal representationfor chemical structures. In order to rearrange the nodes (expanded nodes) in an ENC table and reduce the generation of redundant matrixes, five attribute parameters are assigned to characterize each node and its adjacencies. (1) Node Code (NC). Integer numbers are assigned to the nodes by the program according to their alphabetic order. (2) Brancb Number (BN). This is the number of branches of each node. (3) Bond's Comprehensive Code (BCC). It characterizes all bonds connected to each node and is defined by eq 1, where N1, N2, N3, N4, N5, N6 represent the number of single, double, triple, ionic, coordinated, and aromatic bonds, respectively. BCC = O.lNI+ N2
+ 2.5N3 + 3N4 + 5N5 + 8N6
(1)
(4) Adjacency Node Code (ANC). This code describes the attribute of adjacent layers of each code; it is defined by eq 2, where NCi is the node code of the i-th adjacency node and NCI > NC2 > NC3 > .... ANC = NC1
+ NC2/102 + NC3/104 + ... + NCi/ IO2("') (2)
(5) Adjacency Branch Code (ABC). This indicates the branch status of the adjacent layer of every node as given by eq 3, where BNi is the branch number of the node in the first adjacent layer and BN1 > BN2 > BN3 > .... ABC = BNl
+ BN2/10 + BN3/102 + ...+ BNi/10"' (3)
The canonicalization of the ENC table is carried out according to the following procedures: (1) searching the head node from the expanded node string array according to the following rules in sequence (a) having the maximal BN value (b) having the maximal NC value (c) having the maximal BCC value (d) having the maximal ANC value (e) having the maximal ABC value (2) selecting the next node from the first adjacent layer by the same rules as mentioned above (3) the rest may be deduced by analogy The canonical ENC table of the dye structure in Figure 1 is shown in Table 111.
606 J. Chem. In5 Comput. Sci., Vol. 33, No. 4, 1993
WENTANG ET AL.
Table III. Canonical ENC Table of the Reactive Dye in Figure 1 no. enode N ECI EC2 ECN 1 A2:1-0H,Z-N:N,3-S03H,6-NH 1 1 2 1 2 N:N;l-A,I-A2 2 1 3 1 3 NH: 1 -A2,l-AZNl35 3 2 4 1 4 A;1-S03H,2-N:NP4-S03H 4 3 5 1 5 AZN135;2-NH,4-NH,6-F 5 5 6 1 6 NH;1 -A,1-AZN135
I
I
I
I
I
~~
Table IV. Conversion of Entered Code to Normalized Code by Programming Figure 4. Graphic node menu.
dianrama
r
entered code A2;3-NH,6303H,7-N:N,8-OH AZN246;l -NH,3-NH,S-F output code AZ;l-OH,2-N:N,3-S03H,6-NH AZN135;2-NH,4-NH,6-F
Figure 5. Bond type menu.
Figure 2. Atomic menu.
Figure 3. Atomic group menu.
The ring position numbers also need to be canonicalized. An algorithm for normalizing the position number was presented on the basis of the following rules in sequence. (a) having the minimal position number of heteroatoms in sequence of 0, S,N, ...; (b) having the minimal position number of double bonds; (c) having the minimal position number of substituents. A special program was written for converting the input position number into a normalized number. Two examples are shown in Table IV. INTERACTIVE GRAPHIC INPUT SYSTEM
In this paper, an interactivemolecular structure input system (MSIS), written in Turbo-C language on IBM PC/286, is developed. Employing a mouse, the user could draw a structure diagram on the screen by selecting atoms, atomic groups, chemical rings, and bond symbols from a series of menus. MSIS involves three types of menus for users to select, namely, node menus (see Figures 2-4), bond type menus (see Figure 5 ) , and command menus (see Figure 6). Ring skeletons, aromatic rings such as benzene, naphthalene, anthracene, anthraquinone, and various C-membered aliphatic rings and some fused rings, were drawn as subgraphs
beforehand. Beneath every subgraph, there is a linear code specially designed by the auth0rs.l' They constitute the graphic node menu (see Figure 4). The heterocyclic rings can be generated by replacing the C-atoms in the ring by heteroatoms. There are six bond symbols in this menu, which represent single, double, triple, ionic, coordinate, and aromatic bonds (see Figure 5 ) . The command menus are classified as two-level menus; the first is a main menu, and the second level menu is a drawing menu, being listed on the top and the left of the drawing window, respectively (see Figure 6 ) . The commands are described as follows. commands in main menu:
Lo a d Save Edit Exit
load a graph save a graph edit the graph exit from the MSIS to DOS
commands in drawing menu which is under Edit DefGr, DefAt Draws, DefBd CalSS DelAt, DelBd MoveS Zooms Clear Exit
input nodes draw a bond between two nodes input subgraphs modify a graph move the position of a graph enlarge or reduce a graph clear the screen return to main menu
Chemical structures are assembled from various nodes and bonds by selecting a sequence of commands from the menus. It is very easy to understand, for even a beginner in chemistry; two or three hours are enough to learn how to draw a structure formula on the screen. The entered structure diagram is converted to two computer readable files. One contains the structure drawing information including drawing coordinates, nodes and bonds or ring skeleton codes, which could be used
J. Chem. In$ Comput. Sci., Vol. 33, No. 4, 1993 607
TWO-LEVEL COMPACT CONNECTIVITY TABLES
Load
Exit
Edit
Saue
iiENU Draws Ca 1ss D e f Bd De 1Bd DefR t D e f Gr DelRt HoueS
RotaS
‘P
ZoonS
HoueP Clear R e f Re Exit
4 Fl-help
F2-saue
Figwe 6. Drawing window with command menu. Table V. Substructures and Their Expanded Nodes
substructure diagrams
number of expanded node(s)
expanded nodes
1
A, 1 -S03H,2-N:N,4-S03H N:N;l-A,l-A2
A,l-SO3H,2-N:N,4-S03H N:N; 1-A, 1 -A2 @
Y
=
N
h
w
A2;1-OH,2-N:N,3-S03H,6-NH
54H
to resume the structure diagram if necessary. The other stores the basic node connectivity table. INTERACTIVE SUBSTRUCTURE SEARCH SYSTEM The so-called substructure search is to try to find all compounds which contain the given substructure from a chemical structure database and then get the structure information and other required knowledge of the corresponding compounds. Substructure is a fragment part of chemical structure, which is usually constructed by several connected nodes. The selection of substructures depends on the search purpose and the structure notation system. In the presented system, the expanded node which provides the important structure information in chemistry is specified as the least substructure unit. A fragment which contains one or more expanded node(s) may be defined as a substructure. The fragments, as shown in Table V, could be used for searching the compound in Figure 1. A search from a database of ENC tables for a specific structure could be performed with the ENC table of a substructure as query information. First, one would carry out the matching of expanded nodes and then do that of the node connectivity. The main processes are as follows: (1) Enter the query structure diagram by executing the interactive graphic input program to generate the ENC table of the query structure. (2) Scan the database for all compounds to find whose ENC table contains the expanded nodes of the query structure to determine the candidate compounds.
(3) If the query structure only contains one expanded node, the candidates are the compounds for searching. (4) If the query structure contains two or more expanded nodes, the connectivity matching needs to be done, in addition to the expanded node matching. This search operation is performed as follows. First, try to find connection relationships of these expanded nodes in the ENC table of a candidate and canonicalize them to get a new ENC table of these nodes. Then, compare them with that of the query structure. If the two ENC tables are completely the same, the candidate is the compound for searching. ( 5 ) If no matching compounds are found, return the failing information; otherwise, give the successful news, display the structure diagrams, and print other required information. Similarly, the full-structure search could be performed by the system.
CONCLUSION AND APPLICATION The features of this system are as follows: (1) It is more compact than those of atom-by-atom systems and has the superiority in execution speed and work space. (2) The important substructures in the molecule such as substituents, functional groups, and ring structures are expressed integrally. (3) The more difficult problem of coding cyclic compounds is tackled. Ring structures, such as aromatic and aliphatic
608 J. Chem. Znf. Comput. Sci., Vol. 33, No.4, 1993
carbon rings, and heteroatom-containing rings, etc., are encoded by a few linear symbols. (4) It is a highly automatic system which only needs the user to draw chemical structures on the screen. The drawing process is very easy to manipulate. This system has been applied to set up a commercial dye database for chemical structure search and retrieval.12 The basic node connectivity table is very convenient for structure analysis. It has been used to predict physical properties by using group contribution methods on an IBM-PC computer.** The practice confirmsthat TLCCT is readily understandable, easy to handle, suitable for computerization, and available for use in many applications concerning handling of the structure information of the complicated chemical compounds, especially for pesticides, dyes, and their intermediates. REFERENCES AND NOTES (1) Smith, E. G. The Wiswesser Line-Formula Chemical Notation; McGraw-Hill: New York, 1968. (2) Morgan, H. L. The Generation of a Unique Machine Description for Chemical StructurceA Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965,5, 107-113. (3) Dubia, J. E. French National Policy for Chemical Information and the DARC System as a Potential Tool of This Policy. J . Chem. Doc.1973, 13. 8-13.
WENTANG ET AL.
(4) Walentowski, R. Unique Unambiguous Representation of Chemical Structures by Computerization of a Simple Notation. J. Chem. Inf. Comput. Sci. 1980,20, 181-192. (5) Hendrickson, J. B.; Toczko, A. G. Unique Numbering and Cataloguing of Molecular Structures. J . Chem. Inf. Comput. Sci. 1983,23, 171177. (6) Abe, H.; Kudo, Y.; Sasaki, S.-I. A convenient Notation System for Organic Structure on the Basis of Connectivity Stack. J . Chem. Inf. Comput. Sci. 1984, 24,212-216. (7) Read, R. C. A New System for the h i g n a t i o n of Chemical Compounds. 1. Theoretical Preliminaries and the Coding of Acyclic Compounds. J . Chem. Inf. Comput. Sci. 1983, 23, 135-149. (8) Read, R. C. A New System for the Designationof Chemical Compounds. 2. Coding of Cyclic Compounds. J . Chem. Inf. Comput. Sci. 1985,25, 116-128. (9) Gottlieb, 0. R.; Auxiliadora, M.; Kaplan, C. Replacement-NodalSubtractive Nomenclatureand Codes of ChemicalCompounds. J. Chem. Inf. Comput. Sci. 1986, 26, 1-3. (10) Randic, M. Compact Molecular Codes. J . Chem. Inf. Comput. Sci. 1986,26, 136-148. (11) Feibai, Y.; Xiaochun, Y.; Wentang, C.; Yong, Q.Representation of Chemical Structure by Compact Connectivity Table System. Proceedings of the 4th Asian Chemistry Congress, Beijing, China, Aug 1991; Computer Chemistry Monograph Series 11, Selected Papers of Computersand Applied Chemistry; SciencePress: Beijing, China, 1991; pp 46-59. (12) Feibai, Y.; Wentang, C.; Ying, Z.; Yong, Q.A Database of Structure Information of Commercial Dyes. Proceedings of 13th International CODATA Conference, Oct 1992, Beijing, China; CODATA Bulletin ISSN 0366757X P125. (13) Wentang, C.; Feibai, Y.; Ying, Z. An Intelligent System for Predicting Physical Properties by Use of Organic Structure Diagrams. Procctdings of 13th International CODATA Conference, Oct. 1992, Beijing, China; CODATA Bulletin ISSN 0366-757X P85.