939
J. Chem. In$ Comput. Sci. 1995,35, 939-944
A Compact Code for Chemical Structure Storage and Retrieval Igor Strokov Novosibirsk Institute of Organic Chemistry, Siberian Division of Russian, Academy of Sciences, Lavrentiev avenue 9, Novosibirsk 90, Russia Received January 11, 1995@ Two modifications of a linear tree-like code for efficient structure handling have ‘been proposed. Each symbol of the code corresponds to one edge of a graph, traversed in a deep-first or wide-first order. The code facilitates compact storage of structure formulas, fast substructure search, and similarity search. The usage of the code is illustrated on a sample collection of 50 000 structures. INTRODUCTION Almost every chemical information system uses chemical structure handling. The exploration of large data bases on chemical properties requires compact storage and fast searching of chemical structure information. Methods of saving memory resources in storage of chemical structures include the usage of large structural blocks,’-2covering a chemical graph with some primitive basic graph^,^.^ and joining collections of graphs to hyper structure^.^,^ At the same time a method of storage influences essentially access methods or, on the other hand, is defined by the requirements of the search procedure. The two kinds of search procedures are meant. In the first one (the substructure search) the obligate features of structures to select are set. It can be structure fragments, generic structures, types of cycles in required structures, etc. The methods of substructure search are thoroughly investigated’ and widely applied in practice. The second kind of access (often called similarity search) implies the selection of structures most similar to some query structure in a case when one has no a priori information about common structure features in the query and selecting structures. The similarity search is usually based on a set of predefined structural fragmentss, formal reactions p r o d ~ c t sor , ~even on the direct application of the maximal common subgraph algorithm.’O A method of storage of chemical structures providing fast access by both kinds of search procedures within moderate memory requirements is the subject of this article. Let us regard structure formulas with covalent bonds only using the traditional representation of a chemical structure as a graph with labeled vertices and sometimes edges-a so called molecular or chemical graph. In the article” we proposed to describe the topology (Le., connectivity of vertices) of a molecular graph by the “incomplete covering by chains”. This term means the selection of a spanning subgraph (Le.,the subgraph containing all vertices of a graph) represented by one chain or set of chains. Evidently the length of a chain is sufficient for the complete description of the chain topology. This fact was used to construct a short code for a structure (cf. work3). Let us consider spanning subgraphs belonging to the more general class of trees. It is well-known that one can describe @
Abstract published in Advance ACS Abstrucrs, September 1, 1995. 0095-2338/95/1635-0939$09.00/0
a tree by the enumeration of degrees of its vertices using at least two ways of traversal: breadth-first and depth-first. This feature of trees affording an unambiguous representation of acyclic graphs by symbol strings is widely used for a long time for description of both molecular structurest2and their fragments, e.g., for construction of the correlation tables.13 At the same time the HORD code used for the same purpose includes the description of cycles. There are some approaches extending the tree-like code advantages to arbitrary structures that may contain cycles. Noticing the tree-like part of a molecular skeleton to be expressible by the list of vertex degrees (T-list), Hendrickson, Grier, and Toczko4 proposed to represent the remaining part of a graph (excluding the tree edges) as a similar ring list (R-list). It should be noted, however, that there exist some strongly connected graphs in which the cycles remain even after the deletion of the edges belonging to the spanning tree (e.g., the tetrahedron graph). Another approach consists in the representation of cyclic graphs as trees with repeating dummy nodes and is applied, for instance, in the algorithm for generation of molecular graphs.I4 Our method (listed below) seems to be closer to the last one. A DEEP TREE-LIKE CODE OF A CHEMICAL STRUCTURE Compare, as usual, non-hydrogen atoms of a molecule to vertices and chemical bonds to edges of a graph. Instead of a representation of n-bonds by multiple or labeled edges, let us specify hybridization states of atoms. A hybridization state S is stored in the vertex label L along with an element code E and a possible number of neighbor vertices N
+
L = (E”16) (S*4)+ N - 1
(1)
The element code is a serial number (starting with 1) in the following list: C, N, 0, F, Si, P, S, C1, Br, I, ..., X. Positions 11- 14 are not used; position 15 corresponds to symbol X denoting any vertex with a label not included in the above mentioned scheme. It may be any metal atom, some special general label for setting the generic structure, etc. The hybridization state is set equal to the number of nonhybridized p-electrons capable to form multiple n-bonds. This information allows n-bond distributions to be reconstructed in any traditional form (the localized or “aromatic” 0 1995 American Chemical Society
940 J. Chem. In$ Comput. Sci., Vol. 35, No. 6, 1995
STROKOV
Table 1. Numerical Representation of the Vertex Labels binary -CH1 -CH>-CH