Fuzzy Definition of Molecular Fragments in Chemical Structures

Figure 3 Definition and coding of a −OH($) fuzzy substructure: (a) atoms' numeration; (b) fuzzy subgraph; (c) subgraph of −OH(CH2) substructure; (...
0 downloads 3 Views 105KB Size
J. Chem. Inf. Comput. Sci. 2000, 40, 325-329

325

Fuzzy Definition of Molecular Fragments in Chemical Structures Barbara De¸ bska* and Barbara Guzowska-SÄ wider Department of Computer Chemistry, Rzeszo´w University of Technology, 35-041 Rzeszo´w, Poland Received August 3, 1999

This paper presents a methodology for seeking the relationships between chemical substructures (molecular fragments) and spectral parameters using a computer collection data of molecular spectra. To establish the spectrum-structure correlations, the program has to search the chemical structure base in order to find compounds containing a given molecular fragment in the molecule. There exists no sole definition of a substructure, as it always depends on the type of problem dealt with. In the problem of structural identification, fuzzy definitions of substructures are applied, and their forms are imposed by the spectral methods used. 1. INTRODUCTION

The origin of chemical information is the chemical compound, and the notion of their structure is fundamental in most application fields:1-7 structure-activity relationships, reaction indexing, reaction synthesis, molecular modeling, substructure search, structure elucidation, etc. A number of computers programs were described to support the structure elucidation process.8-29 A possible approach to spectra interpretation can be based on determination of functional groups, which may be present in an unknown compound. Several computer programs have been written in order to mimic the human process of interpreting molecular spectra of an unknown compound. To recognize the structure of a given compound, these programs8-22 use correlation tables, i.e., lists of structural fragments (i.e., functional groups, molecular fragments) together with their spectral characteristics. Usually, these characteristics are taken from literature,8-17 but they can also be generated from a computer library of molecular spectra.18-22 A number of programs have been developed for recognition of the presence or absence of substructures in chemical molecules without extracting the information about spectrumstructure correlations, e.g., as interpretation rules. These programs mainly apply neural network methods.23-29 The first step in developing programs for the structure elucidation process is to elaborate a list of substructures that can be recognized by a computer and to define a numerical representation of chemical structures and molecular fragments. In our approach, complete chemical structures and substructures are coded in the form of molecular graphs. 2. DEFINITION OF MOLECULAR GRAPH

The structure of each molecule can be defined as a molecular graph, whose nodes describe the atoms and whose edges represent chemical bonds linking the atoms. The definition of graphs applied in our research was presented by Fic.30 The nodes were defined by three attributes describing chemical features of the subsequent atoms, such as A, the type of an atom; V, its value; and E, the number of * Corresponding author. E-mail: [email protected].

free electrons. The edges of molecular graphs were coded by the attribute B (bond), whose value was 1, for a single chemical bond; 2, for a double bond; and 3, for a triple bond. Figures 1 and 2 depict molecular graphs of a linear molecule (pentan-1-ol) and of an aromatic one (propiophenone). Every chemical structure can be considered as a combination of smaller fragments, called substructures. The program for structure elucidation requires searching through a structure database to select compounds that in their molecule contain the substructures sought, and for this reason the substructures have to be defined as molecular subgraphs. 3. FUZZY DEFINITION OF MOLECULAR SUBSTRUCTURE

The number of molecular substructures is practically indefinite; therefore they are often divided into characteristic classes to simplify the work on them. In the course of our research, the fuzzy substructure conception was introduced. A fuzzy substructure stands for a characteristic class of substructures and is defined in the form of a fuzzy molecular subgraph.31 The characteristic features of subgraphs are multivalent attributes describing the nodes and the edges of a graph, which means that each fuzzy subgraph represents more than one substructure. Each substructure is defined by the following: The substructure Vector describing the global features of a substructure, such as the number of atoms (graph nodes), the number of bonds, the number of single, double, triple, or aromatic bonds, the number of atoms covered by ring systems, the minimum chemical formula of the substructure (containing information about types and quantity of atoms C, N, O, ..., excluding H atoms). The set of configuration Vectors coding for each atom the type of an atom, the number of free electrons, the location of the atom in the molecule (1 ) not in the ring, 2 ) in the ring, and 0 ) any), the number of neighbors, the number of single, double, triple, and aromatic bonds a given atom is a part of. The list of bonds representing the way the atoms are connected in and the type of bonds. The substructure vector is used as a screening filter for fast and efficient checking whether the subsequent molecules

10.1021/ci9902705 CCC: $19.00 © 2000 American Chemical Society Published on Web 03/03/2000

DE¸ BSKA

326 J. Chem. Inf. Comput. Sci., Vol. 40, No. 2, 2000

AND

GUZOWSKA-SÄ WIDER

graphs. The definitions of subgraphs recognized by the identification system are stored in the computer database. The subgraph database of our system contains data on 222 substructures (e.g., 62 recognized only by IR spectroscopy, 132 by 1H NMR, and 28 identified by both spectral methods). The substructure database contains precise definitions of groups containing the C, N, O, and S elements and the halogens. The basic set of substructures is connected with various types of organic compounds, such as ketones -CO($,$), aldehydes -CHO($), carboxylic acids -COOH($), amines -NH-($,$), alcohols -OH($), aromatic benzene ring ($), heterocyclic rings, pyridine ring ($), etc. Each subset (ketones, aldehydes, etc.) includes substructures together with their precisely or fuzzily defined surroundings. The subset of a given substructure can contain

precisely defined molecular fragments -OH(-CH2-), -OH(>CH-), -OH(>CC