Designing Molecules with Specific Properties from Intercommunicating

The concept that computers can be creative in designing new molecules is now being actively explored. Our work has been aimed at developing an ...
0 downloads 0 Views 164KB Size
J. Chem. Inf. Comput. Sci. 1996, 36, 1061-1066

1061

Designing Molecules with Specific Properties from Intercommunicating Hybrid Systems† James Devillers CTIS, 21 rue de la Bannie`re, 69003 Lyon, France Received February 22, 1996X

The concept that computers can be creative in designing new molecules is now being actively explored. Our work has been aimed at developing an intercommunicating hybrid system using a genetic algorithm and a backpropagation neural network model for solving the general problem of designing molecules with specific properties. A case study dealing with biodegradation modeling is presented. The usefulness of the intercommunicating hybrid systems in QSAR and drug design is emphasized. 1. INTRODUCTION

Humans are hybrid information processing devices. Indeed, our behaviors and actions are governed by a combination of genetic information encoded in our chromosomes and knowledge acquired through a complex learning process. This type of hybrid information processing is now tentatively replicated in computers. Thus, for example, artificial neural networks (ANNs) are inspired by the functionality of biological neurons in the brain.1 Genetic algorithms2 (GAs) are also biologically inspired and based on the principles of the Darwinian evolution. These techniques constitute valuable tools for solving specific tasks in drug design and in (Quantitative) Structure-Activity Relationship ((Q)SAR) and Quantitative Structure-Property Relationship (QSPR) studies.3,4 However, it is now well admitted that certain complex problems in molecular modeling cannot be solved by a single technique alone. Each approach presents particular properties that make them suited for specific applications and not for others. Thus, for example, ANNs are powerful tools for recognizing patterns, but they present some difficulties for explaining how they reach their decisions. Fuzzy logic systems,5,6 which exhibit a wide flexibility for handling imprecise “linguistic” concepts, are good at explaining their decisions, but they cannot automatically acquire the rules they use to make those decisions.7 GAs are very efficient for solving complex optimization problems; however, there are many parameters to be tuned and often considerable experimentation is needed before obtaining the optimal solution(s). The strengths and weaknesses of these different intelligent8 techniques have been the central driving force behind the creation of hybrid systems where two or more techniques are combined in order to overcome the limitations of each approach and create a synergy between them in order to provide valuable modeling tools. According to Goonatilake and Khebbal,7 hybrid systems can be differentiated on the basis of their functionality. In function-replacing hybrids, a principal function of a given technique is replaced by another intelligent technique. Here, the aim is to take an approach presenting weaknesses in a particular property and combine it with a technique that has strengths in the same property. Thus, for example, GAs have † Invited paper presented in the Symposium “Frontiers in Mathematical Chemistry. Novel Directions in QSAR” (M. Randic´ and P. G. Mezey, Presiding) at PACIFICHEM 1995 (December 17-22, 1995, Honolulu, Hawaii). X Abstract published in AdVance ACS Abstracts, October 1, 1996.

S0095-2338(96)00022-4 CCC: $12.00

been shown to be very effective at function optimization, efficiently searching large and complex spaces to find nearly global optima. Under these conditions, GAs have been largely employed for designing and/or training ANNs.9-14 The second class of hybrid systems corresponds to the polymorphic hybrids which use a single processing architecture to achieve the functionality of different intelligent processing techniques. The cellular encoding methodology15 which allows synthesis of ANNs belongs to this category of hybrid systems. The last class of hybrid systems corresponds to the intercommunicating hybrids which are independent, self-contained, processing modules that exchange information and perform separate functions in order to generate optimal solutions.7,16 Thus, if a problem can be subdivided into distinct processing tasks, then different independent modules can be used to solve the parts of the problem at which they are the best. They work in synergy for performing a particular task. The different modules can process subtasks either in a sequential manner or in a competitive/cooperative framework. Intercommunicating hybrids, presenting different levels of complexity, are penetrating the field of QSAR and drug design.17-19 Therefore, the aim of this paper is to present the usefulness of this type of hybrid system through a practical example dealing with the design of molecules with a specific biodegradability. 2. AN INTERCOMMUNICATING HYBRID SYSTEM FOR MODELING BIODEGRADATION

2.1. Description. Sometimes, a slight modification in the structure of an organic compound can considerably change its biological activity. It is particularly true for complex biological activities such as biodegradation. The knowledge of these (Quantitative) Structure-Biodegradability Relationships ((Q)SBRs) is economically very important for optimizing the design and synthesis of new molecules. Recently, the backpropagation neural network (BNN) has been presented as a powerful nonlinear statistical tool for modeling biodegradation due to its ability to learn and generalize from complex and noisy biodegradation data.20,21 The powerful modeling potency of a BNN can be used in conjunction with the optimization ability of a GA in order to produce a hybrid system able to automatically propose molecular structures for the design of molecules presenting various degrees of biodegradability. In this context, an intercommunicating hybrid system was designed for modeling biodegradation. It is basically constituted of © 1996 American Chemical Society

1062 J. Chem. Inf. Comput. Sci., Vol. 36, No. 6, 1996

DEVILLERS

Figure 1. Intercommunicating hybrid system for designing biodegradable molecules.

a GA for searching combinations of structural fragments among a set of descriptors in order to facilitate the design of biodegradable molecules. The biodegradability of the candidate molecules is estimated from a three-layer BNN model. The elaboration of this system has been described in a recent paper.17 The hybrid system is summarized in Figure 1, and the key steps in the design of the BNN model and GA are recalled below. The BNN model was derived from a training set of 38 molecules presenting a high degree of structural heterogeneity. The aerobic ultimate degradation in receiving waters (AERUD) of these molecules was used as endpoint to estimate their biodegradability.22 The generalization ability of the BNN was estimated from a testing set of 49 molecules. Examples of molecules belonging to the training and testing sets are displayed in Figure 2. The molecules were described by means of 13 structural descriptors (Table 1) known to influence the environmental fate of chemicals.20,21,23-28 They were Boolean descriptors. It is obvious that other descriptors could have been selected since numerous structural features influencing the biodegradation of organic molecules have been identified in the literature.23,24 However, a trial and error procedure revealed that this set of descriptors (Table 1) offered the best compromise between an optimal description of our training set and valuable generalization perspectives with the BNN model. Last, they were sufficiently general to describe a wide variety of structures as required for GA experiments. In addition to the 13 Boolean descriptors, four classes of molecular weight were defined (Table 1) not only to account for size effects but also to allow the description of all the molecules. Molecular descriptors (except descriptor no. 10) and the AERUD values were scaled by means of eq (1).

x′i ) [a(xi - xmin)/(xmax - xmin)] + b

(1)

In eq (1), x′i and xi were the scaled and original values, respectively. For the molecular descriptors, xmin and xmax were the minimum and maximum values found in the different columns, respectively. For descriptor no. 10, the four classes were encoded 0.2, 0.4, 0.6, and 0.8, respectively. For the outputs, xmin and xmax corresponded to the minimum and maximum AERUD values (i.e., 1 and 4) which could be calculated from the procedure proposed by Boethling et

Figure 2. Structure of some molecules included in the training (no. 1-7) and the testing (no. 8-14) sets. Table 1. Molecular Descriptors no.

descriptor

no.

descriptor

1 2 3 4 5 6 7 8 9

heterocycle N CdO >2 chlorine atoms fused rings only C, H, O, N NO2 g2 cycles epoxide primary or aromatic OH

10

molecular weight (g‚mol-1) #5 ) 0 and if #5 ) 1 )> #3 ) 0. -When a molecule contains fused rings (descriptor no. 4), it obligatorily contains at least two cycles (descriptor no. 7). Thus, if #4 ) 1 )> #7 ) 1 (the converse is not true). The GA analysis was performed with the STATQSAR package.29 The GA configuration was obtained by a trial

and error procedure. We found that using a population of 100 was a reasonable compromise between having enough sampling over combinatorial space and not taking too much computer time. The crossover (Pcross) and mutation (Pmut) probabilities equaled 0.8 and 0.1, respectively. A 50% population permutation was selected (elitist strategy). Selection was optimized from the prescale, scale, and scalepop routines of Goldberg2 (p 79). Last, the GA was stopped after 300 generations. 2.2. Analysis of the Results. With our intercommunicating hybrid system (Figure 1), it is possible to propose candidate molecules having low AERUD values. These molecules present particular structures (Figure 3). Thus, on the basis of numerous runs, among the best 50 candidates, it is interesting to note, for example, that biodegradable molecules preferentially contain a CdO group (descriptor no. 2) and/or a primary or aromatic OH (descriptor no. 9). At the opposite, these molecules do not include a nitrogen heterocycle (descriptor no. 1) and/or more than two chlorine atoms (descriptor no. 3) (Figure 3). It must be pointed out that these results are not surprising. Indeed, for example, it is well-known that the presence of OH, CHO, and COOH groups on an aromatic ring facilitates biodegradation, as compared with the NO2 group and halogens. In addition, the rate of biodegradation usually decreases with an increasing number of substituents on the aromatic ring.30 Our results clearly show that the proposed intercommunicating hybrid system (Figure 1) can be used for designing molecules with a particular biodegradability since it always proposes logical solutions. It is interesting to note that during the simulations, the hybrid system regularly proposes candidate molecules having the same coding as biodegradable molecules belonging to the training and testing sets. In the chemical industry, the synthesis of new molecules is often under the dependence of structural constraints. Therefore, it appeared interesting to test the limits of the intercommunicating hybrid system (Figure 1) when structural conditions are voluntarily introduced. In practice, this type of constraints requires to firstly modify the initialization process of the population since the selected descriptor(s) is

1064 J. Chem. Inf. Comput. Sci., Vol. 36, No. 6, 1996 Table 2. Number of Biodegradable Candidates Proposed by the GA under Constraints constrained descriptor

no. of candidates with AERUD