Neural Network Prediction of Secondary Structure in Crystals: Hydrogen-Bond Systems in Pyrazole Derivatives Jose Fayos,* Lourdes Infantes, and F. H. Cano
CRYSTAL GROWTH & DESIGN 2005 VOL. 5, NO. 1 191-200
Departamento de Cristalografia, Instituto Rocasolano, CSIC, Serrano 119, Madrid-28006, Spain Received March 16, 2004;
Revised Manuscript Received May 7, 2004
ABSTRACT: With the purpose of predicting by neural networks some structural properties of crystals, in particular, the types of secondary structure built by hydrogen bonds, 46 molecules, containing the pyrazole ring, have been codified in vectors of equal dimension. Looking for an unbiased codification, we selected the components of these vectors from the one-dimensional Fourier transform of the corresponding three-dimensional molecular charge distribution. Matrices of similarity and similarity maps of Kohonen’s trained networks have allowed classification of the molecules, as a previous step before prediction of their hydrogen-bond system. Thus, we have worked under the hypothesis that this molecular codification contains information relevant to the structural level in crystals. The classes obtained show correlation with the previously known secondary structure of the corresponding crystals. Then, we have achieved, by means of training a neural network with some molecular vectors supervised by their coded secondary structure, a significant prediction of the type of secondary structure for the rest of the molecules. This molecular codification seems also to account for other noncovalent molecular interactions involved in the packing. Introduction It seems important to predict the type of secondary structure that forms hydrogen-bonded molecules, moreover as this structure will probably guide the packing mode of the crystal. Although there are important advances in predicting the crystal structure of the most stable polymorph for simple molecules by intermolecular potential calculations,1-5 the energy of the preferred polymorphs are quite similar. Therefore, by using these calculations we generally arrive at several possible, and apparently different, solutions. That is probably due to the still poor models we have for the complex and cooperative phenomenon of molecular self-assembly. A different approach to predict the secondary structure for 1H-pyrazole derivatives has been done through empirical rules.6-8 On the other hand, although there are different crystalline polymorphs for some molecules, usually each molecule is more likely associated to just one polymorph, which would mean that some molecular properties would guide its most probable packing to form the crystal. With this hypothesis, it seems that the database of about a quarter of a million known molecule/crystal structure pairs, available in the Cambridge Structural Database,9 might contain enough information to predict statistically the type of molecular self-association in the crystal. Under this hypothesis, we have recently published a method for the prediction of the molecular crystal packing by neural networks,10 assuming that the threedimensional distribution of molecular charges conditions its self-association. In that work, we codified 31 molecules of different types into vectors by means of the one-dimensional Fourier transform (FT) of their three* To whom correspondence should be addressed. Tel: 34-915619400; fax: 34-915642431: e-mail:
[email protected].
dimensional distributions of charges;11,12 next, these vectors were grouped by similarity, projecting them on a 2D Kohonen’s map.10,13,14 Thus, the molecules could be separated into classes corresponding to different packing types; finally, this classification allowed prediction of the FT of the packing for new molecules. In the present work, we use this method to predict the secondary structure of 47 molecules, more similar among themselves and all having a pyrazole ring in common and substitutions at R3, R4, and R5, as shown in Figure 1. Like in the previous work, to analyze the molecular similarity we have codified molecules in vectors of equal dimension by the one-dimensional FT of their charge three-dimensional distribution,11 an approach that we assume includes enough information of the secondary structure in the crystal. Because we work with nonlinear statistical estimators, as it is a neural network, one part of this work deals with the development of methods for analyzing the molecular input data, the resulting Kohonen’s map, and the derived molecular classifications. Analysis of Molecular and Secondary Structure Cluster Data Sets In our codification scheme, we have a total of four sets, the set of FT(s) of discrete vectors of dimension 32 in si, the molecular 47M set, the secondary cluster of molecules, 49S set, plus the random 47R, and the 1/0 secondary structure 49B set (see Appendix I). First, we classify those vectors after their similarity inside each set. Next we analyze the possible correlation of any of these classes with respect to their secondary structures as observed in their crystals (49B). Before trying to separate classes of vectors by similarity, we see how the vectors of each base fill the 32-dimensional space, that is, their lineal independence. With this purpose, we have calculated for each set two
10.1021/cg049903k CCC: $30.25 © 2005 American Chemical Society Published on Web 09/15/2004
192
Crystal Growth & Design, Vol. 5, No. 1, 2005
Fayos et al.
Figure 1. The 47 molecules selected for the present work, with the identification numbering as the CSD (6) codes. Jumps in numbering come as they were selected from a bigger group. Molecules are listed in such a way that up to number 32 present in the crystal a secondary structure is in dimers (D type, 11 molecules); up to number 50 the secondary structure is in trimers (T, 9); up to 58 it is in tetramers (A, 7); up to 83, chains (C, 15, including NIBFIN); and to the end, layers (L, 5). The secondary structural type has been assigned from the system of hydrogen bonds, as described in the corresponding publication. Table 1. Analysis of Vector Bases 47R, 47M, and 49Sa base
aσTF
taR
maR
MaR
mR
MR
tad
mad
Mad
md
Md
47R 47M 49S 47M‚47S 47M‚47R
0.58 0.27 0.26 0.11 0.00
-0.004 0.05 0.31
-0.05 -0.08 -0.14
0.04 0.19 0.45 -0.29 -0.39
-0.53 -0.88 -0.75 0.46 0.26
0.55 0.98 0.97
4.71 2.43 2.24
4.21 2.11 1.82
5.21 3.37 3.08
3.10 0.42 0.26
6.20 4.68 3.71
a Each base has N(47, 47 or 49) vectors TF (s) of 32 components. The first column is the average of the standard deviations of the 32 n components of the average vector 〈TFn(s)〉N among the N vectors, aσTF ) 〈σ〈TFn(s)〉N〉32. Next columns are some estimators of the similarity matrices A(i,j) or D(i,j) among TFi(s) and TFj(s) vectors of each base. The total average values: taR ) 〈A(i,j)〉 and tad ) 〈D(i,j)〉; the maximum and minimum values among the averages for each vector i: maR ) min(〈D(i,j)〉i)j, MaR ) max(〈A(i,j)〉i)j, mad ) min(〈D(i,j)〉i)j and Mad ) max(〈D(i,j)〉i)j; and the total maximum and minimum mR ) min (A(i,j)), MR ) max(A(i,j)), md ) min(D(i,j)) and Md ) max(D(i,j)). At the bottom, the average values of taR, mR, and MR, for the scalar products between vectors of different bases and corresponding to the same molecule.
similarity matrices, one with elements A(i,j) ) cos Rij between each two FT(s) vectors, and another with elements D(i,j) of distances between each two vectors. Table 1 shows some significant values of these two
matrices for each dataset. The 47 random vectors seem to be distributed homogeneously in the 32-dimensional space, with high deviations from average, average values for A(i,j) almost zero, and D(i,j) values close to
Neural Network Prediction of Secondary Structure
Crystal Growth & Design, Vol. 5, No. 1, 2005 193
Figure 2. Map of the Euclidean distances, D(i,j), among the 47 random vectors generated to establish a reference set. The map is, of course, symmetric. The values for D(i,j) have been scaled to units as the global maximum; the relative values are represented by the gray scale indicated on the right.
each other (in an orthonormal base of binary vectors (10...0) ... (0...01), all A(i,j) are zero and all D(i,j) are identical); obviously, the random vectors could not form classes. The FT(s) vectors of the observed structures, 47M and 49S, on the contrary, seems to be distributed quite differently, indicating that they could form classes. In Table 1, the scalar products between corresponding vectors of classes 47M and 49S show the independence existing among the vectors codified for the isolated molecules and those corresponding to their clusters of secondary structure. As a reference of a total independence among sets, we see that the average of the scalar products among vectors of sets 47M and 47R is zero. This linear independence ensures us that there is not linear bias at the start. Apart from the global estimators of the Table 1, it is interesting to examine directly both matrices A(i,j) or D(i,j), as they actually are vector similarity maps, so they will show how the vectors could be classified. Both matrices give similar information, so we show only here D(i,j), as these values are implied in the Kohonen’s neural network. Figures 2 and 3a give details of the global data of Table 1; they show the different textures of the similarity matrices D(i,j) for 47R and 47M, the last one presenting much lower distances and a nonrandom distribution of values. The same D(i,j) values are represented in Figure 3b, but with the vectors reordered in both axes to visualize some areas of similarity. These five molecular classes of similarity are shown below in a table, with their distribution among the observed types of secondary structure (numbers indicate molecules as in Figure 1 and letters stand for classes, D1...D5 for those in the similarity map; molecules 52...45 are not classified, as they are not similar among them with the rest. D1(12) D2(10) D3(6) D4(10) D5(3)
18 36 66 72 52 70
81 78 67 8 53 71
89 80 69 27 54 73
2 35 58 28 13
23 14 68 29 44
D T A C 31 37 39 57 75 90 91 4 2 1 2 92 56 74 79 83 1 2 1 5 93 0 0 1 4 32 38 42 50 51 5 3 1 1 45 1 2 3 0 0 0 0 3
L 3 1 1 0 0 0
The above table shows the distribution between both classifications, indicating that the molecular codification through the used Fourier transform FT(s) contains some
Figure 3. (a) Map of the Euclidean distances among the 47 molecular FT(k) vectors, scaled and colored as in Figure 2. The molecules are ordered in both axes as in Figure 1, and the four types of secondary structure have been delimited. It seems apparent its character of similarity map, with the criterion of short intervector distances for similar molecules. (b) Same as panel a, but reordering the molecular sequence so as to group vectors with short distances among them and long ones with the others, that is, forming the classes of similarity. Although with overlapping due to some classification ambiguity occurs, an indication of five classes can be put forward. In an ideal case of well-resolved classes, the map would appear structured as diagonal boxes, each for one class. (c) Same as panel b, but reordering the molecular sequence by the criterion of the position of the dominant charge on the substitutions on the pyrazol ring, according to Figure 1: on R3 (19 molecules), on R4 (9 molecules), with no dominant charged site SQ (14 molecules), on R5 (2 molecules) and with equivalent charges on R3 and R5 (2 molecules). The molecules of each type can be seen in Table 2.
194
Crystal Growth & Design, Vol. 5, No. 1, 2005
Figure 4. Kohonen’s map of molecular similarity after 300 epochs. Numbers indicate the molecules after Figure 1. The background, in gray scale, shows the activation level (Appendix 2) or reliability of the process in that zone of the map. The more activated or darker areas correspond to sets of input vectors more similar among themselves, and the contrary for clearer areas. The more populated types of secondary structure, dimers, chains, and trimers are framed bold, bold, and underlined cursive, respectively. The lines are drawn on the map to separate some secondary structure clusters. It should be pointed out that due to the 2-periodicity of the map, for example, molecule 44 is near molecule 39.
information on their secondary structure; the question now is if this information is enough for predicting it. The reordering of molecules to form the similarity map of Figure 3b was done following just the criterion of the D(i,j) values. A different way, with more chemical sense, would be to order the molecules in groups by the position of the dominant charge in the substitutions of the pyrazol ring, since the FT(s) have been weighted by the charges and they are defining the hot/cold areas in the Kohonen’s maps; that dominant charge may be at R3, at R4, R5, at both R35 (symmetrical), or with no dominating charge (Table 2 of the next section shows the molecules of each class). In this way, we obtained Figure 3c. As expected, a classification is possible with this criterion, although, for example, some molecules, namely, 2, 66, 67, 74, 92, and 18, seem to be better classified within the symmetrical ones. Kohonen’s Maps of the Molecular FT(s) A somehow more direct and objective form of classifying the 47 FT(s) molecular vectors is through the Kohonen’s neural network. This procedure projects, by a similarity criterion, these 32D FT(s) vectors onto a two-dimensional Kohonen’s map (see Appendix 2).10,13,14 Figure 4 shows the projection of the 47 molecules codified in the set 47M on the 20 × 20 Kohonen’s map after 300 epochs. The process converged giving an average 〈|FT(k) - W(I,J,k)|〉 distance of 0.0016; this value has to be compared to an average distance of 0.0832, obtained in a similar learning, but with the 47 random vectors of the set 47R, and where the W(i,j,k) matrix did not converge. If the FT(s) molecular codification would have clearly discriminated against the secondary structure type to be formed in the crystal, the molecules of each type would have formed separated clusters in high activated
Fayos et al.
(or hot) areas of the Kohonen’s map; however, Figure 4 shows only an approximation to that ideal situation, and it seems that there is the charges situation which defines the hot/cold areas. But, although not well separated, we observe clusters: four dimers in a cold (low activated) area, six trimers in a hot area (the map is 2D periodic), and three groups of six, four, and four chains in hot areas; thus, this approximation, fuzzy as it appears, would indicate that the molecular FT(s) codification contains some information on the secondary structure type. The correlation of the clusters on the map with the types of secondary structure given in the literature for these molecules can be estimated calculating a neighborhood matrix where each element N(i,j) is the sum of the total number of neighboring molecules of type j for all molecules of type i, up to a certain radius in the Kohonen’s map. The matrix is normalized dividing each element for the product of total number the number of molecules of each type in the input set V(i,j). A good correlation would give high values of V(i,i) with respect to the rest of V(i,j), and a estimation of the classification merit of the map could de defined as the value of Vii/ 〈Vij〉 for each line. The following table is the V(i,j) matrix corresponding to Figure 4, calculated up to a radius of two neighbors; at the beginning of each line are the individual line merits, confirming the best correlation for trimers and chains.
type D T A C L
merit 0.778 2.284 1.376 1.486 0.000
V(i,j) up to Two Neighbors D T A 0.033 0.051 0.013 0.051 0.074 0.000 0.013 0.000 0.041 0.024 0.015 0.038 0.091 0.022 0.057
C 0.024 0.015 0.038 0.044 0.027
L 0.091 0.022 0.057 0.027 0.000
However, it is difficult to extract from the map of Figure 4 some classification of the molecular vectors, since the neural network process, although it forms clusters of similarity, also spreads the molecular distribution homogeneously on the map, without separating the clusters from each other. A procedure for allowing the separation into classes is to calculate several Kohonen’s maps of low size, which will produce the overlapping of similar molecules, then, to suppose that if the FT(s) of two molecules are frequently overlapped in different maps they belong to the same class. In this way, we have calculated 16 maps, of sizes 4 × 4 and 5 × 5, with the set 47M, and we have built the corresponding overlapping matrix of dimension 47 × 47, where each element O(i,j) is the number of maps where the molecule i overlapped with molecule j. To extract classes out of this matrix, we form groups of molecules that overlapped among all of them in a number of maps, nm; next we joined groups that have nc or more common molecules, only if the rest of noncommon molecules overlap among them at least in a given number np of maps. We take these groups as classes of a Kohonen’s selfclassification; a figure of merit for a class, R, can be defined as the amount of overlapping intraclass found for molecules in a class divided by the total amount of overlapping found, involving these molecules with any other. Setting the conditions for forming a class at nm ) 3, nc ) 2 and np ) 1, we found, among the 16
Neural Network Prediction of Secondary Structure
Crystal Growth & Design, Vol. 5, No. 1, 2005 195
Kohonen’s maps, the following classes (R in parentheses), where molecule 45 is not classified. 13 35 70 14 13 2 8
44 53 71 52 18 23 27
56 73 56 31 35 28
(0.16) 78 92 (0.33) (0.92) 66 67 69 74 79 54 57 75 81 89 36 37 39 78 80 29 32 38 42 50
83 (0.79) 91 (0.96) 90 (0.92) 51 58 68 72 93 (0.93)
As it was mentioned in the previous section, molecules 52, 53, 54, 13, 44, 45 do not show similarity with any other (see Figure 3b), besides the first two groups above have a low figure of merit. So, to adjust the above preclassification, we can first eliminate those six molecules; second, having taken into account the repeated molecules (13, 56, 35, 78) and following the distribution shown in Figure 3b, we include molecule 92 in the new class K1, having then the following classes, which include 41 of the 47 original molecules: K1(10) 2 23 35 K2(13) 8 27 28 K3(7) 18 31 57 K4(8) 14 56 66 K5(3) 70 71 73
36 29 75 67
37 32 81 69
39 38 89 74
78 80 90 92 42 50 51 58 68 72 93 91 79 83
Figure 5. Map of distances D(i,j), as Figure 3a, but reordering the molecules by the classes K1, ...K5 extracted from the reduced Kohonen’s maps (see text). Molecules not classified, 13 and 52, are situated separating K4 from K2, and K3 from K5; the rest of nonclassified molecules 44, 45, 53, and 54 are placed after K5. Table 2. (a, b) Distribution of the 47 Molecules, in Secondary Structure Types (D,T,A,C,L) and According the Position of the Dominant Charge in the Pirazol Ring (R3, SQ, R4, R5, R35)a
Figure 5 shows the matrix of distances DIS(i,j) among the FT(s) of the 47 molecules, ordered by these classes with the sequence: K1, K4, 52, K2, K3, 13, K5, 45, 44, 53, 54, which is comparable to that of the previous Figure 3b. It is interesting to compare the optimized correlation matrices between the different classifications: the one obtained from the 47 × 47 D(i,j) map of the FT(s) molecular vectors, Di; the one obtained from 16 reduced Kohonen’s maps, Ki; and the observed classification (D, T, A, C, L) codified in 47B. Below are these matrices with the corresponding correlation factors, CF, obtained by combining rows and columns to get the maximum value, for comparative purposes. They show that classes Di and Ki are quite correlated. K 1 2 3 4 5
D 4 1 2 3 5
K
C 2 2 2 A 0 2 1 D 2 5 2 L 2 1 2 T 4 3 0 (CF ) 0.586)
D 5 4 1 T 3 2 2 A 1 1 1 L 0 3 1 C 1 2 5 (CF ) 0.612)
D1 7 5 0 0 0 D2 0 5 5 0 0 D3 0 0 3 3 0 D4 0 0 0 10 0 D5 0 0 0 0 3 (CF ) 0.936)
6 1 1 0 0
3 0 0 0 0
0 0 1 1 4
0 0 0 0 3
3 1 4
2
5
According to these correlations, it seems interesting to observe, in the Table 2, the molecular distribution after the secondary structure (D, T, A, C, L) and after the situation of the dominant charge (SQ, R3, R4, R5, R35), mentioned in the previous section. Dimers seem to be quite selective with a charge at R4, while trimers are so with no dominant charge, and chains prefer it at R3 and tetramers rest undecided between R3 and no charge. Dominant charges at R5 or in symmetrical situations do not appear to be significant in the sample. Table 2 also shows the distribution of Ki classes and Di classes, respectively, versus secondary structure and dominant charge site simultaneously. For example, the five molecules (D or T and R5) are equally classified by Ki or Di, six of them belonging to the same classes K2 or D4. Thus, it can be seen how the charges define the
a Panel a: Different colors show the self-classification in five classes, by considering the molecular overlapping in the Kohonen’s maps. Molecules of class K1 are in squared bold, class K2 are in gray bold, class K3 are in cursive, class K4 are in squared normal, and class K5 are in squared-grey normal. Panel b: Different colors show the self-classification in five classes, by considering the 47 × 47 distance matrix D(i,j), of Figure 3b. Molecules of class D1 are in cursive, class D2 are in squared bold, class D3 are in squared normal, class D4 are in gray bold, and class D5 are in squaredgrey normal.
hot/cold areas of Figure 4: R3 around the hot area centered at molecule 52, SQ at molecule 44, and R4 in the cold area centered at 8; between are left frontiers zones. So it seems that the Kohonen’s classification for secondary structure is influenced by the charge situation and by the possibility of more than one type of secondary structure present in the actual packing. Kohonen’s Maps of the FT(s) Vectors Codifying the Secondary Structures In this section, we refer to the use of the FT(k) vectors for the 49 secondary structures (49S). The Kohonen’s
196
Crystal Growth & Design, Vol. 5, No. 1, 2005
Fayos et al.
map, which is not shown, was calculated in the same way as we did for isolated molecules. The corresponding neighborhood matrix V(i,j) up to two neighbors of the Kohonen’s map is shown below and indicates that this FT(s) distribution on the map has a correlation with the observed secondary structures: type D T A C L
V(i,j) up to Two Neighbors D T A 0.066 0.061 0.013 0.061 0.074 0.048 0.013 0.048 0.122 0.016 0.026 0.017 0.000 0.022 0.086
merit 2.115 1.602 2.133 1.708 1.702
C 0.016 0.026 0.017 0.055 0.047
L 0.000 0.022 0.086 0.047 0.080
By using 12 reduced Kohonen’s maps of dimension 4 × 4 and 5 × 5, we found the following classification KS with the conditions: nm ) 4, nc ) 2, and np ) 2 and with molecules 66, 67, and 81 being left out. The class figure merit R are in parentheses at the end of the class: KS1(11) 8 13 18 KS2(13) 2 14 23 KS3(6) 13 29 51 KS4(6) 45 68 70 KS5(12) 35 36 37
27 32 52 71 39
28 53 54 72 44
29 64 69 73 56
31 38 42 50 65 (R ) 0.88, without 13) 74 79 89 90 91 92 93 (R ) 0.84, without 79) (R)0.48, without 29) (R)0.89) 57 58 75 78 80 83 (R)0.92)
The correlation factor of this classification with the observed secondary structures is 0.612 and the optimized correlation matrix is KS
2
3
1
4
5
T C A D L
0 3 1 4 5
0 1 3 2 0
3 1 0 7 0
1 5 0 0 0
5 4 3 0 0
Figure 6a is the D(i,j) distance matrix among the 49 FT(k) vectors of the secondary structures ordered as in Figure 1. Figure 6a is comparable to Figure 3a, although with closer grouping, showing also a nonrandom distribution of similarity among the vectors. Figure 6b is the distance matrix among the same vectors but ordered by the classes obtained from the 12 Kohonen’s maps mentioned above, KS1, KS3, KS2, KS4, KS5. The fact that layers might contain chains, chain dimers, trimers, or tetramers, and so on, may have some influence on the distribution of KS classes among the secondary structure types. Thus, it seems there is, in the KS classes, more information than just the one corresponding to the secondary structure. Kohonen’s Maps for the 47 FT(k) Molecular Vectors Supervised by Their Observed Secondary Structures, B(t) To remark on the information on the secondary structure carried by the molecular FT(k), we supervise the molecular classification with information on the actual secondary structure B(t) (see Appendix 1). Figure 7 is the Kohonen’s map of a supervised training (see Appendix 2), superimposed to the activation map, after 300 epochs and with the same conditions as the previous unsupervised training. The process converged, giving an average distance 〈|FT(k) - W(I,J,k)|〉 of 0.0031 (twice the value without supervising of 0.0016). Lines around some molecules with different observed secondary struc-
Figure 6. (a) Map of distances D(i,j), as Figure 3a, but among the FT(k) vectors for the 49 clusters characterizing the secondary structures for each molecule. (b) Same as Figure 6a, but reordering the 49 FT(k) vectors for secondary structures after the Kohonen classes KSi, in the same way as done for Figure 5 for the molecules. Molecules 66, 67, and 81 could not be classified and are situated after KS5.
Figure 7. Kohonen’s map of similarity for the 47 molecular vectors FT(k), supervised by the 47B binary vectors codifying the secondary structure. Numbers and shadows of this map correspond to those of Figure 4 for unsupervised learning. We draw lines separating the secondary structure clusters. The more populated types of secondary structure, dimers, chains, and trimers are marked, respectively, as framed bold, bold, and underlined cursive.
ture type are shown in Figure 7. They show a better separation of types compared with those of the unsupervised learning in Figure 4, a fact that can also be seen in the neighborhood matrices. However, the su-
Neural Network Prediction of Secondary Structure
Crystal Growth & Design, Vol. 5, No. 1, 2005 197
pervision is still not able to join completely each class of secondary structure. Below is given the neighborhood matrix of the supervised map of Figure 7, showing how the supervision has considerably biased the classification despite the small dimension of 5 for B vectors with respect to 32 for M vectors. We will see later how this bias does not happen when trying to supervise the set of random vectors:
type D T A C L
merit 3.244 3.140 0.000 2.448 3.153
V(i,j) up to Two Neighbors D T A 0.116 0.020 0.000 0.020 0.123 0.016 0.000 0.016 0.000 0.024 0.015 0.010 0.018 0.022 0.000
C 0.024 0.015 0.010 0.098 0.053
L 0.018 0.022 0.000 0.053 0.160
The Kohonen’s network, with 47 vectors of dimension 32 and random components between -1 and +1, when supervised by the sequences of secondary structure, B(h), was not stabilized even after 500 epochs, giving an average distance ) 0.084 (a close value to the unsupervised calculation). At this stage, we had the following neighborhood matrix:
type D T A C L
merit 1.643 1.402 1.089 1.347 0.000
V(i,j) up to Two Neighbors D T A 0.066 0.030 0.026 0.030 0.074 0.063 0.026 0.063 0.041 0.024 0.052 0.029 0.055 0.044 0.029
C 0.024 0.052 0.029 0.053 0.040
L 0.055 0.044 0.029 0.040 0.000
Not any set of vectors allows supervision. The lack of effect of the supervising by the secondary structure on the random set indicates that the molecular code FT(k) of M vectors includes, somehow, information about the secondary structure, thus allowing a significant supervision, although the B(t) vectors only affect 5 times versus 32, the dimension of the FT(k) vectors; the maximum value of components of the vectors for both systems is 1. Supervised Kohonen’s Neural Network Training of a Subset of Molecular FT(k) Vectors. Prediction of the Secondary Structures of the Rest of the Molecular Vectors From the supervised Kohonen’s map of Figure 7, we have chosen 37 molecules with FT(k) vectors, as teachers of the synapse matrices, in a training supervised by the corresponding observed secondary structures B(t), and leave the 10 remaining (18, 29, 39, 44, 56, 58, 71, 73, 78, and 89) as test molecules using only their molecular FT(k) vectors, to see how the learned matrix predicts their secondary structure B(t), now not included in the network. The test molecules are randomly selected among the different classes shown in Figure 7, corresponding to types of observed secondary structure. Note that this is a hard sort of test, as the 10/37 predicting ratio can be considered as quite high. After 300 epochs of supervised learning, the synapses matrices converged, with the average distance among the FT(k) ∪ B(t) vectors and the W1(i,j,k) ∪ W2(i,j,t) column vectors of 0.00013. Figure 8 shows that the distribution of classes, among the 37 learning molecules,
Figure 8. Kohonen’s map for the selected 37 molecules to teach the Kohonen’s matrices. These matrices are used to predict the secondary structure of the 10 remaining molecules used as test. The test input vectors, preceded by “t”, are located in their predicted position by the similarity criterion; the teaching molecules around indicate their probable secondary structure. Areas of teaching molecules with the same assigned secondary structure are surrounded by lines, as in Figure 7. The more populated types of secondary structure, dimers, chains, and trimers are marked, respectively, as framed bold, bold, and underlined cursive.
is comparable to that of Figure 7, where the 47 molecules were used. Note, however, the dispersion of tetramers. The test vectors FTt(k) were located on the Kohonen’s map at KM(I,J), by looking for the closest column vector W1(I,J,k) of the learned matrix, the average being 〈|FTt(k) - W1(I,J,k)|〉 ) 1.101. Their predicted secondary structures are W2(I,J,h), the average of their differences to the assigned secondary structure B(h) being 〈|B(h) W2(I,J,h)|〉 ) 0.561. By inspection of Figure 8, we found that the secondary structures of molecules 18, 29, 39, 44, 71, 73, 78 are more or less well predicted, but not so for 56, 58, 89. However, if we ignore the division lines, the test molecules are in general between secondary structure clusters, like molecules 29, 44, or 56. Taking this into account and the relative location fluctuation of molecules from different Kohonen’s maps, as they start from different random matrices, we decided to do some statistical analysis by repeating 32 times the network calculation. The following table shows the final average values of the W2(I,J,h) vectors as the predicted secondary structure for test molecules, with an estimation of their dispersion (within parentheses and referred to the last digits in the average). test molecule 18 29 39 44 56 58 71 73 78 89
D 0.69(44) 0.12(30) 0.15(33) 0.18(36) 0.05(15) 0.32(42) 0.04(10) 0.00(00) 0.01(02) 0.99(04)
W2(I,J,h) T A 0.03(10) 0.51(45) 0.89(29) 0.00(02) 0.90(29) 0.03(17) 0.53(49) 0.06(18) 0.45(48) 0.00(01) 0.28(40) 0.02(10) 0.02(05) 0.02(04) 0.04(18) 0.01(05) 0.24(38) 0.00(00) 0.00(01) 0.25(36)
C 0.03(15) 0.01(05) 0.19(35) 0.09(23) 0.68(42) 0.77(37) 1.00(00) 1.00(00) 0.89(30) 0.01(01)
L 0.20(38) 0.06(13) 0.06(20) 0.48(46) 0.28(40) 0.17(36) 0.02(06) 0.00(00) 0.11(26) 0.20(36)
The columns of the above table become the secondary structure components for each test molecule, which should be compared to the assigned binary vectors B(h), going from (10000) for dimers to (00001) for layers.
198
Crystal Growth & Design, Vol. 5, No. 1, 2005
Values in bold in the table are those that should be one, with the rest of values zero, for a prediction according to the observed secondary structure so, after the table, the probable secondary structure for molecule 18 would be a dimer or, with less probability, a tetramer, or, even less likely, would form layers. The table shows that, according to the published description for the secondary structures, prediction is good for chains, 71, 73, 78, for dimer 18, and for trimer 39. The case of molecule 89, expected to form just layers and predicted as a dimer, led us to look into the published description; then it was noticed that the overall hydrogen-bond system was in layers, but that within it there was a strong subsystem of hydrogen-bond dimers. Indeed, the layers do not appear until a much higher ratio of “donor-acceptor distances/the sum of van der Waals radii” (0.58 for the strong O-H‚‚‚N versus 0.72 for the weaker N-H‚‚‚N). Then we go on, and in the case of molecule 44, just after appearing as the system of trimers, it results in a scheme of “dimers of trimers”, which in the next step forms layers. This is a different sequence than that shown by molecule 39, in which the trimers form columns but not layers. For molecules 56 and 58, they indeed form tetramers, although these highly interpenetrate among themselves forming columns with all types of strong noncovalent interactions. Besides, molecules 56 and 58 present a prediction difficulty, as shown in Figure 8; the teacher tetramers 51, 52, 53, 54, and 57 are quite disperse in this map forming no class, which probably prevents prediction. No trimers whatsoever could be found in the case of molecule 29, but the crystal structure is not clearly defined: it contains an ordered system of dimers, and it also presents a substructural segregation of disordered chains and dimers. For molecule 18, dimers appear clearly first, developing into layers. Molecules 71 and 73 appear clearly as chains then develop the total crystal. For molecule 78 two dimers form first the chains which develop into the crystal. Thus, as we have already noticed in the section dealing with the FT(k) vectors for clusters of secondary structure, it seems that the molecular FT(k) codification includes more information than just that of the secondary system of hydrogen bonding; information on other noncovalent interactions seems to be present. No polymorphs are among the studied molecules to see the effect of a molecule with a tendency toward several structural types, but in the case of molecule 14/79, with two conformations and two types of secondary structure, it is an even more extreme case. In the molecular map (Figure 4), these conformations situate close to each other, and they appear in the same classes D2 and K4; then they became separated in the supervised maps of Figures 7 and 8. Conclusions The molecular codification, by means of the 1D Fourier transform, FT(s) of the three-dimensional distribution of charges, might allow a classification of N molecules by similarity of their correspondent FT(k) vectors. The classification is carried out directly by using the N × N distance matrix among all vectors, which is a similarity map, or by using the Kohonen’s neural network, which projects by similarity the N molecules
Fayos et al.
on a two-dimensional Kohonen’s map. The classes of molecules obtained present a correlation with the corresponding secondary structures of the hydrogen-bonded clusters of molecules observed in the crystals. This correlation, although low in some cases, indicates that the molecular codification used for the FT(k) has included some information on the possible secondary structure for the molecule in a crystal structure. A classification supervised by the observed secondary structure increases that correlation drastically, although the supervision vector is of dimension 5 compared to dimension 32 of the FT(k) vectors. We used, as a check, N vectors of dimension 32, with random components, which obviously cannot be classified. But the important result seems to be that the supervised classification of these random vectors, with the same secondary structures used for molecular FT(k), failed, showing that this supervision works only for vectors already containing information about the secondary structure. The supervised classification of 37 training molecules allows the building of a learned network matrix that provides a significant prediction of the secondary structure for the remaining 10 test molecules. The network predicts a distribution of probable secondary structure among the observed types, and the analysis of these predicted distribution, by inspection of the crystal structure, shows that the FT(k) molecular codification includes more intermolecular information than just the secondary structure. The secondary structure type of hydrogenbonded molecules are implied in the rest of the noncovalent molecular interactions forming the crystal, being difficult to analyze independently of the first one. We show that it is quite likely that an isolated molecule contains information on its own probable clustering interactions, so as to allow predicting of it. A good molecular classification according, for example, to their secondary structure depends on how much this property is implied in the molecular codification; so it seems it will be necessary to continue exploring ways of molecular codifications, maybe by new ways of weighting the molecular FT(k) instead of just by one set of atomic charges, which, although giving a correct trend in indicating the secondary types, does not seem to be accurate enough. The question seems to be that all the molecular surface is involved in the packing and there exists the possibility for the presence of information on the several levels of secondary structure of the clusters. Further work is in progress. Acknowledgment. This work was supported by the Spanish DGICYT, under project number BQU20000868. Appendix 1: Molecular and Structure Codifications. Data Sets. To codify the molecules by the 1D-FT of their threedimensional distributions of charges, we used the system Cerius2,12 first to assign electrostatic charges Qi to the atoms of the molecules (from the possible schemes to be chosen, we use the one giving the largest differences to amplify their effects) and then to create two sets of atomic coordinates files, from the structures found in the crystal, one containing the isolated molecules, and another containing representative clusters
Neural Network Prediction of Secondary Structure
of hydrogen-bonded molecules, formed by two molecules for dimers D, three for trimers T, four for tetramers A, six for chains C, and 12 for layers L. Vectors FT(s) ) ∑i∑jQiQj sin(rijs)/(rijs), were calculated at 32 k values of s between 2.5 and 33.5 for isolated molecules, and between 0.6 and 10 for the clusters of secondary structure, since the information on longer distances is at smaller values of s. We have checked the FT with different total ranges and different intervals to chose the parameters for not losing information. We realized the overdependence of FT(s) on short distances, by considering the two conformers of NIBFIN, which cocrystallize each one forming a different secondary structure; as the TF(s) of both molecules, when calculated for all interatomic distances, were too close to account for those structural differences; hence, only rij greater than 2A were used to calculate any molecular FT(s). We considered intramolecular interatomic distances for molecules and only intermolecular interatomic ones for clusters. Each vector FT(s) was scaled to a maximum (|FT(s)|) )1, so all components are between (1. NIBFIN (see Figure 1) presents two conformers with two types of secondary structures, dimers in that of number 14 and chains in 79, so both are codified as different clusters and as different molecules. In the structure of DIGFEE, two types of chains can be distinguished and are codified as 65 and 66. And GUFKIB presents overall layers, 89, but with a notorious chain system in 64. So we are left with 49 vectors for the secondary structures, set 49S, and 47 vectors for molecules, set 47M. To make a supervised learning in a Kohonen’s network, molecules were also codified by binary vectors B(h), (10000), (01000)...(00001), according to their secondary structure type, D, T, A, C, or L (set 49B). Finally, with the purpose of having a “zero” reference, we generate 47 vectors of dimension 32 with random coordinates between -1 and 1 (set 47R). Appendix 2: Kohonen’s Neural Network Figure 9 represents a Kohonen’s neural network; the cube at the right is the learning matrix of synapses W(i,j,k), which connects the 32 neurons of a vector FT(k) with the neurons of the 2D Kohonen matrix KM(i,j), at the top of the cube. KM(i,j) is the map where we are going to project, by similarity, our 47 FT(k) vectors. To better understand the process, we consider the 3D matrix W(i,j,k) as formed by column vectors perpendicular to KM(i,j), as shown in Figure 9. For each input vector FT(k), the more similar column vector W(I,J,k) is found by the lowest FT(k) - W(I,J,k) distance. The modification of all column vectors W(i,j,k) are calculated, depending on their distance to FT(k), and with decreasing modification, as (i,j) departs (I,J), or as epoch increases. After one epoch, the total W(i,j,k) modifications, which were accumulated for every input vector, are applied for the synapses of the next epoch. Initially, all synapses are random. At the end, if the process converges, there are column vectors W(I,J,k) similar to the input vectors FT(k), and also similar to the neighbor columns. In this way, the FT(k) vectors are projected on their KM(I,J) place of the Kohonen’s map, by similarity between them. Superposition of high similar
Crystal Growth & Design, Vol. 5, No. 1, 2005 199
Figure 9. Scheme of a Kohonen’s neural network, showing one input vector FT(k), and the 3D learning matrix W(i,j,k). When the process is finished each input vector is projected on the KM(I,J) cell of the two-dimensional Kohonen’s map (at the top of the matrix), W(I,J,k) being the column vector of the matrix more similar to FT(k).
molecules on the map is possible, depending on the map size. All together these projections form the so-called Kohonen’s map, periodic in two directions and of a chosen reasonable size, for example, 20 × 20 in the present case, to distinguish all molecules on the map so to allow the visualization of the neighborhoods, previous to class formation. When the learning is successful, that is, the matrix values stay stable during the final epochs, the input vectors are distributed in the map by similarity, and so classes could be defined delimiting areas of the neighborhood. In a supervised learning, instead of using just the molecular FT(k) vectors to form the Kohonen’s map, the union of these vectors extended with the corresponding vector representing their secondary structure B(t) is used, as there are five types of them we are using vectors of dimension 32+5. Two joined synapses matrices are trained now, one for each codification, W1(i,j,k) and W2(i,j,t). In this way, the resulting KM(I,J) point in the final map represents a molecule plus its assigned secondary structure, in our case, with a relative weight of 32 to 5. After learning, an activation map ACT(i,j) ) ∑mW(i,j,k)‚ FTm(k), can be calculated, over the Kohonen’s map. Areas with high activation are more reliable for classification and prediction; the corresponding W(i,j,k) vectors are more representative of the majority of the FT(k) data around, these vectors being more similar among themselves. References (1) Lommerse, J. P. M.; Motherwell, W. D. S.; Ammon, H. L.; Dunitz, J. D.; Gavezzotti, A.; Hofmann, D. W. M.; Leusen, F. J. J.; Mooij, W. T. M.; Price, S. L.; Schweizer, B.; Schmidt, M. U.; Van Eijck, B. P.; Verwer, P.; Williams, D. E. Acta Crystallgr. 2000, B58, 697-714. (2) Motherwell, W. D. S.; Ammon, H. L.; Dunitz, J. D.; Dzyabchenko, A.; Erk, P.; Gavezzotti, A.; Hofmann, D. W. M.; Leusen, F. J. J.; Lommerse, J. P. M.; Mooij, W. T. M.; Price, S. L.; Scheraga, H.; Schweizer, B.; Schmidt, M. U.; Van Eijck, B. P.; Verwer, P.; Williams, D. E. Acta Crystallgr. 2002, B58, 647-661. (3) Beyer, T.; Lewis, T.; Price, S. L. CrystEngComm 2001, 44, 1-35. (4) Mooij, W. T. M.; Leusen, F. J. J. Phys. Chem. Chem. Phys. 2001, 3, 5063-5066.
200
Crystal Growth & Design, Vol. 5, No. 1, 2005
(5) Sarma, J. A. R. P.; Desiraju, G. R. Cryst. Growth Des. 2002, 2, 93-100. (6) Infantes, L.: Foces-Foces, C.; Claramunt, R. M.; Lo´pez, C.; Jagerovic, N.; Elguero, J. Heterocycles 1999, 50, 227-233. (7) (a) Infantes, L. Crystalline Packing Modes of Pirazole-NH Monocycles, Ph.D. Thesis, Universidad Autonoma de Madrid, 2000. (b) Foces-Foces, C.; Alkorta, I.; Elguero, J. Acta Crystallgr. 2000, B56, 1018-1028. (8) Infantes, L. And Motherwell, S. 2004, Struct. Chem., in press. (9) Allen, F. H.; Davies, J. E.; Johnson, O. J.; Kennard, O.; Macrae, C. F.; Mitchell, E. M.; Mitchell, G. F.; Smith, J. M.;
Fayos et al.
(10) (11) (12) (13) (14)
Watson, D. G. J. Chem. Inf. Comput. Sci. 1991, 31, 187204. Fayos, J; Cano, F. H. Cryst. Growth Des. 2002, 2, 591-599. Gasteiger, J.; Marsili, M. Tetrahedron 1980, 36, 3219-3288. CERIUS2 Version 4.2, Mat. Sci., Molecular Simulations Inc., 240/250 The Quorum, Barnwell Road, Cambridge, England CB5 8RE, 2000. Kohonen, T. Neuroscience 1976, 2, 1065. Kohonen, T. Self-Organizing Maps, Springer-Verlag: Berlin, 1995; Series in Information Sciences, Vol. 30.
CG049903K