Crystal-Packing Prediction by Neural Networks - Crystal Growth

A sample of 31 molecules, of quite different chemical characters and known crystal structures, has been employed. These molecules, encoded by the 1-D ...
2 downloads 0 Views 184KB Size
Crystal-Packing Prediction by Neural Networks Jose Fayos* and F. H. Cano Departamento de Cristalografia, Instituto Rocasolano, CSIC, Serrano 119, Madrid-28006, Spain

CRYSTAL GROWTH & DESIGN 2002 VOL. 2, NO. 6 591-599

Received May 24, 2002

ABSTRACT: In this work we propose the use of a neural network to predict the crystal mode of packing of small organic molecules from just their 3-D molecular structure. A sample of 31 molecules, of quite different chemical characters and known crystal structures, has been employed. These molecules, encoded by the 1-D Fourier transform of their 3-D point charge distributions, are used as input in a Kohonen neural network. Although no molecular packing information is given, the resulting similarity output maps self-classify the molecules after the type of H-bonding pattern they present, or according to their observed molecular packing mode, when no H-bonds exist. The corresponding crystal packings of these molecules were encoded similarly by the Fourier transform of a finite cluster of molecules sought from their known crystal structures. The self-classification of these encoded packings, on the Kohonen map, presents good correlation with the classification found for the isolated molecules, and both correlate well also with the visually observed types of packings in the crystal structure. Thus, it seems that the Fourier transform of an isolated molecule includes enough packing information to allow its classification into packing modes. Finally, the same neural network, trained with part of the set of 31 molecules, supervised with their crystal packing, is used to predict the encoded packing of molecules not included in the training, in order to classify them into a mode of packing. Introduction The ab initio prediction of crystal packing, based on the knowledge of just the molecular structure, has great interest from both basic and applied points of view. However, the crystallization process is very complex, being a cooperative process which seems to be highly nonlinear; in addition, we have to take into account some factors, such as the kinetic and thermodynamic conditioning, the solvent composition, and the possibility of polymorphism. Moreover, there are not yet good models for the noncovalent interactions involved in crystallization, as occur in most chemical events. This is probably the reason for the poor results obtained up to now in the prediction of packing, as shown for instance in the important work recently done by 11 crystallographic labs trying to predict the crystal packing of four simple molecules,1 although some improvements have been made in related works.2,3 In the meantime the Cambridge Structural Database (CSD),4 although biased toward molecular interest, has grown to include a quarter of a million molecules and their crystals, which leads to the idea that the CSD could contain enough information for the statistical prediction of the molecular packing. This implies a general assumption that a molecule is associated with a preferable packing mode, although the idea is not so evident of a correlation between some classes of molecules and their modes of packing, whatever the sort of molecular classification is used. In the classical crystallographic point of view, the molecular crystal packing is described by the position and orientation of the molecule in the unit cell plus the crystallographic space group, which generates equivalent molecules. However, it is known that drastic * To whom correspondence should be addressed. Tel: 34-915619400. Fax: 34-91-564243. E-mail: [email protected].

changes in both unit cell and space group can produce almost identical molecular packing; therefore, this representation really does not describe the packing mode. Thus, usually a qualitative visual packing classification, although subjective, is more useful. Looking for a quantitative and more objective representation for molecular packing, we thought of the powder diffraction pattern (PDP), observed or calculated, as it is immune to small molecular movements in the crystal, despite the fact that these movements could cause great changes in the unit cell or space group. In addition, the PDP can be considered to be the fingerprint of the crystal, and in many instances it gives information on cell dimensions, space group symmetry, and even the atomic structure. On the other hand, any molecular classification rests upon selecting some criterion for setting the equivalence: that is, an encoding that will choose some characteristics and reject others, thus imposing an external bias. This selection seems to be critical; therefore, we think that, to sharpen the possible correlation between molecules and their packing, some molecular characteristics most likely conditioning the packing have to be included in the molecular encoding and molecular characteristics should be included in the encoded packing. Following the idea of using the PDP as representative of molecular packing, we encoded the molecules, and also their packings, by their 1-D Fourier transforms (FT), as will be shown in the next section. To encode the packing, we used a small cluster of molecules, about 40 molecules, instead the infinite crystal; thus, we have a broader distribution of values in the FT. Then, we used a neural network to classify the encoded molecules and packings, as has been done elsewhere for predicting IR spectra from molecules.5 The encoding by FT also tries to solve the difficulty of putting all the different

10.1021/cg025530g CCC: $22.00 © 2002 American Chemical Society Published on Web 09/12/2002

592

Crystal Growth & Design, Vol. 2, No. 6, 2002

Fayos and Cano

Figure 1. The 31 molecular structures, ordered according to the neural network classification with Omin ) 4. Classes are named as M4(i/6), where i is the number of the class, 6 is the total number of classes found, and 4 is the value of Omin; L4 stands for the molecules not classified at the Omin ) 4 level. The subjective observed packing modes, from visual inspection of CERIUS outputs (see below), are given in parentheses under the molecular h numbers. A-E are H-bonded structures: A, 3D network; A′, 2D layers; B, strips of tetramers-dimers; C, zigzag chains; D, extended chains; E, isolated dimers. S, P, T, and Z denote structures with no H-bonds: S, high symmetrical packing; P, parallel molecular chains; T, molecular packing in “T” shapes; Z, molecular packing in “Z” shapes.

molecules on an equal (one-dimensional) base representation, to allow their comparison and classification, although at the cost of losing 3-D spatial information. The FT encoding has also the advantage of being a holistic representation, which seems to be convenient for neural networks, as it reinforces their paralleldistributed processing.6 Once classifications were accomplished from this encoding, packing prediction could be possible, if there is correlation between molecular classes and packing classes. Results and Discussion 1. Molecular and Packing Encoding. We have encoded each molecular 3-D structure, as it appears in the crystal, into one n-dimensional vector, with components taken from the Fourier transform (FT), Im(sk) )

∑i*jQiQj sin(rij‚sk)/(rij‚sk), for all the interatomic intramolecular distances rij and weighted by the point atomic charges Qi,7 sk being the reciprocal discrete variable, which is a sampling with discrete values of the scattering variable 4π(sin θ)/λ.5 Thus, we made the reasonable hypothesis that the atomic charges are especially conditioning the molecular packing. Although not reported below, weighting by atomic numbers was checked, giving rise to poorer results. Analogously we have encoded the crystal packing by the FT, Ip(sk), of the cluster of molecules within 2 × 2 × 2 adjacent unit cells, such that their centers of mass are within this cell range, rij now being the interatomic intermolecular distances between all atoms belonging to different molecules. Notice that this encoding includes, indirectly, the intramolecular information. Due

Crystal-Packing Prediction by Neural Networks

Crystal Growth & Design, Vol. 2, No. 6, 2002 593

Table 1. The 31 Molecules Selected for the Analysisa h

name

formula

cluster

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

acetanilide cinnamide adipic acid alanine aniline aspartic acid benzamide benzoic acid β-succinic acid caprolactam citric acid naphthalene anthracene biphenyl quinone benzophenone cubane adamantane ethanol glucose glycine hexamine ice-cubic ice-hexagonal imidazole n-octadecane n-octane terephthalic acid 1 terephthalic acid 2 toluene urea

C8NOH9 C9NOH9 C6O4H10 C3NO2H7 C6NH7 C4NO4H7 C7NOH7 C7O2H6 C4O4H6 C6NOH11 C6O7H8 C10H8 C14H10 C12H10 C6O2H4 C13OH10 C8H8 C10H16 C2OH6 C6O6H12 C2NO2H5 C6N4H12 OH2 OH4×0.5 C3N2H4 C18H38 C8H18 C8O4H6 C8O4H6 C7H8 CN2OH4

64 32 39 32 72 64 32 32 39 64 32 39 39 39 39 32 27 35 32 40 16 35 80 32 32 27 27 27 27 72 24

a The numbering, h, is the molecular reference in this work, and under the cluster column we give the number of molecules in the crystal cluster used for the packing description (i.e. all molecules within 2 × 2 × 2 unit cells).

to the periodicity of the crystal, there appear to be no differences in the FT using more unit cells. Other types of encoding using distances counting just from atoms in the central molecule to atoms in the surrounding ones or just distances between centroids of molecular clusters have been used and, although not reported below, they again give worse results. Now, to go into detail, we have retrieved the 31 organic molecules shown in Table 1 (schematically in Figure 1), and their respective packings, from the CERIUS files;8 this set, we think, forms quite a diverse sample from the chemical point of view. The molecular Fourier transform vectors, Im,h(sk), for 1e h e 31 were calculated for 32 values of sk, in the range of 2.5-15.0 Å-1; the corresponding packing vectors, Ip,h(sk), were also calculated for 32 points, but in the range 0.6-10.0 Å-1, as for longer distances the information stays on lower s values. These ranges should provide enough information to define the continuous FT and also avoid the huge peak near the origin. For every molecule, both the Qi values and the components of Im,h(sk) and Ip,h(sk) vectors were scaled to unity such that for any vector v(scaled) ) v/|vi(max)|. It is interesting, as it concerns the significance of the possible classification, to test the orthogonality (that is, the independence) of the 31 I(sk) vectors to be input in the neural network. We calculate that the average value of the dot products for the normalized vectors IN is 〈INm,h(sk)‚INm,h′(sk)〉h*h′ ) 0.43 (range 0.24-0.56), while 〈INp,h(sk)‚INp,h′(sk)〉 h*h′ ) 0.20 (range 0.13-0.24); thus, overall, the encoded packings form a set more independent than the molecular set. Moreover, in order to test

Figure 2. Self-classification of the molecules by a Kohonen map of size 15 × 15 and 2-D periodicity. The numbers stand for the molecules (h), and the lettering gives the corresponding observed packing mode. Surrounded by lines are the molecular classes M4(i/6), also shown in Figures 1 and 3. They form three superclasses: water molecules, H-bonded packings, and packings with no H-bonds. The encoded molecules, Im,h(sk), are situated in this map such that the differences |∆m,h| ) |Im,h(sk) - W(I,J,k)| are close to zero.

any initial similarity between each encoded couple’s molecule packing, we have calculated the average 〈INm,h(sk)‚INp,h(sk)〉h, which is 0.13, lower than the values mentioned above and ruling out any significant overall initial similarity between molecules and packings. 2. Neural Network Classification. To establish an equivalence criterion, or similarity measure, among the molecular FT vectors or among the packing FT vectors, we decided to classify them through the cycling selfassociation algorithm of a Kohonen neural network.5,9,10,11 We also used, although not reported here, a back propagation neural network,6,9,10,11 but it has in fact produced poorer results. However, we will see that the Kohonen network procedure is able to classify by similarity the vectors in their 32-dimensional space, leading to a reduction of this high-dimensional complexity of the analysis into just two dimensions, those of the so-called Kohonen map, M(i,j), for the self-classification map of the 31 encoded molecules. Each input vector Ih(sk) is, at the end of the procedure, projected onto a position of the map M(Ih,Jh), as shown in Figure 2. In Appendix 1 we explain how the Kohonen neural network operates. The distribution of molecules in the map reflects their similarity: the higher the proximity, the higher the similarity of their vectors, and thus, a group of nearby molecules define a class of equivalence. To assess the proximity relations, it is necessary to take into account the double periodicity of the maps and their resolution given by the number, n × n, of points chosen for the maps. Although high n values avoid molecular (or packing) overlapping, they will give rise to uniform distributions of molecules (or packings) on the maps, which makes it difficult to separate them into classes; therefore, they were not used in the delimitation of classes. In Appendix 2 we describe the way used to define the classes from the Kohonen maps and to measure the degree of coherence of each class. Large maps, e.g. 15 × 15, were used in the figures just to show the distribution of classes on the map. The Kohonen algorithm has been applied to classify the 31 molecules Im,h(sk) and the 31 packings Ip,h(sk) independently, for

594

Crystal Growth & Design, Vol. 2, No. 6, 2002

Fayos and Cano

Figure 3. Scheme of clustering for molecular classes when lowering Omin. Numbers refer to molecules h, capital letters to the observed packing mode, and the nomenclature for classes is given in the legend of Figure 1. Below each class is given their coherence values, defined in Appendix 2.

the purpose of comparing and classifying the 62 vectors together in the same map. We have also done a molecular classification supervised by their packings, where 31 vectors Im,h(sk1)∪Ip,h(sk2) of dimension 64 were classified on the map. For prediction purposes we select a representative subset of molecules (in the sense that all previously determined molecular classes have a representative item in the selected subset) and classify them supervised by the corresponding packings. In this way the net is “taught” and then is used to predict the FTencoded packing of other test molecules, not used in the “learning” process, by localizing those most similar to them in the trained Kohonen map (see Appendix 1). 3. Classification Results. In Figure 2 we present the classification without supervision of the 31 molecular vectors Im,h(sk), identified on the map by the index h given in Table 1. The classes, described by the lines drawn enclosing these indices, actually correspond to the classifications given by the overlapping maps with 4 or more coincidences, Omin ) 4, in 10 overlapping maps (see Appendix 2). They are referred to as classes M4(i/6), i ) 1-6, in Figure 1, where the molecules are ordered by following this classification, and in Figure 3, where can be seen the classification evolution as Omin decreases to 3, 2, and 1. Both figures show the seven molecular vectors left unclassified, named L4. The values given in Figure 3, after the name of the classes, are the measures of the coherence of the classes, as described in Appendix 2. It seems worthwhile to follow the evolution of the classes on going from Omin ) 4 to 1. Although some classes remain stable, there is a tendency to diminish the items left without classification and to diminish the number of classes; thus, the classes become more populated. Analogously, Figure 4 shows the unsupervised classification of the 31 packing vectors Ip,h(sk), with the classes corresponding to Omin ) 2; they are referred to as P2(i/6) in Table 2, where there also appears an unnamed group for h ) 19, 20, with low coherence value. In Figure 5 we show the classification of Im,h(sk) and Ip,h(sk) as a set of 62 independent vectors. The aim of this joint classification is to show the independence (already mentioned in the encoding section) between the

Figure 4. Kohonen map as in Figure 2 but for the packing classes P2(i/6); here also |∆p,h| ) |Ip,h(sk) - W(I,J,k)| ≈ 0. The boldface numbers indicate the relative position (and thus the classification) of the predicted packing vectors for the test molecules (see Prediction of Encoded Packing).

vectors encoding molecules and those encoding their packing; we can see an inner molecular group and the packings out in a separate area. In addition, it can be observed that each molecule, h, is away from its packing, h + 50, confirming that their encodings are not similar. The Kohonen map for the molecules, Im,h(sk), supervised by the corresponding packings, Ip,h(sk), is not shown, but the resulting classification MSP2(i/6), corresponding to Omin ) 2 and shown in Table 2, is similar to that obtained for the P2(i/6) packings alone (see also Figure 4). To have a reference to assess the above classifications, we have tried to classify an “unclassifiable” set formed by 24 orthogonal vectors of dimension 24, (1,0,...), (0,1,...), ..., (0,...,1). For this, we used 10 (5 × 5) Kohonen maps. Then, with Omin ) 4 we obtained no classification at all; with Omin ) 3, 2, only 10 vectors were classified into 4 classes, but with more connections outside than inside each class, and so there was a very low coherence (see Appendix 2). At the Omin ) 1 level, the 24 vectors were grouped into just one class. Thus, the set proved

Crystal-Packing Prediction by Neural Networks

Crystal Growth & Design, Vol. 2, No. 6, 2002 595

Table 2. Packing Classesa P2(1/6) P2(2/6) P2(3/6) P2(4/6) P2(5/6) P2(6/6)

MSP2(1/6) MSP2(2/6) MSP2(3/6) MSP2(4/6) MSP2(5/6) MSP2(6/6)

3 1 23 4 12 17 19 1 8 4 12 17 3

(a) P2(i/6), 26 Molecules Classified and (5,15,19,20,22) without Classification 8 9 10 16 28 2 7 11 25 31 24 (1.00) 6 21 (1.27) 13 14 30 (0.93) 18 26 27 (1.03) 20 (0.35)

29 (1.15)

(b) MSP2(i/6), 30 Molecules Classified and Molecule 22 without Classification 2 5 7 11 15 19 20 25 10 16 28 29 (1.00) 23 24 (1.00) 6 21 (1.23) 13 14 18 30 (1.28) 26 27 (0.80) 9 (0.30)

(1.21)

31

(0.90)

a Part a gives packing classes at the O min ) 2 level, with the molecules included in each class and the coherence index given in parentheses. Part b is the same as part a except for the molecular classification supervised by their respective packings. The nomenclature for classes is given in Figure 1.

Table 3. Correlation between Molecular Classes M4(i/6) and Packing Classes P2(i/6), with Different Values for Omin To Balance the Number of Classes in both Setsa M4(1/6) M4(2/6)

P2(1/6)

P2(2/6)

3, 9, 10, 28 D, D, E, D 8, 29 E, D

1, 25, 31 C, C, A 2, 7, 11 B, B, A

M4(4/6)

P2(4/6)

L2

P2(5/6)

P2(6/6)

4, 6, 21 A, A, A

M4(3/6) L4

P2(3/6)

23, 24 A, A 16 A

M4(5/6) M4(6/6)

5, 15, 19, 20, 22 C, A′, C, A, A

17 S 12, 14, 30 T, Z, Z 13 T

18, 26, 27 S, P, P

a The numbers refer to the 31 chosen molecules, h, and the capital letters to their observed packing types. There is clearly grouping around the main diagonal, indicating high correlation between both classifications. The distribution of the visually observed packing types, given by the capital letters, also indicates their correlation with both classifications.

Figure 5. Joint classification of molecular Im,h(sk) and packing Ip,h(sk) vectors, showing the independence of their encoding. Im,h(sk) are numbered as in Table 1, from 1 to 31, while Ip,h(sk) are numbered here from 51 to 81, to clearly distinguish them.

to be unclassifiable, giving significance to our molecular and packing classifications. 4. Discussion of the Classifications. The capital letters in parentheses in Figure 1 and the capital letters in Figures 2-4 indicate the observed packing modes for the 31 molecules: that is, a qualitative visual packing

classification from a CERIUS display. Letters A-E represent hydrogen-bonded secondary structures with decreasing complexity, while the letters P, S, T, and Z represent different packing modes for molecules with no H-bonds. The correlation between unsupervised molecular classes and the observed packing modes is evident in Figures 1-3, although this may be better appreciated in Table 3. From just the molecular information, there appears to be a clear distinction between molecules with and without possible H-bonding interactions. The last two classes separate the packing modes Z and P in M4(5/6) and M4(6/6); Figure 3 shows how all structures without H-bonds are finally classified in the one class M3(4/4). In addition, there appears to be a sort of grouping of the observed packing modes for H-bond structures; the classes M4(1,2,4/6) have molecules with analogous chemical groups and become just one class, M2(1/3) (see Figure 3), when the overlapping criterion is lowered. The similar clathrate packings of the two water molecules form one class, M4(3/6). It is interesting to notice that the Kohonen self-classification of molecules includes 15, 16, and 22, with weak hydrogen interactions, among those with conventional hydrogen bonds. Thus, it could be said that the neural network has detected molecules able to form weak interactions. Molecules 15 and 16, with CH‚‚‚O distances of 3.38,

596

Crystal Growth & Design, Vol. 2, No. 6, 2002

3.50, and 3.51 Å, with good geometry (values which are less than 1.10 of the sum of the van der Waals radii) have 2-D and 3-D H-bond packing and are classified with conventional hydrogen interactions in M2(1/3), while molecule 22 with longer CH‚‚‚N distances of 3.8 Å could not be classified but stays closer to H-bond areas in Figures 2 and 4. Analogously, in Table 3, it appears there is a correlation between the visually observed packings and the packing self-classification by the Kohonen maps, and so it seems that the network procedure is not very different from that we used to classify packings by visual inspection. As was said in part 1 of this section, the set of molecules is quite broad from the chemical point of view, and so is the corresponding packing set. Therefore, the neural network gives a coarse classification as well, as shown in Table 3. The presence/absence of H-bonds is quite evident again; there is a clear group formed by the observed B-C types and another formed by the D-E types; observations of type A, assigned to 3-D H-bonds, spread over the classes and include a class for the two water packings, somehow indicating the rough criteria employed to establish the observed packing. Structures with no H-bonds are separated into two classes, P2(5/ 6) for T and Z packing types and P2(6/6) for S and P packing types. Finally, in Table 3, there appears a clear correlation between unsupervised molecular classes and unsupervised packing classes, independent of the estimation of the observed packing modes: molecular classes M4(1,2/ 6) pack in packing classes P2(1,2/6), M4(4/6) in P2(4/6), M4(3/6) in P2(3/6), and nearly M4(5/6) in P2(5/6) and M4(6/6) in P2(6/6); moreover, almost the same molecules are left unclassified in both cases. The molecular classification supervised by their packings (see Appendix 1), shown in Table 2b, is similar to the aforementioned unsupervised packing classification P2(i/6); however, supervision has improved the classification, as now classes MSP2(i/6) contain 30 molecules, with molecules 5, 15, 19, and 20 being included in the class MSP2(1/6). The presence of the aforementioned correlation implies that there is common information in the two sets of encoded vectors, Im,h(sk) and Ip,h(sk), which was not apparent in the measure of their linear independence or in the joint Kohonen map shown in Figure 5. It is obvious that some molecular information is included in the encoded packing; on the other hand, it has shown that the network classifies in much the same way as a visual estimation. What is not so obvious is the inclusion of packing information into the Im,h(sk) molecular encoding, but somehow there appears to be, as molecular classes correlate with the visual modes of packing and with the packing classes from the network. Hence, the encoded isolated molecules contain some information about their possible packing, and we think this is the promising point for packing prediction. 5. Prediction of Encoded Packing. With the unsupervised Kohonen neural network self-classification of a set of n vectors, Im,h′(sk), of encoded molecules, together with the classification map Mm(i,j), we get the trained molecular matrix Wm(i,j,k), as is explained in Appendix 1. To include in this classification a new

Fayos and Cano

Figure 6. Kohonen map of size 15 × 15, showing the selfclassification corresponding to the training of the learning matrixes with the 16 training molecules representing the M2(i/3) classes (in plain lettering), supervised with their packings. The predicted positions (I,J) for the 15 test molecules (h′′), in boldface lettering, are situated so as to give minima differences |∆m,h′′|, their corresponding Wp(I,J,k) values becoming their predicted encoded packing. Capital letters represent the observed packing types.

molecule not used for that training, h′′, the procedure is to find the Wm(I,J,k) column vector closest to the input vector Im,h′′(sk); then Mm(I,J) will place the test molecule into the above classification. If we make a supervised training by using the pairs Im,h′(sk)/Ip,h′(sk), we also get the second trained packing matrix Wp(i,j,k); then, the “unknown packing” for a new test molecule, Im,h′′(sk), can be predicted by locating Mm(I,J) and then the corresponding pair vector Wp(I,J,k), which is in fact the predicted encoded packing for the test molecule. This vector, besides being a quantitative result, can be used to classify the unknown packing into a map of previously classified packings Mp(i,j). To test the packing prediction capacities of the Kohonen neural network, we used 16 (h′) training molecules out of the set of 31; the other 15 molecules (h′′) were left out of the test. The training molecules were selected among those well-centered in the areas of the map of Figure 2, corresponding to classes M2(i/3). The resulting classification map is presented in Figure 6, where the old classes M2(i/3) have been marked (the associated network matrixes being Wm(i,j,k) and Wp(i,j,k), as described in Appendix 1). On this map the 15 test molecules are situated at the (I,J) points such that |∆m,h′′(I,J,k)| ) |Im,h′′(sk) - Wm(I,J,k)| is minimum, this difference estimating the molecular fit to their class and the difference |∆p,h′′(I,J,k)| ) |Ip,h′′(sk) - Wp(I,J,k)| being the predicted packing agreement. We have found the following values of |∆m,h′′(I,J,k)| and |∆p,h′′(I,J,k)|, in this order, for our 15 test molecules: molecule 1, (0.6,2.6); molecule 6, (0.8,1.6); molecule 15, (1.1,3.5); molecule 11, (0.7,3.0); molecule 16, (1.7,2.5); molecule 21, (0.6,0.9); molecule 25, (1.1,2.5); molecule 28, (0.7,2.6); molecule 29, (0.9,2.1); molecule 24, (0.3,2.2); molecule 19, (1.0,2.5); molecule 12, (0.5,2.0); molecule 18, (1.2,2.3); molecule 26, (1.0,0.9); molecule 22, (2.7,3.0). As we tried to minimize the first difference, ∆m, their values are lower than those found for the second difference, ∆p. Figure 7 shows the encoded vectors for

Crystal-Packing Prediction by Neural Networks

Crystal Growth & Design, Vol. 2, No. 6, 2002 597

Figure 7. Examples of prediction of encoded packing for the very different modes of adamantane (18), aspartic acid (6), and terephthalic acid (29) with medium agreement. The y axis gives I(sk) values in arbitrary units, and the x axis gives the k step numbers in the s range. On the left is shown the molecular Fourier transform, and on the right are the corresponding observed packing transforms Ip,h(sk), in continuous lines, compared with the predicted packings by the network, Wp(I,J,k), in discontinuous lines.

some test molecules and their observed visual packing, together with their predicted packing: namely, molecule 18 with no H-bonds, molecule 6 with 3D H-bonds, and molecule 29 with H-bond chains. The test molecules 29 and 28 are polymorphs, and the network predicts an average packing for both. As a matter of visual inspection, both polymorphs consist of almost identical layers of chains, with the only difference being the type of sliding of one layer with respect to the other, thus changing the phenyl-phenyl interactions between lay-

ers. The overall prediction can be analyzed in Figure 6, by checking where the test molecules are situated within the trained classes and seeing that they appear inside or near their expected areas corresponding to M2(i/3) classes. In the same way, the predicted positions for the packing of test molecules can be checked (bold numbers) in the unsupervised map of packings shown in Figure 4, trained with the 31 molecules. Both water molecules, 23 and 24, pack in two diamond-like forms, cubic and hexagonal: the first of the OH2 molecules

598

Crystal Growth & Design, Vol. 2, No. 6, 2002

Fayos and Cano

(used for training) and the second of the disordered OH4 molecules (used for testing). As the Im(s) values of both molecules are almost identical, the predicted packing for 24 is very close to both observed packings for 23 and 24. In general, the observed and predicted encoded packings fit better for small values of sk, showing that both structures overlap better at low resolution, although some packing refinement would be still needed for adjustment. Conclusions The encoding of molecules, by the Fourier transform of their 3-D point charge distribution, allows their selfclassification through a Kohonen neural network, with each class of molecules having a similar packing mode. The same encoding, applied to a crystalline cluster of molecules, allows a self-classification of packings, in the same way as that obtained for isolated molecules. Both classifications are correlated between and are also in agreement with a previously visual classification of the packing modes. Hence, the molecular FT encoding includes enough packing information to allow some approach to the prediction of the mode of packing, from the isolated molecule. Moreover, each FT coding includes not only individual characteristics but also some common characteristics with other molecules, those which allow the classification. Anyway, we think it should be remarked that, in this procedure, the predicted molecular packing is not the crystal structure but its one-dimensional FT. As a mater of fact, what we are presenting here, in addition to the aforementioned prediction, is a criterion for the similarity/differences to allow packing classification into modes: that is, a criterion for establishing when two packings are similar/different, independent of any visual perception. Although it is difficult to derive the actual crystal structure from the encoded predicted packing, this one is an approximation of the powder diffraction pattern, from which some information on the unit cell and symmetry could be obtained. Even more, the PDP reflects topological characteristics of the packing better than the drastic symmetry criteria. There is also the possibility of obtaining the 3-D structure from accurate PDP.12 On the other hand, concerning the important question of how to encode molecules in order to involve as much information determining the packing as possible, weights other than the point charges for the FT could be explored, as for instance those parameters used in QSAR studies. Appendix 1. The Kohonen Neural Network The Kohonen neural network5,9,10,11 maps the 31 input vectors of 32 components, Im,h(sk) or Ip,h(sk), onto the twodimensional map M(i,j) (e.g. 15 × 15), which is periodic in the two dimensions of the map. Within that map, proximity relationships indicate the similarity of vectors and so allow self-classification by similarity. The mapping is done through the three-dimensional learning matrix W(i,j,k), with the (i,j) column vectors corresponding to the position M(i,j), as shown in Figure 8. Initially, at cycle (or learning epoch) zero, the 3-D matrix elements have values at random between (1.

Figure 8. Scheme of a Kohonen neural network, showing one input vector I(sk) and the learning matrix W(i,j,k). Each input vector is projected on the Kohonen map on the point M(I,J), where W(I,J,k) is the matrix column vector most similar to I(sk).

Cycle 1 starts modifying these values by checking the first vector Im,1(sk) against all the matrix column vectors (i,j) and selecting one column such that the difference |∆1(I,J,k)| ) |Im,1(sk) - W(I,J,k)| is minimum (with a Euclidean metric). This selection gives rise to a modification by δ1W(i,j,k) of all the matrix elements (as described below), depending on the current epoch and on the proximity of the current column (i,j) to the selected one (I,J). The process goes on, calculating all modifications, δhW(i,j,k), from all vectors in the input set but without still applying these modifications. An epoch ends once all the input vectors have been compared, and then the sum of all modifications of every matrix element is added to it (in a so-called simultaneous or parallel learning). Epochs are repeated until the process converges, that is, the ∆ values, and thus the δ values, go to zero, indicating that the input data can be classified. At this stage, a position of the map M(I,J) can be assigned to each input vector Im,h(sk), that corresponding to the most similar matrix column W(I,J,k), although some input vectors could finish overlapped in the same M(I,J) position. Thus, molecules are selfclassified on the map by similarity, that due to proximity. The matrix modification for every input, h, is δhW(i,j,k) ) Ke‚Kd‚∆h(i,j,k), with Ke ) [(amax - amin)‚f + amin] being a factor depending on the current epoch and Kd ) [(bmax - bmin)‚f‚(1 - d/dmax + 1) + bmin] a factor that depends on the current cycle and on d, the proximity of the current matrix column (i,j) to (I,J), as measured by the Euclidean distance. The parameters ai and bi control the relative variations of both factors; f ) (tmax - t)/(tmax - 1) is the decay epoch factor, with t counting the epoch until tmax; dmax is fixed to the maximum possible d value, taking into account the periodicity of the maps. However, the values of ai and bi seem not to be critical for our actual results, even when other nonlinear decays are employed for Ke or Kd. The network process has been performed using local software (available on request), and we have worked with tmax ) 300 epochs, enough for convergence, with 0.5 < amax< 1.0, 0.0 < amin < 0.1, 0.9 < bmax < 1.0 and

Crystal-Packing Prediction by Neural Networks

0.0 < bmin < 0.1. The same learning procedure has been done to self-classify the input vectors Im,h(sk) or Ip,h(sk). For the supervised classification, two learning matrixes for molecules and packings, Wm(i,j,k1) and Wp(i,j,k2), are used, and the joint vector Im,h(sk1)∪Ip,h(sk2) of dimension 64 is used to look for the closest column (I,J) in the corresponding total matrix Wm∪p(i,j,k1...k2) of dimensions (imax ) jmax ) n) n × n × (2 × 32), which is composed by pasting Wm(i,j,k1) above the Wp(i,j,k2) vector. Appendix 2. Classification From Kohonen Maps At a given resolution, n × n, the resulting Kohonen maps have the input molecules distributed in points M(I,J), corresponding to the final W(I,J,k) matrix elements most similar to them. The higher the n value, the more extended the molecular distribution, and the less the overlapping of input molecules on the map positions. To establish the classification, we have to group the molecules by their proximity in the maps, but this proved to be difficult, as they were quite uniformly spread apart. Therefore, we have tried several procedures to distinguish classes. We first used extended maps to produce afterward either a continuous density map or a sort of informational entropy map, based on this density distribution. In any event, the frontiers between classes appeared to be quite fuzzy. Then, we decided to use several Kohonen output maps, e.g. about 10, to get one classification by using different network parameters with low resolution, n ≈ 5, to compel the overlapping of some molecules in a position M(I,J) of the map; the differences between the resulting maps were mostly due to their different starting random 3-D matrixes. If two molecules overlap frequently, they are similar. Thus, we produced an overlapping matrix, 31 × 31 (31 input data), such that O(h,h′) is the number of maps where molecule h overlaps with molecule h′. Using this matrix and giving a threshold value, Omin(h,h′) ) 4, 3, 2, or 1, for the minimum number of overlappings required, we can form groups of molecules, that is classes, and define an internal consistency, which is high when all molecules in the group overlap with other molecules of this group more than or equal to Omin, and they overlap with not many molecules of the rest of the input molecules. These “closed” groups are to be considered classes of the molecular classification at the Omin level. The higher Omin, the more the number of classes obtained and the more the number of molecules left outside the classification, but each class is composed of more homogeneous

Crystal Growth & Design, Vol. 2, No. 6, 2002 599

data. The coherence of each class can be measured as ∑g(ning - noutg)/(ng‚nmaps), where the summation is over each g molecule of the class, formed by a total of ng molecules; molecule g is overlapped with ning molecules of its class and with noutg molecules outside this class; nmaps is the number of maps that have been calculated (10 in this case). The factor ning - noutg indicates the individual contribution of molecule g to the coherence of the class. A Kohonen map of high resolution, e.g. 15 × 15, does not present any overlapping, and although it does not clearly separate the classes, it does not mix them either; therefore, this type of map is used for presenting the proximity (similarity) relationships among molecules of a class, obtained by the aforementioned procedures. Acknowledgment. This work was supported by the Spanish DGICYT, under Project No. BQU2000-0868. References (1) Lommerse, J. P. M.; Motherwell, W. D. S.; Ammon, H. L.; Dunitz, J. D.; Gavezzotti, A.; Hofmann, D. W. M.; Leusen, F. J. J.; Mooij, W. T. M.; Price, S. L.; Schweizer, B.; Schmidt, M. U.; Van Eijck, B. P.; Verwer, P.; Williams, D. E. Acta Crystallogr. 2000, B56, 697-714. Motherwell, W. D. S.; Ammon, H. L.; Dunitz, J. D.; Dzyabchenko, A.; Erk, P.; Gavezzotti, A.; Hofmann, D. W. M.; Leusen, F. J. J.; Lommerse, J. P. M.; Mooij, W. T. M.; Price, S. L.; Scheraga, H.; Schweizer, B.; Schmidt, M. U.; Van Eijck, B. P.; Verwer, P.; Williams, D. E. Acta Crystallogr. 2002, B58, 647-661. (2) Mooij, W. T. M.; Leusen, F. J. J. Phys. Chem. Chem. Phys. 2001, 3(22), 5063-5066. (3) Sarma, J. A. R. P.; Desiraju, G. R. Cryst. Growth Des. 2002, 2, 93-100. (4) Allen, F. H.; Davies, J. E.; Johnson, O. J.; Kennard, O.; Macrae, C. F.; Mitchell, E. M.; Mitchell, G. F.; Smith, J. M.; Watson, D. G. J. Chem. Inf. Comput. Sci. 1991, 31, 187204. (5) Schuur, J. H.; Selzer, P.; Gasteiger, J. J. Chem. Inf. Comput. Sci. 1996, 36, 334-344. (6) McClelland, J. L.; Rumelhart, D. E. Explorations in Parallel Distributed Procesing, M.I.T. Press: Cambridge, MA, 1994. (7) Gasteiger, J.; Marsili, M. Tetrahedron 1980, 36, 3219-3288. (8) CERIUS2 Version 4.2 Mat.Sci.; Molecular Simulations Inc., 240/250 The Quorum, Barnwell Road, Cambridge, England CB5 8RE, 2000. (9) Zupan, J.; Gasteiger, J. Anal. Chim. Acta 1991, 248, 1-30. (10) Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug Design, 2nd ed.; Wiley-VCH: Weinheim, Germany, 1999. (11) Gasteiger, J.; Zupan, J. Angew. Chem., Int. Ed. Engl. 1993, 32, 503-527. (12) Altomare, A.; Cascarano, G.; Giacovazzo, C.; Guagliardi, A.; Burla, M. C.; Polidori, G.; Camalli, M. J. Appl. Crystallogr. 1994, 27, 435-436 (http://www.netsci.org/Resources/ Software/Struc/sirpow.html).

CG025530G