Molecular Crystal Prediction Approach by Molecular Similarity: Data

May 15, 2009 - Data Mining on Molecular Aggregation Predictors and Crystal. Descriptors. Jose Fayos*. Departamento de Crystalografıa, Instituto Rocas...
0 downloads 7 Views 477KB Size
Molecular Crystal Prediction Approach by Molecular Similarity: Data Mining on Molecular Aggregation Predictors and Crystal Descriptors Jose Fayos*

CRYSTAL GROWTH & DESIGN 2009 VOL. 9, NO. 7 3142–3153

Departamento de Crystalografı´a, Instituto Rocasolano, CSIC, Serrano 119, Madrid 28006, Spain ReceiVed October 7, 2008; ReVised Manuscript ReceiVed April 16, 2009

ABSTRACT: Assuming that isolated molecules contain information of their self-aggregation into the condensed state, similar molecules with respect to that information would form similar crystals. This fact is used in this work to predict molecular crystal packing. To compare isolated molecules, they are codified by equi-dimensional vectors including, as molecular aggregation predictors, the electric potential distribution on the surface of the molecular inertial ellipsoid with axes (L, M, S) and some other molecular form parameters. Their corresponding molecular crystals are codified by equi-dimensional vectors including, as packing descriptors, the cell axes (a, b, c) in decreasing order a > b > c, the cell/ellipsoid axes quotients (a/L, b/M, c/S), and some crystal symmetry information. Both the molecular vectors and crystal vectors of 31 molecules containing an azol group are separately classified by similarity, and although both classifications are diffuse, they are shown to be correlated, indicating that an appropriate selection for aggregation predictors and packing descriptors had been chosen. Finally, given an isolated molecule, even a new theoretical one, the above correlation allows the prediction of the packing descriptors of the probable crystal that this molecule would form, in particular, the crystal cell dimensions with an average relative error of 18%. Introduction As long as the causes we consider determine observable effects, similar causes will produce similar effects. In this paper, we consider as causes some molecular aggregation predictors being the effects on some crystal descriptors depending on those molecular properties. With these considerations, similar molecules will form similar crystals, thus allowing crystal packing prediction by molecular similarity. The prediction success would depend then on the appropriate selection of both molecular and crystal properties. About 98% of molecules show a single polymorph1 in the Cambridge Structural Database CSD,2 which seems to indicate that molecules prefer a specific aggregation mode. Then, although we are aware that sometimes different polymorphs can coexist under the same crystallization kinetics, we could in general assume that one isolated molecule contains the information on how it is going to aggregate for certain conditions. This is actually the hypothesis for most of the approaches to the crystal packing prediction (CPP) of a molecule, and it is also our hypothesis for the present work. Because of its obvious interest, there are many publications relating molecular properties with its crystal structure, from the well-known relationship between the molecular form and its crystal aggregation3 to the more recent two articles with the same suggestive title “Are crystal structures predictable?”4,5 However, CPP is far from being solved because the molecular aggregation process seems to be complex, cooperative, and nonlinear, which implies difficulty of being computer simulated. Even so, different CPPs have been approached by molecular aggregation simulation with lattice energy minimization methods using standard or observed6 (for similar molecules) intermolecular potentials, for example, the so-called “blind tests” attempts where several crystal laboratories competed to predict the crystal structure of some simple molecules.7-11 These calculations are a big advance for CPP, especially after the recent * To whom correspondence should be addressed. Tel: 34-91-5619400; fax: 34-915642431; e-mail: [email protected].

results of the last CSP200710 where improved force fields gave much better results. However, although for small and almost rigid molecules they mostly provide for each molecule some different possible theoretical polymorphs with quite similar packing energies, where it could be difficult to get the observed polymorph, in the case it was among those proposed. Besides CPP by molecular aggregation simulation, there are other approaches to get some information about the crystal formed by a molecule; the papers by Pidcock and Motherwell12-15 are of special interest for us, and some of their results are used in our work. Thus, they define the molecular ideal pattern coefficients, here called from now on (p_c), as [(a|L)/L, (a|M)/ M, and (a|S)/S], where L, M, and S are the large, medium, and short principal axes of the molecular inertial ellipsoid (PAI), with a|L the crystal cell axis more parallel to L, etc. The authors12 found that the distribution histogram of these p_c’s for more than 20 000 structures in CSD shows three wide peaks around 0.86, 1.43, and 2.62 (we will call them standard p_c’s) indicating that the molecular PAI delimit possible values for the cell axes. The histogram of p_c’s calculated with the cell axes in random order with respect to PAI axes gave no peaks, validating the previous result. As an alternative to CPP by minimizing the molecular packing energy, we propose here a molecule/crystal data mining, through the CSD2 observed database, to uncover the above-mentioned relationships between the molecule and its packing or other unexpected relations implicit in the database. Thus, the correlation between possible molecular classes and aggregation classes could be used to predict, by molecular similarity, the unknown crystal associated with a given isolated molecule, even for theoretical new molecules. This CPP approach could also be useful to select the best polymorph between the equi-energetic options after minimizing the packing energy. To uncover molecule/crystal correlations, in previous data mining work,16 we first codified both the molecule and a representative molecular cluster of its crystal by the onedimensional (1D) Fourier transform (FT) vectors of their threedimensional (3D) atomic charge distribution. For a group of

10.1021/cg801122m CCC: $40.75  2009 American Chemical Society Published on Web 05/15/2009

Crystal Prediction by Molecular Similarity

Crystal Growth & Design, Vol. 9, No. 7, 2009 3143

Figure 1. The 31 molecules selected for the present work, with the identification numbering and the CSD2 codes.

chemically diverse molecules, both molecular and cluster vector sets were self-classified by Kohonen17 neural networks (KNN), and some correlations were found between both classifications allowing molecular aggregation prediction by molecular similarity. The predicted molecular cluster FT can be considered as a smoothed powder diffraction pattern (PDP) which would define the crystal structure, with PDP being immune to small packing modifications, although they could produce drastic alterations both in the space group or in cell dimensions. In a second data mining work,18 some molecules containing the azol group were also codified by their FT, and these were found to be correlated with their secondary structure type which finally allowed their prediction. There are some interesting works19,20 concerning the similarity analysis among PDP spectra, their classification in a Kohonen map, and their correlation with some molecular types. Selection of Molecular Aggregation Predictors and Packing Descriptors. The molecular properties selected as aggregation predictors should condition their selected crystal descriptors. Moreover, as we attempt CPP by molecular similarity, in order to compare among molecules and among their crystals, we codified the molecular aggregation predictors by equi-dimensional vectors M(k) and the crystal descriptors by equi-dimensional vectors P(k), which allows one to verify if molecules with similar M(k) produce crystals with similar

P(k). In order to predict the packing P(k)T for a given test molecule codified by M(k)T, a group of learning molecules are needed, with their known M(k)L and P(k)L representing sufficiently M(k)T and its “unknown” P(k)T, respectively. In these conditions, the probable P(k)T would be similar to those P(k)L associated with those M(k)L more similar to M(k)T. For the sake of clarity, a summary of symbols is given before the References. We selected 31 molecules from CSD, containing the azol group, as we did in our previous work.18 Looking for the simplest molecular aggregation problem, we considered just crystal cells containing one molecular species and one molecule per asymmetric unit (Z′ ) 1). To preserve the molecular form from the isolated molecule to its crystal, we considered for the first the same conformation as adopted in the crystal, assuming that molecular shape descriptors would not significantly change by packing forces. Figure 1 shows the 31 selected molecules. Molecular Aggregation Predictors. For obvious reasons, we decided to consider as the main molecular packing conditioner the potential distribution: Vi ) Σj(qj/rij) on points i of the van der Waals (VDW) molecular surface. The formal charge qj of atom j (at the distance rij) was calculated with Cerius221 by the Gasteiger22 method, where each atom is identified by its orbital electronegativity and its connectivity to the rest. We

3144

Crystal Growth & Design, Vol. 9, No. 7, 2009

Fayos

Table 1. The Nine Unscaled Components of the Blocks M1, M2, and M3 of the Aggregation Predictor Vector M(k) for the 31 Molecules with Azol Groupa

1 BENSES 2 BEWLEU 3 BIWWEJ 4 FAQROE 5 FAQSAR 6 FAQSEV 7 FAQSOF 8 FAQSUL 9 FAQTAS 10 GISZIR 11 HDMPYZ 12 HEPHUF 13 HEPJAN 14 HEPJER 15 HIWJIG 16 HIWJIG1 17 KOSFUT 18 LEVVAJ 19 MBCPAZ 20 PAZDPY 21 POYXUW 22 PYRZAL10 23 RIVBAZ 24 RUPRID 25 RUPSEA 26 TAXLOT 27 TEHQAY 28 VAXLAH 29 VEHCOA 30 WILBAU 31 YAXZOM a

L

M

S

M/L

S/L

S/M

VE/VM

SE/SM

SM/VM

12.215 14.178 19.380 10.793 11.880 10.666 10.279 10.580 11.471 13.685 9.740 10.437 10.287 9.869 7.188 7.266 11.577 13.316 9.880 10.701 16.846 10.093 11.071 11.666 12.773 13.800 11.250 13.123 9.122 11.068 15.768

9.637 9.228 7.004 6.195 6.442 7.338 10.143 7.024 6.761 7.957 7.411 6.177 9.421 6.279 6.534 6.321 8.815 7.242 7.200 8.084 7.676 6.116 7.186 6.814 6.528 9.421 9.934 10.013 7.083 7.131 6.924

7.443 5.200 6.029 4.005 4.310 3.969 5.301 4.020 4.240 4.466 5.977 6.044 6.119 2.417 3.743 2.911 8.069 3.776 4.106 2.903 3.530 5.340 6.433 2.980 2.933 6.850 3.967 5.360 6.967 6.377 4.451

0.789 0.651 0.361 0.574 0.542 0.688 0.987 0.664 0.589 0.581 0.761 0.592 0.916 0.636 0.909 0.870 0.761 0.544 0.729 0.755 0.456 0.606 0.649 0.584 0.511 0.683 0.883 0.763 0.776 0.644 0.439

0.609 0.367 0.311 0.371 0.363 0.372 0.516 0.380 0.370 0.326 0.614 0.579 0.595 0.245 0.521 0.401 0.697 0.284 0.416 0.271 0.210 0.529 0.581 0.255 0.230 0.496 0.353 0.408 0.764 0.576 0.282

0.772 0.564 0.861 0.647 0.669 0.541 0.523 0.572 0.627 0.561 0.806 0.978 0.649 0.385 0.573 0.461 0.915 0.521 0.570 0.359 0.460 0.873 0.895 0.437 0.449 0.727 0.399 0.535 0.984 0.894 0.643

2.124 1.579 1.601 1.150 1.244 1.169 1.490 1.115 1.117 1.188 1.431 1.558 1.556 0.686 1.225 0.945 1.363 1.038 1.199 0.824 1.011 1.265 1.326 0.865 0.763 2.036 1.291 1.509 1.459 1.238 1.182

1.152 1.063 1.160 0.981 1.013 0.962 0.997 0.930 0.918 0.926 0.948 1.085 0.948 0.837 1.046 0.952 0.792 0.955 0.975 0.855 1.006 0.949 0.882 0.921 0.853 1.146 1.041 0.963 0.970 0.845 1.000

1.204 1.192 1.181 1.287 1.271 1.260 1.193 1.252 1.247 1.198 1.247 1.262 1.230 1.257 1.355 1.348 1.128 1.179 1.260 1.204 1.156 1.252 1.193 1.218 1.206 1.202 1.179 1.204 1.196 1.172 1.194

L, M, and S are the ellipsoid PAI axes, VE and SE are the ellipsoid volume and surface, VM and SM are the molecular volume and surface.

selected these charges among other options in Cerius2, preserving the criteria of our previous packing prediction works,16,18 since their more extreme values amplify Vi differences between molecules increasing M(k) vector dispersion. The average values of the maxima and minima atomic charges per molecule among the 31 molecules were 0.084(48) and -0.128(121), the rootmean-square standard deviation (RMSSD) of the last digits in parentheses. To cope with the problem of comparing the Vi distribution between molecules with different VDW surfaces, we reduced the molecular surface to that of the ellipsoid of principal axes of inertia (PAI) of axes L, M, and S, adding to each semiaxis the hydrogen VDW radius of 1.17 Å; the PAI calculations were done with the program RPLUTO.23 Considering the ellipsoid as a convex bipyramid, potentials Vi were calculated on 26 points of its surface: the 6 vertex, the 12 edge centers, and the 8 face centers, and then these values were ordered as components of the vector V(i) following the same point sequence on each ellipsoid for all molecules. Finally, in order to compare the Vi distributions between molecules, each molecule was oriented so that its more positive Vi zone would be at the positive L, M, and S side of the ellipsoid. Besides Vi, some molecular form descriptors were also included in the M(k) vector with a total of 35 components divided in four blocks. The block M1(L, M, S) includes the three PAI axes dimensions. The block M2(M/L, S/L, S/M) includes explicitly a global molecular form. The block M3(volume_ellipsoid/volume_molecule, surface_ellipsoid/surface_molecule, surface_molecule/ volume_molecule) includes other molecular form descriptors which are related among them. In fact, Pidcock and Motherwell12 found that molecules with high surface_molecule/ volume_molecule ratios decrease their volume_mol/volume_box ratios (a box including the molecule like our PAI ellipsoid) causing lower p_c’s with a more efficient packing. Note however that the length of our L, M, and S axes and thus the

corresponding p_c values can be different from those used by the above authors.12 Finally, the block M4(26Vi) includes the potential distribution independently of the molecular size and containing the 74% information of M(k). Table 1 shows only the observed nine components of M1, M2, and M3 blocks for the 31 molecules. In order to classify molecules by KNN, all 35 components of each M(k) were scaled by blocks between -1 for the minimum value and +1 for the maximum value: the 31 × 3 components of block M1, the 31 × 3 components of M2, and independently each component of M3 among the 31 molecules. The 26 Vi of M4 were scaled independently for each molecule to preserve its identity, the influences of both, the Vi distribution on the molecular ellipsoid and the molecular shape, being complementary for the crystal packing. On the other hand, scaling of the 31 × 26 Vi all together was rejected because it produced scaled Vi values too close due to the extreme Vi value in one molecule. Figure 2 shows the distribution of some molecular form descriptors and their tendency lines for the 31 molecules. Except for the 17 and 29 molecules with cube form, the rest have L/S > 1.5 indicating disk form if L/M < 1.5, or rod form12 if L/M > 1.5 (the values of L/S and L/M per molecule are implicit in Table 1). The tendency behavior of L/M, dotted line in Figure 2, is inverse to that of Vol_Ellip/Vol_Mol, in black, and in less scale to that of Sur_Ellip/Sur_Mol, in gray, showing lower Vol_Ellip/Vol_Mol ratio for rods than for disks. On the other hand, the fact that Sur_Ellip/Sur_Mol values, in gray, are close to 1, independent of the molecular form, seems to justify our reduction of the VDW molecular surface to the PAI ellipsoid surface. Finally, the Sur_Mol/Vol_Mol ratios, discontinuous gray line, are almost constant with an average of 1.22(5). Packing Descriptors. Although the entire molecular packing information includes the unit cell, space group (SG), and atomic

Crystal Prediction by Molecular Similarity

Crystal Growth & Design, Vol. 9, No. 7, 2009 3145

Figure 2. Molecular form components for the 31M(k) observed vectors, where VE and SE means the volume and surface of the PAI ellipsoid box, while VM and SM are their VDW molecular volume and surface. The respective polynomial tendency lines are superimposed, black for VE/VM, dotted for L/M, gray for SE/SM, and discontinuous gray for SM/VM.

coordinates, that is to say, all the content in CSD, these data obviously have to be reduced to some selected packing descriptors components in P(k), which should also be comparable between crystals. Besides, the P(k) components should be conditioned by the above molecular descriptors M(k) and finally, we realized, they have to be as few as possible in order to produce a less diffuse classification of the 31 P(k) vector set. After some correlation tests (shown below) between classes of 31M(k) and classes of 31P(k), by considering all components or some of them for both classifications, we decided to focus especially on the crystal cell, reducing drastically P(k) to just 10 components divided in four blocks. The cell dimensions, in decreasing order a > b > c to allow comparison between molecules, form the first block P1(abc). The cell angles did not show any correlation with M(k) components; thus, we included in the second block P2() just one component with the average angular cosine ) . For the third P3 block, we considered two options, the first being the cell/PAI quotients (a/L, b/M, c/S), c_p_q from now on, and the second being the ideal packing coefficient (p_c) defined above12 as [(a|L)/L, (a|M)/M, (a|S)/S)], where the cell axes more parallel to L, M, or S were found for each molecule by RPLUTO.23 As expected, our observed p_c components distributed around the standard p_c’s (0.86, 1.43, or 2.62), while the observed c_p_q components distributed at random. The component distribution was analyzed by representing all 31 × 3 observed components of each option in increasing order. While the 93 p_c values, quite spread from 0.41 to 6.75, could be adjusted to three parallel straight lines separated by two steps, with average values for each line of (0.85, 1.43, and 2.57), the 93 c_p_q values, from 0.78 to 3.05, were adjusted to just a straight line without steps indicating a random distribution. We did not find a correlation between molecular centers in the cell and M(k) components, although some correlation was reported14 between these centers and the molecular packing for a large molecular sample. The space group (SG) information is difficult to compare between crystals; we just considered for

our fourth P4(c, s, g) block the higher or lower presence of the symmetry elements more implied in molecular packing: inversion centers c, screw axes s, and glide planes g; where c ) 1 or c ) -1 means there is a center or not in the cell, and in the case of s and g, 1, 0.33, -0.33, or -1 means there are 3, 2, 1, or no independent elements in the cell. Although these components do not identify SG, they could point to a probable one, taking into account the SG distribution of organic molecules in CSD:12 P21/c (35.5%), P1j (21.6%), P212121 (8.6%), C2/c (7.7%), P21 (5.6%) (Σ ) 79%), which is similar to the distribution found for our 31 molecules: 9P21/n, 7P21/c, 1P21/ a, 3P212121, 3Pbca, 2Pna21, 1Pca21, 2P1j, 1P21, 1P2c, 1C2c, where the 17 molecules belonging to the first three SG groups, with P4 block values (1, -0.33, -0.33) actually biasing P4 prediction toward these values. Components of P(k) were also scaled by blocks between -1 and 1: the 31 × 3 components of P1(a, b, c), those 31 × 3 of P3(c_p_q or p_c), and those 31 of P2(). Components of P4(c, s, g) were already scaled. The observed unscaled 10 packing components of the 31 P(k) are in Table 2, together with our best predicted components given later in the Packing Prediction section. This table shows the P3(c_p_q) block values, which we finally prefer to p_c’s since they produce better prediction. Atomic coordinates were not considered in this work, although our predicted cell and symmetry could indirectly help to approach the situation and orientation of the molecule in the cell. The 31 crystals were chosen by their simplicity but considering marginal the presence of polymorphism whether reported or not in CSD; however, two molecules, GISZIR and HIWJIG, are reported with two polymorphs in CSD. The polymorph GISZIR01 was not considered for prediction because the fluorides of the trifluoromethyl group are disordered over two sets of positions. We just considered both polymorphs of N2C3H3-NH2: HIWJIG (15) and HIWJIG1 (16), to analyze their prediction effects. While the calculated atomic charges are equal for both, the torsion and bond angles at NH2 are different modifying the molecular packing and cell dimensions, both

3146

Crystal Growth & Design, Vol. 9, No. 7, 2009

Fayos

Table 2. The 10 Po(k) Unscaled Observed Components for the 31 Molecules (Above) and the Averaged Predicted Components Pp(k) from 30 Independent Trainings (Below)a

a

b

c

BENSES BENSES

1 1

17.356 16.979

9.120 10.832

7.686 9.869

BEWLEU BEWLEU

2 2

16.843 17.864

12.948 13.242

5.778 9.703

BIWWEJ BIWWEJ

3 3

30.934 21.958

8.317 7.584

5.844 7.228

FAQROE FAQROE

4 4

12.948 11.419

7.817 8.667

7.607 7.646

FAQSAR FAQSAR

5 5

13.685 13.080

9.778 9.521

6.120 7.525

FAQSEV FAQSEV

6 6

13.738 12.373

7.975 9.994

7.225 6.608

FAQSOF FAQSOF

7 7

12.698 12.880

12.422 10.934

7.791 8.831

FAQSUL FAQSUL

8 8

10.067 12.812

9.830 8.928

8.001 7.421

FAQTAS FAQTAS

9 9

12.054 14.132

10.291 9.601

7.817 6.568

GISZIR GISZIR

10 10

26.684 15.957

12.475 10.153

7.449 6.136

HDMPYZ HDMPYZ

11 11

15.211 14.191

8.870 11.205

6.541 8.177

HEPHUF HEPHUF

12 12

16.198 16.021

8.383 9.179

5.423 8.202

HEPJAN HEPJAN

13 13

17.159 15.297

16.080 11.767

8.343 6.933

HEPJER HEPJER

14 14

10.054 13.619

5.488 8.295

5.462 5.245

HIWJIG HIWJIG

15 15

14.304 9.653

5.909 8.037

5.265 6.284

HIWJIG1 HIWJIG1

16 16

9.330 10.681

7.370 7.459

5.923 4.914

KOSFUT KOSFUT

17 17

14.040 14.298

11.804 13.564

11.567 11.936

LEVVAJ LEVVAJ

18 18

17.185 15.953

8.567 8.640

7.291 6.196

MBCPAZ MBCPAZ

19 19

14.146 12.587

13.422 9.065

7.804 6.393

PAZDPY PAZDPY

20 20

15.961 14.318

9.514 10.210

5.765 5.298

POYXUW POYXUW

21 21

21.132 25.724

11.264 9.004

10.764 5.249

PYRZAL10 PYRZAL10

22 22

10.054 15.079

7.554 7.608

4.620 7.407

RIVBAZ RIVBAZ

23 23

12.212 13.053

10.764 10.456

9.778 9.662

RUPRID RUPRID

24 24

17.198 16.682

7.817 9.744

5.870 4.261

RUPSEA RUPSEA

25 25

12.054 18.368

11.462 8.891

6.357 4.669

TAXLOT TAXLOT

26 26

10.725 20.990

9.133 10.928

7.501 8.645





σ(x)



a/L

b/M

c/S

3.064 0.020 0.065 1.738 0.000 0.031 2.705 0.000 0.117 0.937 0.144 0.030 1.985 0.077 0.106 1.735 0.071 0.079 2.227 0.142 0.077 0.898 0.021 0.123 1.212 0.111 0.086 2.342 0.074 0.052 2.172 0.082 0.051 2.648 0.047 0.036 2.476 0.000 0.083 2.658 0.017 0.049 2.506 0.000 0.033 2.037 0.000 0.005 2.916 0.056 0.053 2.407 0.101 0.094 1.429 0.000 0.026 1.929 0.000 0.023 1.865 0.000 0.063 1.558 0.050 0.104 1.186 0.113 0.102 2.354 0.000 0.035 2.924 0.014 0.058 2.964 0.222 0.028

0.786 1.421 1.390 2.243 1.188 1.260 3.184 1.596 1.133 1.095 1.200 1.058 0.444 1.151 1.101 0.602 1.288 1.160 0.780 1.236 1.253 1.023 0.951 1.211 0.499 1.051 1.232 2.112 1.950 1.166 0.943 1.562 1.457 1.257 1.552 1.535 2.218 1.668 1.487 1.411 1.019 1.380 1.788 1.990 1.343 1.487 1.283 1.470 0.754 1.213 1.235 0.886 1.291 1.198 2.255 1.432 1.274 2.202 1.491 1.338 2.466 1.254 1.527 2.121 0.997 1.494 0.205 1.103 1.179 1.692 1.475 1.200 1.185 0.944 1.438 2.700 0.777 1.521

0.098 0.947 1.124 0.064 1.403 1.435 0.061 1.187 1.032 0.032 1.262 1.399 0.023 1.518 1.478 0.134 1.088 1.362 0.100 1.224 1.078 0.035 1.400 1.271 0.093 1.523 1.420 0.074 1.568 1.276 0.102 1.196 1.512 0.058 1.357 1.486 0.131 1.707 1.249 0.045 0.875 1.321 0.110 0.904 1.230 0.055 1.165 1.180 0.097 1.340 1.354 0.056 1.182 1.193 0.078 1.865 1.259 0.088 1.176 1.263 0.045 1.467 1.173 0.126 1.234 1.244 0.068 1.498 1.455 0.081 1.146 1.217 0.116 1.757 1.362 0.062 0.970 1.160

0.331 1.033 1.326 0.068 1.110 1.866 0.122 0.969 1.258 0.081 1.898 1.909 0.028 1.419 1.746 0.285 1.821 1.665 0.163 1.469 1.666 0.053 1.989 1.846 0.132 1.844 1.549 0.319 1.669 1.374 0.414 1.095 1.368 0.309 0.897 1.357 0.098 1.365 1.133 0.111 2.261 2.170 0.345 1.407 1.679 0.151 2.034 1.688 0.305 1.434 1.681 0.406 1.929 1.641 0.193 1.901 1.557 0.325 1.989 1.825 0.055 3.050 1.487 0.366 0.865 1.387 0.063 1.519 1.502 0.211 1.970 1.349 0.165 2.168 1.592 0.469 1.094 1.262

σ(y) c

s

g

1.000 -0.020

-1.000 0.168

-0.333 -0.608

-1.000 0.966

-0.333 0.098

0.333 0.105

-1.000 0.930

1.000 -0.767

-1.000 -0.561

1.000 0.942

-0.333 -0.229

-0.333 -0.206

1.000 1.000

-0.333 -0.320

-0.333 -0.320

1.000 0.669

-0.333 -0.268

-0.333 -0.347

1.000 0.964

-0.333 -0.397

-0.333 -0.333

1.000 0.966

-0.333 -0.292

-0.333 -0.289

1.000 0.994

-0.333 -0.260

-0.333 -0.262

1.000 -0.121

-0.333 -0.169

0.333 -0.702

1.000 0.966

-0.333 0.151

-0.333 0.165

1.000 0.803

-0.333 0.378

-0.333 0.308

1.000 0.983

1.000 -0.376

1.000 -0.376

1.000 0.998

-0.333 0.249

-0.333 0.249

-1.000 -0.049

-0.333 0.217

0.333 -0.343

-1.000 -0.767

-0.333 0.506

0.333 -0.606

1.000 0.778

-0.333 -0.205

-0.333 -0.296

1.000 0.834

-0.333 -0.238

-0.333 -0.294

1.000 -0.318

1.000 -0.198

1.000 -0.037

-1.000 0.507

1.000 -0.024

-1.000 -0.005

1.000 0.989

1.000 -0.325

1.000 -0.328

-1.000 0.755

-0.333 -0.395

-1.000 -0.455

1.000 0.994

-0.333 -0.352

-0.333 -0.345

-1.000 -0.587

1.000 -0.265

-1.000 -0.447

1.000 0.784

-0.333 -0.132

-0.333 -0.111

1.000 -0.068

-1.000 0.191

-1.000 -0.483

Crystal Prediction by Molecular Similarity

Crystal Growth & Design, Vol. 9, No. 7, 2009 3147 Table 2. Continued

a

b

c

TEHQAY TEHQAY

27 27

14.422 15.097

11.370 12.775

6.436 6.438

VAXLAH VAXLAH

28 28

14.672 16.246

10.712 11.615

9.475 7.911

VEHCOA VEHCOA

29 29

12.133 10.764

6.870 8.917

6.567 8.040

WILBAU WILBAU

30 30

12.659 12.640

10.646 10.504

9.620 9.712

YAXZOM YAXZOM

31 31

27.342 18.985

8.225 11.234

5.580 9.320



σ(x)

σ(y)



a/L

b/M

c/S

2.133 0.000 0.008 2.170 0.110 0.113 2.282 0.192 0.044 1.173 0.103 0.106 1.615 0.063 0.023

1.882 1.282 1.342 0.467 1.118 1.238 2.313 1.330 1.180 0.110 1.144 1.142 1.872 1.734 1.204

0.052 1.145 1.286 0.036 1.070 1.160 0.077 0.971 1.259 0.029 1.492 1.473 0.030 1.187 1.346

0.215 1.624 1.623 0.251 1.768 1.476 0.186 0.943 1.154 0.041 1.508 1.523 0.128 1.252 2.524

c

s

g

1.000 -0.722

-0.333 0.087

-0.333 -0.056

1.000 0.899

-0.333 -0.439

-0.333 -0.388

1.000 -0.578

-1.000 -0.099

-1.000 -0.777

1.000 0.991

-0.333 -0.340

-0.333 -0.333

1.000 0.964

-0.333 0.414

-0.333 0.428

a In the row before, are averages of the localization distances of Mo(k) scaled vectors in the 30 trained matrices W(I, J, k) (see Appendix, section c) and are the averages of the prediction error distances between scaled Pp(k) and Po(k); σ(x) and σ(y) are the corresponding RMSSD.

probable due to the 0.8 Å reduction of the S axis of molecule 16 with respect to 15, which decreases its volume and surface. However, the distance between their predictor M(k) vectors is between the shorter found into the 31M(k) set, and the distance between their P(k) is the shortest, which means that they are at the limit to be distinguishable into our data. The SG of 15 and 16 are Pca21 and Pna21 with the same P4 components and, as their M(k)’s are the closest, when predicting by molecular similarity P1 and P3 components for 15 we got those of 16, vice versa. For the two polymorphs of GISZIR, although the packing of GISZIR01 was not predicted we could guess it because the conformation of the molecule in the crystal is close to that of GISZIR in its crystal: the total torsion along the NH molecular bridge was 6° for GISZIR01, while 12° for GISZIR. Thus, the PAI (L, M, S) dimensions of GISZIR01 (13.75, 7.97, and 4.51 Å) are very close to those of GISZIR (13.69, 7.96, 4.47 Å), being also close to the rest of the form components for both M(k) packing predictors. Then, like for HIWJIG and HIWJIG1, we would expect similar molecular packings for GISZIR and GISZIR01. However, this is not the case for these polymorphs because both observed P(k) vectors are quite different: (13.77, 8.64, 5.11, 0.11, 1.00, 1.08, 1.13, 1, -1, -1) for GISZIR01 versus (26.68, 12.48, 7.45, 0.07, 1.95, 1.57, 1.67, 1, -0.33, -0.33) for GISZIR, the first with all parallel molecular stacking and the second with alternate parallel stacking, by 30° rotation along the 7.45 Å axis. The apparent disproportionate change in the packing for such close molecular conformations could be related to the observed trifluoromethyl disorder in one polymorph. Note that the predicted cell for GISZIR (15.96, 10.15, 6.13) is closer to the observed reduced cell of GISZIR (13.85, 12.48, 7.46), and it is also closer to that of GISZIR01 than the observed C2/c cell used in the P(k) vector of GISZIR. Figure 3 shows the dispersion of the observed M(k) and P(k) scaled components, by dividing the values into five categories between white and black by color intensity. Figure 3a shows the observed 31 35D scaled vectors M(k); the nine components of the three blocks M1, M2, and M3 are represented between C1-C9 and the 26 potentials on the (L, M, N) PAI surface between C10-C35. These potentials are distributed from the (100) to (-100) direction, through three sheafs of eight directions cutting the PAI in three planes perpendicular to L. Figure 3a shows some similitude of the potentials distribution among the molecules due to their orientation with the more positive Vi values on the (+++) ellipsoid octant area. Figure 3b shows the dispersion of the observed 31 10D scaled vectors P(k) with the P3(c_p_q) option, where the 17 molecules with identical P4 block are at the left. Finally, as reference, the

dispersion of 31 35Drand vectors with random components between -1 and 1 are shown in Figure 3c, where no aggregation area by similitude is observed. In each figure, a preclassification is done by ordering the molecular vectors in the x-axes by their average distance to the rest. These figures show that observed vectors are more classifiable by similitude among their components than random vectors. Besides, there is some similarity between both preclassified molecular orders of M(k) and P(k) shown by calculating Σ|ordM(k) - ordP(k)|/31 ) 9.42, compared with the value of 11.5 calculated for two random orders, which indicates some correlation between the observed M(k) and P(k) sets of vectors. Finally, Figure 3a,b shows similitude for some components but not for all, suggesting diffuse classifications. A dispersion numerical analysis of the observed 31M(k) and 31P(k) vectors shown in Figure 3 is found in Table 3 of Appendix a. Maybe the more relevant results are the dispersion differences between the observed vectors and the parallel random vector sets of 31 35Drand or 31 10Drand. The average distance between the 31 35DM(k) is of 4.03, with the molecular average distance to the rest of molecules from 3.17 to 5.26 and average angle between vectors of 76°; compared with the corresponding values of 4.81, from 4.50 to 5.25, and 90° for the 31 35Drand. In the 10D space, where distances are noncomparable with those in 35D, the average distance between the 31 10DP(k) is of 1.92, from 1.40 to 2.93, with an average angle of 57° (more aggregated than the 31 M(k) set), compared with the values of 2.47, from 2.16 to 2.94, and 91° for the 31 10Drand vector set. Besides, the greater differences with respect to random sets, between the minimum and maximum molecular average distance to the rest of molecules (also between angles as shown in Appendix a), show some aggregation of the observed vectors by similarity, implying vector classification. However, neither the analysis of Appendix a nor of Figure 3a,b explicitly define classes for 31M(k) or 31P(k) sets. Classification of Molecular Aggregation Predictors and Packing Descriptors Vector Sets and Correlation among them. Each 31Mo(k) and 31Po(k) observed vector sets were self-classified by vector similarity by using the Kohonen neural network KNN17 described in Appendix b, which allowed us to uncover correlations between both classifications. The KNN trains a cubic W(i, j, k) matrix so all 31 observed vectors are represented by close image vectors (along the k direction) and whose proximity in the trained matrix means their similarity. The vector classification depends on the error of this representation , which is the average distance between the 31 observed and their image vectors. After 105 training epochs

3148

Crystal Growth & Design, Vol. 9, No. 7, 2009

Fayos

Figure 3. The observed vector components, scaled between -1 and 1 in the y-axis, are visually compared among the 31 molecules, along the x-axis. The component values are divided into the five categories between white and black shown in the legend. (a) The dispersion of the 31 35D scaled vectors M(k); (b) The dispersion of the 31 10D scaled vectors P(k) with P3(c_p_q) and (c) the dispersion of 31 35D vectors with random components. In each figure, the molecules are ordered by their average distance to the rest.

Crystal Prediction by Molecular Similarity

with the same training conditions, we got ) 0.64 for the set 31M1111 (with all blocks included, see Appendix a) and ) 0.20 for 31P1111 by using P3(c_p_q). These errors can be compared with ) 0.91 and 0.41, respectively, obtained when the 105 training epochs were applied to the random sets 31M35rand and 31P10rand, indicating that our observed data are more classifiable than vector sets with random components. Besides, a KNN training of 31M1111, supervised with 31P1111 (see Appendix c), proved that there was some correlation between both sets because the error of the 35D vectors decreased to 0.61 from 0.64 for the unsupervised training, both with the same training conditions, in contrast with the expected error increase if both sets were not correlated. In fact, the supervised training of the above random (hence not correlated) sets increased the error to 0.99 from 0.91 for the unsupervised training. As we expected from Figure 3a,b, both Kohonen maps (see Appendix b) of 31Mo(k) and 31Po(k) sets did not show wellresolved molecular aggregations per class corresponding to a separate sheaf of vectors; instead, they show a more homogeneous distribution corresponding to an overlapped sheaf. However, a molecular proximity (similarity) analysis on these unsupervised Kohonen maps allowed us to reach diffuse classifications of both molecules and their packing, where diffuse means that some vectors were shared between classes. Finally, the shared vectors were approximately attached just to one class, allowing us to calculate the correlation (F) between both reduced classifications of Mo(k) and Po(k). To get it, we used the molecular coincidence matrix between classes of both sets and the entropy (E) of this matrix, the latter giving an inverse measure of that correlation as it is shown in Appendix d. Although the ambiguity in the assignment of some molecules to a class produces some uncertainty in the correlation analysis between classes, we could find some significant correlations. The largest correlation is between the seven classes found for 31M0001 (only the M4 block) and the five classes for 31P1111 with E ) 3.54(3.79) and F ) 0.81(0.54); in parentheses as reference are the noncorrelation values between the same classifications but including molecules at random in each class (see Appendix d). This indicates that the potential distribution on the surface of the molecular PAI ellipsoid clearly conditions the molecular packing in the crystal. The total correlation between the six classes found for 31M1111 and the five classes for 31P1111, used in this work for CPP, was also significant with E ) 3.56(3.83) and F ) 0.56(0.44), although lower than using 31M0001. In fact, we found low correlation between the molecular form components 31M0110 and 31P1111 with E ) 3.80(3.89) and F ) 0.57(0.55), although there was some correlation between 31M0110 and 31M0001 with E ) 3.88(4.01) and F ) 0.61(0.55). But in any case, we found that the addition of the molecular form packing predictors components to the molecular surface potential distribution, as packing predictors, improved CPP. Crystal Packing Prediction from Molecular Aggregation Predictors. The prediction success of the block components of P(k) from those of M(k) depends on the correlation between classes of 31M(k) and 31P(k) sets, although it could also depend on the correlation between classes among blocks belonging to 31M(k) or among blocks of 31P(k). It is difficult to establish which and how many components are the best for CPP; however, having taken into account the analysis shown in the above two sections, we decided to use all the blocks M1111 for the 35D M(k) vectors and all the blocks P1111 for the 10D P(k) vectors. In order to predict the packing Pp(k)T of an observed Test molecule Mo(k)T between our 31 molecules, we used the supervised trained Kohonen matrix W(i, j, k) (see Appendix c)

Crystal Growth & Design, Vol. 9, No. 7, 2009 3149

with the rest of the observed 30Mo(k)L and 30Po(k)L learning vectors. As shown in the Appendix, the 30 (35 + 10)D vector images on W(i, j, k) close to the corresponding Mo(k)LUPp(k)L observed vectors are stored and classified by similitude in the trained matrix. CPP by molecular similarity implies that the image vector in W(I, J, k) containing the vector more similar to Mo(k)T would also include the vector Pp(k)T. The packing prediction success for each molecule Mo(k)T depending on how much the classes of the rest of 30Mo(k)L and 30Po(k)L are correlated among them and on how they represent Mo(k)T and its “unknown” Po(k)T. Although the diffuse classification of the 30Mo(k)L and 30Po(k)L vectors are implicit in data and cannot be avoided, it does not prevent the Mo(k)T close localization in W(i, j, k) nor its Pp(k)T prediction. To avoid matrix overlearning (see Appendix c), we used 105 epochs for each supervised training of the cubic 17 × 17 × (35 + 10) W(i, j, k) learning matrix. Moreover, to take into account some influence of the initial random W(i, j, k) on the prediction, the components of Pp(k)T, the localization error (x-w) of Mo(k)T and the prediction error (y-w) or vector distance between Pp(k)T and Po(k)T have been averaged among 30 independent trainings for each molecule. As we mentioned above, we used two alternatives to define the P3 block: the ordered c_p_q (a/L, b/M, c/S) so (a, b, c) can also be indirectly calculated by multiplying predicted P3 times (LMS), or the ideal p_c’s12 [(a|L)/L, (a|M)/M, (a|S)/S)]. The unscaled predicted c_p_q’s and the cell axes calculated indirectly from them had average differences to the observed values of 0.25 and 1.93, respectively, while the predicted p_c’s and the directly predicted cell in the second alternative had average differences of 0.50 and 2.46, respectively. On the other hand, we observed that the KNN prediction process has the undesired tendency to approach the predicted components Pp(k) to the average values (see Appendix c); this effect for the cell components is 1.3 times greater by using p_c rather than c_p_q prediction. For both reasons, we decided to use the c_p_q prediction option as the best for the last results shown from now. Table 2 shows, for each molecule, the observed Po(k) components and the average predicted components 30 among the 30 trainings, together with the average localization error of Mo(k)T and prediction error, in the 30 W trained matrices, with the corresponding RMSSD. The localization error of Mo(k)T averaged among the 30 trainings for molecule and among the 31 molecules was ,x-w>30>31 ) 2.07(60,7) and the average of the corresponding prediction errors of Pp(k)T was ,y-w>30>31 ) 1.48(78,20). The first value in parentheses is the RMSSD for the 31 average errors and second value is the average of the 31 RMSSD for the 30 or 30 molecular averages, both showing big error differences between molecules but small error differences between the 30 different trainings of the same molecule, or reproducible prediction. The above vector distance prediction error of 1.48 can be compared with the dispersion parameters of 31Po(k) of ) 1.92 (from 1.40 to 2.93) in Table 3 of Appendix a, or with the dispersion of the random set 31P10rand of 2.47 (from 2.16 to 2.94). Thus, the average distances between observed Po(k) and predicted Pp(k) vectors are on the order of the shorter distances between Po(k) but quite lower than the minima distances between random P10rand vectors. The KNN prediction process uses vectors with scaled components between -1 and 1, which is also convenient to compare |Po(k) - Pp(k)| errors between different blocks P1 to P4. Table 4 in Appendix c shows the total average difference between predicted Pp(k) and observed Po(k) components for scaled and unscaled components including the average differences per block. The last

3150

Crystal Growth & Design, Vol. 9, No. 7, 2009

Fayos

Figure 4. Packing prediction errors (differences between observed and predicted scaled components) for each one of the 31 10D Pp(k) vectors; points are the average prediction errors for all blocks, and cell and c_p_q points are prediction errors just for P1 and P3 blocks. At the right above A are the total average prediction errors for the 31 molecules and the average per block, including the less predictable P2 and P4 (cell_ort and sym) blocks. Above R are the total average prediction errors for 31 10D vectors with random components, to be compared with those predicted above A.

average difference ,|30 - Po(k)|>10>31 for the scaled predicted components corresponding to those unscaled of Table 2 is 0.31 compared to 0.74 for the same components of Po, but those of Pp are random between -1 and 1. The error distribution per blocks (P1, P2, P3, P4) is (0.15, 0.41, 0.22, 0.52) for the predicted components versus (0.75, 0.75, 0.68, 0.79) for the random components. Table 4 in Appendix c also shows that the Pp values are closer to the average than Po values, ) 0.20 and ) 0.32, which is probably produced in the averaging process of the W matrix components. The average RMSSD for the 10 scaled components of Pp’s among the 30 trainings and 31 molecules is 0.12, compared to 0.56 by considering random Pp(k) components. Thus, our cell P1(abc) prediction is better than a random prediction by a factor of 0.75/0.15 ) 5.0, the error improvement being 0.68/0.22 ) 3.0 for block P3(a/L, b/M, c/S); however, the improvement is only 0.75/0.41 ) 1.8 for block P2() and 0.79/0.52 ) 1.5 for P4(c, s, g). Hence, the efficiency of our prediction is concentrated on P1 and P3 blocks, the prediction of P2 and P4 being less significant, although both were implied in the overall training of W. In any case, predicted P2 and P4 could provide just a preference more or less for an orthogonal cell or for some symmetry elements in the cell. The considerable variation of CPP errors between molecules and packing blocks can be seen in Figure 4, the errors depending on how the molecules are represented among the 30M(k) and 30P(k) learning data. The same figure also shows the average prediction error for the 31 molecules and, to validate them, for 31 vectors with random components; the errors of the less predictable blocks P2 and P4 are represented only by their average values. As discussed above, the prediction is better for observed components closer to the average values in the learning

set; thus predicted components of P2 are biased to 0.0 due to the presence of many orthogonal or quasi-orthogonal cells and the prediction of P4 is biased to (1, -0.33, -0.33) due to the 17 crystals among the 31 with these components. Conclusions In this work, it is assumed that most of the causes for molecular aggregation are implicit in the isolated molecule; some of these causes are proposed and codified in order to be compared between molecules, and then some packing descriptors associated with those packing predictors are also codified which allows prediction of the molecular packing by molecular similarity. We used here as molecular packing predictor a 35D Mo(k) vector including some observed molecular form descriptors and 26 electrostatic potentials covering the surface of the molecular PAI ellipsoid. As molecular packing descriptor, we used a 10D Po(k) vector including the observed crystal cell dimensions, the c_p_q quotients between the cell and the PAI axes, and some information about the crystal symmetry. Both packing predictors and packing descriptors vector sets show more aggregation than random vector sets. The dispersion of the 31Mo(k) and their 31Po(k) vector sets of 31 molecules containing the azol group were analyzed by considering all or some of their components; then possible classifications of both vector sets show some correlation among them indicating that similar Mo(k) packing predictors produce similar Po(k) packing descriptors. This allowed us to predict, by molecular similarity, the unknown Pp(k) vector for a given observed Mo(k) molecule. The condition for a significant prediction is that both vectors are sufficiently represented in the above 31Mo(k) and their 31Po(k) learning vector sets, which implied quite different packing prediction errors for different target

Crystal Prediction by Molecular Similarity

Crystal Growth & Design, Vol. 9, No. 7, 2009 3151

molecules, although molecular packing predictions are reproducible for independent trainings. On the other hand, the packing prediction depends on correlations between all vector components, those known correlations previously assumed and those unknown but uncovered after the self-classification of the learning bases by a KNN. Probably because the cell dimensions are the more represented in Po(k), by themselves and indirectly by the (a/L, b/M, c/S) quotients, they are the best predicted packing components with an average relative error between the unscaled predicted cell axes and the observed cell of 18% ( 11%; this prediction is five times better than taking random cell dimensions in the observed interval. For a real and crystallized molecule our cell prediction could be checked by its powder diffraction pattern, but of course our cell prediction also applies for a given theoretical not synthesized molecule. On the other hand, our KNN cell prediction by molecular similarity is independent of those proposed by lattice energy minimization methods; thus, KNN prediction could also be useful to delimit the best solution between those almost equienergetic found by these methods. Our less accurate prediction for the cell angles and crystal symmetry could only give some information on the crystal class and of the more or less presence of some symmetry elements in the cell. It is difficult to know a priori the best molecular selection for the learning set to predict the packing of a given molecule. In this work, we have selected 31 molecules chemically related, although in fact they were not related by their size, form, or potential distribution, their observed crystal packing being neither specially related. Thus, as expected, chemical similarity does not guarantee similar packing, and for a more accurate molecular packing prediction, it would probably be better to use a much more extensive learning molecular set, independent of their chemical type. Such molecular learning base would increase the probability to represent the packing predictors and packing descriptors of any test molecule. Appendix (a) Vector Dispersion Analysis for 31M(k) and 31P(k) Observed Vector Sets. Table 3 gives numerical information about the dispersion of the observed 31M(k) and 31P(k) vectors with P3(c_p_q), both scaled between -1 and 1, where M1111 indicates that all blocks or the 35 components of M(k) have been considered and M0001 indicates that only the M4 block with the 26 Vi components were considered. Note that the Table 3. Vector Dispersion Parameters for 31M(k) and 31P(k) Observed Sets, Both Scaled between -1 and 1a 31)

-0.186 -0.206 -0.024 15.329 15.152

-0.610 -0.598 0.103 9.748 9.953

-0.804 -0.792 0.020 7.202 7.369

-0.468 -0.447 0.006 0.059 0.061

-0.529 -0.544 -0.047 1.313 1.295

0.19(20,7) 0.32(33,14) 0.65(44,16) 0.65(44,16) 0.51(88,24) 0.92(163,45) 0.68(116,29) 0.68(116,29) 0.20(20,7) 0.32(33,14) -0.553 -0.547 -0.044 1.285 1.292

31, in parentheses the total rmssd and the rmssd of the molecular averaged errors 10 versus that total average error, and at the right the errors per blocks P1, P2, P3, and P4. Next the errors of Pp and Po with respect to the average vector under 31 and under 31, respectively. Last 10×31 is the average RMSSD, for the 10 components and 31 molecules among the 30 trainings. The first row shows errors for Po and Pp components scaled between -1 and 1. In the second row are the errors for the same components of Po, but those of Pp are random between -1 and 1. Third row with the Po components rescaled to observed values and Pp rescaled with the same factors as Po. Fourth row for rescaled components but substituting the directly predicted cell axes by those calculated by predicted (a/L, b/M, c/S). Fifth row (in bold) shows the final results for the above components scaled again between -1 and 1. At the end are the average values between the 31 molecules of the 10 scaled components for Po and Pp together with the average random components of Prand, and the average of unscaled components for Po and Pp. a

significant correlation between the 31 Mo(k)L and 31 Po(k)L sets; thus, the supervised trained matrix with 30 Mo(k)LUPo(k)L union vector sets could be used to predict the “unknown” packing Pp(k)T for the remaining Mo(k)T as a test molecule not included in the W training. Since the prediction is done by vector similarity, once the closest column vector W(I, J, k1_35) to Mo(k)T is localized, the rest of that column vector W(I, J, k36_45) provides the predicted Pp(k)T. Because of the 17 × 17 (i,j) size for 31 molecules, in general the found W(I, J, k1_45) and W(I, J, k36_45) are not the image vectors of any observed Mo(k)LUPo(k)L vector, but they are a column vector between some similar image vectors around. The vector distance between W(I, J, k1_35) and Mo(k)T is the localization error (x-w) of the test molecule, the vector distance between the predicted Pp(k)T ) W(I, J, k36_45) and the observed (although ignored so far) packing Po(k)T being its prediction error (y-w). Thus, the better Mo(k)T and Po(k)T are represented by the learning vector sets, the better is the packing prediction for the test molecule. Although the average and errors decrease along the epochs during the W(i, j, k) training, simultaneously W homogenize decreasing column vector aggregation by similarity, which also decreases the prediction capacity. This effect is called overlearning and can be minimized by stopping the training as soon as prediction capacity decays; the epoch limit depends on the data to be classified and on the parameters controlling classification, and it was about 105 epochs for our trainings. We observed that the overlearning cut off also diminished an unwanted effect of the KNN process, that is, the overall approach of predicted Pp(k)T vectors to the average vector. Our KNN prediction process uses vectors with scaled components between -1 and 1 which is also convenient to compare Po(k) - Pp(k) errors between different blocks P1-P4. Table 4 shows the total average difference between predicted Pp(k) and observed Po(k) components for scaled and unscaled components (both with the scale factors used for the observed components), followed by the average differences per block. The table also shows other errors implying the average observed vector ; note that although the scaled components of and (at the bottom of the table) differ less

than 0.02, the average values of 31 and 31 indicate more approximation of the Pp values to the average, which may be related to the fact that 31 and 31 averages have similar values. Both effects are probably produced in the averaging process of the W matrix components, and they could not be corrected because the Pp values distributed equally up and down , with