Chem. Mater. 2006, 18, 3287-3296
3287
A New Mapping/Exploration Approach for HT Synthesis of Zeolites Avelino Corma,* Manuel Moliner, Jose M. Serra, Pedro Serna, Marı´a J. Dı´az-Caban˜as, and Laurent A. Baumes Instituto de Tecnologı´a Quı´mica, UPV-CSIC, UniVersidad Polite´ cnica de Valencia, AVda. de los Naranjos s/n, 46022 Valencia, Spain ReceiVed March 15, 2006. ReVised Manuscript ReceiVed May 4, 2006
This work shows a methodology for the synthesis of self-assembled organic-inorganic materials which integrates high-throughput tools for the synthesis and characterization of solid materials and data-mining techniques in materials science. This is illustrated by a detailed exploration of the hydrothermal synthesis in the system SiO2:GeO2:Al2O3:F-:H2O:N(16) methylsparteinium. Data analysis and dimensional reduction were conducted by using principal components analysis and clustering algorithms, allowing the definition of a new and suitable structural vector which summarizes the X-ray diffraction characterization data as well as an improvement of data visualization and interpretation. Different modeling techniques were applied for the prediction of the properties of the materials considering the synthesis descriptors as input of the model. Furthermore, different “material property” descriptors were considered as outcome of the model, that is, the crystallinity of the formed phases, structural principal components computed by principal component analysis, or clustering results. It was found that the final properties of the materials could be successfully modeled using artificial neural networks and decision trees.
1. Introduction The application of combinatorial and high-throughput (HT) techniques to materials science can help chemists to increase the number of variables of a given process that can be studied in a reasonable time period as well as to increase the number of samples produced and characterized.1-3 Moreover, data mining and database technology are applied for the analysis and modeling of the large amounts of data generated, allowing in turn a speeding up of the discovery and optimization process while establishing scientific principles. In recent years, the usefulness of HT methods has been proven for the discovery of solid functional materials.4-8 Indeed, these methods allow the simultaneous study of numerous synthesis and processing variables, this being especially important when dealing with highly nonlinear and multidimensional systems as is the case for the synthesis of microporous molecular sieve systems. The hydrothermal crystallization processes of microporous materials are governed by a large number of parameters which determine the phases formed and the crystallization * To whom correspondence should be addressed. Tel.: 34(96)3877800. Fax: 34(96)3877809. E-mail:
[email protected].
(1) Combinatorial Materials Science; Xiang, X. D., Takeuchi, I., Eds.; Dekker: New York, 2003. (2) Koinuma, H.; Takeuchi, I. Nat. Mater. 2004, 3, 429-438. (3) Hanak, J. J. Appl. Surf. Sci. 2004, 223, 1-8. (4) Gorer, A. U.S. Patent 6.723.678, 2004, to Symyx Technologies Inc. (5) Sohn, K. S.; Seo, S. Y.; Park, H. D. Electrochem. Solid State Lett. 2001, 4, H26-H29. (6) Boussie, T. R.; Diamond, G. M.; Goh, C.; Hall, K. A.; LaPointe, A. M.; Cheryl Lund, M. L.; Murphy, V.; Shoemaker, J. A. W.; Tracht, U.; Turner, H.; Zhang, J.; Uno, T.; Rosen, R. K.; Stevens, J. C. J. Am. Chem. Soc. 2003, 125, 4306-4317. (7) Corma, A.; Serra, J. M.; Serna, P.; Argente, E.; Valero, S.; Botti, V. J. Catal. 2005, 229, 513-524. (8) Klanner, C.; Farrusseng, D.; Baumes, L. A.; Mirodatos, C.; Schuth, F. Angew. Chem., Int. Ed. 2004, 43 (40), 5347-5349.
kinetics. Despite the notable efforts made to rationalize the synthesis of zeolites,9-12 the relationship between synthesis variables and the zeolitic structure formed is not clearly understood, because of the metastable nature of zeolites and the complexity of the involved synthesis mechanisms. As a result of this, the discovery of new microporous materials is still predominantly an empirical process, though strongly helped by accumulated experience. High-throughput methods should be useful in this field13-17 to determine the effect of different synthesis parameters and to help in the discovery of new zeolites. Very recently, a new zeolite, named ITQ-21, containing Si, Ge, and optionally Al as framework cations was reported.18 This material presents a unique pore topology formed by nearly spherical large cavities of 1.18 nm diameter joined to six other neighbored cavities by circular 12-ring pore windows with an aperture of 0.74 nm, which results in a three-directional channel system of fully interconnected (9) Piccione, P. M.; Yang, S.; Navrotsky, A.; Davis, M. E. J. Phys Chem. B 2002, 106, 3629. (10) Corma, A.; Davis, M. E. ChemPhysChem. 2004, 5 (3), 304-313. (11) Schu¨th, F.; Schmidt, W. AdV. Eng. Mater. 2002, 4 (5), 269-279. (12) Rajagopalan, A.; Suh, C.; Li, X.; Rajan, K. Appl. Catal., A 2003, 254, 147-160. (13) Akporiaye, D. E.; Dahl, I. M.; Karlsson, A.; Wendelbo, R. Angew. Chem., Int. Ed. 1998, 37 (5), 609-611. (14) Holmgren, J.; Bem, D.; Bricker, M.; Gillespie, R.; Lewis, G.; Akporiaye, D.; Dahl, I.; Karlsson, A.; Plassen, M.; Wendelbo, R. Stud. Surf. Sci. Catal. 2001, 135, 461-470. (15) Bricker, M. L.; Sachtler, J. W. A.; Gillespie, R. D.; McGoneral, C. P.; Vega, H.; Bem, D. S.; Holmgren, J. S. Appl. Surf. Sci. 2004, 223 (1-3), 109-117. (16) Pescarmona, P. P.; Rops, J. J. T.; van der Waal, J. C.; Jansen, J. C.; Maschmeyer, T. J. Mol. Chem. A 2002, 182-183, 319-325. (17) Klein, J.; Lehmann, C. W.; Schmidt, H. W.; Maier, W. F. Angew. Chem., Int. Ed. 1999, 38, 3369. (18) Corma, A.; Dı´az-Caban˜as, M. J.; Martı´nez-Triguero, J.; Rey, F.; Rius, J. Nature 2002, 418, 514-517.
10.1021/cm060620k CCC: $33.50 © 2006 American Chemical Society Published on Web 06/20/2006
3288 Chem. Mater., Vol. 18, No. 14, 2006
large cavities. This zeolite was synthesized using a large and rigid structure-directing agent, N(16)-methylsparteinium (MSTP), and the directing effect of Ge toward the formation of structures containing double four rings seems decisive for the synthesis of ITQ-21.19 Zeolite ITQ-3020 is a new structure of the MWW family, which is more closely related to MCM5621 but with clearly different X-ray diffraction (XRD) features. The thermal and hydrothermal stability of zeolites increases as the germanium content decreases. Furthermore, it is important for catalytic applications to find out the synthesis conditions in which fully crystalline samples of ITQ-21 could be obtained with the lowest amount (or none) of Ge and the highest acidity [determined by the (Si + Ge)/ Al ratio]. Classical designs of experiments (DoE),22 like factorial or combination designs, have been applied successfully, when exploring the synthesis gel conditions aimed at the discovery of new zeolites or the optimization of existing ones.23-25 It is clear that the synthesis variables should be carefully selected in order to cover the largest part of the most promising parameter space, while keeping the total number of experiments at a reasonable and feasible level. Moreover, the HT methods currently applied for parallel hydrothermal synthesis strongly constrain how the synthesis parameters can be experimentally studied. For instance, when using autoclave arrays (multiautoclaves with 15-96 wells), the intensive exploration of crystallization temperature and time is restricted. Therefore, DoE strategies should be developed which consider the specific aspects of HT methods in this field, while minimizing the number of experiments. On the basis of the data analysis/mining methodology applied in this work, we propose a new mapping/exploration approach for reducing the screening of low-promise conditions, within the multivariate synthesis spaces found in microporous systems. 2. Experimental Section and the Design of Experiments A detailed exploration of the hydrothermal synthesis in system SiO2:GeO2:Al2O3:F-:H2O:MSPT has been performed, to understand the influence of these factors on the growth of ITQ-21 and ITQ-30, at 175 °C under static conditions. Parallel syntheses were developed using a robotic system and 15-fold Teflon-lined stainless steel autoclaves for the crystallization.25 Crystallinity was measured by means of XRD, using a multisample Phillips X’Pert diffractometer employing Cu KR radiation. A factorial experimental design (4.32.22 (19) Blasco, T.; Corma, A.; Dı´az-Caban˜as, M. J.; Rey, F.; Rius, J.; Sastre, G.; Vidal-Moya, J. A. J. Am. Chem. Soc. 2004, 126, 13414-13423. (20) Corma, A.; Dı´az-Caban˜as, M. J.; Moliner, M.; Martı´nez, C. Discovery of a new catalytically active and selective zeolite (ITQ-30) by highthroughput synthesis techniques. J. Catal. in press. (21) Fung, A. S.; Lawton, S. L.; Roth, W. J. U.S. Patent 5 362 697, 1994, to Mobil Oil Corp. (22) Montgomery, D. C. Design and Analysis of Experiments, 4th ed.; John Wiley & Sons Inc.: New York, 1997. (23) Tagliabue, M.; Carluccio, L. C.; Ghisletti, D.; Perego, C. Catal. Today 2003, 81, 405-412. (24) Holmgren, J.; Bem, D.; Bricker, M. L.; Gillespie, R. D.; Lewis, G.; Akporiaye, D.; Dahl, I.; Karlsson, A.; Plassen, M.; Wendelbo, R. Proceedings of the 13th International Zeolite Conference; Montpellier, France, July 8-13, 2001; Galarneau, A., Di Renzo, F., Fajula, F., Vedrine, J., Eds.; Stud. Surf. Sci. Catal. 2001, 135, 461. (25) Moliner, M.; Serra, J. M.; Corma, A.; Argente, E.; Valero, S.; Botti, V. Microporous Mesoporous Mater. 2005, 78, 73-81. (26) Lobo, R. F.; Davis. M. E. Microporous Mater. 1994, 3, 61.
Corma et al. Table 1. Levels and Ranges of Synthesis Factors Employed in the Experimental Design variation ranges
time (days) Si/Ge Al/(Si + Ge) MSPT/(Si + Ge) F/(Si + Ge) H2O/(Si + Ge)
number level
level 1
level 2
2 4 3 2 2 3
1 15 0.02 0.25 0.25 2
5 20 0.04 0.5 0.5 5
level 3
level 4
25 0.067
50
10
Figure 1. Phase diagram showing the occurring materials as a function of the five synthesis variables (starting gel molar ratios and crystallization time).
) 144) was selected for studying simultaneously the concentrations of the components in the starting gel, that is, Al/(Si + Ge), MSPT/ (Si + Ge), F-/(Si + Ge), and Si/Ge molar ratios, as well as the crystallization time. Table 1 shows the values and levels considered for the different variables. For experimental details, see the Supporting Information. Different data-mining techniques have been applied to extract knowledge about the relationships between synthesis conditions and the occurrence of different zeolite phases, minimizing the human participation in the analysis of the great amount of data generated. Furthermore, the advantages of data-mining techniques when processing, visualizing, and interpreting this type of nonlinear data have been shown. In this sense, three issues are key in our methodology: (i) the analysis and extraction of knowledge (i.e., Pareto analysis and data visualization techniques), (ii) a reduction of the complexity/dimensionality of the problem, minimizing the information loss (i.e., clustering analysis and principal component analysis, PCA), and (iii) modeling, enabling one to make a priori predictions (i.e., classification trees and neural networks, NNs). Moreover, this approach combining diverse data-mining techniques has been shown as a realistic way of statistically treating data from materials science. At last, we have used the NN model based on ITQ-21 crystallinity to minimize the germanium content present in the final structure, to increase its thermal stability, while maintaining high crystallinity. More details for data-mining techniques are described in the Supporting Information.
3. Results and Discussion 3.1. Screening Results: Phase Diagram. Figure 1 shows the phase diagram obtained following the factorial design
A New Mapping/Exploration Approach
Chem. Mater., Vol. 18, No. 14, 2006 3289
Figure 2. XRD patterns of ITQ-21 and ITQ-30.
Figure 3. Standardized Pareto chart for ITQ-21 and ITQ-30 formation, showing the effect of the different synthesis factors on the crystallinity of each zeolite. The length of each bar displayed in the frequency histogram is proportional to the absolute value of its associated estimated effect.
described above. ITQ-21, ITQ-30, and amorphous material were obtained in the explored space. The standard X-ray diffractograms for each crystalline phase are shown in Figure 2. Automatic calculation of the occurrence and crystallinity was done integrating the area of the characteristic peaks for each phase and referring this to the fully crystalline materials. For ITQ-21, the integrated area is comprised of a 2θ angle between 25.4 and 27.2°, and for ITQ-30, the range is between 24.6 and 25.4°. Because ITQ-30 also presents diffraction peaks in the 25.4-27.2° region, the percentage of ITQ-30 is subtracted considering the crystallinity measured from the peak located at 25.0°. Considering the crystallinity of the synthesized materials, three different groups have been created. A material is qualified as “amorphous” if both the ITQ-21 and ITQ-30 crystallinities are below 20%. “ITQ21” is defined as a material for which the ITQ-21 crystallinity is higher than 20% and ITQ-30 below 20%. If the ITQ-30 crystallinity is greater than 20%, the material is noted as “ITQ-30”. A first approach using Pareto analysis shows in Figure 3 the relative influence of each synthesis factor over the crystallinity of ITQ-21 and ITQ-30 samples. In this chart, the length of each bar is the estimated effect divided by its standard error, which is equivalent to computing a t statistic for each effect. The vertical line on the plot means that bars which extend beyond the line correspond to effects that are statistically significant at the 95% confidence level. This statistical way of understanding the results allows quantification of the hypothetical weight of the factors in the growth of materials. Both ITQ-21 and ITQ-30 seem to be quite influenced in a negative sense by water and aluminum content; that is, the more water or the higher Al/(Si + Ge), the less crystalline are the samples. Afterwards, MSPT/
(Si + Ge) and F/(Si + Ge) play a positive role in the formation of ITQ-21 and ITQ-30. However, some important differences can be observed when comparing the analyses for ITQ-21 and ITQ-30. On one hand, the relative importance of MSPT/(Si + Ge) and F/(Si + Ge) is higher for ITQ-30, because only in a few small zones can this material be obtained with the minimum content of MSPT/(Si + Ge) and F/(Si + Ge). On the other hand, Si/Ge appears as an important negative factor for ITQ-21 samples, while it becomes slightly positive for ITQ-30 samples. This result has to be understood as a penalization for the growth of ITQ21 when increasing the Si/Ge ratio, because the crystallinity decreases but also some syntheses change to ITQ-30. This reason can be applied for the slight benefit of Si/Ge for ITQ30, taking into account a balance between the loss of crystallinity and the appearance of new ITQ-30 points. However, ITQ-21 samples appear with a lower Si/Ge content. Finally, the relative influence of time for these materials is quite different, being much more important in the case of ITQ-30 than in that of ITQ-21. This effect of time could be understood as a retransformation process of ITQ-21, in such a way that ITQ-30 can only be obtained in 1 day if it is worked with the maximum levels of MSPT/(Si + Ge) and F/(Si + Ge) and the minimum level of Al/(Si + Ge). 3.2. Analysis and Knowledge Extraction from HT Experimental Data. In this section, different techniques of unsupervised analysis will be applied to the original data set derived from the XRD characterization of the whole set of samples, allowing an improvement in data visualization, classification, and the ulterior knowledge extraction. Indeed, structural vectors will be computed from the raw characterization data by means of dimensional reduction and analysis techniques, that is, clustering algorithms and PCA.
3290 Chem. Mater., Vol. 18, No. 14, 2006
Corma et al.
Figure 4. Tree diagram (dendrogram) showing the Euclidean distances between the different clusters and subclusters.
Clustering analyses of raw XRD data allow classification of the as-synthesized samples into different structural groups without applying any previous knowledge. That can be of interest when the resulting materials contain mixtures of phases or unknown phases, where the conventional phase identification systems find difficulties. Moreover, this type of data classification allows the achievement of high degrees of automation in the high-throughput experimental workflow. 3.2.A. Clustering Analysis. The k-means clustering algorithm examines each sample from the population and assigns it to one of the clusters trying to minimize the variance intraclass and maximize the variance interclass. The centroid of one cluster is iteratively computed when a new component is added to the cluster, this process being repeated until all of the components are grouped into the selected number of clusters. This methodology suffers from the initialization of centroids. Depending on the first randomly chosen centroids, the final solution can highly change. Therefore, numerous assignments have been performed in order to get a stable and representative solution. A first data set constituted by the XRD data of each sample has been taken into account for the clustering analysis. This involves vectors with 800 attributes, corresponding to the intensities obtained for each diffraction angle of the 144 samples. The number of clusters chosen to perform the later analysis was investigated by means of a tree diagram (called a dendrogram), using Ward’s clustering method (see the Clustering Analysis section in the Supporting Information). In this tree diagram (Figure 4), the different groups of samples are plotted as a function of the relative diversity of each group (linkage distance). This classification analysis shows that two big clusters can be clearly recognized, corresponding to amorphous and crystalline materials, whereas the last cluster can be split into two new groups, correspond-
Figure 5. XRD measurements of the as-synthesized samples ordered considering the cluster distribution obtained by the k-means algorithm using the second data set.
ing to ITQ-21 and ITQ-30 samples. More specific subclusters can be related to slight differences in the XRD diffractograms for a given structure, because of changes in their crystallinity or germanium contents. From a practical point of view, we have selected a number of three clusters, to make a first classification based on the three types of materials identified manually, that is, amorphous, ITQ-21, and ITQ-30. A second data set constituted by XRD data from the characteristic 2θ range (24.5-27.5°) of ITQ-30 for each sample was considered. Figure 5 shows a general visualiza-
A New Mapping/Exploration Approach
Chem. Mater., Vol. 18, No. 14, 2006 3291
Figure 6. Identification of the formed phase using a k-means clustering analysis.
Figure 7. Averaged XRD diffractogram for the three clusters obtained by k-means analysis. Table 2. Clustering Analysis Carried out Using the XRD Data, Showing the Match between Clustering Results and Phase Identification clustering k-means match clusters
specific 2θ range match (%)
complete 2θ range match (%)
1. amorphous 2. ITQ-21 3. ITQ-30
87.3 100.0 92.3
99.0 89.7 69.2
tion of the XRD data, ordered according to their belonging to the different clusters obtained by the k-means clustering algorithm using the second data set. Figure 6 shows the good match between the clusters obtained by k-means analysis for both data sets and the corresponding material/phase. The clustering analysis using the whole of the XRD data allows one to accurately distinguish amorphous and crystalline materials, whereas it fails only in a few samples when distinguishing between ITQ-21 and ITQ-30 phases (Table 2). However, it is possible to improve the quality of the separation between ITQ-21 and ITQ-30 samples taking only into account the range of 2θ where these two structures present different peaks (24.5° and 27.5°). The k-means clustering in this way allows a strong improvement of the classification between both phases, although the classification
Figure 8. Distribution of the three different phases in the SPC coordinates. (PCA computed using the whole of the XRD data, first data set.)
3292 Chem. Mater., Vol. 18, No. 14, 2006
Figure 9. Identification of different structural properties in the SPC space: distribution of ITQ-21 and ITQ-30 with different ranges of crystallinity.
accuracy of the amorphous samples is reduced. Figure 7 presents the averaged XRD pattern for each cluster (first data set), showing the good match between the clustering analysis and phase identification (see the real diffractograms of standard ITQ-21 and ITQ-30 samples mentioned previously). The characteristic peaks of ITQ-30 can be observed, and the averaged diffractogram can be clearly distinguished from the ITQ-21 XRD pattern. 3.2.B. Principal Component Analysis. The PCA computed from the whole of the XRD data will be referred to as structural principal components (SPCs) from here on. When PCA techniques are applied, it is possible to reduce the XRD vector of each sample (vectors with 800 intensities for each 2θ angle) to a vector with only three new variables (SPCs), without a loss of the main information of the original data because 81.8% of the cumulative variance has been extracted. The corresponding percentage of variance for each component (SPC#1, SPC#2, and SPC#3) is 39.8%, 32.8%, and
Corma et al.
9.2%, respectively. Because of the simplification of the original vector, we can provide now an easy visualization of the distribution of the samples into the virtual threedimensional SPC space. The results of the k-means clustering algorithm and the PCA can be combined, as it is shown in Figure 8. SPC projections of the samples are clearly separated from one cluster to another. Diffraction data usually contain information about the type of crystalline phase as well as about the crystallinity of the material, crystallite size, zeolite framework composition, and so forth. Indeed, the fine-tuning of ITQ-21 crystallite size has been reported19 from nanocrystals to large crystals by controlling the rates of nucleation and crystal growth, through the H2O/(Si + Ge) ratio. In the present study, trying to rationalize the meaning of SPC space, we will study the variation of phase crystallinity and framework composition inside this new space. On one hand, Figure 9 shows the distribution of ITQ-21 and ITQ-30 samples with different degrees of crystallinity into the SPC space. It can be seen that they are clearly distributed in the space, it being possible to correlate crystallinity against SPCs. On the other hand, the correlation between the germanium content in the ITQ21 framework and the SPC was studied. Given that the Si/ Ge ratio in the starting gel has been shown as a very influencing factor on the final crystallinity of ITQ-21 (see the Pareto analysis in Figure 3), the variation of the Si/Ge was followed apart from the correlation between the SPC and crystallinity. Concretely, Figure 10 represents the third SPC as a function of Si/Ge, for three different degrees of crystallinity. It is clear that SPC#3 is strongly correlated with the structural changes produced by the Si/Ge framework variation. In fact, this correlation is attributed to the information extracted by PC analysis from the XRD peak shift produced by the isomorphic substitution of Si by Ge in the zeolite framework, as can be clearly seen in the Figure 10 inset. No correlation was found between Si/Ge and the remaining two SPCs.
Figure 10. Identification of different structural properties in the SPC space for ITQ-21 samples: correlation between SPC#3 and Si/Ge in the starting gel, for three different degrees of crystallinity. Inset: Partial diffractograms corresponding to four samples with different Si/Ge ratios and the same crystallinity (20%), showing the peak shift.
A New Mapping/Exploration Approach
Chem. Mater., Vol. 18, No. 14, 2006 3293 Table 3. NN and Decision Tree Prediction Performances of the Obtained Phase Using the Synthesis Variables as Model Input
Figure 11. Prediction performance of the NN model using the synthesis factors as input and the crystallinity of ITQ-21 and ITQ-30 as output. (Net topology 5_10_4_2, trained using BackProp with the Momentum algorithm and 80% data.)
Consequently, SPCs contain the summarized information of XRD patterns concerning the different structural and morphological changes in the whole of the materials explored. These results demonstrate that the application of dimensional reduction techniques, just as with PCA, of the
class
% DT accuracy
% NN accuracy
amorphous ITQ-21 ITQ-30
92.16 93.10 92.31
96.08 93.10 92.31
raw XRD data allows one to obtain a new series of structural components in a fully automated manner, which entirely describes the properties of the synthesized samples. In addition, these structural vectors can be used to improve the prediction performance of QSAR/QSPR models, such as NNs, as well as the development of new exploration tools (mapping) of nonlinear and multidimensional spaces, such as those found in the development of new microporous materials. 3.3. Construction of Predictive Models (QSPR/QSAR). 3.3.A. PredictiVe Modeling of Material Properties from Synthesis Descriptors. As a first step, NN models were obtained using the synthesis descriptors as input and the zeolite crystallinity as output. Very good prediction results could be obtained using a NN with a two-hidden-layer topology and the back propagation training algorithm (R ) 0.3). A total of 70% of the data were employed for the training process and the rest for testing. Figure 11 shows the experimental and predicted crystallinity for both zeolites, clearly illustrating the high accuracy of the model despite the experimental error associated with the synthesis and characterization steps. Subsequently, this predictive model was applied for finding the theoretical synthesis conditions that optimize the ITQ-21 crystallinity by keeping the molar ratio Si/Ge > 30. Three different sets of conditions with predicted crystallinity around 60% were selected for experimental testing, with 2 days of crystallization time. The
Figure 12. Decision tree ID3-IV obtained using synthesis descriptors as model input and phase clusters as output. [The importance of each factors as follows: Si/Ge 100%, Al/(Si + Ge) 79%, MSTP/(Si + Ge) 72%, H2O/(Si + Ge) 70%, and crystallization time 38%.] The initial data partition called the initial branch or root encompasses all data records. This root is split into subsets or child branches, on the basis of the value of a particular input field, which may in turn be split again into sub-branches and so on.
3294 Chem. Mater., Vol. 18, No. 14, 2006
Corma et al.
Figure 13. NN prediction performance of the SPC using the synthesis factors as input. The correlation factor for the crystallinity of ITQ-21 and ITQ-30 is 0.960 and 0.958, respectively. The inset shows the topology of the best NN.
Figure 14. Eigenvalues for two different data set sizes: on the left-hand side, 60% of the whole available amount of experiments is considered, while on the right side, only 40% is used for the calculation of the eigenvectors.
experimental crystallinity achieved was slightly lower than expected, being for the samples close to 50, as can be shown in Figure 11 (filled squares). Subsequently, predictive models based on decision trees and NNs were computed using just the type of formed material as output data. Figure 12 shows the best decision tree found, describing successfully the type of material formed as a function of the synthesis variables. Table 3 compares the prediction performance of the NN and decision tree models, with very high accuracy, although the NN model is slightly better. The relative importance of each input factor in the occurrence of each phase follows, in both models, the order Si/Ge > Al/(Si + Ge) > MSPT ≈ H2O/(Si + Ge) > time, contrasting with the standardized effect observed for the crystallinity of each phase (Figure 3), where H2O/(Si + Ge) and Al/(Si + Ge) played the major roles for ITQ-21 and ITQ-30, respectively. As a second step, predictive models were computed using the SPCs as output for the model, whereas synthesis variables were used as input. This approach may allow prediction of the structural properties of a material, it being possible to distinguish between the type of phase (known or unknown), crystallinity, framework composition, and so forth. The SPC
output is well-suited when the aims of the exploration are both the discovery of new structures and the optimization of a determined feature when competing phases are also formed. Given that synthesis variables have been shown as the main factors in the growth of both ITQ-21 and ITQ-30 by the Pareto analysis, and bearing in mind that SPCs are strongly correlated with the type of material formed, its crystallinity, and its framework composition, there is no doubt about the existence of clear relationships between synthesis descriptors and SPCs. Following this approach, an accurate NN model was obtained using the available data (70% for training and 30% for validation), trained following the back propagation algorithm (R ) 0.3). Figure 13 shows the observed SPCs versus the predicted ones, the averaged prediction error to the test samples being in the range of 10%. Considering all of the predictive results based on decision trees and NNs, we can see in Figure 12 that the lowest Ge content in the ITQ-21 zeolite that can be synthesized with high crystallinity is for a Si/Ge ratio of 37.5. This is in agreement with previous results19 that suggest that ITQ-21 could be obtained for a Si/Ge ratio of 25, but not for 50.
A New Mapping/Exploration Approach
Chem. Mater., Vol. 18, No. 14, 2006 3295
Figure 15. 3D scatter plot with the first three principal components. On the left-hand side are represented the experiments corresponding to the 40% of the entire data set used for the calculation of the eigenvectors, while on the right side, unseen materials are projected. Table 4. Best Selected NN: MLP 3:3-10-3:1 Real Classes training set: 100% recognition
test set: 96% recognition
predicted class
1
2
3
1
2
3
1 2 3
35 0 0
0 16 0
0 0 6
58 0 1
0 17 2
0 0 9
a NN prediction performances of the obtained phase using the SPC coordinates as input.
This helps to fine-tune the better synthesis conditions for the lowest Ge-content ITQ-21 samples that will have the maximum stability and better catalytic performance. 3.3.B. PredictiVe Modeling of Phase Type from the Structural Principal Components. Finally, the correlation between SPCs and the type of structure by NN modeling was studied. Carefulness is compulsory during this study in order to not overfit the data but also to present a realistic
methodology. Therefore, the stability of the approach is tested by reducing drastically the number of experiments that are used for producing the PCA. Two different sizes, 40% and 60% of the whole available data set, have been used for the calculation of the eigenvectors, and the first three principal components have been kept for both analyses, see Figure 14. Then, the remaining unseen experimental data (60% and 40%, respectively) are projected into the modified space using the analytic definition of the selected principal components (i.e., the first three components), see Figure 15. Then, NNs are trained using only the materials used for the PCA calculations with PCA coordinates as input and phase types as output. Therefore, when the coordinates of the unseen solids are calculated through PCA axes definition, the NN is used in a second step to assign them a label corresponding to the expected phase class. Table 4 indicates the recognition rates for both training and test sets considering the most drastic PCA study (i.e., 40% of the data for
Figure 16. Data mining applied in the development of new solid materials: methodology for automated data analysis, visualization, and QSPR modeling.
3296 Chem. Mater., Vol. 18, No. 14, 2006
component calculation). It can be argued that the NN plays a rather small role because the separation between classes into the PCA space is sharp. However, the results are excellent, and this approach appears to be of great interest. 4. Conclusions This works shows a complete study integrating highthroughput tools for the synthesis and characterization of solid materials and data-mining techniques in the discovery and optimization of new microporous materials. The phase diagram of the system SiO2:GeO2:Al2O3:F-:H2O:N(16) methylsparteinium hydroxide has been systematically explored following a factorial design, the effect of the starting gel composition being determined, as well as the crystallization time. Two different zeolites (ITQ-21 and ITQ-30) were detected within the explored space. Data visualization and dimensional reduction were conducted by using principal components analysis and clustering algorithms, allowing extraction of the desired structural vectors from the XRD characterization data. These unsupervised techniques allow the obtainment of a view of the screening results closer to the topology of the explored multidimensional space, including information about the formed phase(s), crystallinity of the material, particle size, and isomorphic substitution degree, allowing as well the reduction of the experimental noise of the original characterization data. Moreover, the automation of this type of analysis can be easily implemented without any prior knowledge of the problem. Different modeling techniques were applied for the prediction of the properties of the materials obtained considering the synthesis data as input of the model. Furthermore,
Corma et al.
different “material property” descriptors were considered as outcome of the model, that is, crystallinity of the formed phase, SPCs computed by PCA, or clustering results. It was found that the final properties of the materials could be successfully modeled using neural networks, obtaining highquality predictions, especially when applying SPCs as model output. This proposed methodology (see Figure 16) for unsupervised characterization analysis and subsequent predictive modeling could be applied when other material properties are to be explored or optimized, such as, for instance, acidity, fluorescence/phosphorescence, or adsorption properties, and when other characterization techniques are employed, such as RAMAN, NMR, photoluminescence spectroscopy, and IR imaging. Finally, these predictive models could be used for guiding the next experimental round, allowing one to skip the screening of Virtually low-performing materials and promoting the synthesis of new dissimilar materials (with respect to the explored space) and therefore accelerating the multiparametric space exploration. Acknowledgment. Financial support from the Spanish government (Project MAT 2003-07945-C02-01 and Grants TIC2003-07369-C02-01 and FPU AP2003-4635) and the E.U. Commission (TOPCOMBI Project) is gratefully acknowledged. The authors thank I. Millet and J. Herrera for technical assistance. Supporting Information Available: Details for data mining techniques. This material is available free of charge via the Internet at http://pubs.acs.org. CM060620K