Rapid and Accurate Machine Learning

Aug 25, 2014 - ... Department of Chemistry, University of Ottawa, 10 Marie Curie Private, Ottawa, ... relationship (QSPR) models using advanced machin...
1 downloads 0 Views 281KB Size
Letter pubs.acs.org/JPCL

Rapid and Accurate Machine Learning Recognition of High Performing Metal Organic Frameworks for CO2 Capture Michael Fernandez, Peter G. Boyd, Thomas D. Daff, Mohammad Zein Aghaji, and Tom K. Woo* Centre for Catalysis Research and Innovation, Department of Chemistry, University of Ottawa, 10 Marie Curie Private, Ottawa, Ontario K1N 6N5, Canada S Supporting Information *

ABSTRACT: In this work, we have developed quantitative structure−property relationship (QSPR) models using advanced machine learning algorithms that can rapidly and accurately recognize high-performing metal organic framework (MOF) materials for CO2 capture. More specifically, QSPR classifiers have been developed that can, in a fraction of a section, identify candidate MOFs with enhanced CO2 adsorption capacity (>1 mmol/g at 0.15 bar and >4 mmol/g at 1 bar). The models were tested on a large set of 292 050 MOFs that were not part of the training set. The QSPR classifier could recover 945 of the top 1000 MOFs in the test set while flagging only 10% of the whole library for compute intensive screening. Thus, using the machine learning classifiers as part of a high-throughput screening protocol would result in an order of magnitude reduction in compute time and allow intractably large structure libraries and search spaces to be screened. SECTION: Surfaces, Interfaces, Porous Materials, and Catalysis

M

known MOF structures. For example consider the MOF Zn2(1,4-benzenedicarboxylate)2(pyrazine)21 or ZBP, whose calculated CO2 uptake can be enhanced from 0.9 in the unfunctionalized case (Figure 1a) to 4.2 mmol/g (0.15 bar, 298

etal−organic frameworks (MOF) are a class of nanoporous solids formed by the self-assembly of metal ions or clusters and polydentate organic linkers that function as structural building units (SBUs) to create open crystalline frameworks. MOFs have been synthesized with a diverse range of pore geometry, surface area, structural robustness, chemistry, and functionality suitable for different applications such as gas separation1 and storage,2,3 catalysis,4−6 nonlinear optics,7 sensing,8,9 controlled drug release,10 and light-harvesting.11 Due to the nearly limitless combinations possible, finding the optimal set of SBUs to construct a MOF for a given application presents a remarkable design challenge. Molecular simulations of nanoporous materials have advanced such that many structural and functional properties of a MOF can be reliably calculated given only the crystalline structure of the material.12 For example, the gas adsorption isotherms of small molecules such as CO2 can be computed in excellent agreement with experimental results.13 In such simulations, a single point DFT calculation is typically performed on the periodic structure to derive the required atomic charge parameters.14 These are then followed by a grand canonical Monte Carlo (GCMC) simulations, which can be completed on a modern workstation in under half an hour, to predict the gas uptake capacity of a material.12,15,16 Recent in silico generation of massive libraries of hypothetical nanoporous materials has resulted in millions of hypothetical zeolites17 and, so far, hundreds of thousands of hypothetical MOFs structures18,19 being readily available for high-throughput (HT) computational screening.20 In principle, libraries of tens of millions of MOFs structures could be generated simply by systematically functionalizing the SBUs of © 2014 American Chemical Society

Figure 1. (a) Organic SBUs of the MOF ZBP17 with the colored circles depicting the possible functionalizable positions available by symmetry. (b) Functionalization of ZBP which enhances the calculated CO2 uptake from 0.9 to 4.2 mmol/g (0.15 atm CO2, 298 K).

K) through the functionalization shown in Figure 1b. Considering four symmetric substitution points in ZBP (Figure 1a), finding the optimal functionalization with a small library of 35 functional groups gives rise to a total of ∼1.5 million (354) unique combinations. If single-point DFT calculations are required to determine accurate simulation charges for each structure, then exhaustive screening of millions of structures is generally not feasible. Applying larger libraries of functional groups or optimizing materials with more functionalizable Received: June 27, 2014 Accepted: August 19, 2014 Published: August 25, 2014 3056

dx.doi.org/10.1021/jz501331m | J. Phys. Chem. Lett. 2014, 5, 3056−3060

The Journal of Physical Chemistry Letters

Letter

the test set used to validate our models. In order to train and validate our models, GCMC simulations were performed to calculate the CO2 uptake of all 324 500 members of the database at 0.15 and 1 bar CO2 and 298 K. Details of the structure database, its construction, and further computational details are provided in the Supporting Information

positions (MOFs with nine different functional groups have been reported22) would result in combinatorially larger search spaces. Although such large search spaces are encountered when considering a single unfunctionalized MOF, there are nearly endless combinations of unfunctionalized organic and inorganic SBUs that can be combined to construct MOFs from. Thus, the exhaustive screening of such immense search spaces with GCMC simulations is practically impossible, even when using rapid methods to generate the simulation parameters.23,24 Machine learning and chemoinformatic models could be used prior to performing compute intensive simulations (i.e., GCMC or DFT) in order to preselect high-performing candidates while discarding low-performing structures. Because the prescreening of a candidate structure would take a fraction of a second rather than minutes or hours, the use of machine learning technology could make the virtual screening of intractably large search spaces feasible. For methane storage in MOFs, we developed accurate quantitative structure− property relationship (QSPR) models using purely geometrical features of the material, namely, pore size, surface area, and void fraction.25 Such geometry-only QSPR models performed poorly for predicting CO2 uptake in MOFs. As a result, we introduced the atomic property-weighted radial distribution function (AP-RDF) descriptor that also captures the chemical features of a periodic material in addition to its geometric features.26 Nonlinear regression models of AP-RDF descriptors yielded good predictions of the absolute CO2 uptake values at different pressures. For larger and much more chemically diverse MOF libraries, we expect that these descriptors would also yield good predictive models. In this Letter, we demonstrate that reliable machine learning classifiers can rapidly identify promising candidate MOFs for CO2 capture, thereby significantly reducing the number of higher level and compute intensive simulations required to virtually screen a large search space of materials. More specifically, QSPR classifiers that designate structures as high or low performing were trained on a diverse set of 32 450 structures using machine learning algorithms. Separate classifiers were built to recognize high-performing MOFs at 0.15 and 1 bar CO2 at 298 K, conditions relevant for CO2 capture from postcombustion flue gas and landfill methane purification, respectively.27 A set of 292 050 hypothetical MOFs was screened using the QSPR classifiers to rapidly identify highperforming candidate MOFs and the results compared to the uptakes obtained from GCMC simulations. In brief, the remainder of this communication is as follows: first, a description of the database of structures and the QSPR descriptors used to calibrate the models is given, then an account of how the classifiers were trained is provided. The work ends with a validation of the models and concluding remarks. A database of 324 500 hypothetical MOF structures generated by combining 66 SBUs and 19 functional groups was utilized. Figure S1 in the Supporting Information shows the complete set of SBUs and functional groups. Using a similar algorithm to that introduced by Wilmer et al.,18 we generated ∼1800 unfunctionalized base structures. A total of 324 500 noninterpenetrated MOF structures were then generated by the functionalization of symmetrical hydrogen positions with up to two different functional groups. All structures in the database were geometry optimized. We randomly selected 10% of the database (32 450 MOFs) to form the calibration set used to train the QSPR models. The remaining 292 050 MOFs formed

all atom pairs

RDF P(R ) = f



2

−B(rij − R ) PP i je

(1)

i,j

AP-RDF descriptors were used to develop the structure− absorption relationship in the calibration set. The AP-RDF profile of a MOF framework can be interpreted as the weighted probability distribution to find an atom pair in a spherical volume of radius R inside the unit cell. Equation 1 defines the AP-RDF function where the summation is over all atom pairs in the unit cell, rij is the minimum image convention distance of these pairs, B is a smoothing parameter, and f is simply a scaling or normalization factor. For the descriptor to account for chemical information, the radial probabilities are weighted by three tabulated element-based atomic properties, P, the electronegativity, polarizability, and van der Waals volume. Given the crystal structure of a MOF the AP-RDF profiles of the material can be evaluated in a fraction of a second on a standard workstation. Figure 2a shows the averaged electronegativity weighted APRDF profiles of all the MOFs in the calibration set with uptakes 26

Figure 2. Averaged electronegativity weighted AP-RDF plot of all MOFs in the calibration set that possessed CO2 uptakes greater than and lower than 1 and 6 mmol/g at 0.15 and 1 bar CO2, respectively.

greater than and less than 1 mmol/g at 0.15 bar CO2. On average, the AP-RDF profile of MOFs with higher uptakes have a maximum at 7.5 Å and show a steep decline after 10 Å. Meanwhile, MOFs with lower uptakes 1 mmol/g at 0.15 bar and (b) >4 mmol/g at 1 bar by the QSPR classifiers from the test set of 292 050 structures. Three different cutoffs are shown that recover varying percentages of all true-positives. The dashed line indicates the total number of actual high-performing MOFs (all of the true positives). The number in parentheses gives the percentage of the total test set that the QSPR classifier identified as high-performing.

The machine learning based prescreening may be used in a variety of scenarios. For example, if the higher level calculations are very compute intensive, then one may choose a cutoff that only recovers 30% of the actual high performing materials. In this case, over 95% of the MOFs flagged as high-performing will actually be so and worth performing the compute intensive calculations on. In other words, the false positive rate is very low. On the other hand, if one is interested in recovering as many high-performing MOFs as possible, then a lower cutoff that recovers 90% of the real high-performers is appropriate. In this case, the false positive rate also increases, meaning that candidates are being flagged for more compute intensive calculations that are not actually high-performing. Using the 90% cutoff value, the total number of MOFs flagged as highperforming constitutes 28% of the total database. In other words, to recover 90% of the high-performing MOFs, 28% of the total database would have to be screened at the more compute intensive level. An important question that naturally arises is what are the distributions of CO2 uptakes of the MOFs that are “missed” by the classifiers? To examine this, we compare the distribution of CO2 uptakes for the true positives captured by the classifiers to the distribution of uptakes for all of the actual high-performing MOFs. These are plotted in Figure 4 for both pressure regimes. The distribution of CO2 uptakes for all of the true high-

Figure 4. Distribution of the calculated CO2 uptakes of all true positives in the test set compared with the distribution of true positives recovered by the QSPR classifiers at different cut-offs. (a) The distributions at 0.15 bar CO2 pressure and (b) the distributions at 1 bar. 3058

dx.doi.org/10.1021/jz501331m | J. Phys. Chem. Lett. 2014, 5, 3056−3060

The Journal of Physical Chemistry Letters

Letter

performing MOFs is plotted in black. At 0.15 bar CO2, the majority of the high-performing MOFs have uptakes less than 2 mmol/g, although there are still hundreds of MOFs that have uptake capacities greater than 3 mmol/g, which is considered very high under these conditions. Also shown in Figure 4 are the distribution of true positives determined by the QSPR classifiers at various cut-offs. Deviation of the QSPR predicted true positives (colored lines) from the total true positive distribution (black line) reveals the uptake values of the highperforming MOFs that the QSPR classifiers “miss” (the falsenegatives). For all cut-offs, we see that the false-negatives are strongly biased toward the low end of CO2 uptake. Take, for example, the distribution of true positives at a 50% cutoff (red) for the 0.15 bar pressure. In the range between 1 and 2 mmol/g in Figure 4a the red distribution is well below the black distribution, meaning that the classifier is “missing” most of the MOFs in that uptake range. However, in the uptake range of 3 mmol/g or higher, the red and black distributions nearly match, revealing that the classifier is capturing most of the MOFs with the best uptakes. This is quantified in Table 1, which shows that

be relevant for other classes of crystalline materials as long as data is available to calibrate the QSPR models. Analysis of the AP-RDF profiles suggests that materials with more compact frameworks with interatomic distances in the range from 6 to 9 Å exhibit higher affinity for CO2 at both pressure regimes studied. Efforts are currently underway in our lab to identify high-performing materials based on other gas adsorption parameters such as the selectivity, working capacity and heat of adsorption. The QSPR classifiers described here are available for public use at http://titan.chem.uottawa.ca.



S Supporting Information *

Details of the MOF structure library, GCMC simulations, machine learning parameters, classifier details, and additional figures and tables. This material is available free of charge via the Internet at http://pubs.acs.org.



*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



true-positive recovery cutoff 30%

50%

90%

0.15 bar 1 bar

790 (5%) 731 (7%)

945 (10%) 905 (12%)

999 (28%) 968 (33%)

AUTHOR INFORMATION

Corresponding Author

Table 1. Number of the Top 1000 MOFs from the 292 050 Member Test Set That the Classifiers Correctly Identifies with Different Classifier Cut-Offsa pressure

ASSOCIATED CONTENT

ACKNOWLEDGMENTS Authors thank the NSERC of Canada, Carbon Management Canada, and Canada Research Chairs Program for financial support and Compute Canada, Canada Foundation for Innovation, and the University of Ottawa for computing resources.

a The fraction of the test set flagged for GCMC simulations is given in parentheses.



the classifier with the 50% cutoff is able to capture 945 of the top 1000 MOFs. This is impressive given that the classifier only flags 10% (value in parentheses in Table 1) of the total test set for more compute intensive screening. In other words, using the QSPR classifier to prescreen materials for CO2 uptake at 0.15 bar, one would recover the highest performing materials (945 out of 1000) while performing more computationintensive simulations on only 10% of the total search space. Similar but less favorable results are obtained at the 1 bar CO2 pressure shown in Figure 4b and Table 1. Just as machine learning and bioinformatics have greatly influenced the understanding of the links between protein structure and function, and accelerated the drug discovery cycle, the application of informatics to the field of materials science is poised to greatly impact advanced materials discovery. In this work, machine learning and specialized structural descriptors were used to optimize QSPR models for rapidly identifying high-performing materials for CO2 capture at conditions relevant to postcombustion (0.15 bar) and landfill gas purification (1 bar). It was demonstrated on a test set of 292 050 MOFs that the 0.15 bar classifier could recover 945 of the top 1000 MOFs in the database while only flagging 10% of the total database for more compute intensive screening. Translating these recovery rates to a more concrete example, consider the case of the MOF ZBP with ∼1.5 million functional group combinations. Applying the classifier in a prescreening role would reduce the number of required GCMC simulations from ∼1.5 million to ∼150 000, thereby making the screening of an intractably large search space feasible. The approach should be applicable to avoid brute force screening of large structural libraries of MOFs for other applications and should

REFERENCES

(1) Li, J.-R.; Kuppler, R. J.; Zhou, H.-C. Selective Gas Adsorption and Separation in Metal−Organic Frameworks. Chem. Soc. Rev. 2009, 38, 1477. (2) Farha, O. K.; Yazaydın, A. Ö .; Eryazici, I.; Malliakas, C. D.; Hauser, B. G.; Kanatzidis, M. G.; Nguyen, S. T.; Snurr, R. Q.; Hupp, J. T. De Novo Synthesis of a Metal−Organic Framework Material Featuring Ultrahigh Surface Area and Gas Storage Capacities. Nat. Chem. 2010, 2, 944. (3) Suh, M. P.; Park, H. J.; Prasad, T. K.; Lim, D.-W. Introduction to Metal−Organic Frameworks. Chem. Rev. 2012, 112, 782. (4) Ma, L.; Abney, C.; Lin, W. Enantioselective Catalysis with Homochiral Metal−Organic Frameworks. Chem. Soc. Rev. 2009, 38, 1248. (5) Corma, A.; García, H.; Llabrés i Xamena, F. X. Engineering Metal Organic Frameworks for Heterogeneous Catalysis. Chem. Rev. 2010, 110, 4606. (6) Lee, J.; Farha, O. K.; Roberts, J.; Scheidt, K. A.; Nguyen, S. T.; Hupp, J. T. Metal−organic Framework Materials As Catalysts. Chem. Soc. Rev. 2009, 38, 1450. (7) Evans, O. R.; Lin, W. Crystal Engineering of NLO Materials Based on Metal−Organic Coordination Networks. Acc. Chem. Res. 2002, 35, 511. (8) Chen, B.; Xiang, S.; Qian, G. Metal−Organic Frameworks with Functional Pores for Recognition of Small Molecules. Acc. Chem. Res. 2010, 43, 1115. (9) Xie, Z.; Ma, L.; DeKrafft, K. E.; Jin, A.; Lin, W. Porous Phosphorescent Coordination Polymers for Oxygen Sensing. J. Am. Chem. Soc. 2010, 132, 922. (10) Horcajada, P.; Gref, R.; Baati, T.; Allan, P. K.; Maurin, G.; Couvreur, P.; Férey, G.; Morris, R. E.; Serre, C. Metal−Organic Frameworks in Biomedicine. Chem. Rev. 2012, 112, 1232. 3059

dx.doi.org/10.1021/jz501331m | J. Phys. Chem. Lett. 2014, 5, 3056−3060

The Journal of Physical Chemistry Letters

Letter

(11) Kent, C. A.; Mehl, B. P.; Ma, L.; Papanikolas, J. M.; Meyer, T. J.; Lin, W. Energy Transfer Dynamics in Metal−Organic Frameworks. J. Am. Chem. Soc. 2010, 132, 12767. (12) Düren, T.; Bae, Y.-S.; Snurr, R. Q. Using Molecular Simulation to Characterise Metal−Organic Frameworks for Adsorption Applications. Chem. Soc. Rev. 2009, 38, 1237. (13) Vaidhyanathan, R.; Iremonger, S. S.; Shimizu, G. K. H.; Boyd, P. G.; Alavi, S.; Woo, T. K. Direct Observation and Quantification of CO2 Binding Within an Amine-Functionalized Nanoporous Solid. Science 2010, 330, 650. (14) Campaña,́ C.; Mussard, B.; Woo, T. K. Electrostatic Potential Derived Atomic Charges for Periodic Systems Using a Modified Error Functional. J. Chem. Theory Comput. 2009, 5, 2866. (15) Smit, B.; Maesen, T. L. M. Molecular Simulations of Zeolites: Adsorption, Diffusion, and Shape Selectivity. Chem. Rev. 2008, 108, 4125. (16) Tafipolsky, M.; Amirjalayer, S.; Schmid, R. Atomistic Theoretical Models for Nanoporous Hybrid Materials. Microporous Mesoporous Mater. 2010, 129, 304. (17) Pophale, R.; Cheeseman, P. a; Deem, M. W. A Database of New Zeolite-Like Materials. Phys. Chem. Chem. Phys. 2011, 13, 12407. (18) Wilmer, C. E.; Leaf, M.; Lee, C. Y.; Farha, O. K.; Hauser, B. G.; Hupp, J. T.; Snurr, R. Q. Large-Scale Screening of Hypothetical Metal−Organic Frameworks. Nat. Chem. 2012, 4, 83. (19) Martin, R. L.; Simon, C. M.; Smit, B.; Haranczyk, M. In Silico Design of Porous Polymer Networks: High-Throughput Screening for Methane Storage Materials. J. Am. Chem. Soc. 2014, 136, 5006. (20) Colón, Y. J.; Snurr, R. Q. High-Throughput Computational Screening of Metal−Organic Frameworks. Chem. Soc. Rev. 2014, 43, 5735. (21) Henke, S.; Schmid, R.; Grunwaldt, J.-D.; Fischer, R. A. Flexibility and Sorption Selectivity in Rigid Metal−Organic Frameworks: the Impact Of Ether-Functionalised Linkers. Chemistry 2010, 16, 14296. (22) Deng, H.; Doonan, C. J.; Furukawa, H.; Ferreira, R. B.; Towne, J.; Knobler, C. B.; Wang, B.; Yaghi, O. M. Multiple Functional Groups of Varying Ratios in Metal−Organic Frameworks. Science 2010, 327, 846. (23) Wilmer, C. E.; Kim, K. C.; Snurr, R. Q. An Extended Charge Equilibration Method. J. Phys. Chem. Lett. 2012, 3, 2506. (24) Kadantsev, E. S.; Boyd, P. G.; Daff, T. D.; Woo, T. K. Fast and Accurate Electrostatic in Metal Organic Frameworks with a Robust Charge Equilibration Parametrization for High-Throughput Virtual Screening of Gas Adsorption. J. Phys. Chem. Lett. 2013, 4, 3056. (25) Fernandez, M.; Woo, T. K.; Wilmer, C. E.; Snurr, R. Q. LargeScale Quantitative Structure-Property Relationship (QSPR) Analysis of Methane Storage in Metal−Organic Frameworks. J. Phys. Chem. C 2013, 117, 7681. (26) Fernandez, M.; Trefiak, N. R.; Woo, T. K. Atomic Property Weighted Radial Distribution Functions Descriptors of MOFs for the Prediction of Gas Uptake Capacity. J. Phys. Chem. C 2013, 117, 14095. (27) Wilmer, C. E.; Farha, O. K.; Bae, Y.-S.; Hupp, J. T.; Snurr, R. Q. Structure−Property Relationships of Porous Materials for Carbon Dioxide Separation and Capture. Energy Environ. Sci. 2012, 5, 9849. (28) Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273.

3060

dx.doi.org/10.1021/jz501331m | J. Phys. Chem. Lett. 2014, 5, 3056−3060