Machine Learning Using Combined Structural and Chemical

Aug 11, 2017 - Using molecular simulation for adsorbent screening is computationally expensive and thus prohibitive to materials discovery. Machine le...
0 downloads 0 Views 1MB Size
Research Article pubs.acs.org/acscombsci

Machine Learning Using Combined Structural and Chemical Descriptors for Prediction of Methane Adsorption Performance of Metal Organic Frameworks (MOFs) Maryam Pardakhti,† Ehsan Moharreri,‡ David Wanik,§ Steven L. Suib,‡,∥ and Ranjan Srivastava*,† †

Department of Chemical and Biomolecular Engineering, University of Connecticut, Storrs, Connecticut 06269, United States Institute of Materials Science, University of Connecticut, Storrs, Connecticut 06269, United States § Department of Civil and Environmental Engineering, University of Connecticut, Storrs, Connecticut 06269, United States ∥ Department of Chemistry, University of Connecticut, Storrs, Connecticut 06269, United States ACS Comb. Sci. 2017.19:640-645. Downloaded from pubs.acs.org by KAROLINSKA INST on 08/17/18. For personal use only.



S Supporting Information *

ABSTRACT: Using molecular simulation for adsorbent screening is computationally expensive and thus prohibitive to materials discovery. Machine learning (ML) algorithms trained on fundamental material properties can potentially provide quick and accurate methods for screening purposes. Prior efforts have focused on structural descriptors for use with ML. In this work, the use of chemical descriptors, in addition to structural descriptors, was introduced for adsorption analysis. Evaluation of structural and chemical descriptors coupled with various ML algorithms, including decision tree, Poisson regression, support vector machine and random forest, were carried out to predict methane uptake on hypothetical metal organic frameworks. To highlight their predictive capabilities, ML models were trained on 8% of a data set consisting of 130,398 MOFs and then tested on the remaining 92% to predict methane adsorption capacities. When structural and chemical descriptors were jointly used as ML input, the random forest model with 10-fold cross validation proved to be superior to the other ML approaches, with an R2 of 0.98 and a mean absolute percent error of about 7%. The training and prediction using the random forest algorithm for adsorption capacity estimation of all 130,398 MOFs took approximately 2 h on a single personal computer, several orders of magnitude faster than actual molecular simulations on high-performance computing clusters. KEYWORDS: metal−organic frameworks, methane adsorption, machine learning, computational screening, predictive modeling



INTRODUCTION

The search for the optimal adsorbent requires aggressive screening of a variety of porous materials. Computer-assisted screening is an emerging field that significantly increases the discovery rate of advanced materials.4,5 Metal organic frameworks (MOFs) are particularly amenable to computer-assisted high-throughput screening.6−14 MOFs possess unique characteristics such as regularity, variety, and designability.15 These features facilitate simulation of MOFs physicochemical properties. Through a reticular synthesis approach, MOFs are rationally designed by engineering ligand size, organic linker size, metal containing cluster composition, and synthesis conditions.15,16 To determine various fundamental attributes of MOFs, such as substrate-adsorbent interactions, force-fieldbased molecular simulation methods have been used.17−20 Systematic synthesis of nodes and linkers as building blocks and high-throughput screening of hypothetical MOFs by grand

Natural gas is an abundant resource widely regarded as a “transition fuel” toward renewable energy. Methane, the predominant component of natural gas, has the highest combustion energy per unit of carbon dioxide compared to all other hydrocarbons.1 Despite the high combustion energy (ΔH298 = −890 kJ mol−1),2 the volumetric energy density of methane in ambient conditions is only 0.11% of the energy density of gasoline.3 To encourage broader use of natural gas as a transport fuel, adsorbed natural gas technology (ANG) is vigorously pursued by energy and environmental policy-making agencies.3 The most recent target of methane uptake by adsorbent materials set by U.S. Department of Energy is 350 mL(STP)/mL (assuming packing loss).3 The goal is to make the ANG technology superior to compressed natural gas (CNG) technology by achieving higher capacity at lower pressures, as well as stretching the energy density to 30% of that of gasoline. To that end, identifying superior methane adsorbent materials becomes important. © 2017 American Chemical Society

Received: March 29, 2017 Revised: August 7, 2017 Published: August 11, 2017 640

DOI: 10.1021/acscombsci.7b00056 ACS Comb. Sci. 2017, 19, 640−645

Research Article

ACS Combinatorial Science

composition features along with structural variables powered by robust machine learning models to capture the underlying intricacies of adsorption phenomena.

canonical Monte Carlo (GCMC) simulations have proven to be an effective approach.21,22 One of the first molecular simulation packages specifically designed for MOFs was the RASPA software platform.23 RASPA is an evolution of the more general purpose MUltipurpose SImulation Code (MUSIC) developed by the Snurr group.24,25 Because of the vast number of possible MOF structures, performing molecular simulations as a screening tool is computationally prohibitive. Various approaches to overcome this barrier have been pursued. Simon et al. (2015) used a comprehensive database of various adsorbent materials and performed GCMC simulations to examine the relationship between structural properties and adsorption.6 The data set included zeolites,12 hypothetical MOFs (hMOFs),22 porous polymer networks (PPNs),26 hypothetical zeolitic imidazolate frameworks (hZIFs),27 and computation-ready experimental (CoRE) MOF14 structures. They suggested that due to the high dimensionality of the data, machine learning approaches could help extract complicated relationships and yield accurate predictions.6 Fernandez et al. (2013) developed regression models using radial distribution functions as predictor variables to predict CO2 and N2 uptake.7 In 2014, Fernandez et al. used structural variables, such as void fraction and pore size, to predict CH4 uptake and achieved an R2 = 0.85.28 In another work, Fernandez et al. (2014) applied a classification approach based upon quantitative structure−property relationships (QSPR) to predict top performing MOFs for CO2 adsorption achieving 94.5% accuracy.8 Sezginel et al. (2015) used structural properties such as surface area, crystal density, void fraction, and pore diameter, as well as the isosteric heat of adsorption (Qst), to predict CH4 uptake in a set of adsorbent MOFs. Use of heat of adsorption provides information not discernible from structural properties. However, the number of MOFs whose isosteric heat of adsorption were available was limited.9 Fernandez et al. (2016) tried a variety of machine learning techniques and variables, reaching an R2 of 0.94 in predicting CO2 uptake.11 However, structural features do not provide sufficient information to characterize adsorption properties fully. For methane uptake, which is the focus of this work, it is important to construct a comprehensive model to account for chemical interactions, particularly at higher levels of adsorption. Literature exists which emphasizes adsorbent features that cannot be fully captured exclusively by structural variables.9,10,29 Chemical composition variables, such as type and number of atoms, have been used for machine learning, albeit in drug discovery.30 In a recent study, Ash et al. (2017) applied a variety of molecular dynamics (MD) chemical descriptor sets for molecular biological analysis.31 Expanding the descriptor set helps extract more information about the subject, thus improving the machine learning models. Because of the overemphasis on physisorption in the literature, implementation of chemical variables has been overlooked in data mining studies concerning adsorbents. In the current work, the use of chemical descriptors, which are available for MOF structures, was introduced and explored. There has been empirical evidence that certain machine learning algorithms perform robustly on high dimensional data.32 Although estimation of fundamental chemical variables solely from composition requires quantum or thermodynamic calculations, it is possible to use machine learning approaches to circumvent cumbersome computations. To explore this possibility, we introduced a comprehensive set of chemical



DATASET AND PREDICTORS The data used to train and validate the machine learning algorithms were taken from the database of hypothetical MOFs provided by the Snurr group and available at http://hmofs. northwestern.edu.22 130 398 hMOFs were extracted from the database. Characteristics associated with the MOFs and available from the database include both volumetric-based uptake (cm3 methane/cm3 hMOF) and mass-based uptake of methane (cm3 methane/g hMOF) at 35 bar and 298 K. In addition, physical features and the crystal structures are also provided for the hMOFs. Methane uptake was calculated based on the GCMC simulation method (35 bar and 298 K). The crystal structures were designed and produced by the Wilmer et al. (2012) method.22 The list of the predictors used for structural and chemical properties in this work are provided in Tables 1 and 2, respectively. Structural properties in Table 1 Table 1. Parameters Characterizing MOF Structural Properties void fraction surface area [m2/g] density [g/cm3] dominant pore diameter maximum pore diameter interpenetration capacity number of interpenetration framework

min.

median

mean

max.

0.05 0.00 0.12 0.00 0.00 1.00 1.00

0.69 2703 0.79 6.75 8.25 2.00 1.00

0.65 2740 0.86 7.77 9.34 2.09 1.51

0.97 6947 4.04 24.75 24.75 4.00 4.00

include surface area, density, and void fraction with values ranging from 0 to 6947 m2 g−1, 0.118−4.042 g cm−3, and 0.051−0.967 respectively. The surface area and void fractions were calculated by Monte Carlo molecular simulation of nitrogen and helium adsorption, respectively. Chemical predictors, shown in Table 2, were introduced in this work and extracted from crystal structures. They included the type and number of each atom, degree of unsaturation,33 metal to carbon ratio, halogen to carbon ratio, nitrogen to oxygen ratio, and degree of electronegativity. Each atom in the MOF structure has an important role in the adsorption process. The metals in the data set consisted of copper, vanadium, zirconium, and zinc. The effects of the metals in each hMOFs unit cell on the training process were incorporated in three ways. First, the type of metal present was explicitly accounted for as a categorical variable. Next, the total number of atoms for each metal affecting methane adsorption represented another quantitative variable. Finally, the percentage of metal relative to carbon, called the metallic percentage, was introduced as a chemical variable to model the metals affecting methane uptake. In this work the concept behind using the metallic percentage was extracted from the fact that location of metal atoms relative to open sites significantly affects the adsorption yield of MOFs.34 Wu et al. (2009) examined methane adsorption on five different MOFs and showed that methane molecules close to open metal sites have stronger attraction than other adsorption sites. Although pore volume was not explicitly incorporated into the analysis, it was implicitly accounted for via its impact on the density and void fraction 641

DOI: 10.1021/acscombsci.7b00056 ACS Comb. Sci. 2017, 19, 640−645

Research Article

ACS Combinatorial Science Table 2. Parameters Characterizing MOF Chemical Properties variable

min.

median

mean

max.

hydrogen (H) carbon (C) nitrogen (N) oxygen (O) fluorine (F) chlorine (Cl) bromine (Br) vanadium(V) copper (Cu) zinc (Zn) zirconium (Zr) metal type total degree of unsaturation degree of unsaturation per carbon metallic percentage [%] oxygen to metal ratio (surrogate of average oxidation state) electronegative atoms to total atoms ratio weighted electronegativity per atom

0 6 0 8 0 0 0 0 0 0 0

25 53 3 16 0 0 0 0 0 2 0

36.12 68.28 7.103 21.39 2.07 1.86 1.79 0.31 0.89 3.47 0.23

594 606 216 216 120 112 108 12 24 24 12

6 0.19 0. 85 6.5

39 0.77 7 8

51 0.77 8.35 9.61

565 1.36 50.0 51

0.35 0.12

0.27 0.89

0.26 0.86

0.62 2.15

nitrogen to oxygen ratio

0.00

0.25

0.41

6.50

note number of hydrogen atoms per unit cell number of carbon atoms per unit cell number of nitrogen atoms per unit cell number of oxygen atoms per unit cell number of fluorine atoms per unit cell number of chlorine atoms per unit cell number of bromine atoms per unit cell number of vanadium atoms per unit cell number of copper atoms per unit cell number of zinc atoms per unit cell number of zirconium atoms per unit cell categorical variable: V, Cu, Zn, and Zr {[(number of carbons ×2) + 2 − number of Hydrogens]/2}a total degree of unsaturation/number of carbons [number of metal atoms/number of carbon atoms] × 100 [2 × number of oxygen atoms]/total number of metal atoms [number of electronegative atoms]/[total number of atoms] [sum of weightedb electronegative atoms]/[total number of atoms] number of nitrogen atoms/number of oxygen atoms

a

For other elements: oxygens are ignored; halides (F, Cl, Br, I) are treated as hydrogen and nitrogen is counted as one-half of carbon. b Electronegative atoms: O, N, F, Cl, and Br weighted by electronegativity.

the most widely used classification techniques.39,40 For the studies described here, a linear kernel with a tolerance of 0.001 was used. The random forest41 (RF) approach is a supervised learning method that is an extension of the DT algorithm. The forest consists of an ensemble of decision trees where the final result is the average of predictions from all of the decision trees. While the RF method is more complex than the DT algorithm, it is also more robust.41,42 The random forest used here consisted of 250 trees. Each algorithm was trained on the training data for methane uptake. The algorithms were then implemented to predict methane uptake for the test data set. For both volumetric and mass based uptake, three different classes or a combination of classes of variable were tried: structural only (SO) variables, chemical only (CO) variables, and both structural and chemical (SC) variables together were compared to evaluate the quality of predictive capabilities.

values. Similarly, the total number of atoms per unit cell were not explicitly included in the model. Instead, the total number of atoms for each species present per unit cell was accounted for. All numerical predictors were normalized to be in the range of [−1, 1], where the smallest and largest value of each predictor were −1 and 1, respectively. For this study, 8% of the total hMOFs were used to train the ML algorithms. In many real simulation cases for materials screening, the computational costs limit the generation of larger data sets for training. Furthermore, training on large data sets may not necessarily significantly increase accuracy relative to using a smaller data set. As a result, the data set was randomly divided into a training set consisting of 8% of the data and a test set consisting of the other 92% of the data. The choice of using 8% of the data for training was made to be consistent with Simon et al. (2015) data set analysis where they also selected about 8% of hMOFs data for training.6 It should be noted that when larger fractions of the data set were used for training purposes (up to 75%), results were still on par with the 8% analysis (data not shown).



ALGORITHM EVALUATION

Performance of each ML algorithm was evaluated by calculating the R2 values, the mean absolute percentage error (MAPE), the mean error (ME), and the root-mean-square error (RMSE). The R2 value was calculated using eq 1 where n, yi, ui, and u̅ are the number of MOFs, simulated methane uptake, predicted methane uptake and average methane uptake values, respectively.



METHODS AND MODELS Four algorithms were evaluated in this work to compare how different models predict the data. The first was the decision tree (DT) algorithm, also referred to as a classification tree. It is a method starting with a single pass using “if−then” logic consequence to train a layer of classifications. Each layer applies predictors affecting the decision-making process for each consequence. These layers continue forming until the best fit is reached.35−37 The Poisson regression is a generalized linear model using regression which allows for direct interpretation of the coefficients associated with the model.38 The support vector machine (SVM) approach uses hyperplane-based classification by constructing hyperplanes with maximum separation margins and accommodating for nonlinear kernels. It has been one of

n

R2 = 1 −

∑i = 1 (yi − ui)2 n

∑i = 1 (yi − u ̅ )2

(1)

The various error values for each algorithm were calculated using eqs 2 to 4. n

MAPE = 642

∑i = 1 |yi − ui| yi

×

100 n

(2) DOI: 10.1021/acscombsci.7b00056 ACS Comb. Sci. 2017, 19, 640−645

Research Article

ACS Combinatorial Science

Table 3. Evaluation of Predictive Performance of Mass-Based Methane Uptake Using Only Structural, Only Chemical, and Both Structural and Chemical Predictors prediction model performance R2

MAPE (%)

predictor type

DT

SVM

Poisson

RF

DT

SVM

Poisson

RF

structural only (SO) chemical only (CO) structural and chemical (SC)

0.75 0.34 0.84

0.81 0.42 0.9

0.84 0.42 0.92

0.88 0.65 0.97

23.99 69.3 21.7

20.63 66.03 18.57

17.85 64.86 15.45

13.25 42.3 8.75

Figure 1. Parity plots for predicted (ML) vs GCMC simulated mass-based methane uptake (cm3/g) using structural and chemical variables applied on (a) DT, (b) Poisson, (c) SVM, and (d) RF models. The red diagonal in each plot is a 45° line indicating perfect correspondence between ML predictions and GCMC simulation results. The color scale indicates the number of counts or the number of hMOFs that had the corresponding GCMC and ML result.

final results were based on the outcome of all 10 test sets of predictions. The ML process was performed using a PC computer running the 64-bit version of Windows 10 with an Intel Core i7 CPU.

n

ME =

∑i = 1 (yi − ui) (3)

n



n

RMSE =

∑i = 1 (yi − ui)2 n

(4)

RESULTS AND DISCUSSIONS

Model performance for predicting methane adsorption was evaluated through calculation of the R2 value, mean absolute percentage error (MAPE), mean error (ME), and the rootmean-square error (RMSE). Adsorption capabilities were simulated for environmental conditions of 298 K and 35 bar. The results of model performance in predicting mass-related methane uptake is shown in Table 3, where the R2 and MAPE values for each model using different group of predictors (only

For the RF model, an additional cross validation analysis was performed. To ensure the validity of the results, the k-fold cross-validation technique35 was applied for a k-fold value of 10. For the 10-fold cross validation used in this work, data were divided into ten parts, referred to as part 1 to part 10. For the first prediction, part 1 was the test data and parts 2−10 were the training data. For the second prediction, part 2 was the test data, and part 1 and parts 3−10 were the training data, etc. The 643

DOI: 10.1021/acscombsci.7b00056 ACS Comb. Sci. 2017, 19, 640−645

Research Article

ACS Combinatorial Science

this case. In the future, for systems where a significant computation burden might be imposed, PCA provides a worthwhile strategy for making calculations more efficient. To compare the ML approach to more traditional molecular dynamics approaches, a high performance computing cluster was used to predict the methane uptake by GCMC simulation via the RASPA software platform.23 Using the molecular simulation approach for a minimum of 500 Monte Carlo cycles to calculate methane uptake for all the hMOFs took several days. However, when using the ML approach, the process for training and testing the data set consisting of 130 398 hMOFs took about 2 h of “wall” time on a personal computer for all four algorithms combined (DT, Poisson, SVM, and RF). ML proved to be several orders of magnitude faster than molecular simulation alone. This point is raised not to discredit the necessity of molecular simulation methods but rather to illustrate the potency of ML techniques to rapidly reproduce and predict adsorptive capabilities of the MOFs in screening studies.

structural, only chemical or combined structural and chemical) are reported. ME and RMSE values are reported in the Supporting Information. Volumetric-based methane prediction results may also be found in the Supporting Information. These results were based on the prediction over a test data set, consisting of 119 965 MOFs or 92% of the whole data set. The training set used only a small fraction of the whole data set (8% = 10 433 MOFs). Analyzing the results using structural only (SO), chemical only (CO), and combined structural and chemical (SC) predictor types showed that performance of SO predictors was stronger compared to CO predictors in each corresponding model. Using a combination of structural and chemical variables increased the predictive power of every model. Comparative parity plots of the four algorithms with combined structural and chemical features are shown in Figure 1. Also, in going from a relatively simple model (DT) to a more robust method (RF), the quality of the predictions increased. The RF using SC input resulted in the best predictive power among all of the groups. To validate the RF results, a 10-fold cross validation was carried out on the combined structural and chemical data set. Using the 10-fold cross validation approach resulted in a slight improvement in the R2 value for mass-based methane uptake going from 0.97 to a value of 0.98. Similarly, prediction error was reduced from 8.75% to 7.18%. The same trend was seen for volumetric-based methane uptake predictions. Using through 10-fold cross validation evaluation for the RF model resulted in better values of R2 and MAPE, 0.943 and 7.54%, respectively (Table 4 and Supporting Information).



CONCLUSIONS Because of the relatively small computational overhead of machine learning methods compared to molecular simulation, coupled with the affordability of molecular simulation relative to experimentation, a cascade of screening methods encompassing all three approaches (machine learning, molecular simulation, and experiments) will likely be the way of the future in screening adsorbent materials. The current work shows that incorporating chemical variables into the ML-based materials analysis can greatly enhance predictive accuracy while maintaining high computational speed. ML models based on structural and chemical variables are easily retrievable from crystal structure databases, providing reliable predictive power. The comprehensive structural and chemical model using a 10fold cross-validation approach led to an R2 value of 0.98 and mean absolute percentage error of 7.18%.

Table 4. R2 Coefficients and Error Ranges from Prediction of Methane Uptake by RF Using Combination of Structural and Chemical Variables mass-based methane uptake, (cm3/g)



volumetricbased methane uptake (cm3/cm3)

prediction method

R2

MAPE (%

R2

MAPE (%)

RF model prediction using test data set RF model prediction using 10-fold cross validation

0.97 0.98

8.75 7.18

0.92 0.94

9.22 7.54

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acscombsci.7b00056. Histogram figures of chemical variables, prediction performance of volumetric-based methane uptake performance, importance table, and principal component analysis (PDF)

Additionally, it was possible to take advantage of the RF algorithm to identify the most critical parameters via the variable importance plot, shown in Figure S3. The density, void fraction, surface area, pore diameter, metallic percentage relative to carbon, and degree of unsaturation per carbon played the most important roles in prediction of the target variables. Principal component analysis (PCA) was also carried out. Results are shown via a scree plot, pair plots and table of loading of each variable in the principal components in Figures S6−S11 and Table S3. According to the scree plot and the eigenvalue criteria, nine principal components captured over 84% of the total variability. Reducing the dimensions from 29 variables in the original model to nine principal components would reduce the computational cost of the model; however, there would of course be a decline in accuracy of the results. Given the relative speed at which it was possible to carry out the ML training, compromising accuracy was not warranted in



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Steven L. Suib: 0000-0003-3073-311X Ranjan Srivastava: 0000-0003-4309-605X Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors thank Dr. Randy Snurr for providing the hMOFs database, including crystal structure files. The authors also thank the University of Connecticut High Performance Computing Center for providing computational resources. 644

DOI: 10.1021/acscombsci.7b00056 ACS Comb. Sci. 2017, 19, 640−645

Research Article

ACS Combinatorial Science



High Methane Storage at Room Temperature. Angew. Chem., Int. Ed. 2011, 50 (14), 3178−3181. (21) Ockwig, N. W.; Delgado-Friedrichs, O.; O’Keeffe, M.; Yaghi, O. M. Reticular Chemistry. Acc. Chem. Res. 2005, 38 (3), 176−182. (22) Wilmer, C. E.; Leaf, M.; Lee, C. Y.; Farha, O. K.; Hauser, B. G.; Hupp, J. T.; Snurr, R. Q. Large-Scale Screening of Hypothetical Metal−organic Frameworks. Nat. Chem. 2012, 4 (2), 83−89. (23) Dubbeldam, D.; Calero, S.; Ellis, D. E.; Snurr, R. Q. RASPA: Molecular Simulation Software for Adsorption and Diffusion in Flexible Nanoporous Materials RASPA: Molecular Simulation Software for Adsorption and Diffusion in Flexible Nanoporous Materials. Mol. Simul. 2016, 42 (2), 81−101. (24) Gupta, A.; Chempath, S.; Sanborn, M. J.; Clark, L. A.; Snurr, R. Q. Object-Oriented Programming Paradigms for Molecular Modeling. Mol. Simul. 2003, 29 (1), 29−46. (25) Chempath, S.; Düren, T.; Sarkisov, L.; Snurr, R. Q. Experiences with the Publicly Available Multipurpose Simulation Code, Music. Mol. Simul. 2013, 39 (14−15), 1223−1232. (26) Martin, R. L.; Simon, C. M.; Smit, B.; Haranczyk, M. In Silico Design of Porous Polymer Networks: High-Throughput Screening for Methane Storage Materials. J. Am. Chem. Soc. 2014, 136 (13), 5006− 5022. (27) Lin, L.-C.; Berger, A. H.; Martin, R. L.; Kim, J.; Swisher, J. A.; Jariwala, K.; Rycroft, C. H.; Bhown, A. S.; Deem, M. W.; Haranczyk, M.; Smit, B. In Silico Screening of Carbon-Capture Materials. Nat. Mater. 2012, 11 (7), 633−641. (28) Fernandez, M.; Woo, T. K.; Wilmer, C. E.; Snurr, R. Q. LargeScale Quantitative Structure−Property Relationship (QSPR) Analysis of Methane Storage in Metal−Organic Frameworks. J. Phys. Chem. C 2013, 117 (15), 7681−7689. (29) Mertens, F. O. Determination of Absolute Adsorption in Highly Ordered Porous Media. Surf. Sci. 2009, 603 (10), 1979−1984. (30) Reymond, J.-L.; Awale, M. Exploring Chemical Space for Drug Discovery Using the Chemical Universe Database. ACS Chem. Neurosci. 2012, 3 (9), 649−657. (31) Ash, J.; Fourches, D. Characterizing the Chemical Space of ERK2 Kinase Inhibitors Using Descriptors Computed from Molecular Dynamics Trajectories. J. Chem. Inf. Model. 2017, 57 (6), 1286−1299. (32) Caruana, R.; Karampatziakis, N.; Yessenalina, A. An Empirical Evaluation of Supervised Learning in High Dimensions. In Proceedings of the 25th International Conference on Machine LearningICML ’08; ACM Press: New York, New York, USA, 2008; pp 96−103. (33) Badertscher, M.; Bischofberger, K.; Munk, M. E.; Pretsch, E. A Novel Formalism To Characterize the Degree of Unsaturation of Organic Molecules. J. Chem. Inf. Comput. Sci. 2001, 41 (4), 889−893. (34) Wu, H.; Zhou, W.; Yildirim, T. High-Capacity Methane Storage in Metal-Organic Frameworks M2 (Dhtp): The Important Role of Open Metal Sites. J. Am. Chem. Soc. 2009, 131 (13), 4995−5000. (35) Larose, D. T.; Larose, C. D. Data Mining and Predictive Analytics, 2nd.; John Wiley & Sons, Ed., 2015. (36) Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and Regression Trees; Chapman&Hall/CRC Press: Boca Raton, FL, 1984. (37) Dahan, H.; Cohen, S.; Rokach, L.; Maimon, O. Proactive Data Mining with Decision Trees: Theory and Applications; Series in Machine Perception and Artificial Intelligence; World Scientific, 2014; Vol. 81. (38) Cameron, A. C.; Trivedi, P. K. Regression Analysis of Count Data, 2nd.; Cambridge University Press, 2013. (39) Burges, C. J. C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discovery 1998, 2, 121−167. (40) Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20 (3), 273−297. (41) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5−32. (42) Ho, T. K. Random Decision Forests. Proc. Third Int. Conf. 1995, 1, 278−282.

REFERENCES

(1) Konstas, K.; Osl, T.; Yang, Y.; Batten, M.; Burke, N.; Hill, A. J.; Hill, M. R. Methane Storage in Metal Organic Frameworks. J. Mater. Chem. 2012, 22 (33), 16698−16708. (2) Pucker, J.; Zwart, R.; Jungmeier, G. Greenhouse Gas and Energy Analysis of Substitute Natural Gas from Biomass for Space Heat. Biomass Bioenergy 2012, 38, 95−101. (3) He, Y.; Zhou, W.; Qian, G.; Chen, B. Methane Storage in Metal− organic Frameworks. Chem. Soc. Rev. 2014, 43 (16), 5657−5678. (4) Dubbeldam, D.; Krishna, R.; Calero, S.; Yazaydın, A. Ö . Computer-Assisted Screening of Ordered Crystalline Nanoporous Adsorbents for Separation of Alkane Isomers. Angew. Chem., Int. Ed. 2012, 51 (47), 11867−11871. (5) Shah, M. S.; Tsapatsis, M.; Siepmann, J. I. Identifying Optimal Zeolitic Sorbents for Sweetening of Highly Sour Natural Gas. Angew. Chem. 2016, 128 (20), 6042−6046. (6) Simon, C. M.; Kim, J.; Gomez-Gualdron, D. A.; Camp, J. S.; Chung, Y. G.; Martin, R. L.; Mercado, R.; Deem, M. W.; Gunter, D.; Haranczyk, M.; Sholl, D. S.; Snurr, R. Q.; Smit, B. The Materials Genome in Action: Identifying the Performance Limits for Methane Storage. Energy Environ. Sci. 2015, 8 (4), 1190−1199. (7) Fernandez, M.; Trefiak, N. R.; Woo, T. K. Atomic Property Weighted Radial Distribution Functions Descriptors of Metal− Organic Frameworks for the Prediction of Gas Uptake Capacity. J. Phys. Chem. C 2013, 117 (27), 14095−14105. (8) Fernandez, M.; Boyd, P. G.; Daff, T. D.; Aghaji, M. Z.; Woo, T. K. Rapid and Accurate Machine Learning Recognition of High Performing Metal Organic Frameworks for CO2 Capture. J. Phys. Chem. Lett. 2014, 5 (17), 3056−3060. (9) Sezginel, K. B.; Uzun, A.; Keskin, S. Multivariable Linear Models of Structural Parameters to Predict Methane Uptake in Metal−organic Frameworks. Chem. Eng. Sci. 2015, 124, 125−134. (10) Braun, E.; Zurhelle, A. F.; Thijssen, W.; Schnell, S. K.; Lin, L.-C.; Kim, J.; Thompson, J. A.; Smit, B. High-Throughput Computational Screening of Nanoporous Adsorbents for CO 2 Capture from Natural Gas. Mol. Syst. Des. Eng. 2016, 1 (2), 175−188. (11) Fernandez, M.; Barnard, A. S. Geometrical Properties Can Predict CO2 and N2 Adsorption Performance of Metal−Organic Frameworks (MOFs) at Low Pressure. ACS Comb. Sci. 2016, 18 (5), 243−252. (12) Simon, C. M.; Kim, J.; Lin, L.-C.; Martin, R. L.; Haranczyk, M.; Smit, B. Optimizing Nanoporous Materials for Gas Storage. Phys. Chem. Chem. Phys. 2014, 16 (12), 5499−5513. (13) Colón, Y. J.; Snurr, R. Q. High-Throughput Computational Screening of Metal−organic Frameworks. Chem. Soc. Rev. 2014, 43 (16), 5735−5749. (14) Chung, Y. G.; Camp, J.; Haranczyk, M.; Sikora, B. J.; Bury, W.; Krungleviciute, V.; Yildirim, T.; Farha, O. K.; Sholl, D. S.; Snurr, R. Q. Computation-Ready, Experimental Metal−Organic Frameworks: A Tool To Enable High-Throughput Screening of Nanoporous Crystals. Chem. Mater. 2014, 26 (21), 6185−6192. (15) Li, J.-R.; Sculley, J.; Zhou, H.-C. Metal−Organic Frameworks for Separations. Chem. Rev. 2012, 112 (2), 869−932. (16) Lei, J.; Qian, R.; Ling, P.; Cui, L.; Ju, H. Design and Sensing Applications of Metal−organic Framework Composites. TrAC, Trends Anal. Chem. 2014, 58, 71−78. (17) Frenkel, D.; Smit, B. Understanding Molecular Simulation: From Algorithms to Applications; Elsevier, 2002; Vol. 1. (18) Düren, T.; Bae, Y.-S.; Snurr, R. Q. Using Molecular Simulation to Characterise Metal−organic Frameworks for Adsorption Applications. Chem. Soc. Rev. 2009, 38 (5), 1237−1247. (19) Getman, R. B.; Bae, Y.-S.; Wilmer, C. E.; Snurr, R. Q. Review and Analysis of Molecular Simulations of Methane, Hydrogen, and Acetylene Storage in MetalÀ Organic Frameworks. Chem. Rev. 2012, 112 (2), 703−723. (20) Guo, Z.; Wu, H.; Srinivas, G.; Zhou, Y.; Xiang, S.; Chen, Z.; Yang, Y.; Zhou, W.; O’Keeffe, M.; Chen, B. A Metal-Organic Framework with Optimized Open Metal Sites and Pore Spaces for 645

DOI: 10.1021/acscombsci.7b00056 ACS Comb. Sci. 2017, 19, 640−645