Machine Learning Using Combined Structural and Chemical

Aug 11, 2017 - Using molecular simulation for adsorbent screening is computationally expensive and thus prohibitive to materials discovery. Machine le...
5 downloads 13 Views 861KB Size
Subscriber access provided by UNIV OF TEXAS ARLINGTON

Article

Machine learning using combined structural and chemical descriptors for prediction of methane adsorption performance of metal organic frameworks (MOFs) Maryam Pardakhti, Ehsan Moharreri, David Wanik, Steven L. Suib, and Ranjan Srivastava ACS Comb. Sci., Just Accepted Manuscript • DOI: 10.1021/acscombsci.7b00056 • Publication Date (Web): 11 Aug 2017 Downloaded from http://pubs.acs.org on August 14, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

ACS Combinatorial Science is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

Machine learning using combined structural and chemical descriptors for prediction of methane adsorption performance of metal organic frameworks (MOFs) Maryam Pardakhtia, Ehsan Moharrerib, David Wanikc, Steven L. Suibb,d, Ranjan Srivastavaa* a

Department of Chemical and Biomolecular Engineering, University of Connecticut, Storrs, CT

06269, United States b

Institute of Material Science, University of Connecticut, Storrs, CT 06269, United States

c

Department of Civil and Environmental Engineering, University of Connecticut, Storrs, CT

06269, United States d

Department of Chemistry, University of Connecticut, Storrs, CT 06269, United States

*Corresponding Author: [email protected] KEYWORDS Metal organic frameworks • methane adsorption • machine learning • computational screening • predictive modelling

ACS Paragon Plus Environment

1

ACS Combinatorial Science

Table of Content Graphics more effective algorithms

Poisson

decision tree

support vector random forest machine structural and chemical only structural only chemical

higher dimensionality

Predicted (ML) massbased methane uptake 3 [cm /g]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 18

3

GCMC simulated mass-based methane uptake [cm /g]

ABSTRACT Using molecular simulation for adsorbent screening is computationally expensive and thus prohibitive to material discovery. Machine learning (ML) algorithms trained on fundamental material properties can potentially provide quick and accurate methods for screening purposes. Prior efforts have focused on structural descriptors for use with ML. In this work, the use of chemical descriptors, in addition to structural descriptors, was introduced for adsorption analysis. Evaluation of structural and chemical descriptors coupled with various ML algorithms, including decision trees, Poisson regression, support vector machine and random forest, were carried out to predict methane uptake on hypothetical metal organic frameworks. To highlight their predictive capabilities, ML models were trained on 8% of a dataset consisting of 130,398 MOFs and then tested on the remaining 92% to predict methane adsorption capacities. When structural and chemical descriptors were jointly used as ML input, the random forest model with 10-fold cross validation proved to be superior to the other ML approaches, with an R2 of 0.98 and a mean average percent error of about 7%. The training and prediction using the random forest algorithm for adsorption capacity estimation of all 130,398 MOFs took approximately two hours on a single personal computer, several orders of magnitude faster than actual molecular simulations on highperformance computing clusters.

ACS Paragon Plus Environment

2

Page 3 of 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

INTRODUCTION Natural gas is an abundant resource widely regarded as a "transition fuel" towards renewable energy. Methane, the predominant component of natural gas, has the highest combustion energy per unit of carbon dioxide compared to all other hydrocarbons1. Despite the high combustion energy (DH298 = -890 kJ mol-1)2, the volumetric energy density of methane in ambient conditions is only 0.11% of the energy density of gasoline3. To encourage broader use of natural gas as a transport fuel, adsorbed natural gas technology (ANG) is vigorously pursued by energy and environmental policy-making agencies3. The most recent target of methane uptake by adsorbent materials set by U.S. Department of Energy is 350 mL(STP)/mL (assuming packing loss)3. The goal is to make the ANG technology superior to compressed natural gas (CNG) technology by achieving higher capacity at lower pressures, as well as stretching the energy density to 30% of that of gasoline. To that end, identifying superior methane adsorbent materials becomes important. The search for the optimal adsorbent requires aggressive screening of a variety of porous materials. Computer-assisted screening is an emerging field that significantly increases the discovery rate of advanced materials4,5. Metal organic frameworks (MOFs) are particularly amenable to computerassisted high-throughput screening6–14. MOFs possess unique characteristics such as regularity, variety, and designability15. These features facilitate simulation of MOFs physicochemical properties. Through a reticular synthesis approach, MOFs are rationally designed by engineering ligand size, organic linker size, metal containing cluster composition, and synthesis conditions15,16. To determine various fundamental attributes of MOFs, such as substrate-adsorbent interactions, force-field-based molecular simulation methods have been used17–20. Systematic synthesis of nodes and linkers as building blocks and high-throughput screening of hypothetical MOFs by grand canonical Monte Carlo (GCMC) simulations have proven to be an effective approach21,22. One of

ACS Paragon Plus Environment

3

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 18

the first molecular simulation packages specifically designed for MOFs was the RASPA software platform23. RASPA is an evolution of the more general purpose MUltipurpose SImulation Code (MUSIC) developed by the Snurr group24,25. Due to the vast number of possible MOF structures, performing molecular simulations as a screening tool is computationally prohibitive. Various approaches to overcome this barrier have been pursued. Simon et al. (2015) used a comprehensive database of various adsorbent materials and performed GCMC simulations to examine the relationship between structural properties and adsorption6. The dataset included zeolites12, hypothetical MOFs (hMOFs)22, porous polymer networks (PPNs)26, hypothetical zeolitic imidazolate frameworks (hZIFs)27, and computationready experimental (CoRE) MOF14 structures. They suggested that due to the high dimensionality of the data, machine learning approaches could help extract complicated relationships and yield accurate predictions6. Fernandez et al. (2013) developed regression models using radial distribution functions as predictor variables to predict CO2 and N2 uptake7. In 2014, Fernandez et al. used structural variables, such as void fraction and pore size, to predict CH4 uptake and achieved an R2 = 0.8528. In another work, Fernandez et al. (2014) applied a classification approach based upon quantitative structure-property relationships (QSPR) to predict top performing MOFs for CO2 adsorption achieving 94.5% accuracy8. Sezginel et al. (2015) used structural properties such as surface area, crystal density, void fraction, and pore diameter, as well as the isosteric heat of adsorption (Qst), to predict CH4 uptake in a set of adsorbent MOFs. Use of heat of adsorption provides information not discernible from structural properties. However, the number of MOFs whose isosteric heat of adsorption were available was limited9. Fernandez et al. (2016) tried a variety of machine learning techniques and variables, reaching an R2 of 0.94 in predicting CO2

ACS Paragon Plus Environment

4

Page 5 of 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

uptake11. However, structural features do not provide sufficient information to characterize adsorption properties fully. For methane uptake, which is the focus of this work, it is important to construct a comprehensive model to account for chemical interactions, particularly at higher levels of adsorption. Literature exists which emphasizes adsorbent features that cannot be fully captured exclusively by structural variables9,10,29. Chemical composition variables, such as type and number of atoms, have been used for machine learning, albeit in drug discovery30. In a recent study, Ash et al. (2017) applied a variety of molecular dynamics (MD) chemical descriptor sets for molecular biological analysis31. Expanding the descriptor set helps extract more information about the subject, thus improving the machine learning models. Due to the overemphasis on physisorption in the literature, implementation of chemical variables has been overlooked in data mining studies concerning adsorbents. In the current work, the use of chemical descriptors, which are available for MOF structures, was introduced and explored. There has been empirical evidence that certain machine learning algorithms perform robustly on high dimensional data32. Although estimation of fundamental chemical variables solely from composition requires quantum or thermodynamic calculations, it is possible to use machine learning approaches to circumvent cumbersome computations. To explore this possibility, we introduced a comprehensive set of chemical composition features along with structural variables powered by robust machine learning models to capture the underlying intricacies of adsorption phenomena.

ACS Paragon Plus Environment

5

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 18

DATASET and PREDICTORS The data used to train and validate the machine learning algorithms were taken from the database of

hypothetical

MOFs

provided

http://hmofs.northwestern.edu22.

130,398

by

the

hMOFs

Snurr were

group

and

available

extracted

from

the

at

database.

Characteristics associated with the MOFs and available from the database include both volumetricbased uptake (cm3 methane/cm3 hMOF) and mass-based uptake of methane (cm3 methane/g hMOF) at 35 bar and 298K. In addition, physical features and the crystal structures are also provided for the hMOFs. Methane uptake was calculated based on the GCMC simulation method (35 bar and 298 K). The crystal structures were designed and produced by the Wilmer et al. (2012) method22 . The list of the predictors used for structural and chemical properties in this work are provided in Tables 1 and 2 respectively. Structural properties in Table 1 include surface area, density, and void fraction with values ranging from 0 to 6,947 m2 g-1, 0.118-4.042 g cm-3, and 0.051-0.967 respectively. The surface area and void fractions were calculated by Monte Carlo molecular simulation of nitrogen and helium adsorption respectively. Chemical predictors, shown in Table 2, were introduced in this work and extracted from crystal structures. They included the type and number of each atom, degree of unsaturation33, metal to carbon ratio, halogen to carbon ratio, nitrogen to oxygen ratio, and degree of electronegativity. Each atom in the MOF structure has an important role in the adsorption process. The metals in the dataset consisted of copper, vanadium, zirconium, and zinc. The effects of the metals in each hMOFs unit cell on the training process were incorporated in three ways. First, the type of metal present was explicitly accounted for as a categorical variable. Next, the total number of atoms for each metal affecting methane adsorption represented another

ACS Paragon Plus Environment

6

Page 7 of 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

quantitative variable. Finally, the percentage of metal relative to carbon, called the metallic percentage, was introduced as a chemical variable to model the metals affecting methane uptake. In this work the concept behind using the metallic percentage was extracted from the fact that location of metal atoms relative to open sites significantly affects the adsorption yield of MOFs34. Wu et al. (2009) examined methane adsorption on five different MOFs and showed that methane molecules close to open metal sites have stronger attraction than other adsorption sites. Although pore volume was not explicitly incorporated into the analysis, it was implicitly accounted for via its impact on the density and void fraction values. Similarly, the total number of atoms per unit cell were not explicitly included in the model. Instead, the total number of atoms for each species present per unit cell was accounted for. All numerical predictors were normalized to be in the range of [-1, 1], where the smallest and largest value of each predictor were -1 and 1 respectively. Table 1. Parameters characterizing MOF structural properties.

min.

median

mean

max.

void fraction

0.05

0.69

0.65

0.97

surface area [m2/g]

0.00

2703

2740

6947

density [g/cm3]

0.12

0.79

0.86

4.04

dominant pore diameter

0.00

6.75

7.77

24.75

maximum pore diameter

0.00

8.25

9.34

24.75

interpenetration capacity

1.00

2.00

2.09

4.00

number of interpenetration framework

1.00

1.00

1.51

4.00

ACS Paragon Plus Environment

7

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 18

Table 2. Parameters characterizing MOF chemical properties

variable

min.

median

mean

max.

note

hydrogen (H)

0

25

36.12

594

number of hydrogen atoms per unit cell

carbon (C)

6

53

68.28

606

number of carbon atoms per unit cell

nitrogen (N)

0

3

7.103

216

number of nitrogen atoms per unit cell

oxygen (O)

8

16

21.39

216

number of oxygen atoms per unit cell

fluorine (F)

0

0

2.07

120

number of fluorine atoms per unit cell

chlorine (Cl)

0

0

1.86

112

number of chlorine atoms per unit cell

bromine (Br)

0

0

1.79

108

number of bromine atoms per unit cell

vanadium (V)

0

0

0.31

12

number of vanadium atoms per unit cell

copper (Cu)

0

0

0.89

24

number of copper atoms per unit cell

zinc (Zn)

0

2

3.47

24

number of zinc atoms per unit cell

zirconium (Zr)

0

0

0.23

12

number of zirconium atoms per unit cell

metal type

-

-

-

-

categorical variable: V, Cu, Zn, and Zr

total degree of unsaturation

6

39

51

565

{[(number of carbons ´ 2) + 2 - Number of Hydrogens] / 2} *

degree of unsaturation per carbon

0.19

0.77

0.77

1.36

total degree of unsaturation/number of carbons

metallic percentage [%]

0. 85

7

8.35

50.0

[number of metal atoms/number of carbon atoms] ´ 100

oxygen to metal ratio (surrogate of average oxidation state)

6.5

8

9.61

51

electronegative atoms to total atoms ratio

0.35

0.27

0.26

0.62

[number of electronegative atoms] / [total number of atoms]

weighted electronegativity per atom

0.12

0.89

0.86

2.15

[sum of weighted** electronegative atoms] / [total number of atoms]

nitrogen to oxygen ratio

0.00

0.25

0.41

6.50

number of nitrogen atoms / number of oxygen atoms

[2 ´ number of oxygen atoms] / total number of metal atoms

* For other elements: oxygens are ignored; halides (F, Cl, Br, I) are treated as hydrogen and nitrogen is counted as one half of carbon. ** Electronegative atoms: O, N, F, Cl, Br weighted by electronegativity

ACS Paragon Plus Environment

8

Page 9 of 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

For this study, 8% of the total hMOFs were used to train the ML algorithms. In many real simulation cases for materials screening, the computational costs limit the generation of larger datasets for training. Furthermore, training on large datasets may not necessarily significantly increase accuracy relative to using a smaller dataset. As a result, the dataset was randomly divided into a training set consisting of 8% of the data and a test set consisting of the other 92% of the data. The choice of using 8% of the data for training was made to be consistent with Simon et al. (2015) dataset analysis where they also selected about 8% of hMOFs data for training6. It should be noted that when larger fractions of the dataset were used for training purposes (up to 75%), results were still on par with the 8% analysis (data not shown). METHODS and MODELS Four algorithms were evaluated in this work to compare how different models predict the data. The first was the decision tree (DT) algorithm, also referred to as a classification tree. It is a method starting with a single pass using “if-then” logic consequence to train a layer of classifications. Each layer applies predictors affecting the decision-making process for each consequence. These layers continue forming until the best fit is reached35–37. The Poisson regression is a generalized linear model using regression which allows for direct interpretation of the coefficients associated with the model38. The support vector machine (SVM) approach uses hyperplane-based classification by constructing hyperplanes with maximum separation margins and accommodating for nonlinear kernels. It has been one of the most widely used classification techniques39,40. For the studies described here, a linear kernel with a tolerance of 0.001 was used. The random forest41 (RF) approach is a supervised learning method that is an extension of the DT algorithm. The forest consists of an ensemble of decision trees where the final result is the average of predictions from

ACS Paragon Plus Environment

9

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 18

all of the decision trees. While the RF method is more complex than the DT algorithm, it is also more robust 41,42. The random forest used here consisted of 250 trees. Each algorithm was trained on the training data for methane uptake. The algorithms were then implemented to predict methane uptake for the test dataset. For both volumetric and mass based uptake, three different classes or a combination of classes of variable were tried: structural only (SO) variables, chemical only (CO) variables, and both structural and chemical (SC) variables together were compared to evaluate the quality of predictive capabilities. ALGORITHM EVALUATION Performance of each ML algorithm was evaluated by calculating the R2 values, the mean absolute percentage error (MAPE), the mean error (ME), and the root mean square error (RMSE). The R2 value was calculated using equation 1 where n, yi, ui and 𝑢 are the number of MOFs, simulated methane uptake, predicted methane uptake and average methane uptake values respectively. 𝑅# = 1 −

, (-. , (-.

'( )*( +

(1)

'( )* +

The various error values for each algorithm were calculated using equations 2 to 4. 𝑀𝐴𝑃𝐸 =

𝑀𝐸 =

, (-.

𝑅𝑀𝑆𝐸 =

, (-.

'( )*( '(

×

455

'( )*(

(2)

(3)

6

, (-.

6

'( )*( + 6

(4)

ACS Paragon Plus Environment

10

Page 11 of 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

For the RF model, an additional cross validation analysis was performed. To ensure the validity of the results, the k-fold cross validation technique35 was applied for a k-fold value of 10. For the 10fold cross validation used in this work, data were divided into ten parts, referred to as part 1 to part 10. For the first prediction, part 1 was the test data and parts 2-10 were the training data. For the second prediction, part 2 was the test data, and part 1 and parts 3-10 were the training data, etc. The final results were based on the outcome of all 10 test sets of predictions. The ML process was performed using a PC computer running the 64-bit version of Windows 10 with an Intel Core i7 CPU. RESULTS and DISCUSSIONS Model performance for predicting methane adsorption was evaluated through calculation of the R2 value, mean absolute percentage error (MAPE), mean error (ME), and the root mean square error (RMSE). Adsorption capabilities were simulated for environmental conditions of 298K and 35bar. The results of model performance in predicting mass-related methane uptake is shown in Table 3 where the R2 and MAPE values for each model using different group of predictors (only structural, only chemical or combined structural and chemical) are reported. Table 3. Evaluation of predictive performance of mass-based methane uptake using only structural, only chemical, and both structural and chemical predictors. prediction model performance R2

predictor type

MAPE (%)

DT

SVM

Poisson

RF

DT

SVM

Poisson

RF

structural only (SO)

0.75

0.81

0.84

0.88

23.99

20.63

17.85

13.25

chemical only (CO)

0.34

0.42

0.42

0.65

69.3

66.03

64.86

42.3

structural and chemical (SC)

0.84

0.9

0.92

0.97

21.7

18.57

15.45

8.75

ACS Paragon Plus Environment

11

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 18

ME and RMSE values are reported in the Supplementary Information file. Volumetric-based methane prediction results may also be found in the Supplementary Information file. These results were based on the prediction over a test dataset, consisting of 119,965 MOFs or 92% of the whole dataset. The training set used only a small fraction of the whole dataset (8% = 10,433 MOFs). Table 4. R2 coefficients and error ranges from prediction of methane uptake by RF using combination of structural and chemical variables Mass-based methane uptake, (cm3/g)

Prediction method

Volumetric-based methane uptake, (cm3/cm3)

R2

MAPE

R2

MAPE

RF model prediction using test dataset

0.97

8.75%

0.92

9.22%

RF model prediction using 10-fold cross validation

0.98

7.18%

0.94

7.54%

Analyzing the results using structural only (SO), chemical only (CO), and combined structural and chemical (SC) predictor types showed that performance of SO predictors was stronger compared to CO predictors in each corresponding model. Using a combination of structural and chemical variables increased the predictive power of every model. Comparative parity plots of the four algorithms with combined structural and chemical features are shown in Figure 1. Also, in going from a relatively simple model (DT) to a more robust method (RF), the quality of the predictions increased. The RF using SC input resulted in the best predictive power among all of the groups. To validate the RF results, a 10-fold cross validation was carried out on the combined structural and chemical dataset. Using the 10-fold cross validation approach resulted in a slight improvement in the R2 value for mass-based methane uptake going from 0.97 to a value of 0.98. Similarly, prediction error was reduced from 8.75% to 7.18%. The same trend was seen for volumetric-based methane uptake predictions. Using through 10-fold cross validation evaluation for the RF model

ACS Paragon Plus Environment

12

Page 13 of 18

resulted in better values of R2 and MAPE, 0.943 and 7.54% respectively (Table 4 and Supplementary Information File). a)

Predicted (ML) mass-based 3 methane uptake (cm /g)

b)

d)

c)

Predicted (ML) mass-based 3 methane uptake, (cm /g)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

GCMC simulated mass-based 3 methane uptake, (cm /g)

GCMC simulated mass-based 3 methane uptake, (cm /g)

Figure 1. Parity plots for predicted (ML) vs. GCMC simulated mass-based methane uptake (cm3/g) using structural and chemical variables applied on a) DT, b) Poisson, c) SVM, and d) RF models. The red diagonal in each plot is a 45° line indicating perfect correspondence between ML predictions and GCMC simulation results. The color scale indicates the number of counts or the number of hMOFs that had the corresponding GCMC and ML result. Additionally, it was possible to take advantage of the RF algorithm to identify the most critical parameters via the variable importance plot, shown in Figure S3 in the Supplementary Information file. The density, void fraction, surface area, pore diameter, metallic percentage relative to carbon,

ACS Paragon Plus Environment

13

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 18

and degree of unsaturation per carbon played the most important roles in prediction of the target variables. Principal component analysis (PCA) was also carried out. Results are shown via a scree plot, pair plots and table of loading of each variable in the principal components in Figures S6 – S11 and Table S3. According to the scree plot and the eigenvalue criteria, nine principal components captured over 84% of the total variability. Reducing the dimensions from 29 variables in the original model to nine principal components would reduce the computational cost of the model; however, there would of course be a decline in accuracy of the results. Given the relative speed at which it was possible to carry out the ML training, compromising accuracy was not warranted in this case. In the future, for systems where a significant computation burden might be imposed, PCA provides a worthwhile strategy for making calculations more efficient. To compare the ML approach to more traditional molecular dynamics approaches, a high performance computing cluster was used to predict the methane uptake by GCMC simulation via the RASPA software platform23. Using the molecular simulation approach for a minimum of 500 Monte Carlo cycles to calculate methane uptake for all the hMOFs took several days. However, when using the ML approach, the process for training and testing the dataset consisting of 130,398 hMOFs took about two hours of “wall” time on a personal computer for all four algorithms combined (DT, Poisson, SVM, and RF). ML proved to be several orders of magnitude faster than molecular simulation alone. This point is raised not to discredit the necessity of molecular simulation methods, but rather to illustrate the potency of ML techniques to rapidly reproduce and predict adsorptive capabilities of the MOFs in screening studies. CONCLUSIONS

ACS Paragon Plus Environment

14

Page 15 of 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

Due to the relatively small computational overhead of machine learning methods compared to molecular simulation, coupled with the affordability of molecular simulation relative to experimentation, a cascade of screening methods encompassing all three approaches (machine learning, molecular simulation, and experiments) will likely be the way of the future in screening adsorbent materials. The current work shows that incorporating chemical variables into the MLbased materials analysis can greatly enhance predictive accuracy while maintaining high computational speed. ML models based on structural and chemical variables are easily retrievable from crystal structure databases, providing reliable predictive power. The comprehensive structural and chemical model using a 10-fold cross-validation approach led to an R2 value of 0.98 and mean absolute percentage error of 7.18%. SUPPORTING INFORMATION Histogram figures of chemical variables, prediction performance of volumetric-based methane uptake performance, importance table, and principal component analysis.

ACKNOWLEDGEMENTS

The authors thank Dr. Randy Snurr for providing the hMOFs database, including crystal structure files. The authors also thank the University of Connecticut High Performance Computing Center for providing computational resources. REFERENCES (1)

Konstas, K.; Osl, T.; Yang, Y.; Batten, M.; Burke, N.; Hill, A. J.; Hill, M. R. Methane Storage in Metal Organic Frameworks. J. Mater. Chem. 2012, 22 (33), 16698–16708.

(2)

Pucker, J.; Zwart, R.; Jungmeier, G. Greenhouse Gas and Energy Analysis of Substitute Natural Gas from Biomass for Space Heat. Biomass

ACS Paragon Plus Environment

15

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 18

and Bioenergy 2012, 38, 95–101.

(3)

He, Y.; Zhou, W.; Qian, G.; Chen, B. Methane Storage in Metal–organic Frameworks. Chem. Soc. Rev. 2014, 43 (16), 5657–5678.

(4)

Dubbeldam, D.; Krishna, R.; Calero, S.; Yazaydın, A. Ö. Computer-Assisted Screening of Ordered Crystalline Nanoporous Adsorbents for Separation of Alkane Isomers. Angew. Chemie Int. Ed. 2012, 51 (47), 11867–11871.

(5)

Shah, M. S.; Tsapatsis, M.; Siepmann, J. I. Identifying Optimal Zeolitic Sorbents for Sweetening of Highly Sour Natural Gas. Angew. Chemie 2016, 128 (20), 6042–6046.

(6)

Simon, C. M.; Kim, J.; Gomez-Gualdron, D. A.; Camp, J. S.; Chung, Y. G.; Martin, R. L.; Mercado, R.; Deem, M. W.; Gunter, D.; Haranczyk, M.; Sholl, D. S.; Snurr, R. Q.; Smit, B. The Materials Genome in Action: Identifying the Performance Limits for Methane Storage. Energy Environ. Sci. 2015, 8 (4), 1190–1199.

(7)

Fernandez, M.; Trefiak, N. R.; Woo, T. K. Atomic Property Weighted Radial Distribution Functions Descriptors of Metal–Organic Frameworks for the Prediction of Gas Uptake Capacity. J. Phys. Chem. C 2013, 117 (27), 14095–14105.

(8)

Fernandez, M.; Boyd, P. G.; Daff, T. D.; Aghaji, M. Z.; Woo, T. K. Rapid and Accurate Machine Learning Recognition of High Performing Metal Organic Frameworks for CO2 Capture. J. Phys. Chem. Lett. 2014, 5 (17), 3056–3060.

(9)

Sezginel, K. B.; Uzun, A.; Keskin, S. Multivariable Linear Models of Structural Parameters to Predict Methane Uptake in Metal–organic Frameworks. Chem. Eng. Sci. 2015, 124, 125–134.

(10)

Braun, E.; Zurhelle, A. F.; Thijssen, W.; Schnell, S. K.; Lin, L.-C.; Kim, J.; Thompson, J. A.; Smit, B. High-Throughput Computational Screening of Nanoporous Adsorbents for CO 2 Capture from Natural Gas. Mol. Syst. Des. Eng. 2016, 1 (2), 175–188.

(11)

Fernandez, M.; Barnard, A. S. Geometrical Properties Can Predict CO2 and N2 Adsorption Performance of Metal–Organic Frameworks (MOFs) at Low Pressure. ACS Comb. Sci. 2016, 18 (5), 243–252.

(12)

Simon, C. M.; Kim, J.; Lin, L.-C.; Martin, R. L.; Haranczyk, M.; Smit, B. Optimizing Nanoporous Materials for Gas Storage. Phys. Chem. Chem. Phys. 2014, 16 (12), 5499–5513.

(13)

Colón, Y. J.; Snurr, R. Q. High-Throughput Computational Screening of Metal–organic Frameworks. Chem. Soc. Rev. 2014, 43 (16), 5735– 5749.

(14)

Chung, Y. G.; Camp, J.; Haranczyk, M.; Sikora, B. J.; Bury, W.; Krungleviciute, V.; Yildirim, T.; Farha, O. K.; Sholl, D. S.; Snurr, R. Q. Computation-Ready, Experimental Metal−Organic Frameworks: A Tool To Enable High-Throughput Screening of Nanoporous Crystals. Chem. Mater. 2014, 26 (21), 6185–6192.

(15)

Li, J.-R.; Sculley, J.; Zhou, H.-C. Metal–Organic Frameworks for Separations. Chem. Rev. 2012, 112 (2), 869–932.

(16)

Lei, J.; Qian, R.; Ling, P.; Cui, L.; Ju, H. Design and Sensing Applications of Metal–organic Framework Composites. TrAC Trends Anal. Chem. 2014, 58, 71–78.

(17)

Frenkel, D.; Smit, B. Understanding Molecular Simulation : From Algorithms to Applications; Elsevier (formerly published by Academic Press), 2002; Vol. 1.

ACS Paragon Plus Environment

16

Page 17 of 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Combinatorial Science

(18)

Düren, T.; Bae, Y.-S.; Snurr, R. Q. Using Molecular Simulation to Characterise Metal–organic Frameworks for Adsorption Applications. Chem. Soc. Rev. 2009, 38 (5), 1237–1247.

(19)

Getman, R. B.; Bae, Y.-S.; Wilmer, C. E.; Snurr, R. Q. Review and Analysis of Molecular Simulations of Methane, Hydrogen, and Acetylene Storage in MetalÀOrganic Frameworks. Chem. Rev 2011, 112 (2), 703–723.

(20)

Guo, Z.; Wu, H.; Srinivas, G.; Zhou, Y.; Xiang, S.; Chen, Z.; Yang, Y.; Zhou, W.; O’Keeffe, M.; Chen, B. A Metal-Organic Framework with Optimized Open Metal Sites and Pore Spaces for High Methane Storage at Room Temperature. Angew. Chemie Int. Ed. 2011, 50 (14), 3178– 3181.

(21)

Ockwig, N. W.; Delgado-Friedrichs, O.; O’Keeffe, M.; Yaghi, O. M. Reticular Chemistry: Occurrence and Taxonomy of Nets and Grammar for the Design of Frameworks. 2005, 38 (3), 176–182.

(22)

Wilmer, C. E.; Leaf, M.; Lee, C. Y.; Farha, O. K.; Hauser, B. G.; Hupp, J. T.; Snurr, R. Q. Large-Scale Screening of Hypothetical Metal–organic Frameworks. Nat. Chem. 2012, 4 (2), 83–89.

(23)

Dubbeldam, D.; Calero, S.; Ellis, D. E.; Snurr, R. Q. RASPA: Molecular Simulation Software for Adsorption and Diffusion in Flexible Nanoporous Materials RASPA: Molecular Simulation Software for Adsorption and Diffusion in Flexible Nanoporous Materials. Mol. Simul. 2016, 42 (2), 81– 101.

(24)

Gupta, A.; Chempath, S.; Sanborn, M. J.; Clark, L. A.; Snurr, R. Q. Object-Oriented Programming Paradigms for Molecular Modeling. Mol. Simul. 2003, 29 (1), 29–46.

(25)

Chempath, S.; Düren, T.; Sarkisov, L.; Snurr, R. Q. Experiences with the Publicly Available Multipurpose Simulation Code, Music. Mol. Simul. 2013, 39 (14–15), 1223–1232.

(26)

Martin, R. L.; Simon, C. M.; Smit, B.; Haranczyk, M. In Silico Design of Porous Polymer Networks: High-Throughput Screening for Methane Storage Materials. J. Am. Chem. Soc. 2014, 136 (13), 5006–5022.

(27)

Lin, L.-C.; Berger, A. H.; Martin, R. L.; Kim, J.; Swisher, J. A.; Jariwala, K.; Rycroft, C. H.; Bhown, A. S.; Deem, M. W.; Haranczyk, M.; Smit, B. In Silico Screening of Carbon-Capture Materials. Nat. Mater. 2012, 11 (7), 633–641.

(28)

Fernandez, M.; Woo, T. K.; Wilmer, C. E.; Snurr, R. Q. Large-Scale Quantitative Structure–Property Relationship (QSPR) Analysis of Methane Storage in Metal–Organic Frameworks. J. Phys. Chem. C 2013, 117 (15), 7681–7689.

(29)

Mertens, F. O. Determination of Absolute Adsorption in Highly Ordered Porous Media. Surf. Sci. 2009, 603 (10), 1979–1984.

(30)

Reymond, J.-L.; Awale, M. Exploring Chemical Space for Drug Discovery Using the Chemical Universe Database. ACS Chem. Neurosci. 2012, 3 (9), 649–657.

(31)

Ash, J.; Fourches, D. Characterizing the Chemical Space of ERK2 Kinase Inhibitors Using Descriptors Computed from Molecular Dynamics Trajectories. J. Chem. Inf. Model. 2017, 57 (6), 1286–1299.

(32)

Caruana, R.; Karampatziakis, N.; Yessenalina, A. An Empirical Evaluation of Supervised Learning in High Dimensions. In Proceedings of the 25th international conference on Machine learning - ICML ’08; ACM Press: New York, New York, USA, 2008; pp 96–103.

ACS Paragon Plus Environment

17

ACS Combinatorial Science

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(33)

Page 18 of 18

Badertscher, M.; Bischofberger, K.; Munk, M. E.; Pretsch, E. A Novel Formalism To Characterize the Degree of Unsaturation of Organic Molecules. J. Chem. Inf. Comput. Sci. 2001, 41 (4), 889–893.

(34)

Wu, H.; Zhou, W.; Yildirim, T. High-Capacity Methane Storage in Metal-Organic Frameworks M2 (Dhtp): The Important Role of Open Metal Sites. J. Am. Chem. Soc. 2009, 131 (13), 4995–5000.

(35)

Larose, D. T.; Larose, C. D. Data Mining and Predictive Analytics, second.; John Wiley & Sons, Ed.; 2015.

(36)

Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and Regression Trees /; Chapman&Hall/CRC Press: Boca Raton,FL, 1984.

(37)

Dahan, H.; Cohen, S.; Rokach, L.; Maimon, O. Proactive Data Mining with Decision Trees: Theory and Applications; Series in Machine Perception and Artificial Intelligence; WORLD SCIENTIFIC, 2014; Vol. 81.

(38)

Cameron, A. C.; Trivedi, P. K. Regression Analysis of Count Data, second.; Cambridge University Press, 2013.

(39)

Burges, C. J. C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167.

(40)

Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20 (3), 273–297.

(41)

Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32.

(42)

Ho, T. K. Random Decision Forests. Proc. Third Int. Conf. 1995, 1, 278–282.

ACS Paragon Plus Environment

18