A Robust Machine Learning Algorithm for the Prediction of Methane

Jun 7, 2019 - A Robust Machine Learning Algorithm for the Prediction of Methane Adsorption in Nanoporous Materials .... methane adsorption capacities ...
0 downloads 0 Views 2MB Size
Article Cite This: J. Phys. Chem. A 2019, 123, 6080−6087

pubs.acs.org/JPCA

A Robust Machine Learning Algorithm for the Prediction of Methane Adsorption in Nanoporous Materials George S. Fanourgakis,*,† Konstantinos Gkagkas,‡ Emmanuel Tylianakis,¶ Emmanuel Klontzas,†,§ and George Froudakis*,†

Downloaded via NOTTINGHAM TRENT UNIV on July 22, 2019 at 08:22:34 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.



Department of Chemistry and ¶Department of Materials Science and Technology, University of Crete, Voutes Campus, GR-70013 Heraklion, Crete, Greece ‡ Advanced Technology Division, Toyota Motor Europe NV/SA, Technical Center, Hoge Wei 33B, 1930 Zaventem, Belgium § Theoretical and Physical Chemistry Institute, National Hellenic Research Foundation, Vass. Constantinou 48, GR-11635 Athens, Greece S Supporting Information *

ABSTRACT: In the present study, we propose a new set of descriptors that, along with a few structural features of nanoporous materials, can be used by machine learning algorithms for accurate predictions of the gas uptake capacities of these materials. All new descriptors closely resemble the helium atom void fraction of the material framework. However, instead of a helium atom, a particle with an appropriately defined van der Waals radius is used. The set of void fractions of a small number of these particles is found to be sufficient to characterize uniquely the structure of each material and to account for the most important topological features. We assess the accuracy of our approach by examining the predictions of the random forest algorithm in the relative small dataset of the computation-ready, experimental (CoRE) MOFs (∼4700 structures) that have been experimentally synthesized and whose geometrical/structural features have been accurately calculated before. We first performed grand canonical Monte Carlo simulations to accurately determine their methane uptake capacities at two different temperatures (280 and 298 K) and three different pressures (1, 5.8, and 65 bar). Despite the high chemical and structural diversity of the CoRE MOFs, it was found that the use of the proposed descriptors significantly improves the accuracy of the machine learning algorithm, particularly at low pressures, compared to the predictions made based solely on the rest structural features. More importantly, the algorithm can be easily adapted for other types of nanoporous materials beyond MOFs. Convergence of the predictions was reached even for small training set sizes compared to what was found in previous works using the hypothetical MOF database.



INTRODUCTION

a function that will map the input variables to the output and will be able to provide reasonable predictions for unseen situations. In ML, it is desirable to use descriptors that have a clear physical meaning, can be measured experimentally, and/or can be easily computed. For the problem under study, such quantities are the mass density, the void fraction, the surface area, and a few more structural features of the material. In addition, the chemical composition of a material is usually easy to be determined experimentally and therefore is commonly used as a descriptor in these studies.4 In recent studies,3−5 several of the previously mentioned quantities were used as descriptors in ML algorithms. The accuracy of various approaches was examined, and in several

Accurate theoretical determination of the methane adsorption capacity of metal organic frameworks (MOFs) usually requires molecular simulations. Even though efficient simulation techniques and fast computer systems have been developed over the last two decades, the accurate characterization of these materials and the identification of the most promising ones still remain a demanding task due to the large number of MOFs that can potentially be synthesized. Machine learning (ML) offers an alternative approach for the study of nanoporous materials. Over the last few years, the accuracy of several ML algorithms in the prediction of various gas uptake capacities of MOFs1−5 and other crystalline nanoporous materials6 has been investigated. In principle, assuming that the appropriate input variables (commonly called descriptors) are provided to the ML algorithm along with a sufficient number of training examples (a set of input/output pairs), the learning algorithm will produce © 2019 American Chemical Society

Received: April 9, 2019 Revised: June 5, 2019 Published: June 7, 2019 6080

DOI: 10.1021/acs.jpca.9b03290 J. Phys. Chem. A 2019, 123, 6080−6087

Article

The Journal of Physical Chemistry A

Figure 1. Number of appearances of various chemical species in the 4764 MOFs of the CoRE database examined in this study. Notice the logarithmic scale in the y-axis.



METHODOLOGY Training Dataset (CoRE MOFs). In the present study, we have used the computation-ready, experimental (CoRE) MOF database.8 The CoRE database contains a total of 5109 MOFs that have been experimentally synthesized. The atomic coordinates for 4764 of them have been significantly modified with respect to those derived from experimental crystal data to become appropriate for molecular simulations.8 For the evaluation of the present approach, the 4764 MOFs of the CoRE dataset were used. Compared to the hMOFs used in our previous study, 5 the CoRE MOFs are chemically and structurally significantly more diverse. For example, the CoRE dataset contains over 50 types of metal clusters and more than 65 types of metal and metalloid atoms. The frequency of appearance of various types of atoms in the CoRE MOFs is shown in Figure 1. In contrast, the hMOF dataset contains a total of five different types of metal clusters and four different types of metals. Therefore, the CoRE MOF dataset is more appropriate to address the issues previously discussed. In addition, since the CoRE MOFs have been experimentally synthesized, the reported results regarding the methane adsorption capacities of MOFs at various thermodynamic conditions are of a broader interest. GCMC Simulations. Several structural properties of the CoRE database, such as the helium void fraction, surface area, and pore volume, have been provided by the developers of the dataset.8 These are denoted in Table 1 and have been used in the present study as descriptors in the ML algorithm. On the other hand, the methane adsorption capacity was provided only for a small number of MOFs at T = 298 K and P = 65 bar, which are not sufficient for the evaluation of our approach. For this reason, grand canonical Monte Carlo (GCMC) simulations were initially performed for the calculation of the methane adsorption capacities of all 4764 structures using our home-made code. The simulation temperature was set to T = 298 K and the pressure to P = 5.8 and 65 bar since these are the states that United States Department of Energy (DOE) set for deliverable capacity.9,10 GCMC simulations were also performed for P = 1 bar since previous studies have shown significantly lower accuracy in the ML predictions at this pressure. Therefore, evaluation of our approach at these conditions is a demanding test. Finally, the GCMC simulations at the three pressures mentioned were repeated for temperature T = 280 K, and the results were also used for the assessment of the new approach. Potential parameters for the framework atoms were taken from the universal force field (UFF)11 and from TraPPE for methane.12 Due to the negligible electrostatic interactions between the methane molecules, and between a methane molecule and the MOF, only van der Waals (vdW) interactions

cases, it was found that the ML algorithm provides satisfactory results for the methane uptake capacity of MOFs.4 However, in particular at low pressures, the accuracy of the results is not satisfactory.5 One may therefore conclude that the previous descriptors do not provide sufficient information for the accurate training of an ML algorithm. In addition to the previous requirements, the descriptors should be general enough so that the ML method will be able to provide predictions for significantly diverse datasets. In several previous studies,4,5 the hypothetical MOF (denoted as hMOF hereafter) dataset7 was used. This database contains more than 137,000 structures. However, there are only five different types of metallic corners that are present in all structures and only four different types of metal atoms. This is only a small portion of possible corners and metal atoms that an MOF can potentially be synthesized from. Based on the fact that the atom types and/ or the corner types were considered as descriptors in the previously mentioned works, it can be easily concluded that the ML algorithm will not be able to provide predictions for MOFs that contain different types of metal atoms or corners. It should also be mentioned that, despite the small diversity of the structural features in the hMOF dataset, it was found that, for converged predictions of the ML method, set sizes of ∼2000− 5000 samples had to be considered during the algorithm training. To improve the predictive accuracy of the ML algorithms for the given problem for a wide range of thermodynamic conditions, we propose in this work the use of a novel set of descriptors, along with the previously described standard structural features of MOFs. The proposed descriptors are based on the potential energy surface (PES) of a system that comprises the framework of the nanoporous material and a number of hypothetical particles of various sizes. For each descriptor of the set, the average value of the Boltzmann factor is computed. We assign to the new descriptors a clear physical meaning, similar to that of the helium atom void fraction. The computational cost for the accurate calculation of these descriptors is only a fraction of the time required for a molecular simulation. We demonstrate that the new scheme significantly increases the predictive accuracy of the ML algorithm for all the pressures that were considered and, more importantly, in the low-pressure regime. Details about the dataset employed in this study and the methodology used for the calculation of the methane uptake at different thermodynamic conditions are provided below. The proposed descriptors are extensively described, while the ML results are presented and discussed. 6081

DOI: 10.1021/acs.jpca.9b03290 J. Phys. Chem. A 2019, 123, 6080−6087

Article

The Journal of Physical Chemistry A

The Lorentz−Berthelod mixing rules were used to combine the σ and ε of the atoms i and j

Table 1. Descriptors Used in the Present Study in the ML Methodsa descriptor

minimum

maximum

Standard Structural Features void fraction 0.02 0.97 pore volume (cm3 g−1) 0.07 7.46 density (g cm−3) 0.13 5.18 grav. surface area (m2 g−1) 0.0 6832.6 pore-limiting diameter (Å) 2.4 71.5 largest cavity diameter (Å) 2.74 71.64 Probe Atoms probe-1 (σ = 2.5 Å, ε/kB = 50 K) 0.08 5.14 probe-2 (σ = 3.0 Å, ε/kB = 50 K) 0.04 12.37 probe-3 (σ = 3.5 Å, ε/kB = 50 K) 0.0 36.35 probe-4 (σ = 4.0 Å, ε/kB = 50 K) 0.0 131.7

mean

σij =

0.43 0.49 1.37 829.4 4.83 6.79

εij =

σi + σj 2

(2)

εiεj

(3)

The total potential energy of the system is given as the sum of all pair interactions between the methane molecules in the simulation box and between the methane molecules and framework atoms. Interactions were computed up to distance Rc = 12.8 Å, while no energy or pressure tail corrections were considered. The simulation cell of each MOF was constructed by replicating the corresponding unit cell in the three dimensions so that its linear sizes were at least twice the Rc. We performed 50,000 MC cycles for the equilibration of the system and 200,000 MC cycles for the production. In every cycle, a number of Monte Carlo moves equal to the number of the methane molecules found in the simulation box at each instant were attempted. A minimum of 20 moves were attempted in the case where less than 20 methane molecules were found in the MOF. The Monte Carlo moves and corresponding probabilities to attempt one of them were translation (50%), insertion (25%), and deletion (25%). The results of these simulations for the three pressures at the two different temperatures are provided in the Supporting Information (Microsoft Excel file). Notice that the results of the present GCMC simulations are in excellent agreement with previous published results of Chung et al.8 as well as with the results of the ∼120 structures at pressure values of 5.8 and 65 bar in ref 13. Descriptors. As descriptors during the ML training, structural features of MOFs were used. These are presented in the first part of Table 1, while their values are provided by Chung et al.8 (see related computational details). Since these quantities can be experimentally determined and can be routinely

1.12 1.53 2.30 3.79

a

The mean value of each descriptor for the 4764 CoRE MOFs along with its minimum and maximum values is presented. For the standard structural features, these values may be compared with the corresponding values of hypothetical MOFs in ref 4. In the second part of the table, similar values for the probe atoms are given. The LJ parameters of each probe atom (σ and ε) are given in parentheses. In the case where hard spheres (ε = 0) were used as descriptors, their diameters were equal to the vdW diameters (σ) of the probe atoms (see text for details).

were taken into account, which were described with a Lennard− Jones (LJ) potential. Notice that the methane molecule was treated as a one-center particle with parameters ε/kB = 148 K and σ = 3.73 Å. For a pair of particles i and j at a distance rij, the LJ energy is written as ÄÅ É ÅÅi y12 i y6ÑÑÑ ÅÅjj σij zz ÑÑ σ j z ij V (rij) = 4εijÅÅÅÅjjj zzz − jjjj zzzz ÑÑÑÑ j rij z ÑÑ ÅÅj rij z k { ÑÑÖ ÅÅÇk { (1)

Figure 2. Schematic illustration of the procedure used for the calculation of the probe particles in a specific MOF. The probe atom (a) is inserted in the framework, and its energy Ei is computed. Depending on the position i and the size of the probe, the interaction with the framework is less or greater than zero, while the corresponding Boltzmann factor e−βEi is greater or less than one. The final value of the descriptor is obtained by placing the probe in a large number of positions in the framework and taking the average value of the Boltzmann factor. The same procedure is repeated for all probes. 6082

DOI: 10.1021/acs.jpca.9b03290 J. Phys. Chem. A 2019, 123, 6080−6087

Article

The Journal of Physical Chemistry A computed today using well-established software packages, such as Zeo++,14 poreblazer,15 and RASPA,16 they will be referred to, hereafter, as standard structural descriptors to distinguish them from the new set of descriptors that are also based on structural features of the material. It will be discussed later that, by using solely the standard structural descriptors, the predictive accuracy of the ML algorithm is relative low, in particular at low pressures (1 and 5.8 bar). As also mentioned, due to the high diversity of the CoRE MOFs and the relatively small size of the database (compared to the hMOF database), considering as descriptors the types of atom groups (corners, linkers, and functional groups)5 that the MOF consists of will not be sufficient to provide satisfactory results. In an attempt to overcome the previous restrictions, a new set of descriptors based on the potential energy surface of the MOF was developed. In the procedure used, different probe atoms interacting with the framework were considered, and the average Boltzmann factor was computed according to probe − (a) =

1 N

probe particles provides more details about structural characteristics of the framework, such as the pore size distribution. 3. As mentioned by Ongari et al.,17 eq 4 expresses the probability of a single probe atom to be absorbed in the nanomaterial at a certain temperature by considering both the size of the atoms (the LJ parameter σ) and the strength of the interaction between the probe atom and all atoms of the nanomaterial (the LJ parameter ε). Alternative approaches may be used as well for the calculation of the materials’ void fraction. In the probe center pore volume (PC) or simply “pore volume” method (see ref 17 for additional details), all atoms are treated as hard spheres. In the case where a randomly placed probe atom does not overlap with any other atom (i.e., the distance of the probe atom a of radius σa/2 with all other atoms i in the framework is larger than (σa/2 + σi/2), where σi is the vdW diameter of atom i), a value of 1 is assigned; otherwise, the value is set to 0. The void fraction is then given as

N

∑ exp(−βEi(a)) i=1

N

(4)

hs − (a) =

where β = 1/kBT, with kB being the Boltzmann constant, while E(a) i represents the interaction energy of the framework with a single probe atom of type (a) found at position i. The average Boltzmann factor is computed (expressed here by a summation) for all possible positions i of the probe atom in the framework. In the present work, the LJ potential of eq 1 was used to describe the interaction between the probe particle and framework. The average Boltzmann factor was calculated for four different probe particles at T = 298 K using a simple Monte Carlo scheme according to which each probe particle was randomly placed inside the framework and its energy E(a) was computed. A i sufficient large number of random insertions was attempted to accurately approximate the sum of eq 4. The interactions of the probe particles with the MOF were extended up to a cutoff distance of 12.8 Å. The σ and ε values of the probe particles are shown in Table 1. A schematic representation of the procedure used for the calculation of the new set of descriptors is illustrated in Figure 2. A few points deserve clarification here: 1. The calculation of the proposed descriptors is similar to that of the helium void fraction. Usually, for the latter, a single helium atom is inserted at random positions in the material. It is assumed that it interacts with the framework with vdW interactions only (the LJ parameters for helium in eqs 1−3 are ε/kB = 10.9 K and σ = 2.64 Å) and the obtained energy is used for the calculation of the weighted Boltzmann factor in eq 4 (see, for example, ref 8 for a description of the computational procedure). Following the same procedure, the descriptors proposed in this work can be considered as the void fraction of hypothetical particles with sizes (i.e., vdW radius) that do not necessarily correspond to sizes of real atoms. Also, the helium void fraction, considered before as a standard structural feature, can also be considered as a probe particle. 2. The reason that a set of probes, instead of one probe, is needed is, first, to distinguish between different topologies of MOFs. For example, similar values of one probe (e.g., the helium void fraction) are not sufficient to distinguish between diverse framework topologies. Second, the set of

∑i = 1 value N

(5)

According to the latter definition, the strength of the vdW interactions between the probe particles and the crystal is not taken into account. The importance of considering the strength of interactions in the constructed descriptors is evaluated below by using hard spheres instead of interacting probe atoms in the ML method. 4. For all probes, the value of ε/kB was set arbitrarily to 50 K (see Table 1). Notice that, due to the functional form of the LJ potential (eq 1) and the Lorentz−Berthelot mixing rules, which were applied to determine the LJ parameters of the MOFs atoms and the probe, a constant multiplicative factor equal to β ε(a) [ε(a) refers to the LJ parameter of probe-(a)] will appear as a multiplicative constant in the exponent −βE(a) i of eq 4. This constant is also arbitrary. Since it is not attempted to use the ML algorithm for predictions at temperatures different than those used during the training of the algorithm, this arbitrariness should not affect the results. 5. In several previous works, descriptors were constructed based on physical and well-established properties. For example, Pardakhti et al.4 used atomic properties, such as the electronegativity per atom, the degree of unsaturation per carbon, and others as descriptors. In recent works,18,19 the atomic property-weighted radial distribution functions (AP-RDF) were used as descriptors. The AP-RDF were based on the electronegativity, polarizability, and vdW volumes of individual atoms. In contrast to the previous works,4,18,19 the present approach relies for the construction of descriptors on the potential energy surface of the material under study. Although the calculation of these descriptors requires an additional effort compared to the previous approaches, the computational cost is significantly lower compared to that of a molecular (e.g., GCMC) simulation. Another advantage of the present approach is that it can be used for any chemical composition of the crystal, assuming only that the LJ parameters of the atoms and the structure of the material are known. This makes the present approach applicable to a wide variety of nanoporous materials. 6083

DOI: 10.1021/acs.jpca.9b03290 J. Phys. Chem. A 2019, 123, 6080−6087

Article

The Journal of Physical Chemistry A n

It should be mentioned that another new approach was proposed recently, which is also based on the potential energy surface of a system that consists of material frameworks and a probe atom.1 In that study, the LJ parameters of the probe atom were set equal to that of the adsorbed gas under study (i.e., methane and hydrogen). Instead of the weighted Boltzmann factor used here, the histogram of the system’s potential energy was discretized using a small number of bins that served as descriptors for the ML algorithm. It is expected that the computational cost for the construction of the additional descriptors should be similar in the two approaches since, in both cases, a similar number of evaluations of the LJ potential is required. Despite the above-mentioned similarities of the latter approach with the present one, it is difficult to judge whether there are advantages of one approach over the other in terms of accuracy and efficiency. For that, evaluation of the two approaches is required using the same dataset and the same comparison protocol (e.g., ML technique, evaluation metrics, cross-validation method, etc.). Machine Learning Algorithm. The Random Forest algorithm (RF)20 has been extensively used in related studies,4 providing similar or more accurate predictions for the methane uptake capacities of MOFs compared to other ML methods, such as Support Vector Regression (SVR)21 and Decision Trees (DT). The same was also found for the case of CO2 and N2 guest molecules.22 Therefore, the Random Forest regressor was also employed in the present study. To investigate the dependence of the predictive accuracy of the ML algorithm on the size of the training set, datasets with sizes between 50 and 1000 of randomly chosen MOFs were used for the training of the RF algorithm, while the remaining of the total 4764 MOFs were used for the validation of the results. To ensure the stability of the ML results to the choice of the training set, for each training set size, the procedure was repeated 100 times by randomly picking a different set of MOFs every time. The performance of the algorithm for a specific training set size was taken as the average of the 100 independent calculations. The overall performance of the model was evaluated using the k-fold cross-validation technique23 for 10 folds. In this approach, the original dataset is divided into 10 equal parts. One of them serves as a validation set, while the remaining nine are used for the training of the algorithm. This procedure is repeated 10 times, using a different validation set each time. The estimation of the performance of the algorithm is computed as the average of the results of the 10 calculations. As in our previous studies,5 the results of the ML algorithm were evaluated by using the coefficient of determination R2 in eq 6, the mean absolute error (MAE) in eq 7, the root-mean-square error (RMSE) in eq 8, and the weighted absolute percent error (WAPE) in eq 9.

WAPE =

× 100 (9) F where Fi are the reference values (i.e., the methane uptake capacities of CoRE MOFs obtained from the GCMC simulations), F is the corresponding average value, and Ai are the predictions of the ML algorithm for each MOF.



RESULTS AND DISCUSSION Figure 3 shows the dependence of the R2 and WAPE metrics on the size of the training set for the predictions of the ML

Figure 3. Dependence of the R2 and WAPE metrics on the training set size. The dotted lines with circles correspond to the case where only the standard structural features were used in the ML algorithm. The dashed lines with diamonds and the solid lines with squares correspond to the cases where, in addition to the standard structural features, the hard spheres and probes, respectively, were used as descriptors in the ML algorithm. The results correspond to T = 298 K, while the pressure is indicated in the title of each graph.

algorithm for the volumetric-based methane adsorption capacities of MOFs at T = 298 K and for pressures P = 1, 5.8, and 65 bar. The dotted lines with circles correspond to the case where only the standard structural features were considered in the RF method, while the solid lines with squares correspond to the case where both standard structural features and probe atoms were taken into account. It appears that, in all cases, for sizes of the training set of ∼1000, all curves have reached a plateau. It was found (not shown) that larger training set sizes will not lead to any significant improvement of the results. In particular, for the case where both standard structural features and probes were taken into account, convergence was achieved for all temperatures and pressures even for ∼500 MOFs. This result somehow contradicts with previous findings using the hMOF database, in particular at low pressures, in which training set sizes of ∼2000−5000 cases were needed during the ML training to achieve convergence.4,5 As discussed in a previous work,5 the relatively low accuracy of the methane uptake that has been computed for the majority of the hMOFs may have affected the convergence properties of the ML algorithm. Figure 3 also shows the results for the case where the standard structural features are combined with the hard spheres (dashed lines with diamonds). According to the previous discussion, in this case, only the sizes (e.g., vdW radius) of the framework atoms but not the interaction strength of the framework with the guest atom are taken into account in the constructed descriptors. As expected, therefore, the accuracy of the obtained results is in-

n

R2 = 1 −

MAE =

1 n

∑i = 1 (Ai − Fi )2 n

∑i = 1 (Ai − F )2

(6)

n



Ai − Fi| (7)

i=1

n

RMSE =

∑i = 1 (Ai − Fi )2 n

∑i = 1 |Ai − Fi|

(8) 6084

DOI: 10.1021/acs.jpca.9b03290 J. Phys. Chem. A 2019, 123, 6080−6087

Article

The Journal of Physical Chemistry A

Figure 4. Scattering plot comparing the results of the ML algorithm with the results of the GCMC simulations for the volumetric-based methane uptake of CoRE MOFs at 5.8 bar. The graphs in the first row show the results at T = 280 K and at T = 298 K in the second row. In the first column, only structural features were considered during the ML training, only the probes in the second one, and both in the last one. Green symbols show the training dataset, while red symbols show the predictions of the ML algorithm.

Table 2. Evaluation of the ML Predictions Using the k-Fold Cross-Validation Approach for k = 10 Foldsa descriptor

pressure (bar)

AV (vSTP/v)

R2

RMSE (vSTP/v)

MAE (vSTP/v)

WAPE (%)

std struct. std struct. std struct. probes probes probes std struct. + hs std struct. + hs std struct. + hs std struct. + probes std struct. + probes std struct. + probes

1 5.8 65 1 5.8 65 1 5.8 65 1 5.8 65

28.4 (38.4) 67.4 (80.5) 127.9 (137.6) 28.4 (38.4) 67.4 (80.5) 127.9 (137.6) 28.4 (38.4) 67.4 (80.5) 127.9 (137.6) 28.4 (38.4) 67.4 (80.5) 127.9 (137.6)

0.686 (0.704) 0.791 (0.810) 0.930 (0.935) 0.892 (0.885) 0.890 (0.872) 0.866 (0.863) 0.849 (0.851) 0.888 (0.897) 0.943 (0.945) 0.928 (0.924) 0.932 (0.928) 0.955 (0.955)

11.51 (14.01) 16.06 (17.68) 16.39 (17.22) 6.75 (8.71) 11.65 (14.51) 22.72 (25.02) 7.98 (9.95) 11.71 (12.99) 14.82 (15.89) 5.52 (7091) 9.13 (10.89) 13.22 (14.38)

7.90 (9.79) 11.27 (12.37) 11.43 (12.02) 4.44 (5.95) 8.21 (10.38) 16.62 (18.36) 5.35 (6.76) 8.19 (9.25) 10.48 (11.25) 3.18 (4.37) 6.22 (7.62) 9.38 (10.15)

27.92 (25.60) 16.75 (15.38) 8.95 (8.75) 15.68 (15.53) 12.20 (12.91) 13.01 (13.36) 18.89 (17.65) 12.16 (11.50) 8.20 (8.19) 11.25 (11.43) 9.24 (9.47) 7.34 (7.38)

a

The results for T = 298 and 280 K (in parentheses) and pressures P = 1, 5.8, and 65 bar are tabulated. Different sets of descriptors were used, that is, (a) standard structural descriptors (std struct.), (b) probe atoms (probes), (c) standard structural descriptors + hard spheres (std struct. + hs), and (d) standard structural descriptors + probe atoms (std struct. + probes). The average volumetric-based methane uptake capacity of the 4764 CoRE MOFs (AV) examined in this study as obtained from the GCMC simulations at the various thermodynamic conditions (T and P) are also presented.

between the two previous cases where the probe atoms were combined or not together with the standard structural features. The accuracy of the ML predictions can be easily assessed by visualizing scatter plots similar to those in Figure 4. Results are shown in the graphs for the methane adsorption at P = 5.8 bar and for T = 280 K (first row) and T = 298 K (second row). The three columns present results for the case where the standard

structural features only (first column), the probe atoms only (second column), and the structural features combined with the probe atoms (third column) are used as descriptors in the ML algorithm. The x-axis in each graph corresponds to the reference values of the volumetric-based methane uptake of the MOFs, as computed from the GCMC simulations, and the y-axis to the predictions of the ML algorithm. It can be easily understood that 6085

DOI: 10.1021/acs.jpca.9b03290 J. Phys. Chem. A 2019, 123, 6080−6087

Article

The Journal of Physical Chemistry A

algorithm, the accuracy achieved in the predictions of the MOFs uptake capacity at low pressures is significantly higher than the accuracy of the results in our previous studies at the same conditions, where different sets of atom-based descriptors were employed.5 In that sense, the present approach is not only applicable for a wider range of systems, but at the same time, it also provides more accurate results than the previous approaches.

the points in the graph that fall on or close to the diagonal correspond to the MOFs for which the predictions are very close to the reference values. In contrast, points far from the diagonal indicate large errors in the ML predictions for these particular MOFs. In all cases, for the results in Figure 4, 1000 randomly selected MOFs were considered for the training of the ML algorithm and the remaining for the validation. The results in green and red correspond to the training and to the validation sets, respectively. The various metrics for the training and validation sets are also denoted in the graphs. By visualizing the previous graphs or by examining the various metrics, it can be easily concluded that, for both temperatures, the use of probe atoms combined with standard structural features as descriptors for the ML method provides significantly more accurate predictions compared to the other approaches examined. Table 2 summarizes the cross-validation results of the ML algorithm for temperatures T = 280 and 298 K and pressures P = 1, 5.8, and 65 bar. The various metrics (eqs 6−9) presented here were obtained by employing a 10-fold cross-validation approach. Various sets of descriptors were studied; they are discussed below. In agreement with previous studies, in the case where only the structural descriptors are used in the ML algorithm, the accuracy of the predictions is relatively low, in particular at low pressures, and improves at higher ones. Based on the values of R2 and WAPE, it appears that, for the same pressure, the accuracy is slightly higher at the lower temperature. It is easily seen that, by sorting the numerical values of the average adsorption capacities of the MOFs in Table 2, for increased average values of the adsorption capacity, the values of R2 and WAPE indicate higher accuracy. For this reason, in general, the ML predictions, which are based only on structural features, are more accurate for lower temperatures and higher pressures. In previous studies,4,5 the large amount of the available data in the hMOF dataset and the small diversity of the chemical species allowed us to consider the chemical character of the MOFs during the ML training by setting as descriptors the type of atoms or the type of atom groups (i.e., corners, linkers, and functional groups) that the MOFs were consisted of. A similar approach in the present study can not be employed, however, as it was explained before. Based on the results shown in Figure 3 and tabulated in Table 2, it becomes clear that significant improvements of the results can be achieved in the case where both standard structural features and probe atoms are considered. In the latter case, for the two temperatures and three pressures examined, the R2 becomes almost the same (0.93−0.96), while WAPE slightly decreases as pressure increases (i.e., 11, 9, and 7% for pressures of 1, 5.8, and 65 bar, respectively, at both temperatures). These are significantly more accurate predictions than before. For example, R2 = 0.69 and WAPE = 27.9% at T = 298 K and P = 1 bar when only the structural descriptors are considered. As expected, higher accuracy improvements are achieved for the adsorption capacities at low pressures and high temperatures. At the other limit (high pressure and low temperature) where standard structural descriptors alone already provide satisfactory results, the accuracy improvements by using all descriptors are smaller but not negligible. In the case where the standard structural descriptors are combined with the hard spheres instead of probe atoms (see Table 2), the improvements in the predictions are smaller, as shown before. It is important to note at this point that, based on the previous results, by using the probe particles as descriptors in the ML



CONCLUSIONS Previous studies5 for the methane uptake capacity of nanoporous materials have shown that ML algorithms that are based solely on physical and experimental measured quantities, such as the chemical composition and the standard structural features previously described, will not always lead to accurate predictions, particularly at low pressures. In the present study, it was found that, by using probe atoms of various sizes, it is possible to construct a set of descriptors for the ML algorithm that will lead to improved predictions. It was also demonstrated that the accuracy of the results is not affected by the thermodynamic conditions. For a wide range of temperatures (T = 280−300 K and pressures P = 1−65 bar), the accuracy of the obtained results remains the same. The benefits of the present approach are more significant at lower pressures. For example, for P = 1 bar, the R2, MAE, and WAPE metrics improved by a factor of 2−3 compared to the results obtained using only standard structural descriptors. At the highest pressure examined, P = 65 bar, the standard structural descriptors alone can provide accurate results; however, considering also the probe atoms in the ML algorithm, nonnegligible improvements are observed in the predictions. The computational cost for the calculation of each one of the different probe atoms is similar to that required for the calculation of the helium atom void fraction and therefore is significantly lower than the computational time required for an accurate GCMC simulation. It was shown that the choice of the potential parameters of the probe atoms (σ and ε) is not crucial and will not significantly affect the predictions of the ML method. It is important, however, for the methane gas, that the vdW radius of the probe atoms covers a range between 2.5 and 4 Å. Few additional simulations (the results are not presented) showed that probe atom sizes beyond 4 Å will not lead to any significant improvements in the accuracy of the ML algorithm. On the other hand, it is important to consider in the probes both the size and interaction strength with the framework atoms. The results obtained by considering hard spheres (zero interactions with the framework) are clearly of lower accuracy. In contrast to previous approaches,5 the present one does not require the training of the ML algorithm on specific atom types or group of atoms, limiting in this way its applicability only on systems with similar chemical compositions. It is assumed, however, that the interactions between the probe atoms and the framework are known. Based on the proposed set of descriptors (see Table 1), it appears that the present approach is general and could be applied not only in MOFs but also in other types of nanoporous materials. Further work is required though to validate this point. We also aim to investigate whether the present approach can be applied to different guest gases, in particular to those whose (in contrast to methane) electrostatic interactions between the framework and the guest molecules are significant. In this case, it will be examined whether the inclusion 6086

DOI: 10.1021/acs.jpca.9b03290 J. Phys. Chem. A 2019, 123, 6080−6087

Article

The Journal of Physical Chemistry A

(7) Wilmer, C. E.; Leaf, M.; Lee, C. Y.; Farha, O. K.; Hauser, B. G.; Hupp, J. T.; Snurr, R. Q. Large-scale screening of hypothetical metal− organic frameworks. Nat. Chem. 2012, 4, 83−89. (8) Chung, Y. G.; Camp, J.; Haranczyk, M.; Sikora, B. J.; Bury, W.; Krungleviciute, V.; Yildirim, T.; Farha, O. K.; Sholl, D. S.; Snurr, R. Q. Computation-ready, experimental metal-organic frameworks: A tool to enable high-throughput screening of nanoporous crystals. Chem. Mater. 2014, 26, 6185−6192. (9) Arpa-e funding opportunity announcements. https://arpa-e-foa. energy.gov/. (10) Peng, Y.; Krungleviciute, V.; Eryazici, I.; Hupp, J. T.; Farha, O. K.; Yildirim, T. Methane Storage in Metal-Organic Frameworks: Current Records, Surprise Findings, and Challenges. J. Am. Chem. Soc. 2013, 135, 11887−11894. (11) Rappe, A. K.; Casewit, C. J.; Colwell, K. S.; Goddard, W. A.; Skiff, W. M. UFF, a Full Periodic Table Force Field for Molecular Mechanics and Molecular Dynamics Simulations. J. Am. Chem. Soc. 1992, 114, 10024−10035. (12) Martin, M. G.; Siepmann, J. I. Transferable Potentials for Phase Equilibria. 1. United-Atom Description of n-Alkanes. J. Phys. Chem. C 1998, 102, 2569−2577. (13) Zhang, H.; Deria, P.; Farha, O. K.; Hupp, J. T.; Snurr, R. Q. A thermodynamic tank model for studying the effect of higher hydrocarbons on natural gas storage in metal−organic frameworks. Energy Environ. Sci. 2015, 8, 1501−1510. (14) Willems, T. F.; Rycroft, C. H.; Kazi, M.; Meza, J. C.; Haranczyk, M. Algorithms and tools for high-throughput geometry-based analysis of crystalline porous materials. Microporous Mesoporous Mater. 2012, 149, 134−141. (15) Sarkisov, L.; Harrison, A. Computational structure characterisation tools in application to ordered and disordered porous materials. Mol. Simul. 2011, 37, 1248−1257. (16) Dubbeldam, D.; Calero, S.; Ellis, D. E.; Snurr, R. Q. RASPA: Molecular simulation software for adsorption and diffusion in flexible nanoporous materials. Mol. Simul. 2015, 42, 81−101. (17) Ongari, D.; Boyd, P. G.; Barthel, S.; Witman, M.; Haranczyk, M.; Smit, B. Accurate characterization of the pore volume in microporous crystalline materials. Langmuir 2017, 33, 14529−14538. (18) Fernandez, M.; Boyd, P. G.; Daff, T. D.; Aghaji, M. Z.; Woo, T. K. Rapid and Accurate Machine Learning Recognition of High Performing Metal Organic Frameworks for CO2 Capture. J. Phys. Chem. Lett. 2014, 5, 3056−3060. (19) Fernandez, M.; Woo, T. K.; Wilmer, C. E.; Snurr, R. Q. Largescale quantitative structure-property relationship (QSPR) analysis of methane storage in metal-organic frameworks. J. Phys. Chem. C 2013, 117, 7681−7689. (20) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5−32. (21) Smola, A. J.; Schölkopf, B. A tutorial on support vector regression. Statistics and computing 2004, 14, 199−222. (22) Fernandez, M.; Barnard, A. S. Geometrical Properties Can Predict CO2 and N2 Adsorption Performance of Metal-Organic Frameworks (MOFs) at Low Pressure. ACS Comb. Sci. 2016, 18, 243−252. (23) Mosteller, F.; Tukey, J. In Revised Handbook of Social Psychology; Lindzey, G., Aronson, E., Eds.; Addison Wesley, 1968; Vol. 2; pp 80− 203.

of electrostatic interactions between the probe atoms and the framework is needed for accurate predictions.



ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jpca.9b03290. Distribution of the standard structural features, of the average Boltzmann factor for the various probe atoms, and of the volumetric-based methane adsorption capacities for the 4764 CoRE MOFs (PDF) Volumetric-based methane adsorption capacities of all 4764 CoRE MOFs as computed by GCMC simulations at the two temperatures and three pressures examined in this work (XLS)



AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected] (G.S.F.). *E-mail: [email protected] (G.F.). ORCID

Konstantinos Gkagkas: 0000-0001-6326-7317 George Froudakis: 0000-0002-6907-1822 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work has been financially supported by Toyota Motor Europe NV/SA. E.K. has been funded by State Scholarships Foundation (SSF), grant number 2016-050-0503-8189, and cofinanced by the European Union (European Social Fund (ESF)) and Greek national funds through the action entitled “Reinforcement of Postdoctoral Researchers”, in the framework of the Operational Programme “Human Resources Development Program, Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF) 2014−2020.



REFERENCES

(1) Bucior, B. J.; Bobbitt, N. S.; Islamoglu, T.; Goswami, S.; Gopalan, A.; Yildirim, T.; Farha, O. K.; Bagheri, N.; Snurr, R. Q. Energy-based descriptors to rapidly predict hydrogen storage in metal−organic frameworks. Mol. Syst. Des. Eng. 2019, 4, 162−174. (2) Aghaji, M. Z.; Fernandez, M.; Boyd, P. G.; Daff, T. D.; Woo, T. K. Quantitative Structure-Property Relationship Models for Recognizing Metal Organic Frameworks (MOFs) with High CO2 Working Capacity and CO2/CH4 Selectivity for Methane Purification. Eur. J. Inorg. Chem. 2016, 2016, 4505−4511. (3) Borboudakis, G.; Stergiannakos, T.; Frysali, M.; Klontzas, E.; Tsamardinos, I.; Froudakis, G. E Chemically intuited, large-scale screening of MOFs by machine learning techniques. npj Comput. Mater. 2017, 3, 40. (4) Pardakhti, M.; Moharreri, E.; Wanik, D.; Suib, S. L.; Srivastava, R. Machine Learning Using Combined Structural and Chemical Descriptors for Prediction of Methane Adsorption Performance of Metal Organic Frameworks (MOFs). ACS Comb. Sci. 2017, 19, 640− 645. (5) Tsamardinos, I.; Fanourgakis, G. S.; Greasidou, L.; Klontzas, E.; Tsirlis, K.; Borboudakis, G.; Gkagkas, K.; Froudakis, G. E. An Automated Machine Learning architecture for the accelerated prediction of Metal-Organic Frameworks performance in energy and environmental applications. submitted 2019. (6) Thornton, A. W.; Winkler, D. A.; Liu, M. S.; Haranczyk, M.; Kennedy, D. F. Towards computational design of zeolite catalysts for CO2 reduction. RSC Adv. 2015, 5, 44361−44370. 6087

DOI: 10.1021/acs.jpca.9b03290 J. Phys. Chem. A 2019, 123, 6080−6087