Machine Learning Approach for Prediction and Search: Application to

Oct 18, 2016 - Machine Learning Approach for Prediction and Search: Application to Methane Storage in a Metal–Organic Framework. Hiroshi Ohno and Yu...
1 downloads 15 Views 1MB Size
Article pubs.acs.org/JPCC

Machine Learning Approach for Prediction and Search: Application to Methane Storage in a Metal−Organic Framework Hiroshi Ohno* and Yusuke Mukae Toyota Central R&D Laboratories, Inc., 41-1 Yokomichi, Nagakute, Aichi 480-1192, Japan S Supporting Information *

ABSTRACT: Machine learning is applied to predicting the methane uptake and searching the pore properties and the organic building blocks to determine the maximum uptake. To show the importance of the molecular structure of the building block on the methane uptake, we construct a graph-based kernel function that measures the degree of similarity between two molecular structures. A prediction model is a Gaussian process regression, which is constructed by combining this kernel function and the Gaussian kernel function of the pore properties. The structural relationships of the molecular structure contribute to an improvement in the accuracy of the prediction. After training the prediction model, using a random search, we can determine the candidate pore properties and organic building blocks that will yield an uptake larger than that seen in the training data set.



such as text and graphs,21 and a number of studies have examined graph kernels.22−25 A graph, which consists of a set of vertices and edges, is a useful way to describe the molecular structure of a compound.19 For example, the atoms and atom groups in a compound can be considered to be the vertices, and the bonds between the atoms or atom groups can be considered to be the edges. For this reason, we use a graph-based kernel function (structural kernel function) to treat the structural data of compounds. The Coulomb matrix representation is also a useful representation for the molecular structure of a compound.4−6 However, the graph has expandability for the molecular structure of the compound because it can provide a flexible description for bond distance, bond angle, and the atomic properties of the compound. We adopt Gaussian process (GP) regression26 for modeling the relationship between the methane uptake and the structure and pore properties of the MOF building blocks and predicting the methane uptake. GP regression is a nonparametric Bayesian method and is a powerful nonlinear model. We use the multiple kernel, which consists of the kernel function with the quantitative descriptors for the pore properties of the MOF and the structural kernel function, as the covariance function in the GP. The kernel function with respect to the quantitative descriptors is the Gaussian kernel function with automatic relevance determination.27 The structural kernel function is based on the optimal assignment kernel.18 This kernel function uses a simple matching between the vertex labels, the adjacent vertex labels, and the

INTRODUCTION Following the Materials Genome Initiative,1 the use of machine learning (ML) techniques has been expanding rapidly in materials science. The usefulness of ML has been shown in several fields.2−11 In this study, we address the regression task and the search for new materials. Metal−organic frameworks (MOFs) allow the design of pore structure and functionality by using particular combinations of bonding metals and organic linker compounds.12−15 Wilmer et al. used database screening to identify over 300 MOFs that had a predicted methane-storage capacity.16 Fernandez et al. developed models (multilinear regression, decision trees, and support vector machines) for predicting methane storage from the structural features (such as void fraction and pore size) of MOFs.17 They optimized the response surface and determined the conditions of the pore properties that would allow an MOF to maximize methane uptake. However, the structural features are not related to structural relationships in the molecular structure of the MOF building blocks. In this study, we use ML techniques to investigate the relationship between the structural data of MOFs (organic building blocks) and methane uptake. This structural relationship is described in terms of the relations among the atoms or atom groups. Our contribution is to show the effect that the structural relationship of an MOF has on the predicted methane uptake. Moreover, we present a method that uses the prediction model to search for new materials that have the maximum methane uptake. Previous studies18,19 have shown that the properties of a compound can be inferred from the molecular structure. In these studies, the kernel method (e.g., chapter 6 of ref 20) was used. The kernel method is a powerful tool for treating structured data, © XXXX American Chemical Society

Received: July 28, 2016 Revised: September 15, 2016

A

DOI: 10.1021/acs.jpcc.6b07618 J. Phys. Chem. C XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry C

quantitative descriptors of the MOF building block (described below). If ks is positive definite, then k is a positive definite kernel because the Gaussian kernel function is positive definite (section 4.3.4 of ref 26). The third term on the right-hand side is a jitter term, where σjit is estimated from the data set, and δp,q is a Kronecker delta, which is 1 when p = q and is 0 otherwise. For the quantitative descriptors of the MOF building block, we use the following Gaussian kernel function

distance between the vertices; this naive matching is based on an intuitive manners between graphs. We also use a simple random search to explore new materials. On the basis of the data set (499 samples), in which the maximum methane uptake at 3.5 MPa was 203 cm3 (STP)/cm3, to realize the maximum methane uptake, we explored values of the structural features (pore properties) and type of MOF by the random search with the prediction model. We then obtained a material with an uptake of 219.1 cm3 (STP)/cm3 at 3.5 MPa. This MOF had the largest number of benzene rings and the greatest length among the materials in the data set, which implies that the methane uptake is affected by the number of benzene rings in the MOF.

⎛ 1 kg(x , x′) = exp⎜⎜ − ⎝ 2

p

∑ i=1

(xi − x′i )2 ⎞ ⎟⎟ βi2 ⎠

where the dimensionality of x, x′, and β is p and β denotes the automatic relevance determination, which is estimated from the data set. For the training data set, {(x(i), y(i))}i=1,..., n, y(i) = f(x(i)) + ε(i) with ε(i) ≈ N(0, σ 2). ε(i) is an independent and identically distributed random error term and has the normal distribution with mean 0 and variance σ2. Then, the posterior over f is a Gaussian distribution with mean μ (x) and covariance cov(x, x′), as follows



METHODS The framework for our analysis is shown in Figure 1, which shows the prediction and search system based on the data set.

μ(x) = k(x)t (K + σ 2I )−1y , cov(x , x′) = k(x , x′) − k(x)t (K + σ 2I )−1k(x′)t

where f(·) is an unknown function to be estimated, k(x) = (k(x(1), x), ..., k(x(n), x))t, K is the Gram matrix [k(x(i), x′(i))]i=1,..., n, and y = (y(1), ..., y(n))t. A superscript t denotes the transpose of a matrix or a vector. For the test data samples, the above equations give the mean prediction and the variance prediction. It is necessary to estimate β from the training data set, and for the training algorithm, we used a random search (e.g., ref 28) for maximizing the leave-one-out log predictive probability (section 5.4.2 of ref 26). Details of the training algorithm can be found in the Supporting Information (Section 2). Next, to search for new materials with the maximum methane uptake, we employ a simple random search using the prediction model described above. The random search consists of a sampling of values of the quantitative descriptors of the MOF building block and a prediction of methane uptake using the learned prediction model. For the sampling distribution, we used the conditional distribution given the type of MOF. These two operations are alternately repeated until the maximum iteration number is reached. This search algorithm is presented in detail in the Supporting Information (Section 3).

Figure 1. Schematic illustration of the overall analysis framework. A machine learning technique is used in the prediction model, which uses a Gaussian process with a multiple kernel function. The conditional distribution given the type of MOF is used as the sampling distribution in the random search method.

The kernel method is used in the model that predicts the methane uptake. The input to the model consists of quantitative descriptors (described below) and a graph representation of the molecular structure of each MOF. The search system output consists of the specifications (values of the qualitative descriptors and types of MOFs) of the material with the maximum methane uptake. To use the structural relationship of the MOF building blocks, we define a structural kernel function ks (G, G′) for two graphs, G and G′, based on the concept of the optimal assignment kernel.18,19 A graph G is represented by a triple (v, a, d), where “v” is the list of vertices, “a” is the list of adjacent vertices, and “d” is the list of distances between vertices of the graph G. Thus the similarity measure between two graphs is calculated by the structural kernel function. A detailed description is given in the Supporting Information (Section 1). The prediction model that predicts methane uptake uses a GP with a multiple kernel function as the covariance function. Thus, in the GP, the multiple kernel function that is used as the covariance function is defined as follows



EXPERIMENTAL SETUP To validate the effectiveness of the structural relationship of the MOF building blocks, we compared the results from a singlekernel GP that used only quantitative descriptors and a multiple kernel GP that uses quantitative descriptors and graphs of the MOF building blocks. In addition, we compared the results of the GP with those of nonlinear models (support vector regression and neural network) and linear regression using the quantitative descriptors. The quantitative descriptors for the pore properties of the MOFs are shown in Table 1. We decided the attributes of the quantitative descriptors according to the results of ref 17 and the supporting information for ref 17. The quantitative descriptors and the methane uptake were normalized so that they were in the range of 0 to 1. To investigate the effect of the MOF structure on the methane storage, we considered the molecular structure of the building blocks.16

k((x , G), (x′, G′)) = α1kg(x , x′) + α2ks((v , a , d), (v′, a′, d′)) + σjit2δ(x , G),(x ′ , G ′)

where ks (·,·) denotes the structural kernel function, α1, α2 ≥ 0 (in the experiments, we set α1 and α2 to 0.5), and x denotes the B

DOI: 10.1021/acs.jpcc.6b07618 J. Phys. Chem. C XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry C

of its eigenvalues were positive, as shown in the Supporting Information (Section 5). The row and column numbers correspond to the building block numbers in figure S2 of ref 16. We created five training and test data sets randomly to remove the random effects of assignments. Table 2 shows the average

Table 1. Quantitative Descriptors of the Pore Properties of MOFs for Methane Storage Materials description

unit

maximum pore diameter density surface area surface area void fraction

Å cm3/g m2/cm3 m2/g

Table 2. Performance Results for Different Models with the Test Dataset

We used 499 structural data samples of MOFs (63 MOF building blocks) obtained from a hypothetical database.29 The values of the quantitative descriptors were calculated using Zeo++.30,31 For the density and the void f raction, the spherical radius of the probe molecule was set to 0 Å. For the surface area, the spherical radius of the probe molecule was set to 1.84 Å. For both cases, the number of samplings for the simulation was 5000. To calculate the methane uptake at 35 MPa, the interactions between the methane molecule and the constituent elements of the MOF were represented by the Lennard-Jones potential and the Morse potential. RASPA 2.0 was used for the simulation of methane absorption at 298 K (e.g., ref 32). Each run was equilibrated for 50 Monte Carlo steps, followed by 500 steps for the production period. The building blocks were labeled as in figure S2 in the supporting information of ref 16. The frequency of the building blocks is shown in Figure 2a, and the frequency of the building blocks that terminate with a nitrogen atom (see figure S3 of ref 16) is shown in Figure 2b. To apply the structural kernel function, we need to convert the graph of the molecular structure of the MOF building block to the list representation: G = (v, a, d). The data representation is described in detail in the Supporting Information (Section 4).

model

RMSE

r2

linear regression with quantitative descriptors SVR with quantitative descriptors NN with quantitative descriptors GP with quantitative descriptors GP with quantitative descriptors and structure

0.08163 ± 0.0082

0.6425 ± 0.02908

0.07125 ± 0.01434

0.7279 ± 0.06914

0.07155 ± 0.01471 0.07182 ± 0.01001 0.06194 ± 0.00977

0.7279 ± 0.05796 0.7189 ± 0.04992 0.7891 ± 0.05281

Table 3. Estimated Values of the Automatic Relevance Determination for the Quantitative Descriptors of the GP description

β2i

maximum pore diameter density surface area (m2/cm3) surface area (m2/g) void fraction

1.32 0.0246 0.0850 0.872 0.174

Table 4. Results of Search for Materials with Maximum Methane Uptake



RESULTS AND DISCUSSION Results for Prediction of Methane Uptake. Out of the quantitative descriptors β obtained by the training of the GP with the quantitative descriptors, we selected those having the smallest errors when used with the test data set. For the control parameters used for the training algorithm, we set σ = 0.1 and T = 200. Note that the structural kernel function was not used in the GP in the training of the quantitative descriptors β. This restricts the effect of the structural information obtained from the structural kernel function to the methane uptake (output). We applied the structural kernel function to 63 MOF building blocks and calculated the Gram matrix for the MOF building blocks. Then, the obtained Gram matrix was positive definite because all

uptake

x* (quantitative descriptors)

b (building block no.)

219.1 ± 115.6 215.6 ± 120.4 214.3 ± 112.1 213.1 ± 104.4 211.3 ± 105.1 209.2 ± 100.6 207.5 ± 132.5 206.3 ± 121.9 204.2 ± 105.4 204.2 ± 114.1

(11.64, 1.036, 2557, 1871, 0.5802) (10.21, 1.148, 2438, 2382, 0.6008) (12.12, 1.029, 2460, 2044, 0.5972) (9.574, 1.073, 2336, 2852, 0.6717) (10.30, 1.100, 2208, 883.6, 0.6435) (12.32, 0.9576, 2460, 1285, 0.6690) (7.676, 1.134, 2618, 3144, 0.5820) (9.519, 1.204, 2285, 2040, 0.7162) (15.16, 1.202, 1795, 2811, 0.6072) (11.29, 0.9968, 2409, 1458, 0.6053)

(47, 47) (47, 47) (23′, 47) (47, 47) (47, 47) (47, 47) (24′, 47) (47, 47) (47, 47) (47, 28′)

performance and the standard deviation of the error for the models in the test data set (50 samples). We compared the GP with the other nonlinear models: support vector regression (SVR) and neural network (NN). Note that, as with the GP, the

Figure 2. Frequency of building blocks in the data set. The x axis corresponds to the numbers in figure S2 in the supporting information of ref 16. (a) Building block nos. 6−47. (b) Building block nos. 6−47 that terminate in a nitrogen atom. The total size of the data set was 499, and the total number of distinct building blocks was 63. C

DOI: 10.1021/acs.jpcc.6b07618 J. Phys. Chem. C XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry C

Figure 3. Building blocks of the materials with the maximum methane uptake.

appears to have been due to the training data size (449 and 10 00017) and the difference of the attributes of the input variables (the quantitative descriptors). Thus, on the basis of the results shown in Table 2, we have confirmed the effectiveness of using the structural relationship of the MOF building block. A significant difference was found between the RMSE obtained for the linear regression and that for the GP. All of the nonlinear models outperformed the linear regression model. Note that the performance (RMSE and r2) was measured on average. Therefore, for the purpose of improving the performance, it is useful to investigate the samples (materials) for which the performance was worse than that of the GP with the quantitative descriptors. Thus further research should consider revising the data representation for the structural kernel function (Tables 1 and 2 in the Supporting Information). Results of the Search for Materials with Maximum Methane Uptake. We conducted five searches, each beginning with a different random seed: T was set to 20 and L1 and L2 were set to 63, as described above. The top ten x* and b values and the mean value and standard deviation of the predicted methane uptake are summarized in Table 4 for the candidate materials with the most methane uptake. Here, x* is a set of quantitative descriptors: maximum pore diameter, density, surface area (m2/cm3), surface area (m2/g), and void fraction. Moreover, b gives the pair of building block numbers for the MOF, where a prime indicates that a building block has both a carboxylic acid group and a nitrogen atom. The building blocks in Table 4 are also shown in Figure 3. Note that among the building blocks in the data set, no. 47 had the largest number of benzene rings and the greatest length. The predicted standard deviation was a little large, because the obtained specifications were not included in the data set (499 samples) and the estimated noise (the jitter) was included. First, to ensure the validity of the results, we used principal component analysis (PCA)33 for the quantitative descriptors. For the data set (499 samples), the first two principal components contained 96.37% of the cumulative explained variance (information on the data set). In Figure 4a, the data samples are plotted on the 2D subspace spanned by the first and second principal components. Figure 4b shows the data samples with the highest level of methane uptake (>0.7; normalized to the range of 0 to 1).

kernel function of the SVR was the Gaussian kernel function. The training of the GP and the nonlinear models is described in the Supporting Information (Section 6). Note that in Table 2 RMSE is the root-mean-squared error, and r 2 is the deterministic coefficient (that is, the squared correlation coefficient). For the nonlinear models, the performances of the GP, the SVR, and the NN were almost similar to the results. The similarity in the performances of the GP and the SVR is due to the use of the same type of kernel function. For the GP, the estimated values of the automatic relevance determination for the quantitative descriptors are summarized in Table 3. For the jitter term, σ2jit was 0.1085. This shows that the methane uptake (output) in the training data set included ∼10% noise as its variance, which is somewhat high. Because the values of the methane uptake were simulated, the noise was likely due to the estimation accuracy (sampling error) of the simulation in the data preparation because the number of Monte Carlo steps (50) was somewhat small. Thus it is suspected that the output of the GP was inferred ∼10% lower than that of the noise free. Unlike other nonlinear models (support vector regression, kernel ridge regression, and neural networks),6 GP has the advantage of being able to estimate noise level in data. Notice that the output of kernel ridge regression is equivalent to the mean of the GP (see eq 2.27 in section 2.2 of ref 26). As shown in the table, the density, surface area (m2/cm3), and void fraction were the dominant input variables. Regarding the linear regression, Fernandez et al. used the void fraction, dominant pore diameter (Å), and gravimetric surface area (m2/g) as the attributes of the inputs.17 The most dominant input variable was the void fraction. For the GP (nonlinear model), the void fraction was also a important attribute. Therefore, with respect to the other input variables, the difference between the attributes of them appears to have been caused by the difference in the linear and nonlinear modeling. Future research, however, requires further analysis of the results. In the next section, we will use these values in the prediction model when we search for materials that have the maximum methane uptake. The best model was the GP with the quantitative descriptors and structure: RMSE = 0.06194 ± 0.00977 and r2 = 0.7891 ± 0.05281. The RMSE was 12.55 ± 4.215 cm3 (STP)/cm3 at 3.5 MPa. Compared with the previous study (table 1 of ref 17), in which R2 = 0.851 at 35 bar (for test), r2 was somewhat low. The difference in the generalization performance D

DOI: 10.1021/acs.jpcc.6b07618 J. Phys. Chem. C XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry C

Figure 4. Results of PCA. (a) The data set is plotted on the 2D subspace spanned by the first two principal components. (b) Data samples with the highest value of methane uptake (>0.7; 141.9 cm3 (STP)/cm3 at 3.5 MPa).

materials that would be expected to maximize methane uptake (see Table 4). These specifications were not included in the data set (499 samples), and the predicted maximum methane uptake was larger than that observed for the compounds in the data set. The specifications we obtained were reasonably consistent with values found in the literature. Our technical contributions of this study can be summarized as follows • We provided a method by which to apply the kernel method to the structural relationship of an MOF (organic building block) to predict the methane uptake. • We presented a kernel function for which the input consists of the graphs of molecular structures. This function is versatile and can be used effectively for other applications. • We presented a simple random search algorithm to find materials expected to maximize methane uptake. This algorithm is relatively easy to use and does not require the setting of control parameters, such as step size (learning coefficient). Our approach makes it possible to consider the structural relationship of MOFs in prediction and search tasks and can be easily adapted for other applications in materials science. As an area of future research, we intend to experimentally validate the predicted properties of the candidate materials.

Figure 5. Results of the plots of the results on the 2D subspace.

Then, we plotted the results for the quantitative descriptors on the 2D subspace as shown in Figure 5. From the Figure, it can be seen that the results were in the region of high methane uptake because the samples corresponded to the same region in Figure 4b. Therefore, we conclude that the methane uptake of these materials is >0.7 (141.9 cm3 (STP)/cm3 at 3.5 MPa). Next, we compared the values of the quantitative descriptors with the maximum uptake (219.1 cm3 (STP)/cm3 at 3.5 MPa), as listed in Table 4, with the results shown in figure 2 (at 35 bar) in ref 17. We found that a surface area of 2557 m2/cm3 corresponded to a methane uptake range of 150 to 200 cm3 (STP)/cm3, a void fraction of 0.5802 corresponded to a range of from 150 to 200 cm3 (STP)/cm3, and a maximum pore diameter of 11.64 Å corresponded to ∼150 cm3 (STP)/cm3. Although there was a slight discrepancy, a material with these values would be expected to have a high uptake of methane. We believe that the discrepancy seen in these results is due to the structural relationships of the MOFs that were used in the prediction model.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jpcc.6b07618. Technical descriptions of the methods and experiments and the result of the experiments. Section 1: Structural Kernel Function. Section 2: Training Algorithm. Section 3: Random Search Algorithm. Section 4: Data Representation for the Structural Kernel Function. Section 5: Gram Matrix, Section 6: Training of Nonlinear Models. (PDF)





CONCLUSIONS We showed that the structural relationship of the MOF building blocks can be used to predict the uptake of methane. Having established this, we then used the prediction model to conduct a simple random search for the specifications of candidate

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest. E

DOI: 10.1021/acs.jpcc.6b07618 J. Phys. Chem. C XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry C



ADME Prediction in Classification and Regression. QSAR Comb. Sci. 2006, 25, 317−326. (20) Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer-Verlag.: Secaucus, NJ, 2006. (21) Gärtner, T. A Survey of Kernels for Structured Data. SIGKDD Explorations 2003, 5, 49−58. (22) Ramon, J.; Gärtner, T. Expressivity versus Efficiency of Graph Kernels. In Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, Cavtat-Dubrovnik, Croatia, September, 2003; pp 65−74. (23) Borgwardt, K. M.; Ong, C. S.; Schönauer, S.; Vishwanathan, S. V. N.; Smola, A. J.; Kriegel, H.-P. Protein Function Prediction via Graph Kernels. Bioinformatics 2005, 21, 47−56. (24) Ralaivola, L.; Swamidass, S. J.; Saigo, H.; Baldi, P. 2005 Special Issue: Graph Kernels for Chemical Informatics. Neural Networks 2005, 18, 1093−1110. (25) Vishwanathan, S. V. N.; Schraudolph, N. N.; Kondor, R.; Borgwardt, K. M. Graph Kernels. J. Mach. Learn. Res. 2010, 11, 1201− 1242. (26) Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learning; The MIT Press, 2006. (27) Neal, R. M. Bayesian Learning for Neural Networks; SpringerVerlag: Secaucus, NJ, 1996. (28) Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281−305. (29) The dataset (CIF file) was obtained from http://hmofs. northwestern.edu/hc/crystals.php. (30) Willems, T. F.; Rycroft, C. H.; Kazi, M.; Meza, J. C.; Haranczyk, M. Algorithms and tools for high-throughput geometry-based analysis of crystalline porous materials. Microporous Mesoporous Mater. 2012, 149, 134−141. (31) Martin, R. L.; Smit, B.; Haranczyk, M. Addressing Challenges of Identifying Geometrically Diverse Sets of Crystalline Porous Materials. J. Chem. Inf. Model. 2012, 52, 308−318. (32) Lyubchyk, A.; Esteves, I. A. A. C.; Cruz, F. J. A. L.; Mota, J. P. B. Experimental and Theoretical Studies of Supercritical Methane Adsorption in the MIL-53(Al) Metal Organic Framework. J. Phys. Chem. C 2011, 115, 20628−20638. (33) Ivosev, G.; Burton, L.; Bonner, R. Dimensionality Reduction and Visualization in Principal Component Analysis. Anal. Chem. 2008, 80, 4933−4944.

REFERENCES

(1) Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. (2) Behler, J.; Parrinello, M. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Phys. Rev. Lett. 2007, 98, 146401. (3) Bartók, A. P.; Payne, M. C.; Kondor, R.; Csányi, G. Gaussian Approximation Potentials: The Accuracy of Quantum Mechanics, without the Electrons. Phys. Rev. Lett. 2010, 104, 136403. (4) Rupp, M.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett. 2012, 108, 058301. (5) Montavon, G.; Rupp, M.; Gobre, V.; Vazquez-Mayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 2013, 15, 095003. (6) Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; von Lilienfeld, O. A.; Tkatchenko, A.; Müller, K.-R. Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. J. Chem. Theory Comput. 2013, 9, 3404−3419. (7) Pilania, G.; Wang, C.; Jiang, X.; Rajasekaran, S.; Ramprasad, R. Accelerating materials property predictions using machine learning. Sci. Rep. 2013, 3, 2810. (8) Lopez-Bezanilla, A.; von Lilienfeld, O. A. Modeling electronic quantum transport with machine learning. Phys. Rev. B: Condens. Matter Mater. Phys. 2014, 89, 235411. (9) Meredig, B.; Agrawal, A.; Kirklin, S.; Saal, J. E.; Doak, J. W.; Thompson, A.; Zhang, K.; Choudhary, A.; Wolverton, C. Combinatorial screening for new materials in unconstrained composition space with machine learning. Phys. Rev. B: Condens. Matter Mater. Phys. 2014, 89, 094104. (10) Ghasemi, S. A.; Hofstetter, A.; Saha, S.; Goedecker, S. Interatomic potentials for ionic systems with density functional accuracy based on charge densities obtained by a neural network. Phys. Rev. B: Condens. Matter Mater. Phys. 2015, 92, 045131. (11) Huan, T. D.; Mannodi-Kanakkithodi, A.; Ramprasad, R. Accelerated materials property predictions and design using motifbased fingerprints. Phys. Rev. B: Condens. Matter Mater. Phys. 2015, 92, 014106. (12) Eddaoudi, M.; Kim, J.; Rosi, N.; Vodak, D.; Wachter, J.; O’Keeffe, M.; Yaghi, O. M. Systematic Design of Pore Size and Functionality in Isoreticular MOFs and Their Application in Methane Storage. Science 2002, 295, 469−472. (13) Yaghi, O. M.; O’Keeffe, M.; Ockwig, N. W.; Chae, H. K.; Eddaoudi, M.; Kim, J. Reticular synthesis and the design of new materials. Nature 2003, 423, 705−714. (14) Getman, R. B.; Bae, Y.-S.; Wilmer, C. E.; Snurr, R. Q. Review and Analysis of Molecular Simulations of Methane, Hydrogen, and Acetylene Storage in Metal-Organic Frameworks. Chem. Rev. 2012, 112, 703−723. (15) Martin, R. L.; Haranczyk, M. Optimization-Based Design of Metal-Organic Framework Materials. J. Chem. Theory Comput. 2013, 9, 2816−2825. (16) Wilmer, C. E.; Leaf, M.; Lee, C. Y.; Farha, O. K.; Hauser, B. G.; Hupp, J. T.; Snurr, R. Q. Large-scale screening of hypothetical metalorganic frameworks. Nat. Chem. 2011, 4, 83−89. (17) Fernandez, M.; Woo, T. K.; Wilmer, C. E.; Snurr, R. Q. LargeScale Quantitative Structure-Property Relationship (QSPR) Analysis of Methane Storage in Metal-Organic Frameworks. J. Phys. Chem. C 2013, 117, 7681−7689. (18) Fröhlich, H.; Wegner, J. K.; Sieker, F.; Zell, A. Optimal Assignment Kernels for Attributed Molecular Graphs. In Proceedings of the 22nd International Conference on Machine Learning; ACM: New York, 2005; pp 225−232. (19) Fröhlich, H.; Wegner, J. K.; Sieker, F.; Zell, A. Kernel Functions for Attributed Molecular Graphs - A New Similarity-Based Approach to F

DOI: 10.1021/acs.jpcc.6b07618 J. Phys. Chem. C XXXX, XXX, XXX−XXX