Quantitative Structure–Property Relationship Modeling of Electronic

Nov 30, 2015 - virtual data set of 622 passivated and nonpassivated graphenes, and we predicted the properties of the remaining 30% of the structures...
1 downloads 0 Views 1MB Size
Article pubs.acs.org/jcim

Quantitative Structure−Property Relationship Modeling of Electronic Properties of Graphene Using Atomic Radial Distribution Function Scores Michael Fernandez,* Hongqing Shi, and Amanda S. Barnard CSIRO Virtual Nanoscience Laboratory, 343 Royal Parade, Parkville, Victoria 3052, Australia S Supporting Information *

ABSTRACT: The intrinsic relationships between nanoscale features and electronic properties of nanomaterials remain poorly investigated. In this work, electronic properties of 622 computationally optimized graphene structures were mapped to their structures using partial-least-squares regression and radial distributions function (RDF) scores. Quantitative structure−property relationship (QSPR) models were calibrated with 70% of a virtual data set of 622 passivated and nonpassivated graphenes, and we predicted the properties of the remaining 30% of the structures. The analysis of the optimum QSPR models revealed that the most relevant RDF scores appear at interatomic distances in the range of 2.0 to 10.0 Å for the energy of the Fermi level and the electron affinity, while the electronic band gap and the ionization potential correlate to RDF scores in a wider range from 3.0 to 30.0 Å. The predictions were more accurate for the energy of the Fermi level and the ionization potential, with more than 83% of explained data variance, while the electron affinity exhibits a value of ∼80% and the energy of the band gap a lower 70%. QSPR models have tremendous potential to rapidly identify hypothetical nanomaterials with desired electronic properties that could be experimentally prepared in the near future.



magnetic, and optical properties18−23 of graphene can be tuned by controlling the shape and size of the sheets, which could be incorporated into a wide variety of electronics, optoelectronics, and electromagnetic devices. However, controlling the precise structure of individual graphenes remains challenging, which has hampered experimental combinatorial and HT screening approaches.24,25 In this context, computer simulations of nanomaterials provide large-scale structure−property data that could accelerate ground-breaking applications of nanomaterials and circumvent the need for exquisite control at the atomic level. Experimental and theoretical evidence shows that the electronic structures and associated properties of nanomaterials are intrinsically linked to the physical structure.15−17 The structure of nanomaterials can be conveniently represented by structural fingerprints encoding topological, geometrical, or/and electronic features.26 Using nanomaterial fingerprints, we can measure the similarity among samples and or build sophisticated quantitative structure−property relationship (QSPR) models.13,26 Data mining and machine learning algorithms can also be combined to unravel complex structure−property patterns. Moreover, the implementation of powerful function mapping algorithms inside the QSPR framework can increase the accuracy of predictions on large material databases.13,26 In this paper, we investigate the correlations between electronic properties and structural features of a virtual data set of

INTRODUCTION

The exponential increase of the realm of new nanomaterials along with their complexity urges the integration of information technology techniques into novel nanomaterial discovery. To leverage large-scale collaborations between materials and computer scientists, the materials genome initiative (MGI) is an effort to develop novel scalable approaches to discover, manufacture, and deploy advanced materials twice as fast at a fraction of the cost.1 In silico high-throughput (HT) characterization is pursued, applying proven computational chemistry methods to multicomponent crystals2−4 and alloys,5 lithium-based batteries,6−8 optically active organic molecules,9 photovoltaic materials,10 metal−organic frameworks (MOFs),11−13 and graphenes.14 Graphene sheets, i.e., single atomic layers from graphite, as exemplified in Figure 1, have emerged as very promising nanomaterials that have attracted considerable attention in recent years because of their unique properties. The electric,15−17

Received: July 21, 2015

Figure 1. Examples of (a) trigonal, (b) hexagonal, and (c) rectangular graphene structures in the data set. © XXXX American Chemical Society

A

DOI: 10.1021/acs.jcim.5b00456 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Radial Distribution Functions. The application of RDF scores and related descriptors was first proposed by Gasteiger and co-workers30,31 to encode the three-dimensional (3D) structures of molecules and crystals32 and has been used by others to examine different structure−property relationship problems.13,33−35 The 3D coordinates of the atoms of the system are transformed into a numerical RDF code that is independent of the size of the structure. Our adapted RDF scores are computed on H-depleted graphene structures and can be interpreted as the probability distribution to find an atom pair in a spherical volume of radius R according to eq 1:

622 graphenes using partial-least-squares regression (PLSR) and radial distribution function (RDF) fingerprints. Using a genetic algorithm (GA), we explore the most significant RDF scores to predict graphene electronic properties, including the energy of the band gap (EG), the ionization potential (EI), the energy of the Fermi level (EF), and the electron affinity (EA). As we will show, this method of mapping the functional space of nanoparticles yields efficient QSPR models that can aid novel nanomaterial discovery.



DATA PRIOR PREPARATION AND COMPUTATIONAL METHODS Graphene Data Set. For this work, we gathered a data set of virtual nanographene samples with a large range of sizes that cover the ranges observed experimentally. The electronic structures of 622 graphenes ranging from 16 to 2176 carbon atoms were simulated using the density functional tight binding (DFTB) method described as elsewhere.14 The data set includes three classes of graphenes corresponding to hexagonal, rectangular, and trigonal morphologies, as depicted in Figure 1. The initial structures are available free of charge at the CSIRO Data Access Portal (DAP).27 Each structure in the data set is unique and was characterized by the set of RDF scores described below as well as the charge transfer properties EG, EI, EF, and EA derived from the DFTB calculations.28 From this point on, with this data set, all manipulation, preprocessing, calibration, testing, and analysis of PLSR models were done in the Python programming language using the scikit-learn machine learning library.29

N

RDF(R ) =

2

∑ e−B(r − R) ij

(1)

i,j

where the summation is over the N atom pairs in the graphene structure, rij is the distance of these pairs, and B is a smoothing parameter (set equal to 10). The selection of the distance range and bin size can be somewhat arbitrary; here we used a bin size of 0.25 Å and a maximum distance of 30.0 Å, which are based on our previous reports on MOFs.13,34 Changing the bin size certainly produces RDF score vectors of difference sizes but with a similar overall fingerprint profile. We explored two maximum interatomic distances of 30 and 60 Å, which approximately represent the maximum interatomic distances of 15% and 50% of the graphene nanoflakes, respectively. It was preliminarily shown that RDF scores up to 30 Å display higher cross-validation accuracies because having fewer scores equal to zero improves the stability of the fold-out crossvalidated model during GA-based feature selection. The RDF scores computed using eq 1 are not normalized by the number of atom pairs in the structure, so they can easily discriminate among different molecular sizes. In the case of graphene, the use of RDF scores to characterize the molecular structure is particularly convenient considering that bond lengths, and the distribution of interatomic distances in general, can be derived from experimental structure analysis of graphene films using high-resolution transmission electron microscopy (HRTEM),36 as has been successfully reported by McNerny et al.37 Partial-Least-Squares Regression. PLSR finds hyperplanes of minimum variance between the response and independent variables. The predicted variables and the observable variables are projected to a new space, yielding a linear regression model. In our case, the resulting bilinear factor model accounts for the fundamental relation between the matrix X representing the RDF scores and the matrix Y representing each electronic property in the form of latent variable models of the covariance structures in these two spaces. PLS regression is particularly suitable when

Figure 2. RDF scores averaged over hexagonal (blue), rectangular (black), and trigonal (red) graphenes.

Table 1. Details of the Optimum PLSR Models for EG, EI, EF, and EA of Graphene property

interatomic distances (Å)

EG

2.75, 3.25, 3.5, 4.25, 4.5, 4.75, 5.5, 5.75, 7.75, 8.25, 9.0, 9.25, 10.25, 11.75, 12.0, 12.5, 14.25, 14.75, 15.25, 16.0, 17.5, 17.75, 18.0, 19.25, 20.0, 20.75, 21.0, 21.5, 22.25, 23.25, 23.5, 23.75, 24.5, 25.75, 26.0, 26.75, 27.25, 27.75, 29.75 2.0, 2.75, 3.0, 3.75, 5.0, 6.25, 6.5, 7.25, 7.75, 8.25, 9.25, 10.0, 11.5, 12.5, 13.25, 13.75, 15.5, 20.0, 21.75, 23.5, 24.25, 24.75, 25.0, 25.75, 26.25, 26.5, 26.75 2.0, 2.5, 3.0, 3.25, 3.75, 4.0, 4.5, 5.25, 5.5, 6.25, 7.0, 7.25, 8.0, 8.25, 8.5, 9.0, 9.25, 9.75, 10.75, 11.0, 11.25, 11.75, 12.75, 13.0, 13.25, 13.75, 14.25, 14.5, 15.25, 15.75, 17.0, 17.25, 17.5, 17.75, 18.0, 20.0, 20.75, 21.25, 22.75, 23.0, 24.0, 25.5, 26.0, 26.5, 27.0, 27.5, 27.75, 28.25, 28.5, 29.75 2.0, 2.75, 3.0, 3.25, 3.5, 4.75, 5.25, 5.5, 6.5, 7.75, 8.5, 9.75, 10.25, 10.5, 11.5, 14.0, 15.0, 15.25, 15.75, 16.25, 16.75, 17.0, 17.5, 18.0, 19.25, 19.5, 19.75, 21.0, 22.75, 23.0, 23.25, 24.5, 26.75, 27.0, 27.75

EI EF EA

number of factors

Q2TFO

R2test a

Q2F1 b

Q2F2 b

39

0.699

0.696

0.674

0.672

25

0.79

0.837

0.827

0.824

50

0.996

0.866

0.869

0.864

35

0.821

0.808

0.805

0.800

a

Squared Pearson’s correlation coefficient of the test set. bQ2F1 and Q2F2 are predictive squared correlation coefficients of the test set calculated as described elsewhere.42 B

DOI: 10.1021/acs.jcim.5b00456 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 3. Relative linear correlation coefficients of the RDF scores in the optimum PLSR models for (a) EG, (b) EI, (c) EF, and (d) EA of graphene nanoflakes.

there is multicollinearity among the X values, such as in the case of the RDF scores of graphenes. The PLS model tries to find the multidimensional direction in the RDF score space that explains the maximum multidimensional variance direction in the electronic properties space according to the general eqs 2 and 3: X = TPT + E

(2)

Y = UQT + F

(3)

In this work, PLSR was calibrated with RDF scores for 70% of the data set, while the remaining 30% of the data set was used to test the prediction ability of the models (see the Supporting Information for details). GA was used to automate the selection of the optimum number of factors or components through crossvalidation. PLSR models were implemented using the scikit-learn toolbox29 in the Python programming language. Genetic-Algorithm-Based Feature Selection. GAs are stochastic optimization methods that have been inspired by evolutionary principles. The distinctive aspect of a GA is that it simultaneously explores different regions in parameter space, investigating many possible solutions.39 In addition to selecting the best combination of structural features, the GA output was also used to tune the number of factors or components of the PLSR as described elsewhere.40 In the GA framework, the fitness or cost function of each model was computed as the squared predictive correlation coefficient (Q2) of 3-fold-out (TFO) crossvalidation (Q2TFO) to avoid overfitting. The characteristic reproductive cycle was run until the best fitness score remained unchanged for 90% of the generations or the maximum number of generations was reached. This algorithm was run 100 times to train 100 models in each population for a maximum of 100 generations. The best model for each run

where X is the n × m matrix of RDF scores; Y is an n × p matrix of responses; T and U are n × l matrices that are projections of X (the X score, component, or factor matrix) and Y (the Y scores), respectively; P and Q are m × l and p × l orthogonal loading matrices, respectively; and matrices E and F are the error terms, assumed to be independent and identically distributed random normal variables. The decompositions of X and Y are made so as to maximize the covariance between T and U, where the factor and loading matrices T, U, P, and Q are estimated by constructing the linear regression between X and Y as Y = XB̅ + B̅ 0, where B̅ and B̅ 0 are linear coefficients to be estimated as described elsewhere.38 C

DOI: 10.1021/acs.jcim.5b00456 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 4. Scatter plots of the predictions for the test set consisting of 30% of the data set for (a) EG (R2test = 0.696), (b) EI (R2test = 0.837), (c) EF (R2test = 0.866), and (d) EA (R2test = 0.808) of graphene, where “actual” refers to the values calculated by DFTB simulations and “predicted” corresponds to the QSPR predictions.

was selected from the population of the final generation. The statistical analysis of the most informative features was done by histogram plots of the features of the pooled “best” models of the final generations of 100 independent runs. The GA optimization was implemented in the Python programming language.

models calibrated with the RDF scores in the entire range of interatomic distances yield cross-validation accuracies of 0.97, 0.63, 0.74, and 0.75 for EF, EG, EA, and EI, respectively. We further explored enhanced QSPR models with optimum combinations of RDF scores by training simultaneously several PLSR models that evolved for different generations according to the GA implementation previously described. During the GA optimization of the PLSR models, the fitness or cost function (1 − Q2TFO, as described in the Supporting Information) sharply decreases after 10 generations and then stabilizes after 75 generations for all of the electronic properties, albeit with different fitting accuracies (see Figure S1 in the Supporting Information for details). The analysis of 100 different GA runs showed that that majority of the RDF scores are represented in the optimum PLSR models of the four electronic properties as shown in Figure S2 in the Supporting Information. However, the most relevant RDF scores for the properties EF and EA appear in the interatomic distance range from ∼2.0 to ∼10.0 Å, while EG and EI correlate to RDF scores in the larger range from ∼2.0 to ∼30.0 Å. Details of the optimum PLSR models appear in Table 1, where it can be observed that the predictions of EF exhibit the highest cross-validation accuracy of 0.99 and EG was poorly predicted with a cross-validation accuracy of 0.67, while the QSPR models describe >80% of the variance of the properties EA and EI. Interestingly, the optimum model of the latter property covers the majority of the RDF space with the lesser number of 25 RDF scores from 2.0 to 26.75 Å, while EG and EA each correlate to 35 RDF scores in the range from ∼2.0 to ∼29.75 Å. In turn,



RESULTS AND DISCUSSION The RDF fingerprints averaged over hexagonal, rectangular, and trigonal graphenes are shown in Figure 2, where differences among the structure representations can be observed. In graphene, the RDF fingerprints encode the C−C interatomic distances, showing an initial high peak at 2.5 Å. It is worth noting that the peak associated with C−C bond distances cannot be appreciated in the plot because we compute the RDF score in the range from 2.0 Å to 30.0 Å. We consider that this peak is common to all structures and it brings no useful information to the fingerprints. By inspecting the plots, we observe that the RDF peaks mainly differ in height, with rectangular graphenes showing the highest values followed by hexagonal and trigonal structures in that order. Despite the fact that the allocations of peaks along the interatomic distance range describe very similar patterns, the complexity of the profiles suggests that sophisticated pattern recognition techniques rather than simple correlation methods are required to map the electronic properties to the RDF space. The RDF fingerprints depict several peaks for interatomic distances that uniquely characterize each graphene on the data set, but only a reduced number of peaks may be relevant for correlating the different electronic properties. In fact, PLSR D

DOI: 10.1021/acs.jcim.5b00456 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 5. Scatter plots of the predictions of (a, c, e) EF and (b, d, f) EI for (a, b) hexagonal [(a) R2test = 0.634; (b) R2test = 0.743], (c, d) rectangular [(c) R2test = 0.970; (d) R2test = 0.849], and (e, f) trigonal [(e) R2test = 0.846; (f) R2test = 0.925] graphenes in the test set, consisting of 30% of the data set. “Actual” refers to the values calculated by DFTB simulations, and “predicted” corresponds to the QSPR predictions.

EF exhibits the highest correlation with the largest number of 50 RDF scores over the entire interatomic distance range from 2.0 to 30.0 Å. The PLSR models were further validated by Y randomization, which showed no significant chance correlations, as depicted in Figure S3 in the Supporting Information. We also experimented with nonlinear support vector machines (SVMs).41 However, we found that SVM models with a radial basis function kernel tend to overfit the calibration data during GA feature selection and hyperparameter optimization. The optimum SVM models exhibited cross-validation accuracies of 92%, 66%, 70%, and 71% for EF, EG, EA, and EI, respectively, which are lower than those of the PLSR models in Table 1. A fundamental advantage of PLSR is that because of its linear nature, looking into the coefficients of the individual variables in

the resulting linear equations provides a straightforward interpretation of the QSPR models. However, in this case the analysis of the coefficients of the PLSR models in Figure 3 reveals rather complex patterns with both negative and positive contributions of the RDF scores that are very evenly distributed across the relevant interatomic distances for each electronic property of graphene. The RDF scores yielded accurate PLSR models of the electronic properties of the graphenes in the training set with crossvalidation accuracies higher than 70%. However, the predictions needed further validation on an external test set. To do this, we selected a test set of 205 graphenes that were not used during the calibration process. The scatter plots of the predictions of the four electronic properties for the test set appear in Figure 4, E

DOI: 10.1021/acs.jcim.5b00456 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling and the R2test values appear in Table 1 along with Q2F1 and Q2F2 calculated as described by Schuurmann et al.42 In general, the correlation coefficients of the test set predictions (R2test, Q2F1, and Q2F2) matched the cross-validation accuracies in Table 1. This reveals that the PLSR models not only “learned” the structural information from the training set with adequate accuracies but more importantly generalized it to the test set, yielding comparable predictions without significant overfitting. As can be observed, the highest accuracy corresponds to EF with R2test = 0.87 (Figure 4c), while EI and EA showed R2test > 0.8 (Figure 4b,d, respectively). Meanwhile, the predictions of EG showed lower accuracy, with R2test ≈ 70% (Figure 4a). Examination of Figure 4 reveals that the QSPR models underestimate the properties of some graphenes, particularly EG and EF, while lower EA values are overestimated. This brings up the question of how the QSPR models perform for each graphene class. To investigate the prediction accuracies across the three graphene classes in the data set, the graphenes were labeled according to rectangular, hexagonal, and trigonal classes, and the prediction accuracy was measured for each class. In Figure 5 we compare the predictions of the properties EF and EI for the different graphene classes in the test set. As it can be observed, rectangular and trigonal structures exhibit more accurate predictions with R2test > 0.84, while hexagonal graphenes are less accurately predicted with R2test < 0.75. Figure S4 in the Supporting Information depicts similar behavior for the properties EG and EA. The predictions of the highest values of EG in Figure 4 are less accurate, but the PLSR model correctly ranks the majority of their values. This suggests that as has been reported recently for nanoporous materials,13 machine learning models can be successfully applied to classify graphenes into classes with different potential applications rather than as alternative tools to predict the actual values of the graphene electronic properties. Since a certain threshold value of a given property can be indicative of a potential application, classification models can be built in a manner similar to the approach presented here to efficiently discriminate among potential nanomaterial candidates from virtual graphene libraries. On the other hand, the RDF scores or fingerprints can certainly be improved in order to increase the prediction accuracy by adding a weighting scheme to the calculation of RDF scores in eq 1. For example, in addition to the interatomic distance distributions, RDF scores could use partial charges to better describe the electronic state of the carbon atoms in the graphene flakes. The controlled synthesis of graphene flakes is currently rather difficult, as is obtaining readable graphene materials for direct applications. Despite the fact that experimental validation is not possible at the moment, for over 600 regular graphene flakes with trigonal, hexagonal, and rectangular shapes and different degrees of passivation it has been shown that electronic properties that are relevant for industrial applications can be controlled to a large extent via specific interatomic distributions of carbon atoms.

to what degree, and they are somewhat consistent with intuitive assumptions. In the case of graphene, RDF scores successfully describe more than 80% of the variance of EF, EI, and EA, while the EG is more troublesome with less than ∼80% explained variance. Each electric property correlates to atomic arrangements at specific interatomic distances, where short-range distances up to 10 Å account for the properties EF and EA, while EG and EI correlate to atom distributions at wider interatomic distance ranges from 2.0 to 30.0 Å. To the best of our knowledge, this is the first machine learning prediction of electronic properties of graphene solely from its distribution of interatomic distances. The present methodology is a sound approach to efficiently discriminate among potential nanomaterial candidates from virtual graphene libraries.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.5b00456. Details of the machine learning implementation, calibration, and testing; RDF scores for the training and calibration sets; scatter plots of the QSPR predictions of different graphene classes in the data set; and results of the Y randomization experiments of the optimum PLSR models (PDF) Implementation of the PLSR models that can be used to predict EF, EI, EG, and EA of graphenes from RDF scores (XLSX)



AUTHOR INFORMATION

Corresponding Author

*Phone: +61 3 9662 7151. E-mail: michael.fernandezllamosa@ csiro.au. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS Computational resources for this project were supplied by the Australian National Computing Infrastructure National Facility under Grant q27.



REFERENCES

(1) Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; Persson, K. a. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. (2) Curtarolo, S.; Kolmogorov, A. N.; Cocks, F. H. High-throughput ab initio analysis of the BiIn, BiMg, BiSb, InMg, InSb, and MgSb systems. CALPHAD: Comput. Coupling Phase Diagrams Thermochem. 2005, 29, 155−161. (3) Oganov, A. R.; Glass, C. W. Crystal structure prediction using ab initio evolutionary techniques: principles and applications. J. Chem. Phys. 2006, 124, 244704. (4) Pickard, C. J.; Needs, R. J. Ab initio random structure searching. J. Phys.: Condens. Matter 2011, 23, 053201. (5) Morgan, D.; Ceder, G.; Curtarolo, S. High-throughput and data mining with ab initio methods. Meas. Sci. Technol. 2005, 16, 296−301. (6) Kang, K.; Meng, Y. S.; Bréger, J.; Grey, C. P.; Ceder, G. Electrodes with high power and high capacity for rechargeable lithium batteries. Science 2006, 311, 977−980. (7) Chen, H.; Hautier, G.; Jain, A.; Moore, C.; Kang, B.; Doe, R.; Wu, L.; Zhu, Y.; Tang, Y.; Ceder, G. Carbonophosphates: A New Family of Cathode Materials for Li-Ion Batteries Identified Computationally. Chem. Mater. 2012, 24, 2009−2016.



CONCLUSION Intensive research is still needed to understand and control the formation of nanomaterials, but this does not need to hinder the understanding of structure−property relationships. In silico HT screening of virtual nanomaterial libraries combined with machine learning techniques provides a powerful way to approach the structure−property relationship paradigm. Pattern recognition techniques can reveal which structural features are important and F

DOI: 10.1021/acs.jcim.5b00456 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

(30) Hemmer, M. C.; Steinhauer, V.; Gasteiger, J. Deriving the 3D structure of organic molecules from their infrared spectra. Vib. Spectrosc. 1999, 19, 151−164. (31) Hemmer, M. C.; Gasteiger, J. Prediction of three-dimensional molecular structures using information from infrared spectra. Anal. Chim. Acta 2000, 420, 145−154. (32) Oganov, A. R.; Valle, M. How to quantify energy landscapes of solids. J. Chem. Phys. 2009, 130, 104504. (33) González, M. P.; Caballero, J.; Tundidor-Camba, A.; Helguera, A. M.; Fernández, M. Modeling of farnesyltransferase inhibition by some thiol and non-thiol peptidomimetic inhibitors using genetic neural networks and RDF approaches. Bioorg. Med. Chem. 2006, 14, 200−213. (34) Fernandez, M.; Trefiak, N. R.; Woo, T. K. Atomic Property Weighted Radial Distribution Functions Descriptors of Metal-Organic Frameworks for the Prediction of Gas Uptake Capacity. J. Phys. Chem. C 2013, 117, 14095−14105. (35) Schütt, K. T.; Glawe, H.; Brockherde, F.; Sanna, a.; Müller, K. R.; Gross, E. K. U. How to represent crystal structures for machine learning: Towards fast prediction of electronic properties. Phys. Rev. B: Condens. Matter Mater. Phys. 2014, 89, 205118. (36) Plachinda, P.; Rouvimov, S.; Solanki, R. Structure analysis of CVD graphene films based on HRTEM contrast simulations. Proc. IEEE Conf. Nanotechnol. 2011, 2687, 764−769. (37) McNerny, D. Q.; Viswanath, B.; Copic, D.; Laye, F. R.; Prohoda, C.; Brieland-Shoultz, A. C.; Polsen, E. S.; Dee, N. T.; Veerasamy, V. S.; Hart, A. J. Direct fabrication of graphene on SiO2 enabled by thin film stress engineering. Sci. Rep. 2014, 4, 5049. (38) Geladi, P.; Kowalski, B. R. Partial least-squares regression: a tutorial. Anal. Chim. Acta 1986, 185, 1−17. (39) Holland, H. Adaption in Natural and Artificial Systems; The University of Michigan Press: Ann Arbor, MI, 1975. (40) Fernández, M.; Miranda-Saavedra, D. Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machines. Nucleic Acids Res. 2012, 40, e77. (41) Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273−297. (42) Schuurmann, G.; Ebert, R.-U.; Chen, J.; Wang, B.; Kuhne, R. External Validation and Prediction Employing the Predictive Squared Correlation Coefficient Test Set Activity Mean vs Training Set Activity Mean. J. Chem. Inf. Model. 2008, 48, 2140−2145.

(8) Hautier, G.; Jain, A.; Mueller, T.; Moore, C.; Ong, S. P.; Ceder, G. Designing multielectron lithium-ion phosphate cathodes by mixing transition Metals. Chem. Mater. 2013, 25, 2064−2074. (9) Keinan, S.; Therien, M. J.; Beratan, D. N.; Yang, W. Molecular design of porphyrin-based nonlinear optical materials. J. Phys. Chem. A 2008, 112, 12203−12207. (10) Olivares-Amaya, R.; Amador-Bedolla, C.; Hachmann, J.; AtahanEvrenk, S.; Sánchez-Carrera, R. S.; Vogt, L.; Aspuru-Guzik, A. Accelerated computational discovery of high-performance materials for organic photovoltaics by means of cheminformatics. Energy Environ. Sci. 2011, 4, 4849−4861. (11) Wilmer, C. E.; Leaf, M.; Lee, C. Y.; Farha, O. K.; Hauser, B. G.; Hupp, J. T.; Snurr, R. Q. Large-scale screening of hypothetical metalorganic frameworks. Nat. Chem. 2012, 4, 83−89. (12) Wilmer, C. E.; Farha, O. K.; Bae, Y.-S.; Hupp, J. T.; Snurr, R. Q. Structureproperty relationships of porous materials for carbon dioxide separation and capture. Energy Environ. Sci. 2012, 5, 9849−9856. (13) Fernandez, M.; Boyd, P. G.; Daff, T. D.; Aghaji, M. Z.; Woo, T. K. Machine Learning Virtual Screening for Rapid and Accurate Recognition of High Performing MOFs for CO2 Capture. J. Phys. Chem. Lett. 2014, 5, 3056−3060. (14) Shi, H.; Barnard, a. S.; Snook, I. K. High throughput theory and simulation of nanomaterials: exploring the stability and electronic properties of nanographene. J. Mater. Chem. 2012, 22, 18119−18123. (15) Kosynkin, D. V.; Higginbotham, A. L.; Sinitskii, A.; Lomeda, J. R.; Dimiev, A.; Price, B. K.; Tour, J. M. Longitudinal unzipping of carbon nanotubes to form graphene nanoribbons. Nature 2009, 458, 872−876. (16) Jiao, L.; Zhang, L.; Wang, X.; Diankov, G.; Dai, H. Narrow graphene nanoribbons from carbon nanotubes. Nature 2009, 458, 877− 880. (17) Campos-Delgado, J.; et al. Bulk production of a new form of sp(2) carbon: crystalline graphene nanoribbons. Nano Lett. 2008, 8, 2773− 2778. (18) Ritter, K. A.; Lyding, J. W. The influence of edge structure on the electronic properties of graphene quantum dots and nanoribbons. Nat. Mater. 2009, 8, 235−242. (19) Han, M.; Ö zyilmaz, B.; Zhang, Y.; Kim, P. Energy Band-Gap Engineering of Graphene Nanoribbons. Phys. Rev. Lett. 2007, 98, 206805. (20) Obradovic, B.; Kotlyar, R.; Heinz, F.; Matagne, P.; Rakshit, T.; Giles, M. D.; Stettler, M. A.; Nikonov, D. E. Analysis of graphene nanoribbons as a channel material for field-effect transistors. Appl. Phys. Lett. 2006, 88, 142102. (21) Wang, X.; Ouyang, Y.; Li, X.; Wang, H.; Guo, J.; Dai, H. RoomTemperature All-Semiconducting Sub-10-nm Graphene Nanoribbon Field-Effect Transistors. Phys. Rev. Lett. 2008, 100, 206803. (22) Kinder, J. M.; Dorando, J. J.; Wang, H.; Chan, G. K.-L. Perfect reflection of chiral fermions in gated graphene nanoribbons. Nano Lett. 2009, 9, 1980−1983. (23) Yang, L.; Park, C.-H.; Son, Y.-W.; Cohen, M.; Louie, S. Quasiparticle Energies and Band Gaps in Graphene Nanoribbons. Phys. Rev. Lett. 2007, 99, 186801. (24) Park, S.; Ruoff, R. S. Chemical methods for the production of graphenes. Nat. Nanotechnol. 2009, 4, 217−224. (25) Gao, X.; Jang, J.; Nagase, S. Hydrazine and Thermal Reduction of Graphene Oxide: Reaction Mechanisms, Product Structures, and Reaction Design. J. Phys. Chem. C 2010, 114, 832−842. (26) Fernandez, M.; Woo, T. K.; Wilmer, C. E.; Snurr, R. Q. LargeScale Quantitative Structure-Property Relationship (QSPR) Analysis of Methane Storage in MetalOrganic Frameworks. J. Phys. Chem. C 2013, 117, 7681−7689. (27) Barnard, A. Graphene Structure Set, version 1; CSIRO Data Collection, 2014. (28) Shi, H. Q.; Barnard, A. S.; Snook, I. K. Quantum mechanical properties of graphene nano-flakes and quantum dots. Nanoscale 2012, 4, 6761−6767. (29) Pedregosa, F.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. G

DOI: 10.1021/acs.jcim.5b00456 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX