Feature Engineering for Materials Chemistry – Does Size Matter

5 days ago - The biggest improvement to our model was from including band gaps calculated from density functional theory. This resulted in a model whi...
0 downloads 0 Views 859KB Size
Subscriber access provided by MIDWESTERN UNIVERSITY

Chemical Information

Feature Engineering for Materials Chemistry – Does Size Matter? Roger Duncan Amos, and Rika Kobayashi J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00977 • Publication Date (Web): 07 Feb 2019 Downloaded from http://pubs.acs.org on February 10, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Feature Engineering for Materials Chemistry – Does Size Matter? Roger D. Amos and Rika Kobayashi*† ANU Supercomputer Facility, Leonard Huxley Bldg 56, Mills Rd, Canberra, ACT, 2601, AUSTRALIA International Centre for Quantum and Molecular Structure, College of Sciences, Shanghai University, Shanghai 200444, CHINA



KEYWORDS Machine Learning, featurisation, materials genome, band gaps, density functional theory. ABSTRACT: The effects of structural featurisers in the prediction of band gaps have been investigated through machine learning by application to a silver nanoparticle dataset and 2254 potential light-harvesting materials with known band gap. Elemental properties were extended with structural features, via Voronoi polyhedra which allows for neighbour effects, so presumably giving a better representation of the extended system. However, we did not find any noticeably significant difference to the predictive performance of our model. The biggest improvement to our model was from including band gaps calculated from density functional theory. This resulted in a model which could predict the band gaps of the 2254 light-harvesting dataset with an accuracy reflected in a root-mean-square error of 0.232 eV and mean absolute error of 0.142 eV. Furthermore, the good performance of our model was transferable to the prediction of a set of 72 experimental band gaps, independent of the training set, giving a root-mean-square error of 0.91 eV and mean absolute error of 0.76 eV.

INTRODUCTION Although the name Materials Genome was coined in 2002 as a “brand trademark” it was not until 2011 that the Materials Genome Initiative really took off with the provisioning of $100 million in the 2012 US budget announced by President Obama in 20111,2. The accompanying White House blog stated, “The initiative will fund computational tools, software, new methods for material characterization, and the development of open standards and databases that will make the process of discovery and development of advanced materials faster, less expensive, and more predictable.” This has spawned many associated enterprises integrating all these features, the most mature being Materials Project3,4 in the United States and Materials Cloud5 coming out of Europe. Both6,7 are based on the idea of a high-throughput workflow -> feeding into databases -> analysis as illustrated in Figure 1.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 19

Figure 1 Typical high-throughput workflow for materials design This explosion of Big Data in materials space naturally lends itself to the application of Machine Learning (ML) in the final stage of the workflow. Machine learning in Materials Science is not new. There have been many papers over the years applying machine learning to a variety of properties and systems using many approaches. However, what is new is that the wealth of data now available is sufficient to highlight differences between the many machine learning models enabling fine-tuning of the various parts of the machine learning process. In particular, the importance of featurisation, the representation of the system, has become apparent. Many of the early machine learning studies were based on using property descriptors, but not so many were combined with structure representation. It has only been relatively recently that the importance of structure based descriptors is being recognised. Rupp et al.8 demonstrated through the Coulomb matrix method that structural and compositional information is sufficient to infer quantum properties, such as the atomization energy, emulated in subsequent studies for a wide range of electronic structure properties9,10. One would expect intuitively for such extended periodic systems as materials, structural features e.g. space group, coordination number, distance to nearest neighbour, would be of importance. Several groups have started work on featurisers combining property and structure-based information e.g. Isayev et al.’s Universal Fragment Descriptors which have been demonstrated to show superior performance in enhancing the accuracy of ML models11. Isayev et al.’s PropertyLabelled Materials Fragments describe crystals as graphs of vertices corresponding to atoms, “labelled” with elemental chemical and physical properties, and their connectivity. The topology of the system is described by atom-centred Voronoi-Dirichlet polyhedra12,13,14. Voronoi polyhedra provide a way of dividing the volume around each centre into non-overlapping but space-filling regions and their faces correspond to nearest neighbours thus providing information on the local environment. Further structural information is incorporated through properties associated with the shape, size and symmetry of the crystal unit cell. In this way they achieve a good chemical and physical representation of the full 3D molecular network. Similarly, Ward et al.15 have also included crystal structure attributes via Voronoi tesselations in their study of formation energies, and even earlier, Bartok et al.16 used their atomic overlap-based fragment descriptor on small silicon clusters. It has since been pointed out to us that the Voronoi tesselation approach by Ward et al.15 has been demonstrated to be inferior to Faber et al.’s Gaussian distribution function–based representation for the prediction of binding energies17. The initial aim of our study was to implement structure-based featurisers in the style of Isayev et al. and investigate their performance in predicting material properties of interest. In particular,

ACS Paragon Plus Environment

Page 3 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

we have been interested in band gap predictions as it is well-known that density functional theory (DFT) is inadequate for such predictions18,19, significantly underestimating them, being in error by as much as a factor 2, or even qualitatively wrong e.g. predicting a material to be a metal when it is known to be a semiconductor. There are machine learning band gap predictions dating back a decade, such as Gu et al.’s20 study of band gaps for binary and ternary compound semiconductors. However, these works were hampered by a paucity of data, that study being limited to 25 (binary) and 28 (ternary) compounds. More recent studies, have been able to make use of larger datasets, such as Castelli et al.’s21 band gap predictions of light-harvesting Materials suitable for water splitting devices using around 2400 materials from the Materials Project database3,4, chosen to have positive band gaps, ranging from 0.1 to 14 eV. Since commencement of this study our attention has been drawn to Matminer22 which implements a variety of featurisers along the lines of what we intended and thus we have chosen to make use of these. Beyond the elemental descriptors Matminer provides featurisers to describe the system as a whole, for example coulomb matrix8, bag-of-bonds23, radial distribution functions and Voronoi polyhedra. Thus to investigate the importance of including structural information in our featurisers we looked at two applications. The first was a silver nanoparticle dataset containing properties, including band gaps, for clusters ranging in size from 13 to 2947 atoms. The second was extending the set of featurisers of elemental properties of the study of Castelli et al.21 with structural features, via Voronoi polyhedra to allow the influence of neighbours.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 19

METHODOLOGY Data sets 1. Silver Nanoparticle Data Set This is a set of silver nanoparticle structures and properties downloaded from the CSIRO Data Access Portal24. The structures have been optimized using the density functional tight-binding method (DFTB)25 and range in size from 13 atoms to 2947 atoms, with a variety of shapes, as can be seen in Figure 2. The study was entirely computational, with no experimental data. The dataset contains a large set of values of structural and morphological properties and was used by Sun et al. to investigate the influences on the Fermi level with the view of tuning to improve electron transport processes26. They tested different ML algorithms and carried out a principal component analysis to determine which characteristics most influenced the Fermi energy. Ag13-IH

Ag947-DH

Ag2386-DH

Ag2947-RH

Figure 2 Examples of clusters from the nanoparticle dataset from the smallest (Ag13-IH) to largest (Ag2947-RH). All structures are available from the CSIRO Data Access Portal. 2. New Light Harvesting Materials Data Set There is a wealth of band gap data archived in the many repositories that have sprung up from the various Material Genome initiatives. However, experimental data is relatively scarce and much of the missing data is supplemented by computation, mostly with relatively simple DFT methods, such as LDA (Local Density Approximation) or Gradient-Corrected functionals (GGA). Improved methods involve using better functionals that are range separated such as HSE0627 or have corrections for the derivative discontinuity such as GLLB-sc28,29. We chose to use the GLLB-sc calculated band gap database of Castelli et al.21 available from the Computational Materials Repository30,31. This contains band gaps of around 2400 experimentally known materials showing a band gap at the GGA level and their corresponding Materials Project identifier which was used to download 2254 structures from the Materials Project repository4. Note that the Materials Project database contains band gap information but they are from Perdew-Burke-Ernzerhof (PBE)32 calculations, which may be in error by about a factor of 2. 3. Experimental Data Set A set of 72 experimental band gaps of semiconducting and insulating materials from a variety of sources33 have been curated by Morales-Garcia et al.34 in their study of band gap prediction by density functional band structure calculations. Values for the experimental band gaps were extracted from their Supporting Information, which categorised them by composition and space group. These could be matched to a Materials Project identifier from which structures could again be downloaded.

ACS Paragon Plus Environment

Page 5 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Featurisers Featurisation is the process of representing the relevant information of the system of interest in a number format appropriate for machine learning. Matminer provides a sophisticated set of featurisers under the categories:       

bandstructure base composition dos function site structure

A full list and descriptions of the featurisers can be found at the Matminer website https://github.com/hackingmaterials/matminer. In this study we used version 0.4.5 and started with the default elemental featurisers, as decided by Ward et al.22,35, from the composition module. This assigns basic properties to each atom or ion, individually, with no structural consideration, from a list pre-defined in the Magpie library36. This is a general purpose list, designed with the particular philosophy that is should be a universal descriptor to describe as wide a range of properties and materials as possible and not intended explicitly for band gaps. Thus we decided to refine this list further as discussed in more detail below. Structural features were added via two featurisers involving Voronoi polyhedra from the structure module. The first calculated coordination numbers using the polyhedra (SiteStatsFingerprint.from_preset("CoordinationNumber")) and then the elemental properties were modified by including information from neighbours (SiteStatsFingerprint.from_preset("LocalPropertyDifference")). Normalisation and standardisation It is common practice in Machine Learning to scale the features and target, i.e. normalise and standardise them. Normalisation of the features matrix is carried out to ensure that all features are treated on an equal footing, otherwise the longest vector would dominate. This, however, only affects the attribution of features. Scaling the target vector is recommended if the distribution, the band gaps, is a long way from being a normal distribution. A histogram of the band gaps is shown in Figure 3.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 19

Figure 3 Histograms of band gaps for the silver nanoparticle (left) and light-harvesting materials (right) datasets Scaling to a normal distribution (or as close as can be managed) is recommended so that any random samples drawn during cross-validation tests are more equally balanced i.e. to avoid accidental bias through choosing one part of the distribution more frequently than another. We applied standardisation and normalisation throughout, though it must be admitted that it had very little effect upon the results. Machine Learning Model Matminer itself does not contain implementations of machine learning algorithms, though it is designed to prepare datasets for hooking in to standard machine learning packages. For this investigation we used a random forest regressor37 from the scikit-learn library38 following the recommendation of Ward et al.15. Random Forest algorithm is a “forest” of decision trees, we initially chose 300, which partitions the data into subsets of similar values based on a particular feature. The partitioning process is repeated, each new branch representing a different decision rule, with different subsets and at the end all the decision trees are aggregated to make the prediction. We also looked at gradient boosting39, an enhancement of decision tree models, which our previous experience had found them to be the best performing40. However, though it was possible to reduce the RMSE (root mean square error) and MAE (mean absolute error) to less than those found with the random forest method, this was at the cost of higher inaccuracies in the test sets i.e. a tendency to overfit. Therefore, we have kept with the random forest results. Cross-validation Cross-validation is principally used to check the robustness of the fitting process when presented with unseen data as there is always a danger of overfitting i.e. if a scheme is trained to represent a particular set of data closely it is unlikely to be able to reproduce results from outside the training set. To estimate how prone a method is to overfitting a sampling technique is used where the data is divided into training and testing sets, with the fitting process done of the training set, and evaluated on the test set. The split into two sets is done randomly, and repeated several times to avoid bias. We have used a 10-fold cross validation method. It should be noted that a random forest regression is less prone to overfitting than some other methods simply because part of the process is, as the name suggests, random. Cross-validation is also used during the tuning of hyperparameters. These are parameters associated with the training method itself, rather than with the data. The scikit-learn toolkit contains methods for choosing hyperparameters (sklearn.model_selection.RandomizedSearchCV). Applied to our random forest ACS Paragon Plus Environment

Page 7 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

regressors, this showed that all the default parameters in the scikit-learn implementation were already optimum, except the number of trees which was increased to a larger value (500).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 19

RESULTS 1. Silver Nanoparticle Dataset A random forest regression, taking the band gap as the target, and using the full set of property descriptors, described in Ref. 22, and default set of featurisers from the silver nanoparticle dataset produced near-perfect correlation with R2 value of 0.997, Root-Mean-Square Error (RMSE) of 0.021 eV and Mean Absolute Error (MAE) of 0.007 eV, as shown in Figure 4. The RMSE and MAE on the training set is 0.020 +/- 0.001 eV and 0.007 +/- 0.001 eV respectively, while the equivalent for the test set is 0.052 +/- 0.007 and 0.019 +/- 0.002 eV. These values come from the 10-fold cross validation which produces 10 different values of the RMSE and MAU. The values quoted are the mean of all 10 runs, with the standard deviation.

Figure 4 Predicted vs calculated band gap (in eV) for the silver nanoparticle dataset An importance analysis of the features as shown in Figure 5 found most features were irrelevant to the band gap, with the important ones deemed to be: average diameter, number of Ag atoms, total number of Ag-Ag bonds, total Number of Ag-Ag angles, ionization potential and number of surface atoms.

Figure 5 Importance of features for the prediction of band gap in the silver nanoparticle dataset ACS Paragon Plus Environment

Page 9 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

This contrasts with the features found by Sun et al.26 to be most important for Fermi energy, viz. hexagonal close packed population, shape, amorphous population, surface coordination number 6 and surface coordination number 9. Furthermore, as the total number of bonds and angles are correlated with the number of atoms, these features can also be omitted. Repeating the regression with just data for the average diameter and the ionization potential, or alternatively, the number of silver and surface atoms with the ionization potential still manages to match the predictive performance of the full descriptor set as shown in Figure 6.

Figure 6 Predicted vs calculated band gap (in eV) for the silver nanoparticle dataset, with reduced set of features: average diameter and ionisation potential This surprisingly good predictive performance indicates that even though DFTB is an approximate method, it has succeeded in capturing some basic physics. The DFTB band gaps are probably not very accurate (and there is no experimental data) but the feature importance analysis shows that they depend only on the size of the nanoparticle, the number of atoms, and the atomic ionisation energy. This is in accord with the physics described in Roduner’s discussion on size in nanoclusters41 that it is the ratio of atoms on the surface to the total number of atoms, which is inversely proportional to the average diameter of a particle, that provides a key descriptor for the scaling of many properties. Note that we did not need to explicitly add any structural information as the results suggested this would not be necessary as they would be hard to improve upon. However, it could be noted that the features we ended up with reflect properties of the whole system e.g. diameter of the system, the number of atoms, ionization potential, and that the homogeneous nature of the particles rendered a more detailed structural description unnecessary. For most solid state studies though we still believe it should be necessary to explicitly add structural information. 2. New Light Harvesting Materials Data Set Unlike the silver nanoparticle dataset, which is essentially self-featurising, this data set needed deeper consideration of the specific addition of various features. Various authors35,36,42,43 have differing suggestions as to which descriptors should be used to describe the elemental properties depending on their purpose. The default set of elemental featurisers in Matminer35 was designed to describe as wide a range of properties as possible and may be over-engineered for the study of band gaps. For example, when an importance study is performed it shows (Figure 7) that the leading determinant of the band gap is the melting point.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 19

Figure 7 Importance of features for the prediction of band gap in the new light harvesting materials dataset This seems counterintuitive, though the correlation has been noticed before44,45. Remembering that "correlation does not imply causation", we concentrated on descriptors which are directly related to the electronic properties. In principle, only information about the electrons should be needed as that, with the nuclear positions and charges, determines the Schrodinger equation. We ended up with a slightly larger than minimum set comprising the nuclear charge, ionisation potential, electronegativity, the number of valence electrons of each type (s,p,d,f) and the covalent radius. This is similar to that adopted by Lee et al.42. Using our list of elemental descriptors, normalising and standardising the data and running it through a random forest regressor with 500 trees, gave the fit illustrated in Figure 8.

ACS Paragon Plus Environment

Page 11 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 8 Predicted vs GLLB-sc calculated band gap (in eV) for the new light harvesting materials dataset This has an MAE of 0.272 ± 0.006 eV and an R2 value of 0.980 for the whole dataset, which is quite good. However, when a 10-fold cross-validation is run to assess robustness against predicting data outside the training set, we get an MAE of 0.283 ± 0.001 eV on the training set, but 0.764 ± 0.016 eV on the test set. Therefore, further improvements are needed. The attributions of feature importance (Figure 9) make more physical sense, however.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 19

Figure 9 Importance of features for the prediction of band gap in the new light harvesting materials dataset following feature selection Including more structural information via Voronoi tessellation gave the fit shown in Figure 10, apparently making very little difference.

ACS Paragon Plus Environment

Page 13 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 10 Predicted vs GLLB-sc calculated band gap (in eV) for the new light harvesting materials dataset including structural features

The MAE of the training set in this case is 0.287 +/- 0.001 and for the test set is 0.762 +/- 0.016 eV. These have hardly changed from the previous values. One way to further improve the above results, is to include an estimate of the band gap from a lower level method. This is an approach which has been used successfully in the past by Lee et al.42 and Pilania et al. 43, who used Perdew-Burke-Ernzerhof32 or modified Becke-Johnson46 band gaps as a descriptor, and essentially comes at no cost as they are readily available from the Materials Project site4. Though the bulk of these have been obtained with the PBE functional, known to significantly underestimate the true band gap, they are able to provide a useful feature in a training set. Adding the PBE band gaps to the previous set of descriptors resulted in the fit shown in Figure 11.

Figure 11 Predicted vs GLLB-sc calculated band gap (in eV) for the new light harvesting materials dataset after inclusion of PBE band gaps as a feature ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 19

This has an MAE of 0.142 ± 0.004 eV and an R2 of 0.993 representing a significant improvement over the previous result. A 10-fold cross-validation reveals an MAE of 0.144 ± 0.001 eV on the training set, and 0.400 ± 0.010 eV on the test set. The test set is still underperforming relative to the training set, but it is doing much better than with just elemental descriptors. So far we have demonstrated that it is possible to reproduce a large set of GLLB-sc calculated band gaps with quite high fidelity using a combination of elemental properties, some structural information and the approximate PBE band gap. However, a greater challenge would be to predict the experimental band gaps. There have been examples of using machine learning to fit experimental data. For example Zhuo et al.47 fitted nearly 5000 individual band gaps corresponding to compounds with 2500 unique compositions with a MAE of 0.75 eV and RMSE of 1.46 eV. This is actually better than achieved by simple DFT GGA calculations, but it is surprising that it is not even better. The probable reason is that there are multiple values of band gaps for the same composition, which cannot be distinguished without knowing the structures, which were not available. Unfortunately, it is difficult to find accurate experimental band gaps for situations where the structure is also known exactly. We have made use of a collection of data by Morales-Garcia et al.34. This provided a set of 72 values of experimental band gaps for systems for which the composition, and structure are known, along with the PBE band gap from the Materials Project site, and also some more accurate theoretical numbers using the G0W0 method48. Figure 12 shows the experimental band gaps for this set plotted against PBE results, which as expected do not agree, and against G0W0 results showing much better agreement.

Figure 12 Experimental vs calculated by PBE (left) and G0W0 (right) band gaps We have trained our model on GLLB-sc results of a particular, though large, set of compounds. There is no a priori reason to expect this to reproduce experimental values, especially as only 27 of the 72 systems are in the training set. Nevertheless, Figure 13 shows a surprisingly good predictive performance of our model with R2=0.90, RMSE=0.91 eV and MAE= 0.76 eV. It is clear just from the shape of the graph that there is broad agreement. An interesting question is whether the 27 compounds which are in the experimental set and in the training set show better or worse agreement. The predictive fit for the compounds in common with the training set has R2=0.90, RMSE=0.90 and MAE = 0.08 eV. The reason for the low MAE can be seen in the graph - the systems which are in common between the two sets happen to be those which have small band gaps.

ACS Paragon Plus Environment

Page 15 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 13 Experimental vs predicted band gaps from our model (in eV) CONCLUSION The focus of this study was to investigate the importance of including structural information as well as elemental properties for the accurate prediction of band gaps by Machine Learning. Investigations of silver nanoparticle and light-harvesting datasets, including more extended structural representation via Voronoi polyhedra, which is able to take into account neighbour effects, as proposed by many groups, was found not to make as significant an effect as we expected. The biggest influence found was by including an estimate of the band gap from density functional theory. Features found to be of importance were fundamentally the atomic ionisation potential, and number and type of valence electrons, though similar studies showing poorer predictive performance by considering only composition suggest that structure is a factor. It is not clear whether our findings indicate that structure representation by Voronoi polyhedra is not sufficient or whether a comprehensive list of elemental descriptors is in fact able to adequately represent the full extended system. Indeed there must be examples of systems where using structural information is a requirement, for example, when there are multiple structures with the same composition. There is also a possibility that structural features are indirectly represented e.g. latently in the PBE calculations. Nevertheless, a more targeted investigation including more examples of differing structures of similar composition should be our objective for further investigation.

ACKNOWLEDGEMENTS This research was facilitated through a Shanghai University Foreign Expert Grant for RK. We gratefully acknowledge support from NVIDIA through their kind donation of a Titan V.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 19

REFERENCES (1) National Science and Technology Council (2011) Materials Genome Initiative for Global Competitiveness Washington, DC. (2) Kalil, T.; Wadia, C. (2011, June 24) Materials Genome Initiative: A Renaissance of American Manufacturing [Blog post]. Retrieved from https://obamawhitehouse.archives.gov/blog/2011/06/24/materials-genome-initiativerenaissance-american-manufacturing. (3) Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; Persson, K. A. The Materials Project: A Materials Genome Approach to Accelerating Materials Innovation APL Materials 2013, 1, 011002. (4) “Materials Project” https://www.materialsproject.org/ (accessed September 18, 2018). (5) “Materials Cloud” https://www.materialscloud.org/ (accessed September 18, 2018). (6) Jain, A.; Hautier, G.; Moore, C. J.; Ong, S. P.; Fischer, C. C.; Mueller, T.; Persson, K. A.; Ceder, G. A High-Throughput Infrastructure for Density Functional Theory Calculations Comput. Mater. Sci. 2011, 50, 2295-2310. (7) Pizzi, G.; Cepellotti, A.; Sabatini, R.; Marzari, N.; Kozinsky, B. AiiDA: Automated Interactive Infrastructure and Database for Computational Science Comput. Mater. Sci. 2016, 111, 218230. (8) Rupp, M.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning Phys. Rev. Lett. 2012, 108, 058301. (9) Montavon, G.; Rupp, M. ; Gobre, V. ; Vazquez-Mayagoitia, A.; Hansen, K. ; Tkatchenko, A.; Müller , K.-R. ; von Lilienfeld, O. A. Machine Learning of Molecular Electronic Properties in Chemical Compound Space New Journal of Physics 2013, 15, 095003:1-16. (10) Ramakrishnan, R.; von Lilienfeld, O. A. Many Molecular Properties from One Kernel in Chemical Space CHIMIA Int. J. for Chem. 2015, 69, 182-186. (11) Isayev, O.; Oses, C.; Toher, C.; Gossett, E.; Curtarolo, S.; Tropsha, A. Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Nature Commun. 2017, 8, 15679:112. (12) Voronoi, G. Recherches sur les Paralléloèdres Primitives J. Reine Angew. Math. 1908, 134, 198-287. (13) Dirichlet, G. L. Über die Reduktion der Positiven Quadratischen Formen mit Drei Unbestimmten Ganzen Zahlen J. Reine Angew. Math. 1850, 40, 209-227. (14) Blatov, V. A. Voronoi–Dirichlet Polyhedra in Crystal Chemistry: Theory and Applications Cryst. Rev. 2004, 10, 249-318. (15) Ward, L.; Liu, R.; Krishna, A.; Hegde, V. I.; Agrawal, A.; Choudhary, A.; Wolverton, C. Including Crystal Structure Attributes in Machine Learning Models of Formation Energies via Voronoi Tessellations Phys. Rev. B 2017, 96, 024104. (16) Bartók, A. P.; Kondor, R.; Csányi, G. On representing chemical environments Phys. Rev. B 2013, 87, 184115:1-16.

ACS Paragon Plus Environment

16

Page 17 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(17) Faber, F. A.; Christensen, A. S.; Huang, B.; von Lilienfeld, O. A. Alchemical and Structural Distribution Based Representation for Universal Quantum Machine Learning J. Chem. Phys. 2018, 148, 241717:1-12. (18) Perdew, J. P. Density Functional Theory and the Band Gap Problem Int. J. Quant. Chem. 1985, 28, 497-523. (19) Baerends, E. J. Density Functional Approximations for Orbital Energies and Total Energies of Molecules and Solids J. Chem. Phys. 2018, 149, 054105. (20) Gu, T.; Lu, W.; Bao, X.; Chen, N. Using Support Vector Regression for the Prediction of the Band Gap and Melting Point of Binary and Ternary Compound Semiconductors Solid State Sciences 2006 , 8, 129–136. (21) Castelli, I. E.; Hüser, F.; Pandey, M.; Li, H.; Thygesen, K. S.; Seger, B.; Jain, A.; Persson, K. A.; Ceder, G.; Jacobsen, K. W. New Light-Harvesting Materials Using Accurate and Efficient Bandgap Calculations Adv. Energy Mater. 2015, 5, 1400915. (22) Ward, L.; Dunn, A.; Faghaninia, A.; Zimmermann, N. E. R.; Bajaj, S.; Wang, Q.; Montoya, J. H.; Chen, J.; Bystrom, K.; Dylla, M.; Chard, K.; Asta, M.; Persson, K.; Snyder, G. J.; Foster, I.; Jain, A. Matminer: An Open Source Toolkit for Materials Data Mining Comput. Mater. Sci. 2018, 152, 60-69. (23) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Müller, K.-R.; Tkatchenko, A. Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space J. Phys. Chem. Lett. 2015, 6, 2326–2331. (24) Barnard, A.; Sun, B. Silver Nanoparticle Data Set. v2. 2017 CSIRO Data Collection DOI: 10.4225/08/595f2a960c870. (25) Elstner, M.; Porezag, D.; Jungnickel, G.; Elsner, J.; Haugk, M.; Frauenheim, Th.; Suhai, S.; Seifert, G. Self-Consistent-Charge Density-Functional Tight-Binding Method for Simulations of Complex Materials Properties Phys. Rev. B 1998, 58, 7260-7268. (26) Sun, B. Fernandez, M.; Barnard, A. S. Machine Learning for Silver Nanoparticle Electron Transfer Property Prediction J. Chem. Inf. Model. 2017, 57, 2413-2423. (27) Krukau, A. V.; Vydrov, O. A.; Izmaylov, A. F.; Scuseria, G. E. Influence of the Exchange Screening Parameter on the Performance of Screened Hybrid Functionals J. Chem. Phys. 2006, 125, 224106. (28) Gritsenko, O.; van Leeuwen, R.; van Lenthe, E. Baerends, E. J. Self-Consistent Approximation to the Kohn-Sham Exchange Potential Phys. Rev. A 1995, 51, 1944 -1954. (29) Kuisma, M.; Ojanen, J.; Enkovaara, J.; Rantala, T. T. Kohn-Sham Potential with Discontinuity for Band Gap Materials Phys. Rev. B 2010, 82, 115106. (30) Landis, D. D.; Hummelshøj, J. S.; Nestorov, S.; Greeley, J.; Dulak, M.; Bligaard, T.; Nørskov, J. K.; Jacobsen, K. W. The Computational Materials Repository Comput. in Sci. & Eng. 2012, 14, 51-57. (31) “Computational Materials Repository” https://www.cmr.fysik.dtu.dk/ (accessed October 25, 2018). (32) Perdew, J. P. ; Burke, K,; Ernzerhof, M. Generalized Gradient Approximation Made Simple Phys. Rev. Lett. 1996, 77, 3865 -3868.

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(33)

Page 18 of 19

Madelung, O. Semiconductors: Data Handbook, 3rd ed.; Springer-Verlag; New York, 2004.

(34) Morales-García, Á.; Valero, R.; Illas, F. An Empirical, yet Practical Way To Predict the Band Gap in Solids by Using Density Functional Band Structure Calculations J. Phys. Chem. C 2017, 121, 18862−18866. (35) Ward, L.; Wolverton, C. Atomistic Calculations and Materials Informatics: A review Curr. Opin. Solid State Mater. Sci. 2017, 21, 167-176. (36) Ward, L.; Agrawal, A.; Choudhary, A.; Wolverton, C. A General-Purpose Machine Learning Framework for Predicting Properties of Inorganic Materials Npj Computational Materials 2016, 2, 16028. (37)

Breiman, L. Random Forests Mach. Learn. 2001, 45, 1-32.

(38) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-learn: Machine Learning in Python J. Mach. Learn. Res. 2011, 12, 2825-2830. (39) Friedman, J. H. Greedy function approximation: A gradient boosting machine Ann. Statist. 2001, 29, 1189-1232. (40) Hutchinson, S. T.; Kobayashi, R. Solvent-Specific Featurisation for Predicting Free Energies of Solvation through Machine Learning Submitted for publication. (41) Roduner, E. Size Matters: Why Nanomaterials are Different Chem. Soc. Rev. 2006, 35, 583592. (42) Lee, J.; Seko, A.; Shitara, K.; Nakayama, K.; Tanaka, I. Prediction Model of Band Gap for Inorganic Compounds by Combination of Density Functional Theory Calculations and Machine Learning Techniques Phys. Rev. B 2016, 93, 115104. (43) Pilania, G.; Gubernatis, J. E.; Lookman, T. Multi-Fidelity Machine Learning Models for Accurate Bandgap Predictions of Solids Comput. Mat. Sci. 2017, 129, 156-163. (44) Nag, B.R. An Empirical Relation Between the Melting Point and the Direct Bandgap of Semiconducting Compounds J. Elec. Mater. 1997, 26, 70-72. (45) Li, J.; Zhao, X.; Liu, X.; Zheng, X.; Yang, X.; Zhu, Z. Correlation Between the Band Gap Expansion and Melting Temperature Depression of Nanostructured Semiconductors J. Appl. Phys. 2015, 118, 124304. (46) Tran, F.; Blaha, P. Accurate Band Gaps of Semiconductors and Insulators with a Semilocal Exchange-Correlation Potential Phys. Rev. Lett. 2009, 102, 226401:1-4. (47) Zhuo , Y.; Mansouri Tehrani, A.; Brgoch, J. Predicting the Band Gaps of Inorganic Solids by Machine Learning J. Phys. Chem. Lett., 2018, 9, 1668–1673. (48) Hedin, L. New Method for Calculating the One-Particle Green's Function with Application to the Electron-Gas Problem Phys. Rev. 1965, 139, A796-823.

ACS Paragon Plus Environment

18

Page 19 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

TOC Graphic:

ACS Paragon Plus Environment

19