Machine Learning and Statistical Analysis for ... - ACS Publications

Apr 11, 2017 - and descriptors, build predictive models, and draw insights into the .... intermediate steps, of descriptor engineering and then reduci...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/cm

Machine Learning and Statistical Analysis for Materials Science: Stability and Transferability of Fingerprint Descriptors and Chemical Insights Praveen Pankajakshan,*,† Suchismita Sanyal,† Onno E. de Noord,‡ Indranil Bhattacharya,§ Arnab Bhattacharyya,§ and Umesh Waghmare∇ †

Shell India Markets Pvt. Ltd., Bangalore, India Shell Global Solutions International B.V., The Hague, The Netherlands § Indian Institute of Science, Bangalore, India ∇ Jawaharlal Nehru Center for Advanced Scientific Research (JNCASR), Bangalore, India

Chem. Mater. 2017.29:4190-4201. Downloaded from pubs.acs.org by UNIV OF SUNDERLAND on 10/13/18. For personal use only.



S Supporting Information *

ABSTRACT: In the paradigm of virtual high-throughput screening for materials, we have developed a semiautomated workflow or “recipe” that can help a material scientist to start from a raw data set of materials with their properties and descriptors, build predictive models, and draw insights into the governing mechanism. We demonstrate our recipe, which employs machine learning tools and statistical analysis, through application to a case study leading to identification of descriptors relevant to catalysts for CO2 electroreduction, starting from a published database of 298 catalyst alloys. At the heart of our methodology lies the Bootstrapped Projected Gradient Descent (BoPGD) algorithm, which has significant advantages over commonly used machine learning (ML) and statistical analysis (SA) tools such as the regression coefficient shrinkage-based method (LASSO) or artificial neural networks: (a) it selects descriptors with greater stability and transferability, with a goal to understand the chemical mechanism rather than fitting data, and (b) while being effective for smaller data sets such as in the test case, it employs clustering of descriptors to scale far more efficiently to large size of descriptor sets in terms of computational speed. In addition to identifying the descriptors that parametrize the d-band model of catalysts for CO2 reduction, we predict work function to be an essential and relevant descriptor. Based on this result, we propose a modification of the d-band model that includes the chemical effect of work function, and show that the resulting predictive model gives the binding energy of CO to catalyst fairly accurately. Since our scheme is general and particularly efficient in reducing a set of large number of descriptors to a minimal one, we expect it to be a versatile tool in obtaining chemical insights into complex phenomena and development of predictive models for design of materials.



INTRODUCTION

What we need is a versatile tool (a) that can analyze and correct the data for obviously erroneous data set, (b) that identifies a minimal set of descriptors from a large dataset that essentially govern the desired property of a material, and (c) whose results are not dependent on a specific subset of the large dataset used in analysis, thus naturally acquiring transferability. Tools of machine learning can be quite useful in identifying connections between complex phenomena as expressed in such data, particularly when simple mechanistic principles are not readily applicable. Virtual high-throughput screening (VHTS) addresses the issue of long time frame for production, by combining the predictive power of materials modeling with advances in data

The typical time frame from the initial research and discovery of a material to its incorporation in applications for first-use is remarkably long, of ∼10−15 years.1 This was partly due to lack of an integrated approach that considers all aspects of materials development and the continued reliance of research on scientific intuition and trial-and-error type of exploratory experimentation. It is desirable to have a generic approach that links knowledge about a material, in terms of its structure, processing, and properties in an insightful way.38 Design of materials from first-principles simulations using advanced High Performance Computing (HPC) infrastructure is an attractive avenue to move away from this empiricism, although the process is still lengthy and time-consuming. The next stage of innovation will lie on automating the process of intuition that is involved during materials synthesis: taking it from idea to the desired material. © 2017 American Chemical Society

Received: October 4, 2016 Revised: April 11, 2017 Published: April 11, 2017 4190

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials

Figure 1. Forward problem: accelerating materials property calculation. Machine learning involves the task of mapping the set of descriptors D (as features) to the KPI y for every candidate material (C) in the database, by engineering them to extended set XE and fingerprinting to a minimal set X using a selection function or operator f.

giving a black-box approach to modeling analysis and is quite useful when the following conditions are satisfied: • the number of samples or experiments should be really large, • there is not much a priori information or knowledge on the dataset or the science, and • a goal is accurate prediction within the model domain (i.e., interpolation only) given the above circumstances. Unlike genomic biodatabases, in materials datasets, we often deal with a relatively smaller number of experiments or samples, and there is the definite risk that the developed predictive model is overfit to the dataset. The result of scaling/ extrapolating this model to a new dataset is often questionable, especially if there is no backing from materials science to support the findings. While building interpretable models for small or large data, we recommend that the process of mapping from the descriptor space to the KPI space go through intermediate steps, of descriptor engineering and then reducing the extended set of engineered descriptors to get the fingerprinted descriptors (Figure 1). This entire process and recipe are described in the Methodology section of this article. Reduction of the dimension of the descriptor space is very important,6 because it helps not only in reducing the complexity of the model24 but also in understanding better the governing mechanism that underlies the true model, 4 , that had generated the data. Another category of methods from computational statistics are methods such as Least Absolute Shrinkage and Selection Operation (LASSO)7 that shrink the coefficients of the regression model (4 ) to provide only the sparsest set of descriptors that best describes the KPI, and, in this way, they implicitly perform descriptor selection. In shrinkage-based methods or search-based methods, there is also a high probability that chance correlations might appear9,10 in single trials. When the descriptors are highly correlated, during multiple trials of the descriptor selection process, LASSO and similar methods, select any one of the correlated descriptors.30 This results in inconsistency and noise in the trial-based selection process, making interpretation risky and followup difficult for datasets with a larger number of descriptors. An alternative approach is to use methods based on latent variables. The philosophy of latent variable (LV) methods is that observations are a reflection of underlying phenomena that cannot be observed/measured directly. The manifest variables are a reflection of these latent variables and, as such, are not the most interesting for interpretation. Each LV is an orthogonal weighted linear combination of the manifest variables and in that way solves the collinearity problem in regression. The most

analytics and machine learning to bring insights and help in accelerating materials discovery.3,20 Transferring the paradigm from drug design in pharmaceutical industry to materials science has become realistic thanks to the recent advances in the speed of computational resources, availability of opensourced data and knowledge, and the efficiency and stability of materials modeling packages. National programs such as the Materials Genome Initiative1 launched by the U.S. government in 2011 have also provided significant boost to research in this area. For a comprehensive compilation of different initiatives globally, see ref 39. The goal of this article is to develop advanced machine learning methodologies in conjunction with statistical analyses that can work for a relatively small dataset, for example, the one available2 for catalysts for the CO2 electroreduction. Here, we present a semiautomatic recipe that can help a material scientist to start from raw dataset, build predictive multiscale models, and explain the governing mechanism: • our Bootstrapped Projected Gradient Descent (BoPGD) method (see the Supporting Information) in the context of machine learning scheme scales efficiently to a larger set of descriptors in the data, • the recipe is transferable to any application where it is important to understand the underlying mechanism governing the pathways, and • the method fingerprints the sparsest but stable set of descriptors that can predict a given property. At the heart of our recipe is the “Descriptor Selection” tool that takes in the entire set of descriptors (typically very large) as input and provides the fingerprints that map to the key performance indicator (KPI). The underlying idea behind selecting sparse fingerprints is the principle of Occam’s razor or the law of parsimony, which is based on the premise that simpler parameters are a heuristic to developing theoretical models and testing them. In contrast to earlier works,20 where the grouping of data is done in the candidate materials space, in our article, the grouping is done with respect to the descriptors or the extended bag of descriptors. This makes it (a) more efficient and (b) gives the added benefit of providing chemical insights. A database in materials science typically consists of descriptors (D) of a set of materials (C). While building the forward model 4 (see Figure 1), which maps the set of descriptors to the KPI (a target property relevant to an application), often, the principal method used in materials science, either for prediction or for classification is the artificial neural network (ANN).40,41 The ANN is a nonlinear model 4191

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials

stability issues and therefore lends itself toward being robust toward descriptor selection in small datasets, as well as being scalable to larger datasets. Case Study. To demonstrate our recipe, we choose the case study of catalyst screening for the problem of CO2 electroreduction.2 Electrochemical reduction of CO2 into value-added chemicals using renewable energy is looked upon as a promising approach toward achieving carbon neutrality, while providing a method to store or utilize renewable energy from intermittent sources, both reducing our dependence on fossil fuels.14 Electrocatalysts are needed to bind and activate CO2 in order to reduce the high overpotentials typically encountered. Also, catalysts can drive selective formation of desired products. During the past few decades, efforts have mostly focused on different metal catalysts and the various products that can be formed using those metals.15,16 Ma et al.2 have demonstrated the use of machine learning methods such as an artificial neural network trained with a dataset of ab initio adsorption energies and electronic fingerprints of idealized bimetallic surfaces, to capture complex, nonlinear interactions of adsorbates (e.g., *CO) on multimetallics with ∼0.1 eV error, outperforming the known twolevel interaction model22 or the d-band model18 in prediction. With this model and the well-established descriptor-based catalyst design approach, they have identified promising {100}terminated Cu multimetallics with a lower overpotential and possibly higher selectivity in CO2 electroreduction to C2 species, from a calculated database. The database itself contains 298 alloys with 13 descriptors (see Table 1 for notations and

common LV methods are principal component analysis (PCA), which operates on a single (X) data block,11 and partial least squares (PLS), which links the two (X and KPI y) data blocks.12 As a regression technique, PLS can be seen as a regularization method, similar to Ridge Regression.13 Descriptor Importance and Fingerprinting. There are many articles expounding the importance of material descriptors in either accelerating material property calculation or for material design. In ref 8, the authors describe the characteristics desired for a set of descriptors, and we recall it here: • descriptor calculations must not be as intensive as those for the KPI, • it uniquely characterizes the material, as well as propertyrelevant elementary processes, • materials that are very different (similar) should be characterized by very different (similar) descriptor values, and • its dimension should be as low as possible. While determining the descriptors for both materials prediction and discovery, the choice and number of the descriptor should be generally as large as possible at the beginning. In this article, we will define a new term: f ingerprint descriptors (see Figure 1). In the machine learning perspective, the fingerprint descriptors are a subset or daughter of the superset of mother descriptors, which are unique for a property and material. It is the fingerprint’s dimension or cardinality that is kept as low as possible (sparse), while the original descriptor space can be as large as possible. In addition, for the built model that maps the fingerprint descriptor to the property or the KPI, this mathematical mapping is also unique. The engineering23 of the descriptors (nonlinear combinations of simple descriptors, D) ensures that the relationships between the fingerprinted descriptor X and the KPI is linear, so that the true model 4 can be simply expressed within an error ϵ as

4:

Table 1. Descriptors, Their Notations, and Value Rangesa descriptor filling of a d-band center of a d-band width of a d-band skewness of a d-band kurtosis of a d-band work function atomic radius spatial extent of d-orbitals ionization potential electron affinity Pauling electronegativity local Pauling electronegativity adsorbate-metal interatomic d-coupling matrix element squared

y = βX + ϵ

where the fingerprint X is obtained by down selecting24 the engineered descriptors XE as X := f(XE) and the descriptor engineering with Φ as XE = Φ(D), so that 4 can be written as 4:

y = βf (Φ(D)) + ϵ

and the determination of model estimate 4̂ reduces to estimating the coefficients β̂ of the model. Once the model 4̂ is built, the process of predicting the KPI for a set of new materials, ynew based on the descriptors Dnew, becomes a process of applying the model 4̂ to the new material’s descriptors, such that 4̂ :

notation

value range

f εd Wd γ1 γ2 WF r0 rd IE EA χ0 χ Vad2

0.87−0.98 −4.49 to −0.75 eV 0.86−2.16 eV −0.8 to 4.06 2.75−29.99 5.01−6.74 eV 1.38−1.59 Å 0.67−1.04 Å 7.58−9.23 eV 0.56−2.31 eV 1.9−2.54 1.49−2.54 1−3.9

a We note that the center of the d-band here is obtained relative to the Fermi energy.2

value range) and 1 key performance indicator (KPI) or target property being the binding energy of the adsorbed CO on a metal surface ΔECO. For generating the database, electronic structure calculations were performed within the framework of spin-polarized density functional theory (DFT) implemented in Quantum ESPRESSO using the Perdew−Burke−Emzerh (PBE) parametrization of gradient-corrected exchange-correlation functional. Notations. We use D ∈ ℜN × P to represent the Pdimensional set of descriptors (listed in Table 1) and y ∈ ℜN × 1 to represent the KPI for each of the N candidate materials. β ∈ ℜK represents the set of coefficients of a model, having a reduced dimensionality K.

Dnew → ynew ̂

We demonstrate that it is critical how these fingerprint descriptors are chosen and how the unique mapping is uncovered/estimated, especially while dealing with small datasets. Conventional statistical and machine learning approaches have stability issues21 when relatively small datasets are involved: when several trials of the approach are run, the results are often not consistent across the trials.30,31 They very often over train the model to the specific dataset with the result that extrapolation of the model on a new dataset can be problematic. Our formulated recipe is shown to overcome such 4192

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials



Handling Missing Data. There might be cases when data are missing from the database. If there is only a limited set of discontinuous data missing from a few of the descriptors, one might consider data imputation techniques to fill in the missing information. Most often, in materials databases, it might make sense to discard the entire row (remove the material ci), even if there are a few columns missing, to avoid having any error in the imputations. The result of including the row with the missing columns could be that the error can propagate to the model or the insights derived. The data that we used in this article from ref 2 did not have any missing values. Normalize. Because of the enormous variations in the range of the input and the output data, in order to do a fair comparison within the descriptors (simple or extended), we normalize the entire dataset or the engineered dataset before processing it. Normalizing the attributes using z-score normalization is done by

METHODOLOGY In Figure 2, we show a simple graphical abstract of the workflow that is generally adopted for the entire process

X̃ =

X − μX σX

where μX and σX are the population mean and standard variation for each of the column of X. Since the population means and standard deviations are not available, we can estimate them from the dataset. A naiv̈ e bootstrapping process27 is a method to re-create several subsets of the original dataset by randomly sampling subsets from it with replacement. In random sampling with replacement, once a sample is chosen from a bag of samples, it is replaced back to the original bag of data before the next sample is drawn from the same bag. For estimating the mean and the variance, we choose several hundred draws or trials of the data and calculate the empirical bootstrapped distribution of the means and variances. These are averaged out to estimate the μX and σX values. Exploratory Analysis. The original dataset before normalization is quite diverse in terms of the KPI (see Figure 3) and distributed in the range between −2 eV and 0 eV. Except for three descriptors (IP, EA, and EN), there is a good spread in the descriptor range as well (see Figure 3). On evaluating the statistical outliers for these descriptors, it was noticed that the entries with Ag as the substrate had the maximum outliers, in comparison to other alloys, for CDB, SDB, and KDB. In Figure 3, we show a violin plot for the zscore normalized descriptors. The long whiskers on these three descriptors correspond to the outliers that are well outside the limits imposed by the Q1 − 1.5IQR and Q4 − 1.5IQR, where IQR is the interquartile range28 and Q1 and Q4 are the first and fourth quartiles, respectively. IQR is a measure of variability, based on dividing a dataset into quartiles. Quartiles divide a rank-ordered dataset into four equal parts. The values that divide each part are called the first, second, and third quartiles, and they are denoted by Q1, Q2, and Q3, respectively. For data that are symmetric and almost normally distributed, the fences are within the 3σ bounds. The violin plot is similar to the box plot conventionally used by statisticians to see the spread of the data, the mean values and the outliers. In addition to these statistical values, the violin plot also captures the shape of the probability density function for the different descriptors. In the violin plot of Figure 3, the black line in the middle is the mean for each descriptor and the red line gives the median. Since the data is normalized, the black line is along the zero center. The median function takes in a vector, sorts it in

Figure 2. Recipe for going from data to fingerprinting descriptors to insights to models and discovery. It involves the following steps: (1) preprocessing, (2) exploratory analysis, (3) fingerprinting descriptors, (4) statistical model or linear/nonlinear model building and validations, and (5) insights from a subject matter expert.

involved in developing a model or drawing insights. This is a generic process recipe for any application that might involve going from dataset to fingerprinting the descriptors to model generation and discovery. We cannot insist enough the involvement of the subject matter expert at all stages of the process starting with data preparation to the drawing of insights and defining of the final model. As mentioned in ref 8, historically, the reliance was on chemical intuition for deciding the important descriptors for a particular application and developing the relationship between them for best explaining the physical property. Our idea here is to automate the entire process leading to chemical insights and discovery by rational design with minimal necessity of chemical intuition. We demonstrate this entire process for the specific case study as described in above section. According to this flowchart and the application to the case study, this article is divided into the following sections: data preprocessing, exploratory analysis, descriptor fingerprinting, building model and testing hypothesis, drawing insights with the subject matter expert, and conclusions. Data Preprocessing. Preprocessing of the data is important to improve the quality and reliability of the data. It can, in fact, add some insights that are quite useful in terms of the process involved in the generation of the data. This is discussed in detail in the section on drawing insights from outliers as part of the exploratory analysis. 4193

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials

Figure 3. Violin plot showing the diversity of the KPI (BE) and the 13 descriptors. The descriptors KDB, SDB, and CDB have outliers outside the data limit fences, imposed by the Q1 − 1.5IQR and Q4 − 1.5IQR (refer to text), as shown by the lengthy whiskers, while the diversity and spread of the three descriptors IP, EA, and EN is sparse. The black lines in the violin plot are the mean value for the descriptors, here zero as the data are normalized, while the red lines in the violin plot indicates the median values.

ascending or descending order, and then takes in the middle value or average of the middle values in the sorted result. The median is a robust estimate of the mean and ensures that the extreme low or high values do not affect the estimation of the mean. From the violin plot in Figure 3, we find that the descriptors KDB (γ2), SDB (γ1), and CDB (ϵd) have outliers outside the data limit fences. The higher-order moments, Kurtosis and Skewness, usually were found to have larger outliers. The alloys “Ag−Sc@Ag”, “Ag−Y@Ag”, and “Ag−Zr@Ag” had the outliers for which all the three descriptors (KDB, SDB, and CDB) had values well above the statistical significant fences and cannot be simply ignored. These are eliminated from any further evaluation here. Therefore, the number of entries in the database to analyze decreased to 295. We generally found that the outliers in the DFT calculations of the d-band characteristics are high for the bilayered alloys (of the form “X-Y@Ag”) of Ag. The three descriptors IP, EA, and EN have a very discontinuous distribution of the data, which might highlight the fact that these were probably added to the database from a published table available in the literature. Since the dataset descriptor dimension is small (here, 13), the data can be visually analyzed, pairwise, for some insights. In Figure 4, the KPI (y), which is the binding energy of the adsorbed CO on a metal surface (ΔECO) is evaluated against one of the descriptors D, which is the filling of a d-band (f). We observe that the descriptor f has values ranging between a minimum of 0.87 and a maximum of 0.98 (see Figure 4). For copper, silver, and gold, the range for the d-band fillings are highly limited and is between 0.96 and 0.98. We observe a shift in the d-band filling fraction happening from gold (Au) to silver (Ag) to copper (Cu). The low reactivity of these metals is reflected in the very small change in this descriptor f. These metals also do not interact easily with air/oxygen, which is possibly the reason for the “almost” filled d-subshells.

Figure 4. Scatter plot between the “Filling of a d-band” (FDB) f, and the “Binding Energy of adsorbed CO on a metal surface” (BE) ΔECO (in eV). The core of the alloy is grouped together in three clusters with Ni−Pt in group 1, Pd in group 2, and Au−Cu−Ag in group 3.

Similarly, seven other descriptorsWF, AR, SE, IP, EA, EN, and VAD2are dependent only on the substrate and do not vary for the ligand layer on top of the substrate. If we treat all the different members belonging to a common substrate and the rest in subsequent substrates, we find that these seven descriptors are very useful in discriminating between clusters and do not offer any new information within the cluster. By clustering of descriptors, we refer to the grouping of descriptors in such a way that there is a common link that unifies the elements of a cluster. On a cursory analysis, by looking visually at the scatter plot, it does seem that the binding energy ΔECO and the d-band filling f are independent. This could point to three possibilities: (a) the “f illing of the d-band”, FDB, ( f), is not unique for the alloys,2 or 4194

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials

There are multiple insights we get from the Pearson’s correlation plot in Figure 5, which can be connected with the dband model17 that captures the reactivity of metallic surfaces: (1) Most of the d-band characteristics are directly or inversely correlated with each other and, in turn, with the binding energy: the “filling of the d-band” ( f) has a direct relationship with the “Skewness” SDB (γ1) and the “kurtosis” KDB (γ2), while f has an inverse relation with the “center of the d-band” CDB εd and the “width of the d-band” WDB (Wd). Unlike in ref 18, we did not find any correlation between ϵd and Wd, which might be pointing out to a possible independent causal link between f and ϵd, and between f and Wd. (2) there is a strong correlation between the “binding energy” BE (ΔECO) and the “filling of a d-band” (FDB) (f); (3) a direct and strong linear relationsip between the “ionization potential” (IP) and “Pauling electronegativity” (EN) γ0; (4) direct relation between the “adsorbate−metal interatomic d-coupling matrix element squared” (Vad2) and “Width of a d-band” Wd; (5) there is a strong direct relation between the “spatial extent” SE (rd) and “adsorbate−metal interatomic dcoupling matrix element squared” Vad2 (6) direct relation between the “spatial extent” SE (rd) and “Pauling electronegativity” EN (χ0) It is well-established that the d-band characteristics indeed have a strong relationship with the BE, which validates the first finding above and that observed from Figure 6. Generally, there is a nonlinear relationship between the d-band characteristics and the BE. It has been well established that the d-band model developed by Nørskov et al.17 presents a theoretical framework for the description of basic trends in the reactivity of different systems. In ref 18, the authors mention that, according to tight binding theory, the “width of the d-band” (WDB) Wd, can be related to the “interatomic matrix element” (Vad), which describes bonding between an atom and its environment. In fact, r(i) d , a characteristic length for metal i, is related to the spatial extent of the d-orbitals of that metal. The matrix element 3/2 Vad and rd are related as Vad ∝ (r(i) d ) . The nonlinear relationship could possibly explain the slight weakness that we notice in the Pearson’s Correlation Heatmap (see Figure 5) between the matrix element Vad and the Spatial Extent. In fact, in later clustering analysis, discussed in the next section, we found that these two parameters are so synonymous and similar in their values that they could be replaced by a single unique descriptor rd that describes both. The WDB is an intrinsic property of the metal, whereas Vad is a property of the molecule−substrate interaction. However, , the spatial extent of d-orbitals (rd) determines coupling/overlap between the two dbands in the same metal, as well as Vad. Therefore, WDB and Vad both are determined by rd and is coming out of our analysis. This is also one of the strengths of this approach, where it is able to bring out the hidden correlations and the descriptor (here, rd), that is not available in the original database. In ref 19, Watson and Bennett mention that the Pauling Electronegativity (χ) can be expressed in terms of the ionization potentials s and p. This could be one of the reasons for the strong correlations between the IP and the EN. Interestingly, we did not find a strong correlation between the

(b) they are unique but the resolution is low or numerically truncated, or (c) the way that the descriptor is generated during DFT should be re-evaluated. However, on a detailed analysis by using the alloy labels, we find that there is a grouping of the scatter plot in terms of the core metal (Ni, Pt, Pd, Au, Cu, Ag) of the alloy and no relationship with the shell metal in the alloy. The relationship between FDB and BE is strongly established in the Pearson’s product moment correlation, given by the Heatmap of Figure 5 and the graph plot of Figure 6. The P

Figure 5. Pearson’s Pairwise Product Moment Correlation Heatmap among the KPI: binding energy (BE) and the different descriptors. The strong positive correlation between the FDB and BE is given by the yellow boxes in the plot.

Figure 6. Graph plot showing the Pearson’s correlation between the variables described in Figure 5. Green color shows positive correlation while red color shows negative correlation between the KPI and the descriptors. The thickness of the links shows the strength of the relationship. The correlation cutoff is placed at 0.7 for better visibility of the relation.

columns in D are the individual descriptor vectors and when they are similar the correlations are close to 1 or −1 (yellow and dark blue in the plot) and when they are dissimilar (green in the plot) 4195

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials “local electronegativity” (EN), χ, and “Pauling electronegativity” (EN) χ0 but the Euclidean distance, d(χ,χ0), between the two is very small. This could be attributed to the fact that the Pauling Electronegativity data are not very spread out (see Figure 3), unlike the local electronegativity data which is very diverse. Similarly, from Figure 3, we see that the electron affinity and the ionization potentials are also very sparse in their spread. Bootstrap Projected Gradient Descent (BoPGD) Method. In ref 20, Meredig and Wolverton have introduced a “Clustering-Ranking-Modeling” (CRM) framework for identifying the unique descriptors that can predict the characteristics of a new dopant. Their idea is very simple: cluster the dopants together using the X-means algorithm,20 rank the descriptors using a regression fit and then model the behavior within each cluster, using the unique descriptors. This works well if there are indeed clusters within the different sample datasets (here, the four dopant clusters). The fact of using a regression model for ranking the descriptors will forcefully fit all the descriptors into the prediction model for the target property with a high risk that even those descriptors that do not contribute significantly and might be redundant are included. In addition, such a descriptor selection problem has stability issues when several batches of the training sets are used with each batch run giving a different set of descriptors. Finally, the selected descriptors are often the ones that best predict the target property but might not necessarily point out to the governing mechanism. When the nature of the relationship between the KPI and the descriptor is not known, a hypothesis can be built assuming a nonlinear relationship between the two as in tools like ANN or the Kernel Ridge Regression (KRR) or Random Forest or Support Vector Regression (SVR). In such cases, the entire dataset is used and the model is built by using only a part of the data as an input for training and building the model. Perturbation analysis of the resulting model, as observed in ref 2, yields the descriptor importance that might contribute most to the KPI prediction. Post-analysis based on perturbation yields those variables that produce accurate mapping of the descriptor space to the KPI but need not necessarily indicate the dependency on the governing mechanism. In this article, we have used the Hierarchical Clustering and Graph Plot (Figure 6) as a precursor to the Bootstrapped Projected Gradient Descent Algorithm (BOPGD)25 that is discussed in the section of Descriptor Fingerprinting by BoPGD Algorithm in the Supporting Information. The results obtained from the BoPGD fingerprinting algorithm is as given in Table 2.

Statistical Analysis: Uncovering the Governing Mechanism. Principal Component Analysis. Principal component analysis helps in understanding the amount of variability in the data. Any data D ∈ ℜN × P can be decomposed to its K primary principal components, such that P is an orthogonal matrix with P′ × P = I and I is the identity. Therefore, expressing the descriptors as a linear combination of the principal components: K

D ≈ TP′ + E

clusters/ descriptor

1

SE, VAD2

2 3 4 5

LEN AR CDB WF

description spatial extent, adsorbate-metal interatomic d-coupling matrix element squared local Pauling electronegativity atomic radius d-band center work function

D≈

∑ tipi ′ + E i=1

where T contains the scores and P the loadings. The scores contain information on the materials and the loadings on the descriptors. In Figure S1 in the Supporting Information, the score plot of the first two principal components clearly shows a strong clustering of the materials. Since the Au is clustered into two different groups, it might be likely either that the second cluster highlighted by the yellow circle in Figure S1 is an outlier or it is likely that there are some intermediate materials that are missing that links the two clusters together for the Au substrate. We intuitively believe that the latter might be the case, that there are a few alloys with Au as substrate that is not included in these study calculations. From our analysis, we found that Principal Component 1 has high positive loadings for many variables, negative loadings for KDB, FDB, and SDB, and 0 for CDB. It separates “rather purely” (because the Principal Component 2 score is close to 0) the clusters corresponding to “@Pt” (highest score), “@Pd” (close to average score) and “@Cu” (lowest score) clusters. Principal Component 2 has the largest negative loading for CDB and positive loadings for SDB, AR, and FDB, and it separates “@Ag” from “@Ni” clusters, but also has large withincluster variability for “@Ag”. Partial Least Squares (PLS) Regression To Uncover the Latent Governing Mechanism. While decomposition is done on the descriptor space D in the case of PCA, one must choose between orthogonal scores or orthogonal loadings in PLS. In the present study, the first option was used. The reason to perform partial least squares (PLS) analysis instead of ordinary LS regression is that the datasets D and y may contain random noise, which should be excluded from regression. Decomposing D and y into latent space can ensure that the regression is performed only on the most reliable variation. PLS regression was applied to link the D to the y. Different cross-validation schemes were applied and 5 LVs were selected as the optimum model complexity. This model explains 90% of the variance in D and 92% of the variance in y (i.e., R2 = 0.92), and the estimated prediction error is 0.17. Model Performance and Fingerprinted Descriptors Validation. We first define the Median Absolute Error (MAE) performance metric that will help us in evaluating the sanity of selecting only a few of the descriptors.

Table 2. Fingerprinted Descriptors Obtained from the BoPGD Algorithma No.

or

contribution (%) 39.1

MAE = Median(|yi ̂ − yi |) 37.4 11.6 8.7 3.2

∀ i = 1, 2 , ..., N

As mentioned in the subsection entitled “Exploratory Analysis”, while using Violin Plots, the Median is a robust estimate to the mean. The Median metric ensures that extreme low or high values do not skew the evaluation and the validation process. The above metric can be used to evaluate the predicted property (ŷi) against the actual property (yi) for

The KPI binding energy ΔECO is a function of these fingerprinted descriptors alone. a

4196

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials

Figure 7. (a) Absolute error bar graph obtained by the difference between the actual property yi and the predicted property yi for the different sets of test data. The training data vary from 5% to 95% and for each training data fraction, T = 10 000 trials were generated. We see from the graph, as well as intuitively, as the training size decreases, the accuracy of prediction reduces and the variability increases. An increase in the training dataset increases the accuracy but only to a minimum error of 0.104 eV and minimum variability of 0.0095. To balance the tradeoff between the accuracy and the variability, the fraction of data used for the training is between 30% and 55%. (b) The plot between the property predicted by the ensemble linear model and the actual property after choosing 55% of the data for training.

the test dataset when the model was built on the set of the training dataset. Overcoming nonlinearity by using decision tree: By looking at Figure 7, obtained from the model fit between the selected variables to the BE, it is clear that the relationship is nonlinear, especially in the ranges [−2 eV, −1.5 eV] and [−0.2 eV, 0 eV]. In the future work, we will direct an ensemble of the decision

trees to capture the nonlinear relationship between the selected descriptors and the BE.



CHEMICAL INSIGHTS AND MODEL/HYPOTHESIS VALIDATION We now discuss the principal chemical insights derived from our analysis and their connection with well-known ideas in the chemistry of materials, given in this context with the widely 4197

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials used d-band model.17,42 In recent years, the d-band model has been used in understanding effects of strain on reactivity of metal surfaces,18,43 and perovskite oxides.44 It has been used as a predictive model in screening multimetallic electrocatalysts for oxygen reduction reaction.45 The shape of the d-band was also found to be relevant recently to the reactivity of transitionmetal alloys.46 A new chemical idea of orbital-wise coordination number has been proposed by Xin et al. to be relevant to predicting the reactivity of metal nanocatalysts.47 Validating from Prior Knowledge. While the finding that WF, AR, SE, IP, EA, EN, and VAD2 are dependent only on the substrate is interesting, this is so by construction of the data, which was generated for substrates only. Thus, it only verifies that a blind statistical analysis effectively points out such correlations (which happen to be obvious here!). Such effectiveness of our analysis is also evident in our finding that Pauling electronegativity and Ionization Potential are strongly correlated with each other, as was also noted earlier in ref 19. The correlation between the binding energy and the filling factor is interesting. This can be understood in terms of the dband model in which the binding energy is linearly dependent on f.2,18 However, the data reveals that the d-band filling varies only from 0.87 to 0.98 in the entire database, while the BE exhibits quite a diverse distribution. Thus, much of the variation in the energy arises from other factors that are dependent more sensitively on the other descriptors identified here (see the equation presented below). Our analysis shows that various attributes of the d-band (SDB, KDB, WDB, and CDB) are directly or inversely correlated with each other. This forms the basis for why the d-band model has been so successful in a wide range of catalysis problems. It essentially means that the overall shape of the dband (peak in the density of states) is qualitatively similar in these alloys. If we approximate it by a triangle centered at the CDB, it is evident that the width of the d-band is inversely proportional to its height, because the area under the d-band curve is 5. Thus, there is an inverse relationship between the width of the d-band and the higher-order moments (Skewness and Kurtosis). A direct relation of the squared coupling between adsorbate and metal with the WDB can be understood: the latter is a measure of interatomic orbital coupling in the metal and, hence, is a measure of the radius of the transition-metal ion or delocalization or the spatial extent of the d-orbital. More delocalized nature of the d-orbital facilitates its direct overlap with the highest occupied molecular orbitals (HOMO) and lowest unoccupied molecular orbital (LUMO) orbitals of the adsorbate, and, consequently, a stronger adsorbate−metal coupling Vad. As a result, this coupling also results in a stronger binding energy (see Figure 8). Modification of the d-Band Model. In addition to highlighting the relevance of the d-band center, coupling between adsorbate and metal, filling of the d-band, and atomic radii, our BoPGD (see Table 2) and PLS analysis (see Figure 9) reveal the importance of the work function,26 in relevance to the binding energy BE. While the former are quite understandable and included in the d-band model (see ref 18), our finding about the role of the work function suggests that the dband model could be slightly modified and possibly improved! In fact, both the algorithms, BoPGD and PLS analysis, show that there is an inverse relationship between the binding energy (KPI) and the work function. We now show that the work

Figure 8. Schematic of interaction between HOMO (5σ) and LUMO (2π*) orbitals of CO-the adsorbate molecule (right) with d-band of a transition metal (left). It is clear that the center of the d-band can be specified, with respect to a vacuum, by adding the work function (WF) to its energy relative to Fermi energy. The shaded arrows (green and purple) show the gain in energy of the d-band of metal and HOMO of the adsorbate resulting from their mutual interactions. The metal Fermi energy is for Cu, and the energy of the free CO molecule is calculated using PBE-DFT.

function is essential in aligning the energy scale of metallic bands with the energies of frontier orbitals of the adsorbate. From the d-band model depicted in Figure 8, it is clear that the energy of binding of an adsorbate with metal results from the lowering of energies of (a) its HOMO level and (b) of the d-states of metal. Within perturbation theory, both of these are directly proportional to the squared coupling between the adsorbate levels and metallic d-band, and inversely proportional to the difference between the energies of d-band and molecular orbitals. The common reference of energy to metallic d-band and adsorbate orbitals is the vacuum energy (set to 0 eV in Figure 8). In earlier analysis in terms of the d-band model,2 the Fermi energy (ϵF) has been used as a reference. Since ϵF, with respect to a vacuum, is given by the work function W, it naturally enters into the perturbative analysis of the metal− adsorbate interaction (see Figure 8, where we have obtained energies of HOMO and LUMO of CO from a DFT calculation). A typical term2,18 describing the energy of this interaction and as given in the Supporting Information of ref 2 modified with the work function W is given as ⎤ ⎡ V2π *2 ΔEd = −4f ⎢ + S2π *V2π *⎥ ⎦ ⎣ |ϵ2π *−ϵd −W |

⎡ ⎤ V5σ 2 − 2⎢(1 − f ) + (1 + f )S5σ V5σ ⎥ |ϵ5σ −ϵd −W | ⎣ ⎦

where V2π*2 = Vad2β(R0), R0 is the atomic radius, and β is an adsorbate-dependent function. Here, 5σ and 2π* are labelled as LUMO and HOMO, respectively, where ε(5σ, 2π*) are their energies and V(5σ,2π*) are their couplings with metal, which are dependent on the interaction of the adsorbate with a dorbital of metal interaction through an adsorbate-dependent function β(R0). Finally, V(5σ,2π*) are the repulsive interactions between adsorbate and metal, which are dependent on the overlap of their wave functions, i.e., Vad. From Figure 8, we note that the energy difference can be determined meaningfully when the two energies are aligned on the same scale. If we set the energy under vacuum to zero, we note that location of the d-band center is given by −(W + ϵd). Thus, we argue that the 4198

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials

Figure 9. Partial least squares method lists the contribution of the descriptors to the Latent Variable 1 while fitting to the KPI y. Clearly, the d-band characteristics and the work function (WF) seem to be the dominating factors that, in combination, indicates an underlying mechanism that influences the binding energy (BE).

huge potential in lending itself for extrapolation of model to new materials for consideration in design. (2) Stability and Consistency of BOPGD over LASSO. Applying regression coefficient methods such as LASSO to this small dataset reveals the intrinsic instability by pointing to different descriptors during randomly chosen training and test sets, whereas BOPGD demonstrates remarkable stability by selecting the same descriptors, independent of the random sampling runs. LASSO will not converge due to multiple solutions possible. Furthermore, the increase in the number of features beyond the five down-selected descriptors by BoPGD did not decrease the error rates in the linear or nonlinear fits. The maximum error is 0.3 eV between the predicted and the calculated BE, and the minimum error is 0 eV. With regard to the computational efficiency of BoPGD over LASSO, BoPGD scales more efficiently to large sets of descriptors than LASSO. Without the bootstrap step, the computational time complexity is 6(M2N + M3) for LASSO against 6(K 2N + K log K ) for BoPGD, where K is the reduced dimension of the fingerprint and M is the scaled-up descriptor space, and K ≪ M < N. (3) Chemical Insights. From the PLS, it is revealed that a combination of d-band characteristics (FDB, CDB, SDB, KDB, WDB), along with the work function W and the atomic radius hold the key to the CO binding energy on catalyst surfaces. This is a strong validation to the existing d-band theory, as well as advancing it to the next level by adding the influence of W to the model of catalyst performance, which has also been confirmed via chemical derivation in this work. The generality of our scheme, the transferability of its resulting minimum descriptors through identification of chemical mechanisms, and the efficiency in its scaling to a large dataset make it attractive to be used in a wide range of material problems integrating the features at different scales and stages of development.

metal-specific work function should enter the expressions of the d-band model, rather than the energies relative to the Fermi energy EF, resulting in its slight modification We note that the original and the subsequent flavors of the formulation of the dband model use the d-band center, relative to the Fermi level, which implicitly includes information about the work function W. The general physics remains the same in our modification of the d-band model, but it has the advantage that explicit inclusion of the work function facilitates referencing Fermi energy to the vacuum and hence to the HOMO/LUMO energies of a specific molecule, giving a more complete description of the metal−adsorbate interaction. Indeed, our machine learning and statistical analysis has not only identified the important descriptors that are the essential ingredients of the successful d-band model of catalysts, but has also pointed out an additional descriptor (work function, WF) that can be readily imported into the d-band model.



CONCLUSIONS In conclusion, we have presented a recipe of statistical analyses and machine learning methodology developed here to help a material scientist to start from a raw dataset, build predictive models, and uncover the governing mechanism. We have applied this entire recipe to a published database containing 298 alloys.2 Our proposed machine learning method based on BOPGD provides significant advantages over conventionally used methods of descriptor selection such as artificial neural networks and LASSO-based methods in the following ways: (1) Fingerprinting and Transferability. Our BoPGD method down-selects fewer descriptors (5 descriptors, compared to 13 used in the Neural Network model used in ref 2) that are sufficient to describe the dataset variability and the Binding Energy (BE), providing a significant advantage over artificial neural networks. Our method is expected to result in more transferable descriptors with the addition of new materials to the dataset as the 10 experiments show tightly bound errors. This is because it is based on mechanism as opposed to fitting and therefore has a 4199

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials



(4) Wold, S.; Albano, C.; Dunn, III, W. J.; Edlund, U.; Esbensen, K.; Geladi, P.; Hellberg, S.; Jonahnsson, E.; Lindberg, W.; Sjöström, M. In Chemometrics−Mathematics and Statistics in Chemistry; Kowlaski, B. R., Ed.; NATO ASI, Ser. C, No. 138; Reidel Publishing Co., 1984. (5) van de Waterbeemd, H., Ed. Chemometric Methods in Molecular Design; VCH Publishers: Weinheim, Germany, 1995. (6) Haury, A.-C.; Gestraud, P.; Vert, J.-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 2011, 6, e28210. (7) Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267−288. (8) Ghiringhelli, L. M.; Vybiral, J.; Levchenko, S. V.; Draxl, C.; Scheffler, M. Big Data of Materials Science: Critical Role of the Descriptor. Phys. Rev. Lett. 2015, 114, 105503. (9) Topliss, J. G.; Edwards, R. P. Chance factors in studies of quantitative structure-activity relationships. J. Med. Chem. 1979, 22, 1238−1244. (10) Jouan-Rimbaud, D.; Massart, D. L.; de Noord, O. E. Random correlation in variable selection for multivariate calibration with a genetic algorithm. Chemom. Intell. Lab. Syst. 1996, 35, 213−220. (11) Wold, S.; Esbensen, K.; Geladi, P. Principal Component Analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37−52. (12) Wold, S.; Ruhe, A.; Wold, H.; Dunn, W. J., III. The collinearity problem in linear regression. The Partial Least Squares (PLS) approach to generalized inverses. SIAM J. Sci. Stat. Comput. 1984, 5, 735−743. (13) De Jong, S. PLS shrinks. J. Chemom. 1995, 9, 323−326. (14) Jhong, H.-R.; Ma, S.; Kenis, P. J. A. Electrochemical conversion of CO2 to useful chemicals: Current status, remaining challenges, and future opportunities. Curr. Opin. Chem. Eng. 2013, 2, 191−199. (15) Hori, Y.; Wakebe, H.; Tsukamoto, T.; Koga, O. Electrocatalytic process of CO selectivity in electrochemical reduction of CO2 at metal−electrodes in aqueous-media. Electrochim. Acta 1994, 39, 1833− 1839. (16) Qiao, J.; Liu, Y.; Hong, F.; Zhang, J. A review of catalysts for the electroreduction of carbon dioxide to produce low-carbon fuels. Chem. Soc. Rev. 2014, 43, 631−675. (17) Hammer, B.; Scheffler, M. Local Chemical Reactivity of a Metal Alloy Surface. Phys. Rev. Lett. 1995, 74, 3487−3490. (18) Kitchin, J. R.; Nørskov, J. K.; Barteau, M. A.; Chen, J. G. Role of Strain and Ligand Effects in the Modification of the Electronic and Chemical Properties of Bimetallic Surfaces. Phys. Rev. Lett. 2004, 93, 156801. (19) Watson, R. E.; Bennett, L. H. A Mulliken electronegativity scale and the structural stability of simple compounds. J. Phys. Chem. Solids 1978, 39, 1235−1242. (20) Meredig, B.; Wolverton, C. Dissolving the Periodic Table in Cubic Zirconia: Data Mining to Discover Chemical Trends. Chem. Mater. 2014, 26, 1985−1991. (21) Varoquaux, G.; Gramfort, A.; Thirion, B. Small-sample brain mapping: Sparse recovery on spatially correlated designs with randomization and clustering. In 29th International Conference on Machine Learning (ICML 2012), Edinborough, U.K., June 26−July 1, 2012; John, L., Joelle, P., Eds.; 2012. (22) Hammer, B.; Morikawa, Y.; Nørskov, J. K. CO Chemisorption at Metal Surfaces and Overlayers. Phys. Rev. Lett. 1996, 76, 2141− 2144. (23) Li, Z.; Ma, X.; Xin, H. Feature engineering of machine-learning chemisorption models for catalyst design. Catal. Today 2017, 280, 232−238. (24) Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16−28. (25) Bhattacharya, I. Feature Selection under Multicollinearity & Causal Inference on Time Series. M.S. Thesis, Indian Institute of Science, Bangalore, India, 2017. (26) Baker, L. Adsorption of Xenon on Vicinal Copper and Platinum Surfaces. Thesis, Carnegie Mellon University, Pittsburgh, PA, 2008. (27) Efron, B.; Tibshirani, R. An Introduction to the Bootstrap; Chapman & Hall/CRC Press: Boca Raton, FL, 1993.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.chemmater.6b04229. Brief paragraph on the application of machine learning and statistics to materials science; introduction to the Bootstrap Projected Gradient Descent (BoPGD) Method and details of the algorithm used in the Methodology section; score plot of the two principal components described in the Principal Component Analysis section (Figure S1) (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Umesh Waghmare: 0000-0002-9378-155X Author Contributions

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The first authors would like to thank and acknowledge the support of Shell for funding this research. The second author would like to thank JNCASR for their support.



ABBREVIATIONS BE = binding energy FDB = filling of a d-band CDB = center of a d-band WDB = width of a d-band SDB = skewness of a d-band KDB = kurtosis of a d-band WF = work function AR = atomic radius SE = spatial extent IP = ionization potential EA = electron affinity EN = Pauling electronegativity LEN = local Pauling electronegativity VAD2 = adsorbate-metal interatomic d coupling matrix element squared LASSO = Least Absolute Shrinkage and Selection Operator DFT = density functional theory KPI = key performance indicator IQR = interquartile range



REFERENCES

(1) Materials Genome Initiative for Global Competitiveness; White Paper, Group on Advanced Materials, June 2011; www.whitehouse. gov/sites/default/files/microsites/ostpmaterials_genome_initiativefinal.pdf. (2) Ma, X.; Li, Z.; Achenie, L. E. K.; Xin, H. Machine-LearningAugmented Chemisorption Model for CO2 Electroreduction Catalyst Screening. J. Phys. Chem. Lett. 2015, 6, 3528−3533. (3) Pilania, G.; Wang, C.; Jiang, X.; Rajasekaran, S.; Ramprasad, R. Accelerating materials property predictions using machine learning. Sci. Rep. 2013, 3, 2810. 4200

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201

Article

Chemistry of Materials (28) Upton, G.; Cook, I. Understanding Statistics; Oxford University Press: Cambridge, U.K., 1996. (29) Beck, A.; Teboulle, M. A Fast Iterative Shrinkage Thresholding Algorithm for Linear Inverse Problems. SIAM J. Image. Sci. 2009, 2, 183−302. (30) Zhao, P.; Yu, B. On Model Selection Consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541−2563. (31) Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418−1429. (32) Wainwright, M. J. Sharp thresholds for noisy and high-dimensional recovery of sparsity using -constrained quadratic programming. Technical Report, Dept. of Statistics, UC Berkley: Berkeley, CA, 2006; p 709. (33) Zou, H.; Hastie, T. Regularization and variable selection via the Elastic Net. J. Royal. Stat. Soc. B 2005, 67, 301−320. (34) She, Y. Sparse regression with exact clustering. Elec. J. Stat. 2010, 4, 1055−1096. (35) Bühlmann, P.; Rütimann, P.; van de Geer, S.; Zhang, C.-H. Correlated variables in regression: clustering and sparse estimation. J. Stat. Plan. Infer. 2013, 143, 1835−1858. (36) Bach, R. F. Bolasso: Model Consistent Lasso Estimation through the Bootstrap. Proc. Twenty-Fifth ICML 2008, 33−40. (37) Jain, P.; Tewari, A.; Kar, P. On Iterative Hard Thresholding Methods for High-dimensional M-Estimation. In Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, Canada, Dec. 8−13, 2014; pp 685−693. (38) Committee on Integrated Computational Materials Engineering, National Materials Advisory Board, Division on Engineering and Physical Sciences, National Research Council. Integrated Computational Materials Engineering: A Transformational Discipline for Improved Competitiveness and National Security; National Academies Press: Washington, DC, 2008 (ISBN 9780309178211). (39) Nosengo, N. Can artificial intelligence create the next wonder material? Nature 2016, 533, 22−25. (40) Bhadeshia, H. Neural Networks in Materials Science. ISIJ Int. 1999, 39, 966−979. (41) Bhadeshia, H. Neural Networks and Information in Materials Science. Stat. Anal. Data Mining: ASA Data Sci. J. 2009, 1, 296−305. (42) Nørskov, J. K.; Abild-Pedersen, F.; Studt, F.; Bligaard, T. Density functional theory in surface chemistry and catalysis. Proc. Natl. Acad. Sci. U. S. A. 2011, 108, 937−943. (43) Mavrikakis, M.; Hammer, B.; Nørskov, J. K. Effect of Strain on the Reactivity of Metal Surfaces,. Phys. Rev. Lett. 1998, 81, 2819−2822. (44) Akhade, S. A.; Kitchin, J. R. Effects of strain, d-band filling, and oxidation state on the bulk electronic structure of cubic 3d perovskites. J. Chem. Phys. 2011, 135, 104702. (45) Xin, H.; Holewinski, A.; Linic, S. Predictive Structure− Reactivity Models for Rapid Screening of Pt-Based Multimetallic Electrocatalysts for the Oxygen Reduction Reaction. ACS Catal. 2012, 2, 12−16. (46) Xin, H.; Vojvodic, A.; Voss, J.; Nørskov, J. K.; Abild-Pedersen, F. Effects of d-band shape on the surface reactivity of transition-metal alloys. Phys. Rev. B: Condens. Matter Mater. Phys. 2014, 89, 115114. (47) Ma, X.; Xin, H. Orbitalwise Coordination Number for Predicting Adsorption Properties of Metal Nanocatalysts. Phys. Rev. Lett. 2017, 118, 036101.

4201

DOI: 10.1021/acs.chemmater.6b04229 Chem. Mater. 2017, 29, 4190−4201