Machine Learning and Statistical Analysis for Materials Science

Apr 11, 2017 - Machine Learning and Statistical Analysis for Materials Science: Stability and Transferability of Fingerprint Descriptors and Chemical ...
0 downloads 9 Views 1MB Size
Subscriber access provided by HACETTEPE UNIVERSITESI KUTUPHANESI

Article

Machine Learning and Statistical Analysis for Materials Science: Stability and Transferability of Fingerprint Descriptors and Chemical Insights Praveen Pankajakshan, Suchismita Sanyal, Onno E. de Noord, Indranil Bhattacharya, Arnab Bhattacharyya, and Umesh Waghmare Chem. Mater., Just Accepted Manuscript • DOI: 10.1021/acs.chemmater.6b04229 • Publication Date (Web): 11 Apr 2017 Downloaded from http://pubs.acs.org on April 13, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Chemistry of Materials is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

Machine Learning and Statistical Analysis for Materials Science: Stability and Transferability of Fingerprint Descriptors and Chemical Insights Praveen Pankajakshan*†, Suchismita Sanyal†, Onno E. de Noord§, Indranil Bhattacharyaǂ, Arnab Bhattacharyyaǂ, Umesh Waghmare+ †

Shell India Markets Pvt. Ltd., +Jawaharlal Nehru Center for Advanced Scientific Research, § Shell Global Solutions International B.V., ǂIndian Institute of Science. ABSTRACT: In the paradigm of virtual high throughput screening for materials, we have developed a semi-automated workflow or ‘recipe’ that can help a material scientist to start from a raw dataset of materials with their properties and descriptors, build predictive models, and draw insights into the governing mechanism. We demonstrate our recipe, which employs machine learning tools and statistical analysis, through application to a case study leading to identification of descriptors relevant to catalysts for CO2 electroreduction, starting from a published database of 298 catalyst alloys. At the heart of our methodology, lies the Bootstrapped Projected Gradient Descent (BoPGD) algorithm that has significant advantages over commonly used Machine Learning (ML) and Statistical Analysis (SA) tools such as the regression coefficient shrinkage-based method (LASSO) or artificial neural networks: (a) it selects descriptors with greater stability and transferability, with a goal to understand the chemical mechanism rather than fitting data, and (b) while being effective for smaller datasets such as in the test case, it employs clustering of descriptors to scale far more efficiently to large size of descriptor sets in terms of computational speed. In addition to identifying the descriptors that parametrize the dband model of catalysts for CO2 reduction, we predict work function to be an essential and relevant descriptor. Based on this result, we propose a modification of the d-band model that includes the chemical effect of work function, and show that the resulting predictive model gives the binding energy of CO to catalyst fairly accurately. As our scheme is general and particularly efficient in reducing a set of large number of descriptors to a minimal one, we expect it to be a versatile tool in obtaining chemical insights into complex phenomena and development of predictive models for design of materials.

ACS Paragon Plus Environment

Chemistry of Materials

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

INTRODUCTION The typical time frame from the initial research and discovery of a material to its incorporation in applications for first-use is remarkably long, of about 10 to 15 years [1]. This was partly due to lack of an integrated approach that considers all aspects of materials development and the continued reliance of research on scientific intuition and trial-and-error kind of exploratory experimentation. It is desirable to have a generic approach that links knowledge about a material in terms of its structure, processing and properties in an insightful way [38]. Design of materials from first-principles simulations using advanced High Performance Computing (HPC) infrastructure is an attractive avenue to move away from this empiricism, although the process is still lengthy and timeconsuming. The next stage of innovation will lie on automating the process of intuition that is involved during materials synthesis: taking it from idea to the desired material. What we need is a versatile tool (a) that can analyze and correct the data for obviously erroneous dataset, (b) that identifies a minimal set of descriptors from a large-data set that essentially govern the desired property of a material, and (c) whose results do not depend on a specific subset of the large data used in analysis, thus naturally acquiring transferability. Tools of machine learning can be quite useful in identifying connections between complex phenomena as expressed in such data, particularly when simple mechanistic principles are not readily applicable. Virtual High-Throughput Screening (VHTS) addresses the issue of long time frame for production, by combining the predictive power of materials modeling with advances in data analytics and machine learning to bring insights and help in accelerating materials discovery [3, 20]. Transferring the paradigm from drug design in pharmaceutical industry to materials science has become realistic thanks to the recent advances in the speed of computational resources, availability of open-sourced data and knowledge, and the efficiency and stability of materials modeling packages. National programs such as the Materials Genome Initiative [1] launched by the US government in 2011 have also provided significant boost to research in this area. For a comprehensive compilation of different initiatives globally, refer to Ref [39] The goal of this article is to develop advanced machine learning methodologies in conjunction with statistical analyses that can work for a relatively small dataset, for example the one available [2] for catalysts for the CO2 electroreduction. We present here a semi-automatic recipe that can help a material scientist to start from raw dataset, build predictive multi-scale models and explain the governing mechanism: •

our Bootstrapped Projected Gradient Descent (BoPGD) method (see supplementary information) in the context of machine learning scheme scales efficiently to larger set of descriptors in the data,



the recipe is transferable to any application where it is important to understand the underlying mechanism governing the pathways, and



the method fingerprints the sparsest but stable set of descriptors that can predict a given property.

At the heart of our recipe is the “Descriptor Selection” tool that takes in the entire set of descriptors (typically very large) as input and provides the fingerprints that map to the Key-Performance Indicator (KPI). The underlying idea behind selecting sparse fingerprints is the principle of Occam’s razor or the law of parsimony which is based on the premise that simpler parameters are a heuristic to developing theoretical models and testing them. In contrast to earlier works [20] where the grouping of data is done in the candidate materials space, the grouping is done with respect to the descriptors or the extended bag of descriptors. This makes it (a) more efficient and (b) gives the added benefit of providing chemical insights.

ACS Paragon Plus Environment

Page 2 of 19

Page 3 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

Figure 1. Forward Problem: Accelerating Materials Property Calculation. Machine Learning involves the task of mapping the set of Descriptors D (as features) to the KPI y for every candidate material (C) in the database, by engineering them to extended set XE and fingerprinting to a minimal set X using a selection function or operator f. A database in materials science typically constitutes descriptors (D) of a set of materials (C). While building the forward model  (see Figure 1), which maps the set of descriptors to the KPI (a target property relevant to an application), often, the principal method used in Materials Science either for prediction is the artificial neural network (ANN) [40, 41]. The ANN is a non-linear model giving a black-box approach to modeling analysis and is quite useful when the following conditions are satisfied: •

the number of samples or experiments should be really large,



there isn’t much a priori information or knowledge on the dataset or the science, and



a goal is accurate prediction within the model domain (i.e. interpolation only) given the above circumstances.

Unlike genomic bio-databases, in materials datasets we often deal with relatively smaller number of experiments or samples, and there is the definite risk that the developed predictive model is over-fit to the dataset. The result of scaling/extrapolating this model to a new dataset is often questionable, especially if there is no backing from Materials Science to support the findings. While building interpretable models for small or large data, we recommend that the process of mapping from the descriptor space to the KPI space go through intermediate steps, of descriptor engineering and then reducing the extended set of engineered descriptors to get the fingerprinted descriptors (Figure 1). This whole process and recipe are described in the Methodology section of this article. Reduction of the dimension of the descriptor space is very important [6] because it helps not only in reducing the complexity of the model but also in understanding better the governing mechanism that underlies the true model, , that had generated the data. Another category of methods from computational statistics are methods such as Least Absolute Shrinkage and Selection Operation (LASSO) [7] that shrink the coefficients of the regression model ( ) to provide only the sparsest set of descriptors that best describes the KPI, and in this way they implicitly perform descriptor selection. In shrinkage-based methods or search-based methods, there is also a high probability that chance correlations might appear [9, 10] in single trials. When the descriptors are highly correlated, during multiple trials of the descriptor selection process, LASSO and similar methods, select any one of the correlated descriptors [30]. This results in inconsistency, and noise in the trial-based selection process making interpretation risky and follow-up difficult for datasets with larger number of descriptors. An alternative approach is to use methods based on latent variables. The philosophy of latent variable (LV) methods is that observations are a reflection of underlying phenomena that cannot be observed / measured

ACS Paragon Plus Environment

Chemistry of Materials

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

directly. The manifest variables are a reflection of these latent variables and are not as such the most interesting for interpretation. Each LV is an orthogonal weighted linear combination of the manifest variables and in that way solves the collinearity problem in regression. The most common LV methods are Principal Component Analysis (PCA), which operates on a single (X) data block [11], and Partial Least Squares (PLS), which links the two (X and KPI y) data blocks [12]. As a regression technique, PLS can be seen as a regularization method, similar to Ridge Regression [13]. Descriptor Importance and Fingerprinting There are many articles expounding the importance of material descriptors in either accelerating material property calculation or for material design. In [8], the authors describe the characteristics desired for a set of descriptors, and we recall it here: •

descriptor calculations must not be as intensive as those for the KPI,



it uniquely characterizes the material as well as property-relevant elementary processes,



materials that are very different (similar) should be characterized by very different (similar) descriptor values,



it’s dimension should be as low as possible.

While determining the descriptors for both materials prediction and discovery, the choice and number of the descriptor should be generally as large as possible at the beginning. In this article, we will define a new term: fingerprint descriptors (see Figure 1). In Machine Learning perspective, the fingerprint descriptors are a subset or daughter of the superset of mother descriptors, which are unique for a property and material. It is the fingerprint’s dimension or cardinality that is kept as low as possible (sparse), while the original descriptor space can be as large as possible. Additionally, for the built model that maps the fingerprint descriptor to the property or the KPI, this mathematical mapping is also unique. The engineering of the descriptors (nonlinear combinations of simple descriptors, D) ensures that the relationships between the fingerprinted descriptor X and the KPI is linear, so that the true model  can be simply expressed within an error  as: :  =  + 

where, the fingerprint  is obtained by down selecting the engineered descriptors  as : = ( ) and the descriptor engineering with Φ as  = Φ(), so that can be written as: :  =  (Φ()) + ,

 , of the model.  reduces to estimating the coefficients,  and the determination of model estimate   is built, the process of predicting the KPI for a set of new materials, based on the Once the model   to the new material’s descriptors such that descriptors  becomes a process of applying the model   :  ↦  .  We demonstrate that it is critical how these fingerprint descriptors are chosen and how the unique mapping is uncovered/estimated, especially while dealing with small datasets. Conventional statistical and machine learning approaches have stability issues [21] when dealing with relatively small datasets: when several trials of the approach are run, the results are often not consistent across the trials. They very often over train the model to the specific dataset with the result that extrapolation of the model on a new dataset can be problematic. Our formulated recipe is shown to overcome such stability issues and therefore lends itself towards being robust towards descriptor selection in small datasets as well as being scalable to larger datasets. Case Study To demonstrate our recipe, we choose the case study of catalyst screening for the problem of CO2 Electroreduction [2]. Electrochemical reduction of CO2 into value-added chemicals using renewable energy is looked upon as a promising approach towards achieving carbon neutrality, while providing a method to store or utilize renewable energy from intermittent sources, both reducing our dependence on fossil fuels [14]. Electrocatalysts are needed to bind and activate CO2 in order to reduce the high over-potentials typically encountered. Also, catalysts can drive selective formation of desired products. During the past few decades efforts have mostly focused on different metal catalysts and the various products that can be formed using those metals [15, 16].

ACS Paragon Plus Environment

Page 4 of 19

Page 5 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

Xianfeng Ma et al [2] have demonstrated the use of machine learning methods such as Artificial Neural Network trained with a data set of ab initio adsorption energies and electronic fingerprints of idealized bimetallic surfaces, to capture complex, nonlinear interactions of adsorbates (e.g., *CO) on multi-metallics with ∼0.1 eV error, outperforming the known two-level interaction model [22] or the d-band model [18] in prediction. With this model and the well-established descriptor-based catalyst design approach, they have identified promising {100}terminated Cu multi-metallics with a lower over-potential and possibly higher selectivity in CO2 Electoreduction to C2 species, from a calculated database. The database itself contains 298 alloys with 13 descriptors (see Table 1 for Notations and Value Range) and 1 Key-performance Indicator (KPI) or target property being the binding energy of the adsorbed CO on a metal surface ∆ECO. For generating the database, electronic structure calculations were performed within the framework of spin-polarized Density Functional Theory (DFT) implemented in Quantum ESPRESSO using Perdew-Burke-Emzerh (PBE) parametrization of gradient-corrected exchangecorrelation functional.

Table 1. Descriptors, their Notations and Value Ranges. We note that center of a d-band here is obtained relative to the Fermi Energy [2]. Descriptor

Notation

Value Range

Filling of a d-band

f

0.87 to 0.98

Center of a d-band

εd

-4.49 eV to -0.75 eV

Width of a d-band

Wd

0.86 eV to 2.16 eV

Skewness of a d-band

γ1

-0.8 to 4.06

Kurtosis of a d-band

γ2

2.75 to 29.99

Work function

W

5.01 eV to 6.74 eV

Atomic radius

r0

1.38 Å to 1.59 Å

Spatial extent of d-orbitals

rd

0.67 Å to 1.04 Å

Ionization potential

IE

7.58 eV to 9.23 eV

Electron affinity

EA

0.56 eV to 2.31 eV

Pauling electronegativity

χ0

1.9 to 2.54

Local Pauling electronegativity

χ

1.49 to 2.54

Adsorbate-metal interatomic d coupling matrix element squared

2 Vad

1 to 3.9

Notations We use  ∈ ℜ× to represent the P-dimensional set of descriptors (listed in Table 1) and  ∈ ℜ× to represent the KPI for each of the N candidate materials.  ∈ ℜ represents the set of coefficients of a model, having a reduced dimensionality K.

ACS Paragon Plus Environment

Chemistry of Materials

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

METHODOLOGY

Fig. 2. A recipe for going from data to fingerprinting descriptors to insights to models and discovery. It involves Step 1: Pre-processing; Step 2: Exploratory Analysis; Step 3: Fingerprinting Descriptors; Step 4: Statistical Model or Linear/Nonlinear model building and validations; Step 5: Insights from Subject Matter Expert. In Figure 2, we show a simple graphical abstract of the workflow that is generally adopted for the whole process involved in developing a model or drawing insights. This is a generic process recipe for any application that might involve going from dataset to fingerprinting the descriptors to model generation and discovery. We cannot insist enough the involvement of the subject matter expert at all stages of the process starting with data preparation to the drawing of insights and defining of the final model. As mentioned in [8], historically the reliance was on chemical intuition for deciding the important descriptors for a particular application and developing the relationship between them for best explaining the physical property. Our idea here is to automate the entire process leading to chemical insights and discovery by rational design with minimal necessity of chemical intuition. We demonstrate this entire process for the specific case study as described in above section. According to this flow chart and the application to the case study, this article is divided into the following sections: data preprocessing, exploratory analysis, descriptor fingerprinting, building model and testing hypothesis, drawing insights with the subject matter expert, and conclusions. Data preprocessing Pre-processing of the data is important to improve the quality and reliability of the data. It can in fact add some insights that are quite useful in terms of the process involved in the generation of the data. This is discussed in detail in the section on drawing insights from outliers as part of the Exploratory Analysis.

ACS Paragon Plus Environment

Page 6 of 19

Page 7 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

Handling Missing data: There might be cases when there is data missing from the database. If there is only a limited set of discontinuous data missing from a few of the descriptors, one might consider data imputation techniques to fill in the missing information. Most often, in materials databases, it might make sense to discard the entire row (remove the material ci) even if there are a few columns missing to not have any error in imputations. The result could be that the error can propagate to the model or the insights derived. The data that we used in this article from [2] did not have any missing values. Normalize: Due to the enormous variations in the range of the input and the output data, in order to do a fair comparison within the descriptors (simple or extended), we normalize the entire or the engineered dataset before processing it. Normalizing the attributes using z-score normalization is done by: ( " #$ ) ! = %$ where #$ and %$ are the population mean and standard variation for each of the column of X. As the population means and standard deviations are not available, we can estimate it from the dataset. A naïve bootstrapping process [27] is a method to recreate several subsets of the original dataset by randomly sampling subsets from it with replacement. In random sampling with replacement, once a sample is chosen from a bag of samples, it is replaced back to the original bag of data before the next sample is drawn from the same bag. For estimating the mean and the variance, we choose several hundred draws or trials of the data and calculate the empirical bootstrapped distribution of the means and variances. These are averaged out to estimate the #$ and %$ . Exploratory analysis The original dataset before normalization is quite diverse in terms (see Figure 3) of the KPI and distributed in the range between -2eV and 0eV. Except for three descriptors IP, EA and EN, there is a good spread in the descriptor range as well (see Figure 3).

Figure 3. Violin plot showing the diversity of the KPI (BE) and the 13 descriptors. The descriptors KDB, SDB and CDB have outliers outside the data limit fences, imposed by the &' " '. ) IQR IQR and &- + '. ) IQR IQR (refer to text), as shown by the lengthy whiskers, while the diversity and spread of the three descriptors IP, EA and EN is sparse. The black lines in the violin plot are the Mean value for the descriptors, here zero as the data is normalized, while the red lines in the violin plot indicates the median values.

ACS Paragon Plus Environment

Chemistry of Materials

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

On evaluating the statistical outliers for these descriptors, it was noticed that the entries with Ag as the substrate had the maximum outliers, in comparison to other alloys, for CDB, SDB and KDB. In Figure 3, we show a violin plot for the z-score normalized descriptors. The long whiskers on these three descriptors correspond to the outliers that are well outside the limits imposed by the . " 1.5 IQR and .1 + 1.5 IQR, where IQR is the InterQuartile Range [28] and . and .1 being the first and fourth quartiles respectively. IQR is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles, and they are denoted by Q1, Q2, and Q3, respectively. For a data that is symmetric and nearly normally distributed, the fences are within the 3σ bounds. The violin plot is similar to the box plot conventionally used by statisticians to see the spread of the data, the mean values and the outliers. In addition to these statistical values, the violin plot also captures the shape of the probability density function for the different descriptors. In the violin plot of Figure 3, the black line in the middle is the mean for each descriptor and the red line gives the median. Since the data is normalized, the black line is along the zero center. The median function takes in a vector, sorts it in ascending or descending order, and then takes in the middle value or average of the middle values in the sorted result. The median is a robust estimate of the mean and ensures that the extreme low or high values do not affect the estimation of the mean. From the violin plot in Figure 3, we find that the descriptors KDB (23 ), SDB (2 ) and CDB (45 ) have outliers outside the data limit fences. The higher order moments, Kurtosis and Skewness, usually were found to have larger outliers. The Alloys “Ag-Sc@Ag”, “Ag-Y@Ag” and “Ag-Zr@Ag” had the outliers for which all the three descriptors: KDB, SDB and CDB had values well above the statistical significant fences and cannot be simply ignored. These are eliminated from any further evaluation here. So, the number of entries in the database to analyse drops down to 295. In general, we found that the outliers in the DFT calculations of the d-band characteristics are high for the Bi-layered (of the form “X-Y@Ag”) alloys of Ag. The three descriptors IP, EA and EN have a very discontinuous distribution of the data which might point out to the fact that probably these were added to the database from a published table available in the literature. As the dataset descriptor dimension is small, here 13, the data can be visually analyzed, pairwise, for some insights. In Figure 4, the KPI (y), which is the Binding Energy of the adsorbed CO on a metal surface (∆ECO) is evaluated against one of the descriptor (D) which is the Filling of a d-band (f). We observe that the descriptor f has values ranging between a minimum of 0.87 and a maximum of 0.98 (see Figure 4). For Copper, Silver and Gold the range for the d-band fillings are highly limited and is between 0.96 and 0.98. We observe a shift in the d-band filling fraction happening from Gold (Au) to Silver (Ag) to Copper (Cu). The low reactivity of these metals is reflected in the very small change in this descriptor f. These metals also don’t interact easily with air/oxygen and this has possibly the reason to do with the “almost” filled d-subshells. Similarly, 7 other descriptors: WF, AR, SE, IP, EA, EN, and VAD2 are dependent only on the substrate and do not vary for the ligand layer on top of the substrate. If we treat all the different members belonging to a common substrate and the rest in subsequent substrates, we find that these 7 descriptors are very useful in discriminating between clusters and do not offer any new information within the cluster. By clustering of descriptors, we refer to the grouping of descriptors in such a way that there is a common link that unifies the elements of a cluster. On a cursory analysis, by looking visually at the scatter plot, it does seem that the Binding Energy ∆ECO and the d-band filling f are independent. This could point to three possibilities: a.

the “Filling of a d-band”, FDB (f), is not unique for the alloys [2], or

b.

they are unique but the resolution is low or numerically truncated, or

c.

the way that the descriptor is generated during DFT should be re-evaluated.

ACS Paragon Plus Environment

Page 8 of 19

Page 9 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

Figure 4 Scatter Plot between the “Filling of a d-band” (FDB), f, and the “Binding Energy of adsorbed CO on a metal surface” (BE) ∆789 (in eV). The data is grouped together as clusters by the core of the alloy NiPt, Pd, and Au-Cu-Ag. However, on a detailed analysis by using the alloy labels, we find that there is a grouping of the scatter plot in terms of the core metal (Ni, Pt, Pd, Au, Cu, Ag) of the alloy and no relationship with the shell metal in the alloy.

Figure 5. Pearson’s Pairwise Product Moment Correlation Heatmap among the KPI: Binding Energy (BE) and the different descriptors. The strong positive correlation between the FDB and BE is given by the yellow box in the plot. The relationship between FDB and BE is strongly established in the Pearson’s product moment correlation, given by the Heatmap of Figure 5 and the Graph Plot of Figure 6. The P columns in D are the individual descriptor

ACS Paragon Plus Environment

Chemistry of Materials

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 19

vectors and when they are similar the correlations are close to 1 or -1 (yellow and dark blue in the plot) and when they are dissimilar (green in the plot)

Figure 6. A Graph Plot showing the Pearson’s correlation between the variables described in Figure 5. Green color shows positive correlation while red color shows negative correlation between the KPI and the descriptors. The thickness of the links shows the strength of the relationship. The correlation cutoff is placed at 0.7 for better visibility of the relation. There are multiple insights we get from the Pearson’s correlation plot in Figure 5, which can be connected with the d-band model [17] that captures the reactivity of metallic surfaces: 1. Most of the d-band characteristics are directly or inversely correlated with each other and in turn with the Binding Energy: the “filling of the d-band” (f) has a direct relationship with the “Skewness” SDB (2 ) and the “Kurtosis” KDB (23 ), while f has an inverse relation with the “Center of the d-band” CDB εd and the “width of the d-band” WDB Wd. Unlike in [18], we did not find any correlation between εd and Wd. which might be pointing out to a possible independent causal link between f and εd, and between f and Wd. 2. there is a strong correlation between the “Binding Energy” BE Δ;? 3 4. Direct relation between the “Adsorbate-metal interatomic d coupling matrix element squared” @A5 and “Width of a d-band” B5 5. there is a strong direct relation between the “Spatial Extent” SE (C5 ) and “Adsorbate-metal interatomic d 3 coupling matrix element squared” @A5 6. Direct relation between the “Spatial Extent” SE (C5 ) and “Pauling Electronegativity” EN (>? ) It is well established that the d-band characteristics indeed have a strong relationship with the Binding Energy, which validates the first finding above and from Figure 6. In general, there is a non-linear relationship between the d-band characteristics and the BE. It has been well established that the d-band model developed by Nørskov et al. [17] presents a theoretical framework for the description of basic trends in the reactivity of different systems. In [18], the authors mention that according to tight binding theory, the “width of the d band” (WDB) Wd can be related to the “interatomic matrix element” Vad that describes bonding between an atom and its environment. In fact, r(i)d, a characteristic length for metal i, is related to the spatial extent of the d orbitals of that metal. The (F)

H

matrix element Vad and rd is related as: @A5 ∝ E C5 GI . The nonlinear relationship could possibly explain the slight weakness we notice in the Pearson’s Correlation Heatmap [Fig 5] between the matrix element @A5 and the Spatial

ACS Paragon Plus Environment

Page 11 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

Extent. In fact, in later clustering analysis, discussed in the next section we found that these two parameters are so synonymous and similar in their values that they could be replaced by a single unique descriptor C5 that describes both. The Width of the d band is an intrinsic property of the metal, whereas Vad is a property of the molecule-substrate interaction. However, C5 , the spatial extent of d-orbitals, determines coupling/overlap between the two d-bands in the same metal as well as Vad. So, Width of the d-band and Vad both are determined by C5 and is coming out of our analysis. This is also one of the strengths of this approach where it is able to bring out the hidden correlations and the descriptor, here C5 , that is not available in the original database. In [19], R. E. Watson and L. H. Bennet mention that the Pauling Electronegativity, χ, can be expressed in terms of the s and p Ionization Potentials. This could be one of the reasons for the strong correlations between the IP and the EN. Interestingly, we did not find a strong correlation between the “Local Electronegativity” (EN), χ , and “Pauling Electronegativity” (EN) χ? but the Euclidean distance, K(χ, χ? ), between the two is very small. This could be attributed to the fact that the Pauling Electronegativity data is not very spread out (see Figure 3) unlike the Local Electronegativity data which is very diverse. Similarly, from the Figure 3, we see that the Electron Affinity and the Ionization Potentials are also very sparse in their spread. Bootstrap Projected Gradient Descent (BoPGD) Method In [20], Meredig and Wolverton have introduced a “Clustering-Ranking-Modeling” (CRM) framework for identifying the unique descriptors that can predict the characteristics of a new dopant. Their idea is very simple, cluster the dopants together using the X-means algorithm [20], rank the descriptors using a regression fit and then model the behavior within each cluster using the unique descriptors. This works well if there are indeed clusters within the different sample datasets (here the 4 dopant clusters). The fact of using a regression model for ranking the descriptors will forcefully fit all the descriptors into the prediction model for the target property with a high risk that even those descriptors that do not contribute significantly and might be redundant are included. In addition, such a descriptor selection problem has stability issues when several batches of the training sets are used with each batch run giving a different set of descriptors. Finally, the selected descriptors are often the ones that best predict the target property but might not necessarily point out to the governing mechanism. Fingerprint Descriptors No.

Clusters/Descriptor

Description

Contribution (in %)

1

SE, VAD2

Spatial Extent, Adsorbatemetal interatomic d coupling matrix element squared

39.1

2

LEN

Local Pauling Electronegativity

37.4

3

AR

Atomic Radius

11.6

4

CDB

d-band Center

8.7

5

WF

Work Function

3.2

Table 2: Fingerprinted Descriptor obtained from the BoPGD algorithm. The KPI Binding Energy M789 is a function of these fingerprinted descriptors alone. When the nature of the relationship between the KPI and the descriptor is not known, a hypothesis can be built assuming a nonlinear relationship between the two as in tools like ANN or the Kernel Ridge Regression (KRR) or Random Forest or Support Vector Regression (SVR). In such cases, the entire data is taken and the model is built by using only a part of the data as an input for training and building the model. Perturbation analysis of the resulting model, like in [2], yields the descriptor importance that might contribute most to the KPI prediction. Post-analysis based on perturbation yields those variables that produce accurate mapping of the descriptor space to the KPI but need not necessarily indicate the governing mechanism dependency.

ACS Paragon Plus Environment

Chemistry of Materials

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 19

In this article, we have used the Hierarchical Clustering and Graph Plot (Figure 6) as a precursor to the Bootstrapped Projected Gradient Descent Algorithm (BOPGD) that is discussed in the section of Descriptor Fingerprinting by BoPGD Algorithm in SI. The results obtained from the BoPGD fingerprinting algorithm is as given in Table 2.

Statistical Analysis: Uncovering the Governing Mechanism Principal Component Analysis Principal component analysis helps in understanding the amount of variability in the data. Any data D ∈ ℜO×P can be decomposed to its K primary principal components, such that P is an orthogonal matrix with QR × Q = S and I is the identity. So, expressing the descriptors as a linear combination of the principal components:  ≈ UQ′ + ; WC  ≈ X YF ZF ′ + ;, F[

where T contains the scores and P the loadings. The scores contain information on the materials and the loadings on the descriptors. In Figure 7, the score plot of the first two principal components clearly shows a strong clustering of the materials. Since the Au is clustered into two different groups, it might be likely either that the second cluster highlighted by the yellow circle in Figure 7 is an outlier or it is likely that there are some intermediate materials that are missing that links the two clusters together for the Au substrate. We intuitively believe that the latter might be the case, that there are a few alloys with Au as substrate that is not included in this study calculations. From our analysis, we found that the Principal Component 1, has high positive loadings for many variables, negative loadings for KDB, FDB and SDB and 0 for CDB. It separates “rather purely” (because the Principal Component 2 score is close to 0) the clusters corresponding to ‘@Pt’ (highest score), ‘@Pd’ (close to average score) and ‘@Cu’ (lowest score) Clusters. The Principal Component 2, has the largest negative loading for CDB and positive loadings for SDB, AR and FDB, and it separates ‘@Ag’ from ‘@Ni’ clusters, but has large withincluster variability for ‘@Ag’ as well. 0.6

0.4

0.2

0

-0.2

-0.4

-0.6 FDB

CDB

WDB

SDB

KDB

WF

AR

SE

IP

EA

EN

LEN

VAD2

Variables

Figure 8: Partial Least Squares Method lists the contribution of the descriptors to the Latent Variable 1 while fitting to the KPI y. Clearly the d-band characteristics and the Work Function seem to be the

ACS Paragon Plus Environment

Page 13 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

dominating factors that in combination points out to an underlying mechanism that influences the Binding Energy. Partial Least Squares (PLS) Regression to uncover latent governing mechanism While decomposition is done on the descriptor space D in the case of PCA, one has to choose between orthogonal scores or orthogonal loadings in PLS. In the present study, the first option was used. The reason to perform Partial Least Squares (PLS) instead of ordinary LS regression is that the data sets D and y may contain random noises, which should be excluded from regression. Decomposing D and y into latent space can ensure that the regression is performed only on the most reliable variation. PLS regression was applied to link the D to the y. Different cross-validation schemes were applied and 5 LVs was selected as the optimum model complexity. This model explains 90 % of the D variance and 92 % of the y variance (i.e. R2 = 0.92) and the estimated prediction error is 0.17. Model Performance and Fingerprinted Descriptors Validation We define first the Median Absolute Error (MAE) performance metric that will help us in evaluating the sanity of selecting only a few of the descriptors. eF " F |), ∀g = 1,2, … , j \]; = Median (| As mentioned in subsection of Exploratory Analysis, while using Violin Plots, the Median is a robust estimate to the mean. The median metric ensures that the extreme low or high values do not skew the evaluation and the eF , against the actual validation process. The above metric can be used to evaluate the predicted property,  property, F , for the test dataset when the model was built on the set of training dataset.

(a)

ACS Paragon Plus Environment

Chemistry of Materials

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 19

(b) Figure 9. (a) Absolute Error bar graph obtained by the difference between the actual property k and the predicted property k for the different sets of test data. The training data varies from 5% to 95% and for each training data fraction, T=10000 trials were generated. We see from the graph and intuitively also, as the training size decreases, the accuracy of prediction reduces and the variability increases. An increase in the training dataset increases the accuracy but only to a minimum error of 0.104 eV and minimum variability of 0.0095. To balance the tradeoff between the accuracy and the variability, the fraction of data used for the training is between 30% to 55%, (b) The plot between the property predicted by the ensemble linear model and the actual property after choosing 55% of the data for training. Overcoming non-linearity by using decision tree: By looking at the Figure 9, obtained from the model fit between the selected variables to the Binding Energy, it is clear that the relationship is nonlinear, especially in the ranges [-2eV, -1.5eV] and [-0.2eV, 0eV]. In the future work, we will direct an ensemble of the decision trees to capture the nonlinear relationship between the selected descriptors and the Binding Energy. CHEMICAL INSIGHTS AND MODEL/HYPOTHESIS VALIDATION We now discuss the principal chemical insights derived from our analysis and their connection with well-known ideas in chemistry of materials, given in this context with the widely used d-band model [17, 42]. In recent years, the d-band model has been used in understanding effects of strain on reactivity of metal surfaces [18, 43], and perovskite oxides [44]. It has been used as a predictive model in screening multi-metallic electrocatalysts for Oxygen reduction reaction [45]. The d-band shape was also found to be relevant recently to the reactivity of transition metal alloys [46]. A new chemical idea of orbital-wise coordination number has been proposed by Xin et al to be relevant to prediction of reactivity of metal nanocatalysts [47]. Validating from Prior Knowledge While the finding that WF, AR, SE, IP, EA, EN and VAD2 depend only on the substrate is interesting, this is so by construction of the data which was generated for substrates only. Thus, it only verifies that a blind statistical analysis effectively points out to such correlations (which happen to be obvious here!). Such effectiveness of our analysis is also evident in our finding that Pauling electronegativity and Ionization Potential are strongly correlated with each other, as also was pointed out earlier in Ref. [19]. The correlation between the binding energy and the filling factor is interesting. This can be understood in terms of the d-band model in which the binding energy depends linearly on f [18, 2]. However, the data reveals that the d-band filling varies only from 0.87 to 0.98 in the entire database, while the Binding Energy exhibits quite a

ACS Paragon Plus Environment

Page 15 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

diverse distribution. Thus, much of the variation in the energy arises from other factors that depend more sensitively on the other descriptors identified here (see Equation below). Our analysis shows that various attributes of the d-band (SDB, KDB, WDB and CDB) are directly or inversely correlated with each other. This forms the basis for why the d-band model has been so successful in a wide range of catalysis problems. It essentially means that an overall shape of the d-band (peak in the density of states) is qualitatively similar in these alloys. If we approximate it by a triangle centered at the CDB, it is evident that the width of the d-band is inversely proportional to its height because the area under the d-band curve is 5. Thus, there is an inverse relation between width of the d-band and the higher order moments: the Skewness and Kurtosis. A direct relation of the squared coupling between adsorbate and metal with the width of the d-band can be understood: the latter is a measure of inter-atomic orbital coupling in the metal, and hence of the radius of the transition metal ion or delocalization or the spatial extent of the d-orbital. More delocalized nature of the dorbital facilitates its direct overlap with HOMO and LUMO orbitals of the adsorbate, and consequently a stronger adsorbate-metal coupling Vad. As a result, this coupling also results in a stronger binding energy (see Figure 10). Modification of the d-Band Model In addition to highlighting the relevance of d-band center, coupling between adsorbate and metal, filling of the d-band and atomic radii, our BoPGD (see Table 2) and PLS analysis (see Figure 8) reveal the importance of WorkFunction in relevance to the binding energy. While the former are quite understandable and included in the dband model (see [18]), our finding about the role of work function suggests that d-band model could be slightly modified and possibly improved! In fact, both the algorithms, BoPGD and PLS analysis, show that there is an inverse relationship between the Binding Energy (KPI) and the Work Function. We now show that the work function is essential in aligning the energy scale of metallic bands with the energies of frontier orbitals of the adsorbate. From the d-band model depicted in Fig. 10, it is clear that the energy of binding of an adsorbate with metal results from the lowering of energies of (a) its HOMO level and (b) of the d-states of metal. Within perturbation theory, both of these are directly proportional to the squared coupling between the adsorbate levels and metallic d-band, and inversely proportional to the difference between the energies of d-band and molecular orbitals. The common reference of energy to metallic d-band and adsorbate orbitals is the vacuum energy (set to 0eV in Fig. 10). In earlier analysis in terms of the d-band model [2] Fermi energy (εF) has been used as a reference. Since εF with respective to vacuum is given by the work function, W naturally enters in the perturbative analysis of the metal-adsorbate interaction (see Fig. 10, where we have obtained energies of HOMO and LUMO of CO from a DFT calculation). A typical term [2,18] describing the energy of this interaction and as given in SI of [2] modified with the work function is: ΔE5 = "4 n|r

I oIp ∗

Ip∗ srt su|

+v3w∗ @3w∗ x " 2 y(1 " ) |r

I oz{

z{ srt su|

+(1 + )v|} @|} ~,

3 3 where @3w ∗ = @A5 (? ), ? is the Atomic Radius and and β is an adsorbate dependent function. Here, 5σ and 2π* label LUMO and HOMO, ε(5σ, 2π*) are their energies, V(5σ,2π*) are their couplings with metal which depend on the interaction of the adsorbate with d-orbital of metal interaction through an adsorbate dependent function β(R0). Finally, V(5σ,2π*) are the repulsive interactions between adsorbate and metal, which depend on the overlap of their wavefunctions, i.e. Vad. From the Figure 10 we note that the energy difference can be determined meaningfully when the two energies are aligned on the same scale. If we set vacuum energy to zero, we note that location of the d-band center is given by "(B + 45 ). Thus, we argue that the metal-specific work-function should enter the expressions of the d-band model rather than the energies relative to Fermi energy, resulting in its slight modification We note that the original and the subsequent flavors of the formulation of the d-band model use the d-band center relative to the Fermi level, which implicitly includes information about the work

ACS Paragon Plus Environment

Chemistry of Materials

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 19

function. The general physics remains the same in our modification of the d-band model, but it has the advantage that explicit inclusion of the work function facilitates referencing Fermi energy to the vacuum and hence to the HOMO/LUMO energies of a specific molecule, giving a more complete description of the metaladsorbate interaction. Indeed, our machine learning and statistical analysis has not only identified the important descriptors that are the essential ingredients of the successful d-band model of catalysts, but has also pointed out an additional descriptor (Work Function) that can be readily imported into the d-band model.

Figure 10: A schematic of interaction between HOMO (5σ) and LUMO (2π*) orbitals of CO-the adsorbate molecule (right) with d-band of a transition metal (left). It is clear that the center of the d-band can be specified with respect to vacuum by adding work function (W) to its energy relative to Fermi energy. The shaded arrows (green and purple) show the gain in energy of d-band of metal and HOMO of the adsorbate resulting from their mutual interactions. The metal Fermi energy is for Cu and the Energy of the free CO molecule is calculated using PBE-DFT. CONCLUSIONS In conclusion, we have presented a recipe of statistical analyses and machine learning methodology developed here to help a material scientist to start from raw dataset, build predictive models and uncover the governing mechanism. We have applied this entire recipe to a published database containing 298 alloys [2]. Our proposed machine learning method based on BOPGD provides significant advantages over conventionally used methods of descriptor selection such as artificial neural network and LASSO based methods in the following ways:

1. Fingerprinting and Transferability: Our BoPGD method down-selects fewer descriptors (5 descriptors as compared to 13 used in the Neural Network model used in [2]) that are sufficient to describe the dataset variability and the Binding Energy, providing significant advantage over artificial neural networks. Our method is expected to result in more transferable descriptors with addition of new materials to the dataset as the 10 experiments show tightly bound errors. This is because it is based on mechanism as opposed to fitting and therefore has a huge potential in lending itself for extrapolation of model to new materials for consideration in design. 2. Stability and consistency of BOPGD over LASSO: Applying regression coefficient methods like LASSO to this small dataset reveals the intrinsic instability by pointing to different descriptors during randomly chosen training and test sets, whereas BOPGD demonstrates remarkable stability by selecting the same descriptors, independent of the random sampling runs. LASSO will not converge due to multiple solutions possible. Furthermore, the increase in the number of features beyond the 5 down-selected descriptors by BoPGD did not decrease the error rates in the linear or nonlinear fits. The maximum error is 0.3eV between the predicted and the

ACS Paragon Plus Environment

Page 17 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemistry of Materials

calculated BE, and the minimum error is 0 eV. Computational efficiency of BoPGD over LASSO: BoPGD scales more efficiently to large sets of descriptors than LASSO: Without bootstrap the computational time complexity is €(M 3 N + M ‚ ) for LASSO against €(K 3 N + K log K) for BoPGD, where K is the reduced dimension of the fingerprint and M is the scaled up descriptor space, and K