Bias-Free Chemically Diverse Test Sets from ... - ACS Publications

Jul 19, 2017 - Michelle L. Coote,. ‡ and Amanda S. Barnard. †. †. Data61 CSIRO, Molecular & Materials Modelling, Door 34, Goods Shed, Village St...
0 downloads 0 Views 1MB Size
Research Article pubs.acs.org/acscombsci

Bias-Free Chemically Diverse Test Sets from Machine Learning Ellen T. Swann,*,† Michael Fernandez,† Michelle L. Coote,‡ and Amanda S. Barnard† †

Data61 CSIRO, Molecular & Materials Modelling, Door 34, Goods Shed, Village Street, Docklands, Victoria 3008, Australia ARC Centre of Excellence for Electromaterials Science, Research School of Chemistry, Australian National University, Canberra, Australian Capital Territory 2601, Australia



S Supporting Information *

ABSTRACT: Current benchmarking methods in quantum chemistry rely on databases that are built using a chemist’s intuition. It is not fully understood how diverse or representative these databases truly are. Multivariate statistical techniques like archetypal analysis and K-means clustering have previously been used to summarize large sets of nanoparticles however molecules are more diverse and not as easily characterized by descriptors. In this work, we compare three sets of descriptors based on the one-, two-, and three-dimensional structure of a molecule. Using data from the NIST Computational Chemistry Comparison and Benchmark Database and machine learning techniques, we demonstrate the functional relationship between these structural descriptors and the electronic energy of molecules. Archetypes and prototypes found with topological or Coulomb matrix descriptors can be used to identify smaller, statistically significant test sets that better capture the diversity of chemical space. We apply this same method to find a diverse subset of organic molecules to demonstrate how the methods can easily be reapplied to individual research projects. Finally, we use our bias-free test sets to assess the performance of density functional theory and quantum Monte Carlo methods. KEYWORDS: machine learning, benchmarking, quantum chemistry, bias-free test sets



INTRODUCTION

quantify and it is not known how truly representative these types of databases are. To overcome these challenges, multivariate statistical techniques can be used to screen molecules and remove human biases from the selection process of these data sets. Different dimension reduction techniques can be applied to find patterns in high complexity data sets. For example, kmeans clustering and principal component analysis are techniques for unsupervised pattern recognition commonly used to identify intrinsic patterns in chemical data.17,18 Clustering methods assign structures that share similar features to individual groups or clusters. Principal component analysis transforms the data into a new coordinate system of principal components using an orthogonal linear transformation where the axes are oriented to account for maximal variation in the data set. Archetypal analysis19 is a relatively new matrix factorization method that results in easily interpretable components similar to principal components but with added flexibility over clustering. First introduced in 1994 by Cutler and Breiman,19 it can identify any individual sample whose convex combinations of features optimally approximate the features of the entire data set, yielding a set of archetypes or “pure chemical structures” of the convex hull.20 This method was recently

The development of new quantum chemical methods requires extensive testing to demonstrate robustness and to identify potential weaknesses and shortcomings. This has resulted in the construction of numerous databases to try and standardize this process. The first of these databases were limited to basic molecular properties with well-established experimental values like atomization energies, ionization potentials and electron affinities of small molecules.1−9 With the advent of high-level methods these databases now include a much more diverse range of properties (e.g., GMTKN30, a collection of 30 general main group thermochemistry, kinetics, and noncovalent interaction test sets10 and the Minnesota Database 2.011). These databases are designed to represent a diverse subset of chemical space, but they are often biased by chemical intuition and a general inclination for “good” results, severely limiting the chemical space studied. This has led to overconfidence, and the application of methods that perform well for some test sets being applied to real life problems where they have failed (i.e., the B3LYP12,13 functional gained popularity after it was shown to have good results with atomization energies but can fail unexpectedly for other reactions10). The converse is also true; functionals can test poorly on the established test sets but perform much better for real-life problems (i.e., the PBE14 functional15). To try and remove this bias Grimme et al. used an algorithm to randomly generate structures,16 but while chemical diversity is intuitively understood it is challenging to Published XXXX by the American Chemical Society

Received: May 30, 2017 Revised: July 5, 2017 Published: July 19, 2017 A

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

function theory36 and offers a good compromise between accuracy and cost. The biggest limitation for DFT is the unknown form of the exchange-correlation functional and performance for DFT methods can be sporadic, depending on the approximation to this functional. Often the methods are parametrized for a given set of problems and will perform well for systems similar to the training set, but can fail catastrophically for other systems.37 Quantum Monte Carlo methods are a stochastic alternative to ab initio and DFT methods and have been shown to be highly accurate for energetic38,39 and structural properties.40

extended to nanoparticles in combination with k-means clustering to identify the most representative nanostructures with unique combinations of structural features.21 To apply these multivariate statistical methods to molecules the chemical structures need to be deconstructed into numerical descriptors. Traditional molecular structure representations like Cartesian coordinates or molecular formula are not amenable to these methods. Alternative representations use numerical vectors built using the physical, geometric or electrostatic properties22 but these whole-molecule, lowdimension descriptors convey very little information about the substructure of the molecule. Descriptors based on one-, two- or three-dimensional representations of a molecule are preferable. One-dimensional descriptors such as molecular fingerprints take the molecular formula and deconstruct it into a bit string with a simple check for the presence of a predefined set of functional groups. They can represent extremely highdimensional chemistry-spaces, varying between 150 to 200 bits for MDL applications, up to a few thousand for Tripos and Daylight applications, and even up to millions for pharmacore fingerprints.23 Two-dimensional descriptors use the molecular structure and contain information on the size, shape, symmetry, branching, connectivity and cyclicity. However, one and twodimensional descriptors do not contain any information on the stereochemistry of the molecule. They are identical for isomers and will be very sparse for small molecules. Three-dimensional descriptors consider the three-dimensional structure of the molecule and are invariant to rotations, translations and permutations of equivalent atoms. A wide range of these descriptors have been developed.24−31 The Coulomb matrix M26 is a matrix representation of the threedimensional structure of a molecule. Its entries are given by eq 1, ⎧ 0.5Z 2.4 for I = J I ⎪ ⎪ MIJ = ⎨ ZIZJ for I ≠ J ⎪ ⎪ |R I − R J | ⎩



COMPUTATIONAL METHODS

Data Set. The data set was built using geometries and energies taken from the CCCBDB.35 This is an extensive collection of molecules and energies but is not comprehensive for all methods and basis sets. Due to the nature of the data in this database not all molecules have energies for all methods. A compromise was made between using a highly accurate reference method and using as many data points as possible to sample a broad section of chemical space. We used molecules with ωB97XD/6-31G* geometries and energies for the data set and G4 energies for reference values. The final data set had 1499 molecules. Descriptors. We use the MACCS 166 keys41 fingerprints and 19 connectivity-based topological descriptors (outlined in the Supporting Information), both generated using RDKIT.42 We combined three versions of Coulomb matrices; random, sorted and eigenvectors, described in the Supporting Information. Correlated columns were removed for each set of descriptors so that only the column with the highest variance was retained for each intercorrelated pair (R2 > 0.9). This left a total of 149 fingerprint descriptors, 13 topological descriptors and 435 coulomb matrix descriptors. Statistical Analysis. Principal component analysis produces orthogonal latent variables amenable to interpretation by systematically extracting the variation of the CCCBDB data set into so-called “principal components”,43 expressed as uncorrelated linear combinations of the original geometrical features as described by

(1)

where RI and ZI are the Cartesian coordinates and nuclear charge of atom I respectively. Diagonal elements encode a polynomial fit of atomic energies to nuclear charge and offdiagonal elements of the matrix correspond to the Coulomb repulsion between atoms I and J. It was introduced to correlate chemical structure to accuracy of approximate quantum chemical methods and has been used for machine learning models.26,32,33 Machine learning systematically identifies similarities among data to make quantitative predictions.34 In this work we compare the performance of three classes of descriptors (one, two and three- dimensional) and demonstrate the functional relationship between the descriptors and electronic energy of a system using machine learning techniques. Molecules are taken from the NIST Computational Chemistry Comparison and Benchmark Database (CCCBDB).35 The CCCBDB is an online database of experimental and computed thermochemical data for a selected set of 1709 gas-phase atoms molecule, ranging in size from 1 to 26 atoms. It encompasses a vast array of properties including enthalpies of formation, entropies, heat corrections, geometries and atomic charges. We use our new test sets to assess the performance of density functional theory (DFT) and quantum Monte Carlo methods. DFT is a popular alternative to traditional ab initio wave

X = t 1p1′ + t 2p2′ + ... + t ApA′ + E = TP′ + E

(2)

where X is the original data matrix, A is the total number of extracted principal components, and E is the residual matrix. The new latent variables, t scores, show how the molecules relate to each other, while the p loadings reveal the importance of the original variables (structural features) for the patterns seen in the scores. In addition to condensing the geometrical information into principal components, the molecules were conveniently grouped into clusters using k-means clustering. This matrix factorization method partitions the n molecules into k clusters according to its nearest mean. A set of observations (x1, x2, ..., xn) corresponding to d-dimensional real vectors is allocated into k (≤n) sets S = {S1, S2, ..., Sk} so as to minimize the withincluster sum of squares, according to the equation below: 2

k

args min ∑



x − μi

i = 1 x ∈ Si

(3)

where μi is the mean of points in Si. B

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

metries taken the CCCBDB.35 DFT calculations use the 6311+G(3df,2p) basis set. We also compare the performance of MP2 and the G4(MP2)-6X composite method.65 G4(MP2) is widely accepted as a highly accurate method close to chemical accuracy37 and G4(MP2)-6X65 is an approximation to G4(MP2) with improved speed for larger molecules. All DFT calculations including the starting orbitals for diffusion Monte Carlo were performed using Gaussian.66 CCSD(T) and MP2 calculations were performed using Molpro.67 Diffusion Monte Carlo (DMC) calculations were performed using the CMQMC code.68 We use Slater-Jastrow trial wave functions of the form

As mentioned above, archetypal analysis describes the variance in the data set as linear combinations of a few “pure types” or archetypes, they do not need to be contained in the data. For each molecule to be represented as a mixture of archetypes, the n × m matrix X representing the data set with n molecules and m geometrical features is transformed to find a k × m matrix Z that corresponds to the archetypal patterns. Archetypal analysis yields the two n × k coefficient matrices α and β which minimize the residual sum of squares: n

RSS =

=



Xi −

∑ αijZj

i=1

j=1

n

k

∑ i=1

2

k

Xi −

n

ΨT = e J D↑D↓

2

∑ αij ∑ βjlXl j=1

l=1

where D↑D↓ are Slater determinants constructed from singleparticle orbitals, taken from B3LYP calculations. It has been shown that Kohn−Sham orbitals give better quantum Monte Carlo energies than Hartree−Fock orbitals.69 J is a Jastrow factor containing explicit electron correlation terms. Here the Jastrow factor we use is a sum of electron−electron (ee) and electron−nucleus (eN) terms. It has recently been shown that including electron−electron−nucleus (eeN) interactions has little effect on the final energy but increases computational time.70 The free parameters in the Jastrow factor were optimized by minimizing the total energy at the variational Monte Carlo level. Pseudopotentials are routinely used in diffusion Monte Carlo calculations to replace the chemically inert core electrons and reduce local energy fluctuations, speeding up the calculations. Here we use energy consistent Burkatzki−Filippi−Dolg pseudopotentials with the associated triple-ζ basis set and improved H atom potential71,72 with imaginary time step sizes τ = 0.01. Nonlocal pseudopotentials were treated beyond the locality approximation in diffusion Monte Carlo using the size-consistent T-moves approach.73

(4)

Under the the following constraints: (1) ∑jk= 1 αij = 1 with αij ≥ 0 and i = 1, ..., n and (2) ∑ln= 1 βjl = 1 with βjl ≥ 0 and j = 1, ..., k. Constraint 1 indicates that predictors of Xi are finite mixtures k of archetypes, Xi = ∑ j = 1 αijZj , while constraint (2) implies that archetypes Zj are convex combinations of the observations, n Zj = ∑l = 1 βjl Xl . This can be conveniently implemented in the R package archetypes.44 Cluster centroids and archetype patterns can be conveniently associated with the nearest molecules, which constitute prototypes and archetypes of the data set of structures as we have recently proposed.45 Machine Learning. The magnitude of the error, ΔE, defined as the difference between G4 and ωB97X-D3/6-31G* energies (ΔE = EG4 − EωB97X‑D3/6‑31G*), was correlated to the descriptors using random forests,46 which is a well established machine learning technique. Model optimization and feature selection was performed with 50% of the data set, while the remaining 50% of the data set was used to test the prediction ability of the models (see Supporting Information for details). Atomization Energies. The new test sets were used to assess the performance of a variety of methods. An estimation to the “gold-standard” coupled cluster with single, double and perturbative triple excitations extrapolated to the complete basis set limit (CCSD(T)/CBS) was made using eq 5 CCSD(T) MP2 ECBS = ECBS + (E CCSD(T) − E MP2)|smallbasisset

(6)



RESULTS AND DISCUSSION Test Sets. Principal component analysis was used to explore the variance within the descriptors (see Supporting Information). The fingerprints (FP) descriptors are essentially independent from each other; each variable only accounts for the presence or absence of a particular functional group. Not surprisingly, these descriptors have the lowest explained variance with the first, second and third principal components accounting for only 6.6%, 5.6%, and 4.6% of the variance respectively, for a combined total of only 16.7%. In comparison the principal components for the topological (TOPO) and Coulomb matrix (CMAT) descriptors recover a much higher proportion of the variance. The first, second and third principal components for TOPO explain 28.8%, 18.9% and 13.5% of the variance respectively (combined total of 61.2%). In the case of CMAT descriptors the first, second and third principal components explain 25.3%, 9.7%, and 7.0% of the variance respectively for a combined total of 42.0%. The CCCBDB is a diverse data set and so results in especially unique fingerprints. Principal component analysis is an orthogonal transformation and is not expected to recover much variance for uncorrelated variables. In contrast the TOPO and CMAT descriptors are more easily generalized by the principal components. There are fewer TOPO descriptors so the principal components will account for more variance. Despite having more variables (435 compared to 149) the

(5)

This difference does not depend significantly on the basis set and is a good approximation to CCSD(T)/CBS.47 EMP2 CBS was determined using the extrapolation scheme of Helgaker and coworkers48 and cc-pVTZ and cc-pVQZ basis sets.49 The small basis set difference was calculated at the cc-pVDZ level.49 For larger molecules (48, 167, 264, 768, 769, 773, 1433, 1477, and 1482) MP2 calculations are replaced with the approximate resolution of the identity MP2 (RI-MP2) method.50−52 Absolute and relative RI-MP2 energies agree well with MP2 values.53 We compare the accuracy of DFT using 10 different exchange-correlation functionals. The types of functionals evaluated include local density approximation (SVWN54,55), generalized gradient approximation (GGA) functional (PBE14), meta-GGAs (TPSS,56 M06L57), hybrid GGAs (B3LYP,12,13 PBE0, 58,59 ωB97XD 60 ), hybrid meta-GGAs (B1B95, 61 MPW1B95 62 ), and double-hybrid GGAs (B2PLYP, 63 mPW2PLYP64). We use the same ωB97XD/6-31G* geoC

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

Figure 1. Hierarchical clustering dendrograms of the cluster prototypes of (a) TOPO and (b) CMAT descriptors (molecules with at least one halogen atom (F, Cl, or Br) are highlighted in green).

nanoparticles21 and a test set of corrosion inhibitors.74 For each cluster we selected the molecule with the shortest Euclidean distance to the cluster centroid as the cluster prototype, which were used for further analysis. In k-means clustering the number of clusters is set a priori. A choice of too few clusters can miss important information whereas too many clusters becomes redundant. The optimum number of clusters was selected by analyzing the amount of explained variance as a function of number of clusters (see Supporting Information). To keep the number of clusters to a minimum we selected sufficient clusters to explain 70% of the explained variance (including more clusters would increase the explained variance, as required). This gave 10 TOPO prototypes and 66 CMAT prototypes (outlined in Supporting Information) The TOPO descriptors describe a chemical space that is more easily summarized by a select few prototypes whereas the CMAT space is quite diverse and spread out. Clustering the fingerprint descriptors recovers very little information; a consequence of the diversity of the data set.

principal components for CMAT recover more variance than for FP, suggesting the variables are more correlated. Although FPs routinely used in pharmaceutical research and the development of quantitative structure activity relationships, molecules in these applications are generally larger and the fingerprints are unique (but similar) and dense. The TOPO and CMAT descriptors may be readily reduced by removing descriptors with a large number of zeros without losing too much information, but in the dense FPs the zeros are meaningful and important and so they cannot be reduced in the same way. Reducing the dimensionality of the FPs loses too much information and introduces unwanted bias, potentially making them unsuitable for small molecules studies such as these. In general, however, the reason for the poor performance of the FP descriptors is a topic that warrants further research. To characterize the diversity of the data set we used k-means clustering to identify groups (clusters) of molecules that share structural or functional similarity based on the three types of descriptors. This process has previously been used for a set of D

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

Figure 2. Simplex plots of the (a) TOPO and (b) CMAT descriptors. Archetypes (large circles) are located toward the edges of the regular polygons and molecules are scattered as projections of the archetypes in the simplex. Prototypes are shown as triangles. Points are colored according to the magnitude of ΔE, defined as the difference between G4 and ωB97X-D3/6-31G* energies.

groups is next to azobenzene (48) with two benzene rings and a nitrogen double bond. The upper-right branch shows four molecules with very different structures, despite containing only C and H atoms. As we have stated, k-means clustering only identifies structures that characterize the main groups of molecules that share similar properties, which fails to identify structures that represent unique combinations of features and lie on the complex hull of the data set. To find the outliers archetypal analysis is needed to identify molecules with the shortest Euclidean distance to each “pure type”. The explained variance as a function of number of archetypes is shown in Supporting Information. To keep the analysis of archetypes consistent with the prototypes we selected enough archetypes to explain 70% of the variance, resulting in 5 TOPO archetypes and 13 CMAT archetypes (Supporting Information). Fingerprint descriptors are poorly summarized by archetypes since they represent a high-dimensional chemical space that cannot be encapsulated by a convex hull. Since each molecule can be described as a linear combination of archetypes we can plot the data on a two-dimensional simplex plots, shown in Figure 2. The edges are defined by the archetypes and all other molecules are scattered at relative positions given by the contributions of each archetype toward each molecule. Points are colored according to the magnitude of ΔE, defined as the difference between G4 and ωB97X-D3/631G* energies. Smaller errors are shown in indigo and larger errors in red. Generally we can see from these results that some archetypes are more important in describing the remainder of the data than others, though all are necessary to some degree. The clustering of the data, however, is an artifact of the arrangement of the archetypes in the simplex plot, and a more even distribution simply implies that more archetypes are needed to explain the variation in the data, and is not a reflection on the quality of the results. Regions of the plot that are sparsely populated represent a configuration space occupied by molecules that are not represented in the CCCBDB database, or do not exist.

Clustering recovers less variance for all three classes of descriptors when compared to principal component analysis. Hierarchical clustering was used to further analyze these prototype structures, measuring the similarity within the descriptors at different levels. The resulting dendrograms are shown in Figure 1a for TOPO descriptors and Figure 1b for CMAT descriptors. The TOPO prototypes are quite diverse and are depicted by nine hierarchical branches for the 10 prototypes. A broad range of functional groups is captured in this small set. Topological descriptors tend to emphasize physical structure or connectivity information over charge and this is seen in the grouping of the branches. Linear molecules (GeO2 (318), CH2CCCH2 (779), GeS2 (168)), and cage-like molecules (the P4 cluster (230) and adamantane (773)) belong to the same branches. From a chemical perspective the separation is not intuitive. For example GeO2 (318) is more similar to 1,2,3-butatriene (779) than GeS2 (168). The species selected as prototypes include molecules that are challenging from a theoretical perspective (i.e., P45,75 and diazene N2H276−95). These statistically relevant molecules are unlikely to be those intuitively selected by chemists but are necessary to provide cases that challenge the accuracy and suitability of quantum chemical methods. CMAT descriptors required 66 prototypes to account for 70% of the explained variance and the resulting dendrogram is crowded. By virtue of having more points there is a broader range of atoms and functional groups represented in the CMAT prototypes than the TOPO prototypes and molecules range in size from two to twenty-six atoms. CMAT descriptors are calculated using nuclear charge and the three-dimensional coordinates of the molecule, and this is reflected in the dendrogram. Highlighted in green are all molecules with at least one halogen atom (F, Cl or Br) and in general they share the same branches. Organic molecules are also grouped together; for example, 3-penten-1-yne (555) and 1-buten-3-yne (1401) both have a carbon double and triple bond and share the same branch. Despite this, not all grouping is intuitive. For example the upper-left branch has myo-inositol (1433) with 6 OH E

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

The TOPO and CMAT archetype molecules cover a broad range of molecule sizes; the largest molecule for both sets of descriptors has 14 heavy atoms (azobenzene (48) for TOPO and anthracene (167) for CMAT). Both descriptors include a small, hydrogen-containing archetype (the hydrogen radical (267) for TOPO and 2H (460) for CMAT) and for both descriptors this is the most important archetype. Using descriptors based entirely on the structural information on the molecules we have found archetypes which distribute molecules according to their error, or rather, how challenging they are for computational methods. In general the archetype molecules have relatively large errors, supporting the idea that archetypes are more unique and consequently more challenging molecules. CMAT and TOPO descriptors emphasize different attributes of a molecule i.e. topological descriptors highlight connectivity, but Coulomb matrices emphasize charge. This can be very useful when deciding what types of descriptors to use when seeking a new test set, depending on future research plans. It is not unexpected that there is little overlap between their prototypes and archetypes, only CBr4 (990) shows up as an archetype for both, the benzyl radical (682) is a prototype for both and azobenzene (48) is a TOPO archetype but a CMAT prototype. The broad distribution of the molecules in CMAT space is evidenced by the overlap of archetypes and prototypes. Azobenzene (990), 2,2,3,3-tetamethylbutane (1035) and octafluoropropane (1264) and are both CMAT prototypes and archetypes and it can be seen in the simplex plot that there are prototype structures lying on the archetype convex hull. Once identified, the archetypes and prototypes can be combined to form a reliable, statistically significant and diverse test set for quantum chemical methods. To assess the performance of CMAT and TOPO archetypes and prototypes in representing the database as a whole we compared the distribution of the errors. We define the error as the difference between the ωB97XD/6-31G* and G4 energies. Results are shown in Figure 3. The error distribution of the entire data set is shown in gray. The TOPO prototypes (light red) are more widely distributed than the CMAT prototypes (light blue) and the broad distribution of the CMAT prototypes more closely resembles that of the whole data set. In contrast, the TOPO archetypes (dark red) encompass molecules with both small and large errors. The CMAT archetypes are more widely distributed than the archetypes, but a skew toward larger errors is seen. For both descriptors the archetypes have larger errors than the prototypes. This again shows the prototypes are more representative structures whereas the archetypes (especially in the case of TOPO archetypes) capture molecules at the extreme ends of the distribution. These archetypes and prototypes were found using only structural information, yet the molecules identified as archetypes and prototypes reflect the energies of the whole data set. As seen in the dendrogram and simplex plots, these structures are not the ones that would be intuitively selected, and might even be selected against for being problematic systems (i.e., P4 and diazene N2H2). By removing human biases we have found a diverse and statistically significant subset of molecules. These new test sets are labeled CMolsC-196 (CMAT test set) and CMolsT-197 (TOPO test set) and are available online. The CCCBDB is an especially diverse data set but most applications using test sets focus on a specific property or functional group. To this end we devised a smaller test set containing all molecules with at least one C and one H atom

Figure 3. Error (ΔE) histogram for (a) the whole test set and (b) a smaller set of organic molecules, defined as the difference between the ωB97XD/6-31G* and G4 energies, measured in Hartrees (Ha).

and no transition metals, to mimic an organic test set that would normally be used by chemists. This resulted in 700 molecules. Given the number of molecules and the number of CMAT descriptors, we limited CMAT descriptors to only include ones that had values for more than 25% of the data set. This resulted in 194 variables compared to 435 which were used for the entire data set. Once correlated columns were removed there were 13 TOPO descriptors and 143 fingerprint descriptors. We performed the same principal component analysis, kmeans clustering and archetypal analysis as described above and results are shown in Supporting Information. For further analysis we selected archetypes and prototypes for 70% variance again, resulting in 5 TOPO archetypes, 11 TOPO prototypes and 62 CMAT prototypes (see Supporting Information). CMAT archetypes did not recover 70% of the variance and were omitted from further study. The dendrograms for the CMAT and TOPO prototypes are shown in Supporting Information. Once again we see the TOPO prototypes are quite diverse, with eight distinct branches for the 11 molecules. They cover a broad range of functional groups with an emphasis on functional groups containing oxygen. For the CMAT prototypes we see a similar grouping of halogencontaining molecules as was seen for the larger data set. F

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

Table 1. Details of the Optimum Regression Models of ΔE (Defined As the Difference between the ωB97XD/6-31G* and G4 Energies)

a

model

descriptor

parametersa

RCV

STDCV (Ha)

RTest

STDTest (Ha)

Random Forest

CMAT TOPO

n = 25, s = 2 n = 25, s = 2

0.975 0.739

0.032 0.105

0.964 0.733

0.041 0.104

n is the number of estimator trees and s is the minimum sample split in the random forest model.

Figure 4. Scatter plot of ΔE predictions for the test set consisting of 50% of the data set using TOPO (a) and CMAT (b) descriptors. ΔEActual refers to the values calculated by energy differences between the two methods and ΔEPredicted corresponds to the random forest predictions.

tion) appear in Table 1. The optimal random forest models exhibit cross-validation correlation coefficient (R2CV) of ∼0.97 and ∼0.74 for CMAT and TOPO descriptors, respectively. Similarly, scatter plots in Figure 4 illustrate the better transferability capacity of the CMAT machine learning model, where the accuracy of the predictions of ΔE values of the test set structures is >0.96, while the TOPO correlation is 0.2 units lower. The tridimensional information on the CMAT descriptors is more suitable for describing the energy difference between the two methods, which suggest that the set of archetype and prototype structures selected by this descriptor type are a better representation of the data set in the particular context of ωB97XD/6-31G* and G4 energy gaps. It is important to note that some molecules exhibit large energy differences (>2 Ha), evidencing that the database does contain sufficiently computationally challenging cases to be useful in this context (such as molecules with hypervalent bonding, halogens or second row atoms). Atomization Energies. While total atomic energies are readily computed, in practice they are rarely reported on. Energy differences are of more chemical interest but accurate values rely on good cancellation of errors. Electron correlation energy makes up a small part of the total energy but becomes significant for energy differences. Atomization test sets like the G(n) series1−6 are predominantly made up of small, stable molecules with well-defined experimental values. Our new CMolsC-1 and CMolsT-1 test sets provide a more robust test of a methods performance. Here, we use them to test the performance of a selection of DFT methods, as well as MP2, G4(MP2)-6X, and diffusion Monte Carlo. Results are reported in Table 2. For most methods the most challenging set of molecules is the CMolsC-1 test set, and the archetypes are generally more challenging structures than prototypes, with larger mean

The simplex plot for the TOPO archetypes is shown in Supporting Information. The TOPO archetypes again include the smallest molecule in the set (in this case the CH radical (832)) and one of the largest (adamantane (773)). Molecules are more evenly distributed in the simplex plot compared to Figure 2a but again molecules are distributed according to ΔE, despite no energetic information being included in the descriptors. The CMAT descriptor was not suitable for this smaller data set as there was too much variance in the data set in CMAT space to be encapsulated by the archetypes. Despite this, the distribution of molecules within the simplex plot for the TOPO descriptors shows it is suitable for this smaller set. The histogram of the errors of the archetypes and prototypes for the organic subset is shown in Figure 3b. The distribution of archetypes and prototypes is very similar to those found for the whole test set. The TOPO archetypes capture molecules at the extremes of the distribution again, whereas the prototypes for both descriptors better represent the entire subset. This reiterates the suitability of prototypes and archetypes for forming a reliable and statistically significant data set. In cases where the CMAT descriptor is too diverse, the TOPO descriptor is still appropriate. These new test sets are labeled CMolsC-org98 (CMAT organic test set) and CMolsT-org99 (TOPO organic test set) and are available online. We have demonstrated the use of our method for organic molecules but the same approach could be used to find test sets of more exotic species. Machine Learning Prediction of Energy Differences using TOPO and CMAT Descriptors Machine Learning. We calibrated regression machine learning models to predict the difference between the ωB97XD/6-31G* and G4 energies (ΔE) using the TOPO and CMAT descriptors. Random forest models trained with 50% of the data set sample around archetype and prototype structures (see Supporting InformaG

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

improvable by including more determinants. In a previous study of G2 atomization energies using a multideterminant trial wave function reduced the error from 2.1 to 1.2 kcal/mol.39 The same approach could reduce the error here.

Table 2. Mean Absolute Deviation from Estimated CCSD(T)/CBS Reference Values (kcal/mol) for Atomization Energies of the Molecules in the CMolsC-1 and CMolsT-1 Test Sets

SVWN PBE TPSS M06L B3LYP PBE0 ωB97XD B1B95 B2PLYP mPW2PLYP G4 MP2 DMC

CMolsC1A

CMolsC1P

CMolsT1A

CMolsT1P

overall

196.4 25.1 4.6 10.9 22.1 8.8 8.2 9.0 19.7 18.0 4.0 21.0 10.0

150.8 23.9 6.3 8.3 11.1 8.5 5.6 6.6 11.3 10.5 2.9 12.5 6.3

191.4 36.4 6.4 7.4 12.7 9.9 7.4 5.7 13.0 13.0 4.0 9.5 11.1

156.2 26.2 9.0 11.0 12.4 9.9 5.4 9.7 11.0 10.9 4.1 12.9 8.4

159.8 25.0 6.3 8.9 12.8 8.8 6.0 7.2 12.5 11.7 3.2 13.6 7.3



CONCLUSIONS The development of new quantum chemical methods requires extensive testing to demonstrate robustness and to identify potential weaknesses and shortcomings. Numerous databases are available to test and standardize this process. However, selecting robust structural test sets for specific problems is vulnerable to human bias and intuition. We have demonstrated that archetypes and prototypes are diverse subsets of molecular structures that can serve as reference to build robust calibration and test sets. Fingerprint descriptors, used in material science, are insufficient to describe chemical space, but both Coulomb matrix and topological descriptors were useful to identify the ideal subset of structures. Archetypal analysis found the molecules with the largest errors with reference to G4 methods without calculating the energy while the prototypes are a good representation of the data set as a whole. When tested on a smaller set of organic molecules, the archetypes and prototypes exhibit similar results. By calculating atomization energies for the CMolsC-1 and CMolsT-1 test set we have shown DFT performance depends on the parametrization and treatment of the exchange-correlation functional but diffusion Monte Carlo is a consistent method for challenging problems. These conclusions are difficult to draw from other test sets as they are biased by the same chemical intuition used to build the training sets. Prototypes and archetypes can produce test sets without need of assumptions, prior knowledge or a completed body of work, as we have shown with both the general case and the specific case of organic molecules. These methods can be used to create small, robust test sets that avoid the burden on computational resources that is presented by the existing biased test sets. The test sets presented here are useful for main-group and organic chemistry but these methods can be extended to create test sets for virtually any property or type of system. Test sets for more exotic molecules or metallic systems could be constructed in the same manner, using descriptors that favor these exotic features. High-cost electronic structure calculations are not required to identify potentially large errors or challenging structures where popular methods might fail; just a sound statistical analysis of the chemical diversity of the structures is required.

absolute deviation values for CMolsC-1A compared to CMolsC-1P. Results are not as consistent for CMolsT-1, functionals like M06L and B1B95 have larger errors for prototypes compared to archetypes but for ωB97XD and the double-hybrid functionals B2PLYP and mPW2PLYP the archetypes are more challenging. Topological descriptors align more with our understanding of chemical intuition compared to the Coulomb matrix descriptors, and the CMolsT1 test set highlights how parametrization and the treatment of the exchange−correlation functional affect DFT performance. In contrast, diffusion Monte Carlo is not a parametrized method and the errors are consistent for archetypes and prototypes from both sets. Regardless of accuracy, diffusion Monte Carlo is more transferrable. Density functionals can be grouped together according to their treatment of the exchange-correlation term in what is known as a “Jacob’s Ladder”.100 There is significant variation within the different classes of methods. LDA functionals such as SVWN form the lowest rung of the Jacob’s ladder and have the largest errors with an overall mean absolute deviation of 159.8 kcal/mol. The double-hybrid functionals B2PLYP and mPW2PLYP lie at the top of the Jacob’s ladder, but have comparable performance with lower level methods like B3LYP. They have been shown to be more basis-set dependent than hybrid functionals,101 but these two test sets highlight some potential shortcomings compared to other methods. The best DFT method is ωB97XD, with an overall error of 6.0 kcal/mol. ωB97XD is a long-range corrected functional and includes 100% Hartee-Fock exchange for long-range electron−electron interactions. It is known to perform well for challenging systems like the DC9 test set101 and performs better than other functionals for noncovalent systems.60 G4(MP2)-6X has the lowest mean absolute deviation of all methods studied here. It is a composite method that uses an additivity scheme based on ab initio wave function calculations but has been parametrized on a large test set. Diffusion Monte Carlo has an overall mean absolute deviation of 7.3 kcal/mol and performs better than the double-hybrid functionals but worse than ωB97XD and TPSS. These two new test sets include more exotic molecules with second and third-row atoms or hypervalent bonding that are generally more challenging for diffusion Monte Carlo methods. Unlike DFT methods, diffusion Monte Carlo is systematically



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acscombsci.7b00087. Details on the Coulomb matrix and topological descriptors, machine learning models, and explained variance, as well as information on the archetypes and prototypes (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Ellen T. Swann: 0000-0002-0465-0380 H

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

(17) Potyrailo, R.; Rajan, K.; Stoewe, K.; Takeuchi, I.; Chisholm, B.; Lam, H. Combinatorial and high-throughput screening of materials libraries: review of state of the art. ACS Comb. Sci. 2011, 13, 579−633. (18) Fernandez, M.; Boyd, P. G.; Daff, T. D.; Aghaji, M. Z.; Woo, T. K. Machine Learning Virtual Screening for Rapid and Accurate Recognition of High Performing MOFs for CO2 Capture. J. Phys. Chem. Lett. 2014, 5, 3056−3060. (19) Cutler, A.; Breiman, L. Archetypal Analysis. Technometrics 1994, 36, 338−347. (20) Stone, E.; Cutler, A. Archetypal analysis of spatio-temporal dynamics. Phys. D 1996, 90, 209−224. (21) Fernandez, M.; Barnard, A. S. Identification of particle Prototypes and Archetypes. ACS Nano 2015, 9, 11980−11992. (22) Katritzky, A. R.; Lobanov, V. S.; Karelson, M. CODESSA 2.0 (Comprehensive Descriptors for Structural and Statistical Analysis); University of Florida: USA, 1996. (23) Gardiner, E. J.; Gillet, V. J.; Haranczyk, M.; Hert, J.; Holliday, J. D.; Malim, N.; Patel, Y.; Willett, P. Turbo similarity searching: Effect of fingerprint and dataset on virtual-screening performance. Statistical Analysis and Data Mining 2009, 2, 103−114. (24) De, S.; Bartok, A. P.; Csanyi, G.; Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 2016, 18, 13754−13769. (25) Sadeghi, A.; Ghasemi, S. A.; Schaefer, B.; Mohr, S.; Lill, M. A.; Goedecker, S. Metrics for measuring distances in configuration spaces. J. Chem. Phys. 2013, 139, 184118. (26) Rupp, M.; Tkatchenko, A.; Muller, K.-R.; Von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012, 108, 58301. (27) Bartók, A. P.; Gillan, M. J.; Manby, F. R.; Csányi, G. Machinelearning approach for one- and two-body corrections to density functional theory: Applications to molecular and condensed water. Phys. Rev. B: Condens. Matter Mater. Phys. 2013, 88, 054104. (28) Bartók, A. P.; Kondor, R.; Csányi, G. On representing chemical environments. Phys. Rev. B: Condens. Matter Mater. Phys. 2013, 87, 184115. (29) von Lilienfeld, O. A. First principles view on chemical compound space: Gaining rigorous atomistic control of molecular properties. Int. J. Quantum Chem. 2013, 113, 1676−1689. (30) Pietrucci, F.; Andreoni, W. Graph Theory Meets Ab Initio Molecular Dynamics: Atomic Structures and Transformations at the Nanoscale. Phys. Rev. Lett. 2011, 107, 085504. (31) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer New York: New York, NY, 2009. (32) Montavon, G.; Rupp, M.; Gobre, V.; Vazquez-Mayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.-R.; Von Lilienfeld, O. A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 2013, 15, 095003. (33) Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; Von Lilienfeld, O. A.; Tkatchenko, A.; Muller, K. R. Assessment and validation of machine learning methods for predicting molecular atomization energies. J. Chem. Theory Comput. 2013, 9, 3404−3419. (34) Varnek, A.; Baskin, I. Machine learning methods for property prediction in chemoinformatics: quo vadis? J. Chem. Inf. Model. 2012, 52, 1413−1437. (35) NIST Computational Chemistry Comparison and Benchmark Database NIST Standard Reference Database Number 10, Release 18, October 2016. http://cccbdb.nist.gov/. (36) Kohn, W.; Becke, A. D.; Parr, R. G. Density Functional Theory of Electronic Structure. J. Phys. Chem. 1996, 100, 12974−12980. (37) Cohen, A. J.; Mori-Sánchez, P.; Yang, W. Challenges for Density Functional Theory. Chem. Rev. 2012, 112, 289−320. (38) Morales, M. A.; McMinis, J.; Clark, B. K.; Kim, J.; Scuseria, G. E. Multideterminant Wave Functions in Quantum Monte Carlo. J. Chem. Theory Comput. 2012, 8, 2181−2188. (39) Petruzielo, F. R.; Toulouse, J.; Umrigar, C. J. Approaching chemical accuracy with quantum Monte Carlo. J. Chem. Phys. 2012, 136, 124116.

Michelle L. Coote: 0000-0003-0828-7053 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS



REFERENCES

Computational resources were supplied by the Australian National Computing Infrastructure (NCI) national facility under the Partner Allocation Scheme grant dr4. E.T.S. acknowledges support from the Office of the Chief Executive (OCE), CSIRO. We thank Claudia Filippi for sharing an unpublished hydrogen pseudopotential and basis set.

(1) Curtiss, L. A.; Jones, C.; Trucks, G. W.; Raghavachari, K.; Pople, J. a. Gaussian-1 theory of molecular energies for second-row compounds. J. Chem. Phys. 1990, 93, 2537−2545. (2) Curtiss, L. A.; Raghavachari, K.; Trucks, G. W.; Pople, J. A. Gaussian-2 theory for molecular energies of first-and second-row compounds. J. Chem. Phys. 1991, 94, 7221−7230. (3) Curtiss, L. A.; Raghavachari, K.; Redfern, P. C.; Pople, J. A. Assessment of Gaussian-2 and density functional theories for the computation of enthalpies of formation. J. Chem. Phys. 1997, 106, 1063−1079. (4) Curtiss, L. A.; Redfern, P. C. J. Chem. Phys. 1998, 109, 42−55. (5) Curtiss, L. A.; Raghavachari, K.; Redfern, P. C.; Pople, J. a. Assessment of Gaussian-3 and density functional theories for a larger experimental test set. J. Chem. Phys. 2000, 112, 7374−7383. (6) Curtiss, L. A.; Redfern, P. C.; Raghavachari, K. Assessment of Gaussian-3 and density-functional theories on the G3/05 test set of experimental energies. J. Chem. Phys. 2005, 123, 124107. (7) Lynch, B. J.; Truhlar, D. G. Robust and Affordable Multicoefficient Methods for Thermochemistry and Thermochemical Kinetics: The MCCM/3 Suite and SAC/3. J. Phys. Chem. A 2003, 107, 3898−3906. (8) Zhao, Y.; Truhlar, D. G. The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: Two new functionals and systematic testing of four M06-class functionals and 12 other function. Theor. Chem. Acc. 2008, 120, 215−241. (9) Peverati, R.; Truhlar, D. G. An improved and broadly accurate local approximation to the exchange-correlation density functional: The MN12-L functional for electronic structure calculations in chemistry and physics. Phys. Chem. Chem. Phys. 2012, 14, 13171. (10) Goerigk, L.; Grimme, S. Efficient and Accurate Double-HybridMeta-GGA Density Functionals?Evaluation with the Extended GMTKN30 Database for General Main Group Thermochemistry, Kinetics, and Noncovalent Interactions. J. Chem. Theory Comput. 2011, 7, 291−309. (11) Peverati, R.; Truhlar, D. G. Quest for a universal density functional: the accuracy of density functionals across a broad spectrum of databases in chemistry and physics. Philos. Trans. R. Soc., A 2014, 372, 20120476. (12) Becke, A. D. Density-functional thermochemistry. III. The role of exact exchange. J. Chem. Phys. 1993, 98, 5648−5652. (13) Stephens, P. J.; Devlin, F. J.; Chabalowski, C. F.; Frisch, M. J. Ab Initio Calculation of Vibrational Absorption and Circular Dichroism Spectra Using Density Functional Force Fields. J. Phys. Chem. 1994, 98, 11623−11627. (14) Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient Approximation Made Simple. Phys. Rev. Lett. 1996, 77, 3865−3868. (15) Grimme, S.; Steinmetz, M.; Korth, M. How to compute isomerization energies of organic molecules with quantum chemical methods. J. Org. Chem. 2007, 72, 2118−2126. (16) Korth, M.; Grimme, S. Mindless” DFT Benchmarking. J. Chem. Theory Comput. 2009, 5, 993−1003. I

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

(40) Cleland, D. M.; Per, M. C. Performance of quantum Monte Carlo for calculating molecular bond lengths. J. Chem. Phys. 2016, 144, 124108. (41) Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273−1280. (42) RDKit: Open source Cheminformatics, 2016. http://www.rdkit. org. (43) Jackson, J. E. A User’s Guide to Principal Components; John Wiley & Sons, Inc., 1991; pp i−3. (44) Marinetti, S.; Finesso, L.; Marsilio, E. Archetypes and principal components of an IR image sequence. Infrared Phys. Technol. 2007, 49, 272−276. (45) Fernandez, M.; Wilson, H. F.; Barnard, A. S. Impact of distributions on the prediction of nanoparticle prototypes and archetypes. Nanoscale 2017, 9, 832−843. (46) Breiman, L. Random forests. Machine learning 2001, 45, 5−32. (47) Jurečka, P.; Hobza, P. On the convergence of the (ΔECCSD(T) - ΔEMP2) term for complexes with multiple H-bonds. Chem. Phys. Lett. 2002, 365, 89−94. (48) Halkier, A.; Helgaker, T.; Jørgensen, P.; Klopper, W.; Koch, H.; Olsen, J.; Wilson, A. K. Basis-set convergence in correlated calculations on Ne, N2, and H2O. Chem. Phys. Lett. 1998, 286, 243−252. (49) Dunning, T. H. Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen. J. Chem. Phys. 1989, 90, 1007. (50) Feyereisen, M.; Fitzgerald, G.; Komornicki, A. Use of approximate integrals in ab initio theory. An application in MP2 energy calculations. Chem. Phys. Lett. 1993, 208, 359−363. (51) Vahtras, O.; Almlöf, J.; Feyereisen, M. Integral approximations for LCAO-SCF calculations. Chem. Phys. Lett. 1993, 213, 514−518. (52) Bernholdt, D. E.; Harrison, R. J. Large-scale correlated electronic structure calculations: The RI-MP2 method on parallel computers. Chem. Phys. Lett. 1996, 250, 477−484. (53) Jurečka, P.; Nachtigall, P.; Hobza, P. RI-MP2 calculations with extended basis sets: a promising tool for study of H-bonded and stacked DNA base pairs. Phys. Chem. Chem. Phys. 2001, 3, 4578−4582. (54) Slater, J. C. A simplification of the Hartree-Fock method. Phys. Rev. 1951, 81, 385. (55) Vosko, S. H.; Wilk, L.; Nusair, M. Accurate spin-dependent electron liquid correlation energies for local spin density calculations: a critical analysis. Can. J. Phys. 1980, 58, 1200−1211. (56) Tao, J.; Perdew, J. P.; Staroverov, V. N.; Scuseria, G. E. Climbing the density functional ladder: nonempirical meta-generalized gradient approximation designed for molecules and solids. Phys. Rev. Lett. 2003, 91, 146401. (57) Zhao, Y.; Truhlar, D. G. A new local density functional for maingroup thermochemistry, transition metal bonding, thermochemical kinetics, and noncovalent interactions. J. Chem. Phys. 2006, 125, 194101. (58) Ernzerhof, M.; Scuseria, G. E. Assessment of the Perdew-BurkeErnzerhof exchange-correlation functional. J. Chem. Phys. 1999, 110, 5029−5036. (59) Adamo, C.; Barone, V. Toward reliable density functional methods without adjustable parameters: The PBE0 model. J. Chem. Phys. 1999, 110, 6158−6170. (60) Chai, J.-D.; Head-Gordon, M. Long-range corrected hybrid density functionals with damped atom-atom dispersion corrections. Phys. Chem. Chem. Phys. 2008, 10, 6615−6620. (61) Becke, A. D. Density-functional thermochemistry. IV. A new dynamical correlation functional and implications for exact-exchange mixing. J. Chem. Phys. 1996, 104, 1040−1046. (62) Zhao, Y.; Lynch, B. J.; Truhlar, D. G. Development and Assessment of a New Hybrid Density Functional Model for Thermochemical Kinetics. J. Phys. Chem. A 2004, 108, 2715−2719. (63) Grimme, S. Semiempirical hybrid density functional with perturbative second-order correlation. J. Chem. Phys. 2006, 124, 034108.

(64) Schwabe, T.; Grimme, S. Towards chemical accuracy for the thermodynamics of large molecules: new hybrid density functionals including non-local correlation effects. Phys. Chem. Chem. Phys. 2006, 8, 4398−4401. (65) Chan, B.; Deng, J.; Radom, L. G4(MP2)-6X: A Cost-Effective Improvement to G4(MP2). J. Chem. Theory Comput. 2011, 7, 112− 120. (66) Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Mennucci, B.; Petersson, G. A.; Nakatsuji, H.; Caricato, M.; Li, X.; Hratchian, H. P.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, J. A., Jr.; Peralta, J. E.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; Millam, J. M.; Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.; Farkas, O.; Foresman, J. B.; Ortiz, J. V.; Cioslowski, J.; Fox, D. J. Gaussian 09, revision E.01; Gaussian, Inc.: Wallingford, CT, 2009. (67) Werner, H.-J.; Knowles, P. J.; Knizia, G.; Manby, F. R.; Schütz, M. Molpro: a general-purpose quantum chemistry program package. Wiley Interdisciplinary Reviews: Computational Molecular Science 2012, 2, 242−253. (68) Per, M. C. CSIRO Quantum Monte Carlo Software Package; CSIRO: Canberra, Australia, 2016. (69) Per, M. C.; Walker, K.; Russo, S. P. How Important is Orbital Choice in Single-Determinant Diffusion Quantum Monte Carlo Calculations? J. Chem. Theory Comput. 2012, 8, 2255−2259. (70) Swann, E. T.; Coote, M. L.; Barnard, A. S.; Per, M. C. Efficient protocol for quantum Monte Carlo calculations of hydrogen abstraction barriers: Application to methanol. Int. J. Quantum Chem. 2017, 117, e25361. (71) Burkatzki, M.; Filippi, C.; Dolg, M. Energy-consistent pseudopotentials for quantum Monte Carlo calculations. J. Chem. Phys. 2007, 126, 234105. (72) Dolg, M.; Filippi, C. Private communication, 2014. (73) Casula, M.; Moroni, S.; Sorella, S.; Filippi, C. Size-consistent variational approaches to nonlocal pseudopotentials: Standard and lattice regularized diffusion Monte Carlo methods revisited. J. Chem. Phys. 2010, 132, 154113. (74) Fernandez, M.; Breedon, M.; Cole, I. S.; Barnard, A. S. Modeling corrosion inhibition efficacy of small organic molecules as non-toxic chromate alternatives using comparative molecular surface analysis (CoMSA). Chemosphere 2016, 160, 80−88. (75) Grimme, S. Accurate Calculation of the Heats of Formation for Large Main Group Compounds with Spin-Component Scaled MP2Methods. J. Phys. Chem. A 2005, 109, 3067−3077. (76) Winter, N. W.; Pitzer, R. M. Theoretical description of the diimide molecule. J. Chem. Phys. 1975, 62, 1269. (77) Parsons, C. a.; Dykstra, C. E. Electron correlation and basis set effects in unimolecular reactions. A study of the model rearrangement system N2H2. J. Chem. Phys. 1979, 71, 3025. (78) Jensen, H. J.; Joergensen, P.; Helgaker, T. Ground-state potential energy surface of diazene. J. Am. Chem. Soc. 1987, 109, 2895−2901. (79) Andzelm, J.; Sosa, C.; Eades, R. Theoretical study of chemical reactions using density functional methods with nonlocal corrections. J. Phys. Chem. 1993, 97, 4664−4669. (80) McKee, M. L. Catalyzed cis/trans isomerization of diazene. A computational study in the gas and aqueous phases. J. Phys. Chem. 1993, 97, 13608−13614. (81) Smith, B. J. Isomers and transition structures of diazene. J. Phys. Chem. 1993, 97, 10513−10514. (82) Angeli, C.; Cimiraglia, R.; Hofmann, H.-J. On the competition between the inversion and rotation mechanisms in the cis-trans J

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX

ACS Combinatorial Science

Research Article

thermal isomerization of diazene. Chem. Phys. Lett. 1996, 259, 276− 282. (83) Jursic, B. Ab initio and density functional theory study of the diazene isomerization. Chem. Phys. Lett. 1996, 261, 13−17. (84) Mach, P.; Masik, J.; Urban, J.; Hubac, I. Single-root multireference Brillouin-Wigner coupled-cluster theory. Rotational barrier of the N2H2 molecule. Mol. Phys. 1998, 94, 173−179. (85) Martin, J. M. L.; Taylor, P. R. Benchmark ab initio thermochemistry of the isomers of diimide, N2H2, using accurate computed structures and anharmonic force fields. Mol. Phys. 1999, 96, 681−692. (86) Stepanic, V.; Baranovic, G. Ground and excited states of isodiazene - an ab initio study. Chem. Phys. 2000, 254, 151−168. (87) Chattaraj, P. K.; Perez, P.; Zevallos, J.; Toro-Labbe, A. Theoretical study of the trans-N2H2 -> cis-N2H2 and F2S2 -> FSSF reactions in gas and solution phases. J. Mol. Struct.: THEOCHEM 2002, 580, 171−182. (88) Hwang, D.; Mebel, A. Reaction Mechanism of N2/H2 Conversion to NH3: A Theoretical Study. J. Phys. Chem. A 2003, 107, 2865−2874. (89) Pu, X.; Wong, N.-B.; Zhou, G.; Gu, J.; Tian, A. Substituent effects on the trans/cis isomerization and stability of diazenes. Chem. Phys. Lett. 2005, 408, 101−106. (90) Biczysko, M.; Poveda, L.; Varandas, J. Accurate MRCI study of ground-state N2H2 potential energy surface. Chem. Phys. Lett. 2006, 424, 46−53. (91) Chaudhuri, R. K.; Freed, K. F.; Chattopadhyay, S.; Sinha Mahapatra, U. Potential energy curve for isomerization of N2H2 and C2H4 using the improved virtual orbital multireference Møller-Plesset perturbation theory. J. Chem. Phys. 2008, 128, 144304. (92) Mahapatra, U. S.; Chattopadhyay, S. Evaluation of the performance of single root multireference coupled cluster method for ground and excited states, and its application to geometry optimization. J. Chem. Phys. 2011, 134, 044113. (93) Jana, J. Relative stabilities of two difluorodiazene isomers: density functional and molecular orbital studies. Reports Theor. Chem. 2012, 1, 1−10. (94) Musiał, M.; Łupa, L.; Szopa, K.; Kucharski, S. Potential energy curves via double ionization potential calculations: example of 1,2diazene molecule. Struct. Chem. 2012, 23, 1377−1382. (95) Sand, A. M.; Schwerdtfeger, C.; Mazziotti, D. Strongly correlated barriers to rotation from parametric two-electron reduceddensity-matrix methods in application to the isomerization of diazene. J. Chem. Phys. 2012, 136, 034112. (96) Swann, E.; Fernandez, M.; Barnard, A.; Coote, M. CMolsC-1 Quantum Chemical Test Set. v1. CSIRO Data Collection 2017, DOI: 10.4225/08/58bcf1565950a. (97) Swann, E.; Fernandez, M.; Barnard, A.; Coote, M. CMolsT-1 Quantum Chemical Test Set. v1. CSIRO Data Collection 2017, DOI: 10.4225/08/58bcf21ca85b6. (98) Swann, E.; Fernandez, M.; Barnard, A.; Coote, M. CMolsC-org Quantum Chemical Test Set. v1. CSIRO Data Collection 2017, DOI: 10.4225/08/58bcf3005e549. (99) Swann, E.; Fernandez, M.; Barnard, A.; Coote, M. CMolsT-org Quantum Chemical Test Set. v1. CSIRO Data Collection 2017, DOI: 10.4225/08/58bcf2cf53bbe. (100) Perdew, J. P.; Schmidt, K. Jacob’s ladder of density functional approximations for the exchange-correlation energy. AIP Conf. Proc. 2000, 577, 1−20. (101) Goerigk, L.; Grimme, S. Efficient and Accurate Double-HybridMeta-GGA Density Functionals?Evaluation with the Extended GMTKN30 Database for General Main Group Thermochemistry, Kinetics, and Noncovalent Interactions. J. Chem. Theory Comput. 2011, 7, 291−309.

K

DOI: 10.1021/acscombsci.7b00087 ACS Comb. Sci. XXXX, XXX, XXX−XXX