Kinetic Energy of Hydrocarbons as a Function of ... - ACS Publications

Jan 26, 2016 - work trained to reproduce the Kohn−Sham kinetic energy of hydrocarbons from ... instead yields the energy as a functional of a nonint...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/JCTC

Kinetic Energy of Hydrocarbons as a Function of Electron Density and Convolutional Neural Networks Kun Yao and John Parkhill* Department of Chemistry, University of Notre Dame, 251 Nieuwland Science Hall, Notre Dame, Indiana 46556, United States S Supporting Information *

ABSTRACT: We demonstrate a convolutional neural network trained to reproduce the Kohn−Sham kinetic energy of hydrocarbons from an input electron density. The output of the network is used as a nonlocal correction to conventional local and semilocal kinetic functionals. We show that this approximation qualitatively reproduces Kohn−Sham potential energy surfaces when used with conventional exchange correlation functionals. The density which minimizes the total energy given by the functional is examined in detail. We identify several avenues to improve on this exploratory work, by reducing numerical noise and changing the structure of our functional. Finally we examine the features in the density learned by the neural network to anticipate the prospects of generalizing these models.



(ML) techniques.15 Our approximation is able to predict bonding and shell-structure. It is designed to be compatible with existing Kohn−Sham exchange-correlation functionals and algorithms used to evaluate them. The functional is nonlocal and implemented without a pseudopotential16,17 approximation in real-space. Other groups have reported quantitative approximations to the Born−Oppenheimer potential energy surface (BO-PES) using Neural Networks and other techniques.18−24 In terms of theoretical detail our functional lies between KS-DFT and ML approximations to BO-PES that depend on only atomic positions.18,19,22 OF-DFT is positioned to become an inexpensive approximation to Kohn−Sham theory and is promising in multiresolution schemes.25−28 Accelerating progress is being made toward accurate kinetic energy functionals; a complete review is beyond the scope of this paper.3,4,29−51 Most kinetic functionals are related to the local Thomas-Fermi (TF) or semilocal von Weizscker (VW) functionals, which are exact for uniform electron gas, and one orbital system, respectively.52 They can be

INTRODUCTION The ground state energy of a many-electron system is determined up to an additive constant by the electron density n(r). 1 However, the overwhelming majority of ’DFT’ applications use the Kohn−Sham (KS) formalism which instead yields the energy as a functional of a noninteracting wave function, which is a functional of the density.2 KS-DFT introduces computational overhead in large systems because the Kohn−Sham determinant uses at least one wave function for each electron. In orbital-free (OF) DFT only one function, the density n(r), is needed. Indeed, existing OF software packages are able to treat systems roughly an order-ofmagnitude larger than Kohn−Sham implementations on modest computer hardware.3−6 The total energy in OF-DFT can be written as OF‐DFT Etotal = T[n] + Enu − el[n(r )] + E hartree[n(r )]

+ Exc [n(r )] + Enu − nu

(1)

where T[n] is the kinetic energy functional. In the Kohn−Sham scheme, the noninteracting kinetic energy of a determinant Ts is used instead, and the correlation contribution to the kinetic energy (T−Ts) is usually accounted for by the exchangecorrelation energy functional. Inexpensive and accurate density functionals are known for all the parts of the energy besides the kinetic piece. One must only provide an accurate approximation to Ts[n] to enjoy the computational advantages of OFDFT.7 In this paper we follow a totally naive and empirical route8 to approximations of Ts[n], based on convolutional neural networks. We call the approach CNN-OF.9−14 This approach is a three-dimensional extension of the pioneering work of the Burke group, based on other machine learning © 2016 American Chemical Society

written as TTF = ∫ CTFn(r)5/3dr and TVW =

1 8

2

∫ |∇nn((rr))| dr ,

3

where CTF equals 10 (3π 2)2/3. Extensions to these functionals can be classified by their locality, the first group being semilocal approximations based on gradient information and physical constraints.4,36−42,45−49,53,54 The accuracy of modern generalized gradient approximation (GGA) kinetic functionals has remarkably reached ∼1% for atoms.37,45,50,51,54,55 Chemical bonds are described by some of these functionals,37,45,56 but Received: October 24, 2015 Published: January 26, 2016 1139

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147

Article

Journal of Chemical Theory and Computation

density that is exact for a physical model system (VW or TF) multiplied by a unitless nonlocal enhancement functional F, defined as

further improvements will be required for these functionals to become routine tools. Nonlocal functionals of the density are another major class, and there are two important subtypes: two-point functionals based on a physical relation between density response and kinetic energy57−62 and empirical functionals based on the kernel method (KM) of machine learning.63,64 When combined with pseudopotentials, the nonlocal functionals usefully predict bulk properties of metals, semiconductors,29,33,43 and even some molecules.33,34 Including angular momentum dependence further improves the accuracy of these functionals.59 However, they require a local pseudopotential and perform best when the metallic Lindhard density response function is an accurate model. Another class of nonlocal functionals recently appeared, empirical functionals based on the Kernel Method (KM) of Machine Learning.15 Groundbreaking studies of kernel method functionals have been restricted to one-dimensional models of molecules but have demonstrated several promising features, including bonding behavior.65 This work is related to the KM approach but makes several different design choices: • Our functional takes the form of an enhancement functional, F{r,n}, for a hybrid of the TF and VW functionals. The computational effort required to evaluate F{r,n} independent of system size. • We use Convolutional Neural Networks (CNN’s) rather than the KM to learn the enhancement functional. • CNN-OF is evaluated in real-space, and no pseudopotential (PS) is required. • Our functional targets the positive semidefinite, noninteracting Kohn−Sham kinetic energy density τ+66 so that we can work with an integrand of positive sign. We explain the motivation and impact of each of these design choices in the remainder of this paper. Our calculations will use common exchange-correlation approximations and assume that correlation’s contribution to the kinetic energy, T[n]−Ts[n], is accounted by the exchange-correlation energy functional.67 For our OF functional to be practically useful, it must be compatible with existing KS functionals, and so this approximation is a useful expedient that could be relaxed in future work. Note that there is no unique kinetic energy density;66,68 modeling τ+ has practical advantages. There are known conditions of the exact XC-functional that are not satisfied by the approximate XC functionals69 in common use today. Likewise there are known features of the exact kinetic functional that are not satisfied by this work70 including density scaling relations,71,72 response relations, and asymptotic limits.73 We expect that enforcing these conditions will be key to future developments. Within our scheme it is simple to enforce physical constraints with data, using the same techniques to enforce rotational invariance in CNNs which classify images. Rotated versions of input data13 are used to augment the data set, causing the network to respect the principle of rotational invariance. We defer an investigation of these constraints and acknowledge from the outset that our functional is neither exact nor unique defined. This paper simply establishes that convolutional neural networks are able to predict the Kohn−Sham kinetic energy in three-dimensional molecules from the electron density.

Ftf (vw){r , n} = τ+(r )/τtf (vw)(r )

(2)

where n is a sample of density at position r, 1 τ+(r ) = 2 ∑i |∇ϕi(r )|2 is the positive semidefinite Kohn− Sham kinetic energy density, and τtf(vw) is the local TF or VW kinetic energy density.68 The total noninteracting kinetic energy can be written as Ts{n} =

∫ Ftf (vw){r , n}τtf (vw)(r)dr

(3)

One can write the nonlocal parts of a two-point functional57−62 in a similar form where TNL{n(r)} = ∫ dr ∫ dr′K(r,r′,n(r),n(r′))dr′ where K is related to the density response function. The two-point functionals are a continuous integral convolution. In our approach, there is a discrete convolution of density performed, but there is no physical significance in the similarity. As described in the next section, a fixed number of density samples, {n(rm′ )}, are passed to the convolutional neural network for each r which outputs Ftf(vw). The computational cost of evaluating the F at each point is a constant independent of the size of the molecule. The overall cost of evaluating our model for Ts is extensive with the number of grid points. We choose to use τ+ as a local KE density since it is positive, which avoids numerical problems, and the ratios of τtf and τvw to t+ are well-behaved functions. Ideally, Ftf should equal to 1 for any constant function input, and Fvw is also equal to 1 for any oneorbital system.74 We have examined several different choices of Ftf vs Fvw as described later on. Convolutional Neural Networks. If the set of orbitals generating the density is known, calculating the noninteracting kinetic energy is trivial. This observation suggests that Ts[n] can be thought of as recognizing orbitals that sum to the density and that robust statistical models for recognition are a logical choice for kinetic energy functionals. CNNs have emerged over the past few decades as the most powerful models for classification of image data. Previous 1D machine-learning work has employed a KM, kernel ridge regression (KRR), to learn the kinetic energy functional.15,64,65,75 The KM benefits from a straightforwards, deterministic learning process. However, it is difficult to scale kernel methods to large amounts of data with high dimensionality.76 Neural networks (NNs), and in particular convolutional neural networks, are known for their ability to digest vast, high dimensional data sets. The universal approximation theorem77 shows that neural networks are capable of approximating arbitrary maps on compact subspaces,78 which makes them appealing model density functional approximations. NNs, depicted in Figure 1, are a nonlinear function of many variables, structured into ’layers’. Each layer, i, is composed of many functions (’neurons’, m) which take several input variables, yi−1 n , from a previous layer. The neuron multiplies inputs with learned parameters (’weights’, wi,m n ) and sums the result with a bias bi,m. The result is fed into a nonlinear activation function, usually f(x) = max(0,x) or tanh(x). The activation function used in this work is the rectified linear unit (ReLU):f(x) = max(0, x) which we choose over tanh(x) because it is positive and unbounded like F, our target. The formula for the output of the mth neuron in the ith layer is thus i−1 yi,m = f(bi+∑nwi,m n yn ). Inputs for the next layer are taken from



FUNCTIONAL FORM Like most exchange correlation functionals the form of our nonlocal kinetic energy is based on a local kinetic energy 1140

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147

Article

Journal of Chemical Theory and Computation

shared across an area in convolutional layers, the number of parameters in a convolutional network is much smaller. For data with locality CNNs generalize better than fully connected networks.80,81 The network’s output is computed in a series of tensor contractions reminiscent of the coupled-cluster equations.82 Both models derive flexibility by nonlinearly transforming their input. Choices of Input and Network Shape. The density of an entire molecule is an impractical amount of information to provide a neural network and predict F. Partial samples of n(r)83 also determine the energy in principle, but functionals based on extremely small volumes of density must be numerically unstable, so we sought a compromise. While designing our functional we looked at the shape of F for several small molecules and also experimented with several sorts of input. Based on the structure of the exact F, we allow our neural network’s model of F to depend on two one-dimensional lines of the density centered at r′. These lines are oriented toward the nearest nuclei to let the functional perceive nearby shellstructure. This choice of input is arbitrary and should be refined in future work. The bottom panel of Figure 3 shows the results of some experiments with different versions of the density. The dimensionless gradient along the two lines we mentioned above is the scheme used for our results, but we also experimented with the density, the square root of the density, and other variations. The density itself does not display much shape and has a large dynamic range. The reduced gradient (s(r) = |∇n(r)|/2kF(r)n(r)) is a better choice: it is dimensionless and has more local shell structure.84 One can see from Figure 3 that using s as input results in a large improvement over using n1/2 as input. Each vector of s fed into the network is normalized using the local response normalization function81

Figure 1. Left panel: A schematic of a typical fully connected neural network. It includes an input layer, two hidden layers, and an output layer. Right panel: A simple example shows how four neurons in a convolutional filter locally sample the previous layer. The weight matrix of this filter is smaller than the dimension of the input to this layer. Neurons in the filter overlap to cover the input.

the outputs of the previous layer. All layers except for the first and last are called ”hidden” layers. In our work the last layer only has a single output, the desired kinetic enhancement factor, F. The value of all the weights and biases are initialized randomly and are optimized by minimizing the prediction error of the network, which is the squared-difference between the target enhancement factor and the output of the NN, by gradient descent.10 It is advantageous to have as few parameters as possible. A neural network with excess parameters will learn the probability distribution of its training data, and since the training set must be finite, this ruins the performance of the network on independent tests. Conversely, the difference between training and test error is a measure of how different the distribution is of the training and test sets. If data has spatial locality, then distant y1n are uncorrelated, and it is useful to constrain the shape of the network to eliminate the weights which would couple distant inputs. Convolutional neural networks are a locally constrained form of neural network, inspired by the structure of the animal visual cortex.79 They are appropriate for data like images (and electron density) which have local features. The functions in a convolutional layer are broken up into ’filters’. A filter is a rectangular collection of neurons that are forced to have the same weights and biases. The weights and biases of a filter are structured as matrices because the input is rectangular and have a small number of weights to model local features. Each function in a filter only receives a limited sample of the previous layer’s output. The input fields received by the neurons in the filter are spread over the input data as in a discrete twodimensional convolution. The right panel of Figure 1 shows how four neurons in a filter locally sample the previous layer. The part of a convolution due to a neuron mn in a filter with size p × q, producing output yimn, is given by the formula

n(s x) =

∑ ∑ wabi− 1,iy(im−+1 a)(n + b)) a=0 b=0

(1 +

(5)

where x is the position in the line. The normalized s further improves numerical stability by focusing the network on spatial features. The performance of CNNs depends on their layer structure, which must be found by a combination of intuition and trial and error.85 Small networks do not have enough complexity to model their training data. Large networks which are underdetermined by their training data do not generalize. The top panel in Figure 3 shows the learning curves of three different CNNs with different numbers of convolutional layers. Predictably, a CNN with more convolutional layers has less training error; however, the test error shows the actual performance of the CNN, and one can see the test error of the CNN with three convolutional layers and the CNN with four convolutional layers is quite similar. Based on this test and the computational constraints of evaluating the network in our GPU-CNN code, we chose three convolutional layers as our production model. The performance of the fully connected neural network and KRR have also been tested, and the results of both are shown in Table SI-2. A CNN has a smaller test error than both methods. The size of the CNN we settled on is summarized in Figure 2 and Table SI-1. In this paper, we focused on training kinetic energy functional for alkanes. The functional is trained based on minimum energy structures and ten randomized structures of seven molecules: butane, 2,5-dimethylhexane, 2-ethyl-3,3dimethylpentane, octane, tetramethylbutane, 2,3,3,4-tetrame-

a=p b=q i ymn = f (bi +

sx x ′= x + 5 0.01 ∑x ′= x − 5 (s x ′)2 )0.5

(4)

yimn

where is the value of neuron in layer i at position (m,n), and wab is the weight matrix of the filter. Filters in the convolutional layer capture different features of the data in a previous layer.80 They reduce the dimension of their input and improve the robustness of the NN.80,81 Since the weight matrices wi−1,1 are ab 1141

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147

Article

Journal of Chemical Theory and Computation

optimize the network. Because the amount of programming effort is quite significant, we defer further investigation of density optimization to future work but show that our network finds a physical density minimum, and this minimum is stable along bonding coordinates. In all the other sections the density we feed into the CNN is the converged KS density. Each molecule is separated into three parts: a near-carbon-core region, a valence region, and a tail region. The near-carbon-core region includes grids which lie less than 0.17 bohr from the nearest carbon nuclear, and the learning target for grids in this region is Ftf. This region usually has less than 1% of the total grid points but contributes more than 20% of the total kinetic energy. Also considering the density in this region is chemically inert, it is treated separately. As mentioned above, Ftf grows nonlinearly at long-range, whereas Fvw nicely converges to 1, so Fvw was chosen as the learning target for grids in the tail region which lie at least 2.3 bohr from the nearest nucleus. All the other grids consist of the valence region, which is most sensitive to bonding, and the learning target for grids in this region is Ftf. Each region was trained separately but with the same CNN architecture. To learn the functional we minimize sum-squared F-prediction errors over training densities as a function of the network’s weights and biases. This error is itself a nonlinear function, which we minimize from a random guess using stochastic gradient descent86 with analytic gradients provided by backpropagation.81,87 The training was performed on Nvidia Tesla K80 GPU to accommodate the memory demands of the density input. L2 regularization88 was applied to prevent overfitting. Minibatch stochastic gradient descent89 with momentum with batches of 128 samples was used during training. With this CNN structure and training scheme, the training wall-time for each part is shown in Table SI-3. It is worth mentioning that our learning target is the kinetic energy enhancement factor instead of the kinetic energy. The cancellation of the errors in the learning objective does not lead to cancellation of the errors in the kinetic energy. This is the reason that the noisy error is uniformly distributed over grid points. These results could be further improved by manipulating the learning target.

Figure 2. Orange dot is the quadrature point (r) at which the functional is being evaluated. The two lines which sample the normalized reduced gradient intersecting this point are inputs to the network which consists of three convolutional and two fully connected layers. The final layer outputs F(r).



RESULTS Our intended applications for these functionals are force-fieldlike approximations to the BO-PES and an inexpensive method to solve for densities of large systems. The wall-time cost of evaluating the local kinetic energy of each grid is constant regardless of grid size, and the number of quadrature grid points in a molecule is linear with system size. One only needs to add up the local kinetic energy contribution of each point, and so the functional is trivially linear scaling and naively parallel. We demonstrate that the functional learns properties of the kinetic energy rather than merely the total energy by showing its ability to predict F throughout different regions of a molecule. We then show that the functional is accurate enough to predict bonding semiquantitatively and produces smooth PESs despite its nonlinearity by examining its accuracy on describing the bonds in ethane and predicting kinetic energy along a KS molecular dynamics (MD) trajectory of 2methylpentane. We also examine the features learned by the CNN and the minimum density found by the network. Test molecules do not occur in the training set that was used to optimize the CNN, differing in both their bonding and geometry. A locally interfaced hybrid of modified BAGEL code,90 using the libXC library,91 and Cuda-Convnet81 was

Figure 3. Top: Ftf prediction errors (unitless) of CNNs with different number of convolutional layers for valence region as a function of learning epochs. An epoch is the number of gradient steps taken for the whole training set. CNNs with more convolutional layers produce less training error. However, the test error is saturated with three layers. Bottom: Learning curves of CNNs with different input types. A CNN using n1/2 as input has less accuracy than a CNN using normalized s as input.

thylpentane, and 5-ethyl-2-methylheptane. 2% of the standard xc-grid quadrature points of each structure were sampled as training samples: 3.7 Gigabytes of data. Optimization of the density is discussed in the last section. In our current implementation it is very costly to take the functional derivative of the CNN, although this operation can be implemented with the same speed as the gradient used to 1142

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147

Article

Journal of Chemical Theory and Computation

structure is 0.03. Considering the range of Ftf, between 0 and 13, the stochastic nature of our minimization, and the single precision arithmetic used, the CNN’s Ftf is remarkably accurate. The shape of the error is relatively uniform noise distributed throughout the volume of the molecule, on the order of our numerical precision. This noisy error is a challenge facing CNN-OF, especially because noise near the core can render the functional chemically inaccurate. This noise is related to the nonlinearity of the CNN and our limited precision arithmetic. Future work should separate these sources of error.96 One can imagine several remedies: solving for an ensemble of networks, training on adversarial examples, or denoising.97 We will show in the following sections that although noise is the dominant problem, it does not preclude chemical applications of CNN-OF. The Supporting Information applies 37 kinetic functionals to the test cases and shows that CNN-OF predicts hydrocarbon bonding better than GGAs at our disposal. However, we omit this data from the main text because comparisons of physically motivated functionals with machine learning functionals are biased by the training set. We include these results merely to show that ML functionals can fill a useful niche. We also cannot say anything about the relative performance of two-point functionals because existing codes cannot apply these functionals to hydrocarbons. Bonding and Potential Energy Surfaces. Poor prediction of chemical bonds and bond energies has kept OF-DFT out of the mainstream of computational chemistry. The large magnitude of the kinetic energy decreases as atoms are drawn apart, and the failure to bond is simply due to a small error in the slope. As Figure 5 shows, the bonding curve generated with our trained kinetic energy functional successfully reproduces local minimal both for C−C bond and C−H bonds, and the curves are smooth especially in the vicinity of the minimum. Both the predicted C−C and C−H bond lengths lie within 50 milli-Angstroms of the KS value. We also tested GGA-type functionals, including all the kinetic energy functionals in libXC and the VT84F kinetic energy functional.46 As shown in Figure SI-2 and Figure SI-3, several GGA-type functionals in libXC have local minima along the C−C bonding coordinates, but no GGA functional predicts the C−H bond. Accuracy along an MD Trajectory. To test the generalization of our functional away from the minimum, we examined a section of a high-temperature Kohn−Sham MD trajectory. This test assesses whether the method produces a qualitatively correct surface for the KS geometries when several atoms are moving at once. Seventeen consecutive steps of the molecular dynamics trajectory of 2-methypentane obtained at 1800 K were sampled (see Figure 6). In our current implementation this is still a demanding task, because dozens of gigabytes of data are stored and processed for each density. The CNN kinetic functional captures the general shape of PES including the positions of stationary points, although the curvature is imperfect. The PES is calculated by other GGA functionals shown in Figure SI-4, and none of them capture the general shape of the KS PES. Optimal Density. As mentioned above, due to the current limitations of our code, the density fed into the CNN thus far has been the converged KS density. The ability to optimize the density from an initial guess is critical.35,37,45,63,65,98 Furthermore, previous studies have shown that it is nontrivial to optimize the density with a machine-learning functional.63,65,99 One can calculate the density gradient of the kinetic energy functional that we constructed with code similar to the weight-

used to produce these results. A conventional pruned Lebedev atom-centered grid92 was used for integration of exchangecorrelation energy and orbital-free kinetic energy in conjunction with Becke’s atomic weight scheme.93 The grid saturates the accuracy of most kinetic functionals to better than a microHartree. Q-Chem was also used to produce some comparison results.94 Any quantity calculated with the CNN is obtained with single-precision arithmetic. The bonding curve test for ethane includes 9 points both for C−C bonding and C−H bonding, which contains 1,905,000 grid points in total. The MD trajectory of 2-methypentane includes 17 steps which contains 3,706,000 grid points in total. There are 533,760 training samples each consisting of 2000 inputs, and there are more than 800,000 parameters in the CNN. The B3LYP exchange correlation functional95 and SVP basis were used to generate the hydrocarbon training data. For one 2-methypentane molecule, it takes 5.78 s for a Nvidia Tesla K80 to evaluate the kinetic energy of all three regions. As shown in Figure SI-5, this wall-time cost scales linearly with the number of atoms. The number of test samples that we predict outnumber the training samples by a factor of 10, which means that our kinetic energy functional has to generalize well to achieve desirable accuracy. Since the carbon core region and the valence region contribute more than 99% of the total kinetic energy, our discussion below mainly focuses on Ftf, and the plot of Fvw is shown in Figure SI-1. Prediction of F. After training a CNN to reproduce the nonlocal enhancement factor, we wanted to establish whether there were trends in the accuracy of its prediction. To measure this we plot the F produced by our learned model alongside the ’accurate’ Kohn−Sham enhancement factor. The accurate Ftf and the one generated by the trained CNN of ethane along the C−C−H plane are shown in Figure 4. As one can see, the simulated Ftf is smooth and reproduces the fine structure of the Kohn−Sham Ftf surface, including the shell structure of the carbon atom near the core. The root-mean-square deviation of Ftf between the accurate Ftf and CNN’s model at its equilibrium

Figure 4. Top left panel: The desired Ftf calculated by Kohn−Sham. Top right panel: Ftf predicted the trained neural network. Bottom panel: Error of the simulated Ftf compared with the accurate Ftf. Neural networks simulate the exact Ftf accurately, and structureless noise is the dominant error. 1143

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147

Article

Journal of Chemical Theory and Computation

Kohn−Sham wave function. By brute force search we find the determinant yielding the density which minimizes the total energy and the shape of the energy with respect to density perturbations around the minimum. This process can be written as min[ECNN(n(UaiUaj ··· UakψKS))]

(6)

where ψKS is the Kohn−Sham wave function, UaiUbi···Uki are the unitary matrices of all the possible rotations between occupied and virtual orbitals, and ECNN is the energy predicted with our functional for density n. The BLYP exchange correlation functional100,101 and svp basis were applied in this section. As shown in Figure SI-7, the optimal density found by our CNN is close to the KS density. The real-space difference between the KS density and the CNN density is shown in Figure 7. The largest density difference is an excess of 0.01

Figure 7. Left panel: Difference between the converged KS density and the density found by the CNN along the bonding surface of hydrogen molecule at bond length equals 1.4 bohr. Right panel: Total percentage error of the CNN density compared with the converged KS density at different bond lengths. The error is calculated by Δn = ∫ | nCNN(r) − nKS(r)|dr. CNN found all the density within an error of 7% compared with converged KS density.

Figure 5. Top panel: PES with kinetic energy calculated by Kohn− Sham and CNN kinetic energy functional along the C−C bonding coordinate. Bottom panel: PES calculated with kinetic energy calculated by Kohn−Sham and CNN functional along the C−H bonding coordinate. Our kinetic functional trained by CNN successfully reproduces the local minimum with reasonable bond length.

atomic units located near the nucleus. The integrated absolute density error ∫ |nKS(r)−nCNN(r)|dr of the CNN density at all bond lengths is smaller than 7%, and the shape of the energy with respect to perturbations around this density is parabolic and smooth. What the Network Learns. We can examine the weights in the lowest convolutional layer to gain some insights into the neural network. These weight vectors are ’features’ recognized in the density by the CNN. For example in a network trained to identify images of people, the weights of low-lying convolutional layers look like patterns observed in images (edges and corrugation). The weight vectors of higher layers take on the shapes of increasingly complex objects in images (eyes, hands etc.). The smooth structure of the weight vectors is entirely due to the learning process because the network is initialized with random numbers. Features tend to be nearly orthogonal, and by observing features we can diagnose overparametrization. Excess filters do not lose their noisy character even when learning is complete. For this experiment we trained a network on the Madelung wave function, n(r)1/2. The lowest feature layers correspond directly to the density of the molecule, and the higher levels learn abstract representations of τ+, features that are difficult to interpret. Looking at the weight matrices of the lowest convolutional layer (Figure 8), the nonlocality of the kinetic functional is superficially obvious. Most features extend several angstroms away from the point at which the enhancement is being evaluated. We can also infer that the real-space size of the sample we feed the network (10 Å) is

Figure 6. PES with kinetic energy calculated by Kohn−Sham, CNN kinetic energy functional along 17 steps of a Kohn−Sham molecular dynamics trajectory of 2-methypentane. For better comparison, the CNN curves are translated so that they start at the same point. The PES calculated with CNN kinetic energy follows the KS surface, and the relative errors are similar to the difference between hybrid B3LYP and GGA PBE.

gradient used to optimize the network.10 However, it is a significant undertaking to efficiently implement the gradient. Instead in this section we exhaustively explore every possible density of hydrogen molecule, to find the optimal density for our functional and the shape of the surface. We explore the space of possible densities using a redundant parametrization based on all possible occupied-virtual orbital rotations of a 1144

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147

Article

Journal of Chemical Theory and Computation

Figure 8. Left panel: Madelung wave function (insert picture: molecular orbitals) along the sampled line of ethane (top: line points to carbon atom, bottom: line points to hydrogen atom). Middle panel: the value of the weights of the filters in the first convolutional layer. Right panel: curves after the density are transformed by the filters. The label on each curve is corresponding to the labeled filter in the middle panel that did the transformation. The transformed density near the carbon atom shows an obvious nodal structure, while the transformed hydrogen density does not.



adequate, since weights at the edges are near zero in most filters. The number of features is basically saturated given our data, because small noisy oscillations persist in the weight vectors. Looking at the output of the lowest convolutional layer shows how the network is able to distinguish the shell structure of different atoms using its convolutional filters. We can see from Figure 8 that even though the density along the lines passing through a hydrogen atom and carbon atom have a similar single peak shape, they become distinguishable after the transformation of the first convolutional layer. After the transformation, the outputs of carbon’s density have many nodes, while the outputs of hydrogen’s density have none. Subsequent layers can easily detect these edges in their input to discriminate atoms. Ultimately to describe many materials, the network must learn the shell structures of several atoms; this basic classification is the first step in that process.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jctc.5b01011. Additional results and details (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank The University of Notre Dame’s College of Science and Department of Chemistry and Biochemistry for generous start-up funding and NVidia corporation for some processors used in this work.





DISCUSSION AND CONCLUSIONS We have shown that an empirically trained Convolutional Neural Networks can usefully predict the Kohn−Sham kinetic energy of real hydrocarbons from their electron density. Our scheme can be practically integrated with existing functionals and codes without pseudopotentials. The network is able to learn nonlocal structure and predict bond lengths despite being constructed in a naive way. We have shown (as predicted) that nonlinearity causes noise which is the key challenge to further development but that useful accuracy is already possible. There are several venues to improve on the model, for example by enforcing physical constraints, improving numerical precision, and the data used to train the network. We also show that the kinetic energy functional learned by CNN can predict the KS density accurately. Implementing the gradient of the functional to complete the whole self-consistent loop will be the primary goal in our further work. For the foreseeable future OF-DFT will, at-best, be an inexpensive approximation to Kohn−Sham, and so the computational cost of evaluating the functional must be considered alongside these improvements.

REFERENCES

(1) Hohenberg, P.; Kohn, W. Phys. Rev. 1964, 136, B864−B871. (2) Kohn, W.; Sham, L. J. Phys. Rev. 1965, 140, A1133−A1138. (3) Chen, M.; Xia, J.; Huang, C.; Dieterich, J. M.; Hung, L.; Shin, I.; Carter, E. A. Comput. Phys. Commun. 2015, 190, 228−230. (4) Karasiev, V. V.; Sjostrom, T.; Trickey, S. Comput. Phys. Commun. 2014, 185, 3240−3249. (5) Xiang, H.; Zhang, X.; Neuhauser, D.; Lu, G. J. Phys. Chem. Lett. 2014, 5, 1163−1169. (6) Das, S.; Iyer, M.; Gavini, V. Phys. Rev. B: Condens. Matter Mater. Phys. 2015, 92, 014104. (7) Chai, J.-D.; Lignères, V. L.; Ho, G.; Carter, E. A.; Weeks, J. D. Chem. Phys. Lett. 2009, 473, 263−267. (8) Wellendorff, J.; Lundgaard, K. T.; Jacobsen, K. W.; Bligaard, T. J. Chem. Phys. 2014, 140, 144107. (9) Fukushima, K. Biol. Cybern. 1980, 36, 193−202. (10) LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Proc. IEEE 1998, 86, 2278−2324. (11) Simard, P.; Steinkraus, D.; Platt, J. C. Best practices for convolutional neural networks applied to visual document analysis. Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on. 2003; pp 958−963. 1145

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147

Article

Journal of Chemical Theory and Computation (12) Matsugu, M.; Mori, K.; Mitari, Y.; Kaneda, Y. Neural Networks 2003, 16, 555−559. (13) Hinton, G. E.; Osindero, S.; Teh, Y.-W. Neural Comput. 2006, 18, 1527−1554. (14) Ciresan, D.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification. Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 2012; pp 3642− 3649. (15) Snyder, J. C.; Rupp, M.; Hansen, K.; Müller, K.-R.; Burke, K. Phys. Rev. Lett. 2012, 108, 253002. (16) Huang, C.; Carter, E. A. Phys. Chem. Chem. Phys. 2008, 10, 7109−7120. (17) Lehtomäki, J.; Makkonen, I.; Caro, M. A.; Harju, A.; LopezAcevedo, O. J. Chem. Phys. 2014, 141, 234102. (18) Behler, J.; Parrinello, M. Phys. Rev. Lett. 2007, 98, 146401. (19) Handley, C. M.; Popelier, P. L. J. Phys. Chem. A 2010, 114, 3371−3383. (20) Khaliullin, R. Z.; Eshet, H.; Kühne, T. D.; Behler, J.; Parrinello, M. Nat. Mater. 2011, 10, 693−697. (21) Lopez-Bezanilla, A.; von Lilienfeld, O. A. Phys. Rev. B: Condens. Matter Mater. Phys. 2014, 89, 235411. (22) Rupp, M. Int. J. Quantum Chem. 2015, 115, 1058−1073. (23) Behler, J. Phys. Chem. Chem. Phys. 2011, 13, 17930−17955. (24) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Müller, K.-R.; Tkatchenko, A. J. Phys. Chem. Lett. 2015, 6, 2326−2331. (25) Wesolowski, T. A.; Warshel, A. J. Phys. Chem. 1993, 97, 8050− 8053. (26) Wesołowski, T. A. Phys. Rev. A: At., Mol., Opt. Phys. 2008, 77, 012504. (27) Shin, I.; Carter, E. A. Modell. Simul. Mater. Sci. Eng. 2012, 20, 015006. (28) Laricchia, S.; Constantin, L. A.; Fabiano, E.; Della Sala, F. J. Chem. Theory Comput. 2014, 10, 164−179. (29) Shin, I.; Carter, E. A. J. Chem. Phys. 2014, 140, 18A531. (30) Xia, J.; Carter, E. A. J. Power Sources 2014, 254, 62−72. (31) Huang, C.; Carter, E. A. Phys. Rev. B: Condens. Matter Mater. Phys. 2012, 85, 045126. (32) Yang, W.; Parr, R. G.; Lee, C. Phys. Rev. A: At., Mol., Opt. Phys. 1986, 34, 4586. (33) Xia, J.; Carter, E. A. Phys. Rev. B: Condens. Matter Mater. Phys. 2012, 86, 235109. (34) Xia, J.; Huang, C.; Shin, I.; Carter, E. A. J. Chem. Phys. 2012, 136, 084102. (35) Xia, J.; Carter, E. A. Phys. Rev. B: Condens. Matter Mater. Phys. 2015, 91, 045124. (36) Lembarki, A.; Chermette, H. Phys. Rev. A: At., Mol., Opt. Phys. 1994, 50, 5328. (37) Karasiev, V. V.; Trickey, S. Comput. Phys. Commun. 2012, 183, 2519−2527. (38) Chai, J.-D.; Weeks, J. D. Phys. Rev. B: Condens. Matter Mater. Phys. 2007, 75, 205122. (39) Chai, J.-D.; Weeks, J. D. J. Phys. Chem. B 2004, 108, 6870−6876. (40) Iyengar, S. S.; Ernzerhof, M.; Maximoff, S. N.; Scuseria, G. E. Phys. Rev. A: At., Mol., Opt. Phys. 2001, 63, 052508. (41) Chakraborty, D.; Ayers, P. W. J. Math. Chem. 2011, 49, 1810− 1821. (42) Perdew, J. P. Phys. Lett. A 1992, 165, 79−82. (43) Huang, C.; Carter, E. A. Phys. Rev. B: Condens. Matter Mater. Phys. 2010, 81, 045206. (44) Shin, I.; Ramasubramaniam, A.; Huang, C.; Hung, L.; Carter, E. A. Philos. Mag. 2009, 89, 3195−3213. (45) Karasiev, V. V.; Trickey, S. Chapter Nine: Frank Discussion of the Status of Ground-State Orbital-Free DFT. Adv. Quantum Chem. 2015, 71, 221−245. (46) Karasiev, V. V.; Chakraborty, D.; Shukruto, O. A.; Trickey, S. Phys. Rev. B: Condens. Matter Mater. Phys. 2013, 88, 161108. (47) Karasiev, V. V.; Sjostrom, T.; Trickey, S. Phys. Rev. B: Condens. Matter Mater. Phys. 2012, 86, 115101.

(48) Karasiev, V.; Jones, R.; Trickey, S.; Harris, F. E. Phys. Rev. B: Condens. Matter Mater. Phys. 2009, 80, 245120. (49) Trickey, S.; Karasiev, V.; Jones, R. Int. J. Quantum Chem. 2009, 109, 2943−2952. (50) Trickey, S.; Karasiev, V. V.; Chakraborty, D. Phys. Rev. B: Condens. Matter Mater. Phys. 2015, 92, 117101. (51) Xia, J.; Carter, E. A. Phys. Rev. B: Condens. Matter Mater. Phys. 2015, 92, 117102. (52) Chan, G. K.-L.; Cohen, A. J.; Handy, N. C. J. Chem. Phys. 2001, 114, 631−638. (53) Wang, Y. A.; Carter, E. A. Theoretical methods in condensed phase chemistry; Springer: 2002; pp 117−184. (54) Espinosa Leal, L. A.; Karpenko, A.; Caro, M.; Lopez-Acevedo, O. Phys. Chem. Chem. Phys. 2015, 17, 31463−31471. (55) Constantin, L. A.; Fabiano, E.; Laricchia, S.; Della Sala, F. Phys. Rev. Lett. 2011, 106, 186406. (56) Karasiev, V.; Trickey, S.; Harris, F. E. J. Comput.-Aided Mater. Des. 2006, 13, 111−129. (57) García-González, P.; Alvarellos, J. E.; Chacón, E. Phys. Rev. B: Condens. Matter Mater. Phys. 1996, 53, 9509−9512. (58) García-González, P.; Alvarellos, J. E.; Chacón, E. Phys. Rev. B: Condens. Matter Mater. Phys. 1998, 57, 4857−4862. (59) Ke, Y.; Libisch, F.; Xia, J.; Wang, L.-W.; Carter, E. A. Phys. Rev. Lett. 2013, 111, 066402. (60) Wang, L.-W.; Teter, M. P. Phys. Rev. B: Condens. Matter Mater. Phys. 1992, 45, 13196−13220. (61) Wang, Y. A.; Govind, N.; Carter, E. A. Phys. Rev. B: Condens. Matter Mater. Phys. 1998, 58, 13465−13471. (62) Wang, Y. A.; Govind, N.; Carter, E. A. Phys. Rev. B: Condens. Matter Mater. Phys. 1999, 60, 16350−16358. (63) Snyder, J. C.; Rupp, M.; Müller, K.-R.; Burke, K. Int. J. Quantum Chem. 2015, 115, 1102−1114. (64) Vu, K.; Snyder, J. C.; Li, L.; Rupp, M.; Chen, B. F.; Khelif, T.; Müller, K.-R.; Burke, K. Int. J. Quantum Chem. 2015, 115, 1115−1128. (65) Snyder, J. C.; Rupp, M.; Hansen, K.; Blooston, L.; Müller, K.-R.; Burke, K. J. Chem. Phys. 2013, 139, 224104. (66) Sim, E.; Larkin, J.; Burke, K.; Bock, C. W. J. Chem. Phys. 2003, 118, 8140−8148. (67) Whittingham, T. K.; Burke, K. J. Chem. Phys. 2005, 122, 134108. (68) Anderson, J. S.; Ayers, P. W.; Hernandez, J. I. R. J. Phys. Chem. A 2010, 114, 8884−8895. (69) Perdew, J. P.; Ruzsinszky, A.; Sun, J.; Burke, K. J. Chem. Phys. 2014, 140, 18A533. (70) Levy, M.; Ou-Yang, H. Phys. Rev. A: At., Mol., Opt. Phys. 1988, 38, 625. (71) Chan, G. K.-L.; Handy, N. C. Phys. Rev. A: At., Mol., Opt. Phys. 1999, 59, 2670. (72) Borgoo, A.; Tozer, D. J. J. Chem. Theory Comput. 2013, 9, 2250−2255. (73) Lee, D.; Constantin, L. A.; Perdew, J. P.; Burke, K. J. Chem. Phys. 2009, 130, 034107. (74) García-Aldea, D.; Alvarellos, J. E. Phys. Chem. Chem. Phys. 2012, 14, 1756−1767. (75) Li, L.; Snyder, J. C.; Pelaschier, I. M.; Huang, J.; Niranjan, U.-N.; Duncan, P.; Rupp, M.; Müller, K.-R.; Burke, K. Int. J. Quantum Chem. 2015, DOI: 10.1002/qua.25040. (76) Huang, P.-S.; Avron, H.; Sainath, T.; Sindhwani, V.; Ramabhadran, B. Kernel methods match Deep Neural Networks on TIMIT. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. 2014; pp 205−209. (77) Hornik, K. Neural Networks 1991, 4, 251−257. (78) Larochelle, H.; Bengio, Y.; Louradour, J.; Lamblin, P. J. Mach. Learn. Res. 2009, 10, 1−40. (79) Hubel, D. H.; Wiesel, T. N. J. Physiol. 1968, 195, 215−243. (80) LeCun, Y.; Bengio, Y.; Hinton, G. Nature 2015, 521, 436−444. (81) Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 2012, 1097−1105. (82) Parkhill, J. A.; Head-Gordon, M. Mol. Phys. 2010, 108, 513. 1146

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147

Article

Journal of Chemical Theory and Computation (83) Mezey, P. G. Mol. Phys. 1999, 96, 169−178. (84) Zupan, A.; Perdew, J. P.; Burke, K.; Causa, M. Int. J. Quantum Chem. 1997, 61, 835−845. (85) Kavzoglu, T. Determining optimum structure for artificial neural networks. Proceedings of the 25th Annual Technical Conference and Exhibition of the Remote Sensing Society; 1999; pp 675−682. (86) Kiwiel, K. C. Math. Prog. 2001, 90, 1−25. (87) Rumelhart, D. E.; Hinton, G. E.; Williams, R. J. Nature 1986, 323, 533−536. (88) Hastie, T.; Tibshirani, R.; Friedman, J.; Franklin, J. Math. Intel. 2005, 27, 83−85. (89) Li, M.; Zhang, T.; Chen, Y.; Smola, A. J. Efficient mini-batch training for stochastic optimization. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 2014, 661−670. (90) Shiozaki, T.; Parker, S.; Reynolds, R.; Le, H.-A.; Kim, I.; MacLeod, M.; Kelley, M.; Caldwell, M.; Bates, J. BAGEL, Brilliantly Advanced General Electronic-structure Library. (91) Marques, M. A.; Oliveira, M. J.; Burnus, T. Comput. Phys. Commun. 2012, 183, 2272−2281. (92) Lebedev, V. I. Zh. Vychisl. Mater. Mater. Fiz. 1975, 15, 48−54. (93) Becke, A. D. J. Chem. Phys. 1988, 88, 2547−2553. (94) Shao, Y.; et al. Mol. Phys. 2015, 113, 184−215. (95) Becke, A. D. J. Chem. Phys. 1993, 98, 5648−5652. (96) Goodfellow, I. J.; Shlens, J.; Szegedy, C. arXiv:1412.6572. arXiv.org e-Print archive 2014, http://arxiv.org/abs/1412.6572v3 (accessed Jan 21, 2016). (97) Burger, H. C.; Schuler, C. J.; Harmeling, S. Image denoising: Can plain Neural Networks compete with BM3D? Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 2012; pp 2392−2399. (98) Hung, L.; Huang, C.; Carter, E. A. Commun. Comput. Phys. 2012, 12, 135. (99) Snyder, J. C.; Mika, S.; Burke, K.; Müller, K.-R. Empirical Inference; Springer: 2013; pp 245−259. (100) Becke, A. D. Phys. Rev. A: At., Mol., Opt. Phys. 1988, 38, 3098. (101) Lee, C.; Yang, W.; Parr, R. G. Phys. Rev. B: Condens. Matter Mater. Phys. 1988, 37, 785.

1147

DOI: 10.1021/acs.jctc.5b01011 J. Chem. Theory Comput. 2016, 12, 1139−1147