Fast and Accurate Molecular Property Prediction: Learning Atomic

Aug 6, 2018 - ... Learning Atomic Interactions and Potentials with Neural Networks ... molecular structure, a function of the depth of the neural netw...
2 downloads 0 Views 3MB Size
Subscriber access provided by UNIV OF DURHAM

Surfaces, Interfaces, and Catalysis; Physical Properties of Nanomaterials and Materials

Fast and Accurate Molecular Property Prediction: Learning Atomic Interactions and Potentials with Neural Networks Masashi Tsubaki, and Teruyasu Mizoguchi J. Phys. Chem. Lett., Just Accepted Manuscript • DOI: 10.1021/acs.jpclett.8b01837 • Publication Date (Web): 06 Aug 2018 Downloaded from http://pubs.acs.org on August 9, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

Fast and Accurate Molecular Property Prediction: Learning Atomic Interactions and Potentials with Neural Networks Masashi Tsubaki∗,† and Teruyasu Mizoguchi∗,‡ †National Institute of Advanced Industrial Science and Technology ‡Institute of Industrial Science, University of Tokyo E-mail: [email protected]; [email protected]

Abstract The discovery of molecules with specific properties is crucial to developing effective materials and useful drugs. Recently, to accelerate such discoveries with machine learning, deep neural networks (DNNs) have been applied to quantum chemistry calculations based on the density functional theory (DFT). While various DNNs for quantum chemistry have been proposed, these networks require various chemical descriptors as inputs and a large number of learning parameters to model atomic interactions. In this paper, we propose a new DNN-based molecular property prediction that (i) does not depend on descriptors, (ii) is more compact, and (iii) involves additional neural networks to model the interactions between all the atoms in a molecular structure. In the consideration of the molecular structure, we also model the potentials between all the atoms; this allows the neural networks to simultaneously learn the atomic interactions and potentials. We emphasize that these atomic “pair” interactions and potentials are characterized using the global molecular structure, a function of the depth of the neural networks; this leads to the implicit or indirect consideration of atomic “many-body”

1

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 24

interactions and potentials within the DNNs. In the evaluation of our model with the benchmark QM9 dataset, we achieved fast and accurate prediction performances for various quantum chemical properties. In addition, we analyzed the effects of learning the interactions and potentials on each property. Furthermore, we demonstrated an extrapolation evaluation, i.e., we trained a model with small molecules and tested it with large molecules. We believe that insights into the extrapolation evaluation will be useful for developing more practical applications in DNN-based molecular property predictions.

The discovery of molecules with specific properties is crucial to developing effective materials and useful drugs. To facilitate such discoveries, quantum chemistry calculations based on the density functional theory (DFT) 1 have been widely used; however, there is a trade-off between computational speed and prediction accuracy. To address this problem, recently, various machine learning (ML) techniques have been applied to quantum chemistry calculations for approximating molecular energies and finding density functionals. 2–5 The advantage of the ML techniques is that, once the model is trained, the predictions can be made extremely quickly. If such an ML-based approach is practical, i.e., not only fast but also accurate, the application range of the DFT calculations will be enormous, and will result in the development of large-scale molecular screening and molecular dynamics (MD) simulations. Very recently, due to various advantages, deep neural networks (DNNs)

6–8

have been

successful in achieving fast and accurate quantum chemistry calculations. 9–11 Compared to the kernel methods, 12 for example, the training time of the DNNs scales linearly with the number of data samples. In addition, while the kernel methods require chemical descriptors as inputs, the DNNs can directly transform a molecular structure, i.e., atom types and their positions, into an efficient low-dimensional representation via hierarchical non-linear functions. For the benchmark QM9 dataset, 13 which contains approximately 130k small organic molecules and provides 13 types of quantum chemical properties for each molecule, including HOMO (highest occupied molecular orbital), LUMO (lowest unoccupied molecular orbital),

2

ACS Paragon Plus Environment

Page 3 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

and the atomization energy, DNNs, such as neural message passing (NMP), 14 deep tensor neural network (DTNN), 9 and SchNet, 10,11 have yielded better prediction performances in comparison to other ML methods. However, NMP uses various descriptors such as the acceptor, donor, and hybridization (i.e., sp, sp2, and sp3); DTNN and SchNet do not use such descriptors, and the modeling of atomic interactions is based on a tensor and residual network (ResNet), 15 which requires a large number of learning parameters. In addition, some specific properties in the QM9 dataset associated with, for example, the electronic spatial extent, static polarizability, vibrational frequency, and heat capacity have not been well predicted by the NMP and SchNet. Based on the above observations, there is room to develop more effective DNN-based molecular property predictions that (i) do not depend on descriptors as in the case of NMP, (ii) are more compact, i.e., have fewer learning parameters than the DTNN and SchNet do, and (iii) can achieve high prediction performances for some properties that the NMP and SchNet cannot predict well. Note that such a compact model requires only the atom types and their positions in a molecule as input, and then can predict the various properties of the molecule. This process, from input to prediction, can be achieved extremely quickly; it takes a few minutes for 10k molecules (shown in Figure 5) in comparison to the DFT calculation, which takes a few minutes for one molecule. This allows a significant acceleration of the discovery for new materials and drugs. In this paper, we propose a new DNN-based molecular property prediction to model the interactions between all the atoms in a molecular structure to achieve the above aim. In the consideration of the molecular structure, we also model the potentials between all the atoms; this allows the neural networks to simultaneously learn the atomic interactions and potentials. We emphasize that these atomic “pair” interactions and potentials are characterized using the global molecular structure, a function of the depth of the neural networks; this leads to the implicit or indirect consideration of atomic “many-body” interactions and potentials within the DNNs. Figure 1 shows the entire process for our model, and the cap-

3

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(1) Initialize atoms with random vectors

(2) Update atom vectors

N C

Formamide (0) Input a molecular structure Atom x N -0.026 C 0.046 O 1.072 H 0.825 H -0.908 H -0.961

Scalar

y z 1.346 0.009 -0.012 0.001 -0.653 -0.011 1.885 0.004 1.827 0.019 -0.475 0.008

Vector (2 dim)

Page 4 of 24

(3) Atom vectors at final depth T

(5) Output molecular Properties

・・・

HOMO -0.25 eV

・・・

LUMO 0.03 eV (4) Sum

O

Deep

H

Molecular vector

・・・

H

・ ・ ・ ・ Atomization energy -169.86 eV

・・・

H



Neural network Connects all the atom pairs in the molecule and models the interactions and potentials.

Figure 1: An overview of our model: (0) the input is a molecular structure, i.e., the atom types and their positions and (5) the output is a set of quantum chemical properties in the QM9 dataset, e.g., HOMO, LUMO, and the atomization energy. (1) First, we transform all atoms in the molecule to real-valued d-dimensional vectors (in the above example, d = 2). The vectors are randomly initialized with atom types, i.e., different (the same) atom types have different (the same) initialized random vectors. (2) Then, we update each atom vector using a neural network, which connects all the atom pairs in the molecule and models the interactions and potentials between the atoms (details in the Supporting Information). At the first layer ℓ = 1, the atomic interaction and potential is the “pair” interaction and potential, respectively. However, (3) gradually, these become the atomic “many-body” interaction and potential implicitly or indirectly within the DNN, because each atom vector gradually captures more of the global molecular structure according to the layer ℓ. (4) At the final layer (i.e., depth) ℓ = L, we sum over the atom vectors and obtain a molecular vector. (5) Finally, minimizing the mean squared errors (MSEs) between the model outputs and properties, we use backpropagation to optimize all the learning parameters in the network, including the atom vectors.

tion describes each sub-process. In the evaluation of our model with the benchmark QM9 dataset, fast and accurate prediction performances for various quantum chemical properties were achieved. In addition, the effects of the learning interactions and potentials on each property were analyzed. Furthermore, we demonstrated an extrapolation evaluation, i.e., we trained a model with small molecules and tested it with large molecules. Formally, we define a molecule M as a set of atom types and their positions (Figure 1(0)), { } { }Natom i.e., M = (a1 , r1 ), (a2 , r2 ), · · · , (aNatom , rNatom ) = (ai , ri ) , where ai is the type (e.g., i=1

4

ACS Paragon Plus Environment

Page 5 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

hydrogen or oxygen) of the i-th atom, ri is the 3D Cartesian coordinate of the i-th atom, and Natom is the number of atoms in the molecule M. In this paper, we design a function to map a molecule M to a real-valued d-dimensional vector, i.e., zM ∈ Rd (Figure 1(4)). Using the molecular vector zM , we predict its real-valued quantum chemical properties, i.e., tM ∈ R, e.g., HOMO, LUMO, and the atomization energy (Figure 1(5)). { }Natom Given a molecule M = (ai , ri ) as described above, we first transform ai to xi ∈ Rd ; i=1

this is a real-valued d-dimensional vector of the i-th atom (Figure 1(1)). In this paper, we refer to xi as an “atom vector” and its properties are as follows: (i) the dimensionality d is a hyperparameter (e.g., d = 50), (ii) xi is randomly initialized with the atom types, i.e., different (the same) atom types have different (the same) initialized random vectors, and (iii) xi is then trained via backpropagation to predict the quantum chemical properties. { }Natom Therefore, we have XM = (xi , ri ) , which is a vector-based input representation of i=1

the molecule M to the neural network to be described in the following. Next, we describe the computation of the neural network, which connects all the atom pairs in the molecule, models the interactions and potentials between the atoms, and then “updates” each atom vector considering the molecular structure (Figure 1(2)). The strategy to update vectors is commonly used in neural networks for graph-structured data. 16 More { }Natom precisely, given XM = (xi , ri ) as defined above, we refer to the i-th atom vector at a layer ℓ of the neural network (ℓ)

i=1 (ℓ) as xi ,

(ℓ+1)

We update xi and obtain xi (ℓ+1)

xi

where ℓ is also assumed to be the number of updates.

as follows:

∑ ( (ℓ) (ℓ) (ℓ) ) ( (ℓ) ) = f xi + g xj , Vij , αij ,

(1)

j∈M\i

where f is the neural network, j is the index of the other atoms in M, and g is the neural (ℓ)

(ℓ)

network, which is a function of three variables: xj , the potential Vij

∈ R between the

(ℓ)

i-the and j-th atoms, and the interaction αij ∈ R between the i-the and j-th atoms. Note that we consider (i) the potential within the atom vector (i.e., a feature of the atom) and

5

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 24

(ii) the interaction by weighting the atom vector (i.e., a computation for the atom). The Supporting Information describes the details of the above two neural networks, potential, and interaction. Therefore, by integrating the above process over each layer ℓ, each atom vector (ℓ)

xi is updated with other atoms in the molecule and gradually captures more of the global molecular structure considering the atomic interactions and potentials (Figure 1(3)). Indeed, there are some related studies modeling the atomic potentials within machine learning, 4,17,18 e.g., neural network potentials; 19 we discuss them in Supporting Information. Note that, as the Supporting Information describes, the above atomic interaction is (ℓ)

(ℓ)

(ℓ)

characterized by the atomic environment at the layer ℓ, i.e., αij is a function of (xi , xj , dij ), where dij = ||ri −rj || is the Euclidean distance between i-th and j-th atoms. In addition, the (ℓ)

(ℓ)

(ℓ)

atomic potential Vij is also a function of (xi , xj , dij ). These functions are implemented via the neural networks, and their parameters are learned via backpropagation. For example, at the first layer ℓ = 1, the atomic interaction and potential are only the “pair” interaction and potential, respectively; however, gradually, these become the atomic “many-body” interaction (ℓ)

(ℓ)

and potential implicitly or indirectly because the atom vectors xi and xj gradually capture more of the global molecular structure according to the layer ℓ (i.e., the number of updates of the atom vectors). So far, we have described the neural network to update each atom vector in the molecule; now we turn our attention to the final output, i.e., the molecular vector (Figure 1(4)). To obtain the molecular vector zM ∈ Rd in this paper, we simply sum over the atom ∑ atom (L) xi , where L is the final layer. Note that L is a vectors in the molecule, i.e., zM = N i=1 hyperparameter and can be assumed to be the number of layers or the depth of the neural network. Using zM , we predict its quantum chemical properties with a linear regression model (Figure 1(5)). The Supporting Information describes the details of the regression model, implementation, and optimization. We use the QM9 dataset 13 to evaluate our model. This dataset contains approximately 130k small organic molecules made up of CHONF, the molecular size is from 3 to 29, and

6

ACS Paragon Plus Environment

Page 7 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

Table 1: The mean absolute errors (MAEs) of NMP, SchNet, and our model for the 13 quantum chemical properties in the QM9 dataset. Our model uses 100-dimensional atom vectors, and the depth of the neural network is six. Property Unit U0 U H G ⟨R2 ⟩ α ω1 Cv ϵHOMO ϵLUMO ∆ϵ µ ZPVE

NMP SchNet Our model

eV 0.019 eV 0.019 eV 0.017 eV 0.019 2 Bohr 0.180 Bohr3 0.092 −1 cm 1.9 cal/molK 0.040 eV 0.043 eV 0.037 eV 0.069 Debye 0.030 eV 0.0015

0.014 0.019 0.014 0.014 0.260 0.593 — 0.033 0.041 0.034 0.063 0.033 0.0017

0.005 0.005 0.005 0.005 0.019 0.044 0.049 0.032 0.138 0.069 0.091 0.098 0.0141

13 types of quantum chemical properties are provided for each molecule, including HOMO, LUMO, and the atomization energy. These properties are obtained via a DFT calculation (Gaussian 09) at the B3LYP/6-31G(2df, p) level of theory. For the evaluation, we randomly shuffled and split the dataset into training/development/test sets with a ratio of 8:1:1, of which the development (or validation) set is used to tune the model hyperparameters (see the Supporting Information). Table 1 shows the main result: the mean absolute errors (MAEs) of NMP, 14 SchNet, 11 and our model for the 13 quantum chemical properties in the QM9 dataset. As Table 1 shows, all the MAEs of our model are low, that is, the prediction performance is accurate; in particular, the performances for the properties with respect to the atomization energies, i.e., U0 , U , H, and G, are more accurate than those of NMP and SchNet. In addition, our model achieves the most accurate performances for ⟨R2 ⟩ (the electronic spatial extent), α (the norm of the static polarizability), ω1 (the highest fundamental vibrational frequency), and Cv (the heat capacity). Conversely, the performances for other properties, e.g., with

7

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

CC(O)CC1CCC1C

CC12C3CC1OC(=N)C23

Page 8 of 24

CCC(=O)C1(C)CN1C

HOMO (eV)

-0.255 (-0.256)

-0.245 (-0.246)

-0.225 (-0.225)

LUMO (eV)

0.069 (0.070)

0.028 (0.028)

-0.021 (-0.021)

Gap (eV)

0.325 (0.326)

0.274 (0.273)

0.204 (0.203)

U0 (eV)

-389.508 (-389.488)

-401.699 (-401.946)

-404.071 (-404.346)

U (ev)

-389.497 (-389.478)

-401.692 (-401.938)

-404.060 (-404.335)

H (ev)

-389.496 (-389.477)

-401.691 (-401.937)

-404.059 (-404.334)

G (ev)

-389.544 (-389.524)

-401.731 (-401.977)

-404.107 (-404.381)

1624.76 (1621.83)

999.094 (996.412)

1353.01 (1347.09)

(Bohr2)

89.0 (89.1)

77.39 (77.32)

84.45 (84.42)

μ (Debye)

1.333 (1.350)

3.506 (3.507)

1.666 (1.675)

ZPVE (eV)

0.228 (0.228)

0.150 (0.150)

0.191 (0.191)

3805.43 (3798.24)

3466.78 (3471.57)

3183.43 (3188.19)

39.894 (39.875)

29.390 (29.271)

38.306 (38.429)

α

(Bohr3)

ω (cm-1) Cv (cal/molK)

Figure 2: Examples of molecules in the test dataset and their predicted properties using our model. Note that the values in brackets are calculated based on the DFT.

respect to the electronic structure, i.e., ϵHOMO , ϵLUMO , and ∆ϵ, are lower than those of NMP and SchNet. NMP uses various descriptors as inputs to its neural network, such as the acceptor, donor, and hybridization (i.e., sp, sp2, and sp3); SchNet does not use such descriptors, and its modeling interactions are based on ResNet, 15 which requires a large number of learning parameters. We believe that these considerations, from the viewpoint of machine learning (i.e., input descriptors and learning parameters), may improve the performance with respect to the electronic structure. However, even if such complex or specific descriptors and a large number of learning parameters are used, the above mentioned properties, i.e., ⟨R2 ⟩, α, ω1 , and Cv cannot be well predicted by NMP and SchNet. Conversely, our model considers the electronic structure implicitly, from the viewpoint of physical chemistry, by modeling and learning the potentials between the atoms, and achieved the most accurate performances for 8

ACS Paragon Plus Environment

Page 9 of 24

Calculated with DFT

Learned with our DNN

C-H

C-H

Normalized energy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

Capture the difference

Normalized bond length

C=O

C=O Steeper than the curve of C-H

Figure 3: Left: the potential profiles obtained by DFT simulation. Right: the learned potential curves with our DNN. We described the potentials between two atoms (C-H and C=O) for two molecules: a small molecule (SMILES: O=CNC=O) and a relatively large molecule (SMILES: CCC(=O)C1(C)CN1C). Note that the learned potentials are obtained from the 6-th (i.e., final) layer in our DNN.

these properties. Overall, while there is room for improvement in terms of the electronic structure, our model achieved the best prediction performance for 8 out of 13 quantum chemical properties in the QM9 dataset. Figure 2 shows examples of molecules in the test dataset and their predicted properties using our model. Figure 3 shows the potential profiles obtained by DFT simulation (left) and the learned potential curves with our DNN (right). In order to compare our DNN with the DFT, we have to normalize the potential, because the potential for one bond (e.g., C=O) was obtained by all pair interactions in the molecule (e.g., C-H) in our DNN method. For the horizontal

9

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 24

Learned interaction weights 0.22

O

0.13(↓) 

0.08

0.11 0.11(↓)

0.22

0.30(↑)

0.22

C H

N

6 layer

2 layer

Figure 4: The learned interaction (bond) strengths between the two atoms in a molecule ℓ (SMILES: O=CNC=O). These bond strengths are the weight values αij obtained by Eq. (9) and we show the values in some layers: ℓ = 2 and ℓ = 6.

axis, the minimum point was normalized to be 1, and the vertical values at the minimum and that at 6 in the horizontal axis was set to be 1. As shown in these figures, the pair potentials obtained by our DNN can reproduce the characteristics in the DFT simulation. In particular, the potential curves of C=O is steeper than that of C-H, indicating that C=O bond is harder than C-H. In addition, our DNN can capture a rough tendency of potentials for a small and large molecule (see the difference of two C-H bonds). Note that the results of DNN and DFT are not perfectly matched. The potential learned by the DNN model considers the all atomic pair potentials (and interactions). On the other hand, to describe the specific pair potential curve as shown in Figure 3, we extracted one atomic pair potential, i.e., C=O and C-H, from the model. Furthermore, the model does not learn the potential curve itself but the molecular properties; that is, the potential curves are learned indirectly to predict the molecular properties. We believe that these learned curves are close to the results of DFT calculation and the DNN reproduces the characteristic features, e.g., C=O is stronger than C-H. (ℓ)

Figure 4 shows the learned interaction (bond) strengths, i.e., weight values αij obtained

10

ACS Paragon Plus Environment

Page 11 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

(a)

(b)

Figure 5: The scalability in terms of the prediction time and (a) the number of data samples N and (b) the molecular sizes Natom . These figures show that the prediction not only scales linearly in N and Natom but also can be performed extremely quickly.

by Eq. (9), between the two atoms in a molecule. As shown in these two figures, all weight values are similar at the shallow layer; however, as the layer gets deeper, the weights change gradually and become more correct values considering the chemical bond strengths in the molecular structure (see the weight value of C=O in 6-th layer). While the analyses using such Figures 3 and 4 cannot be shown and evaluated exhaustively for all molecules in the QM9 dataset, (i) learning interactions and potentials significantly improve the total prediction performance (see Fig. 7 (a)), (ii) the model becomes more interpretable by describing the potential curves and bond strengths, and (iii) we believe that these can be helpful to analyze the results of data-driven machine learning. Figure 5 shows the scalability of our model in terms of the prediction time and (a) the number of data samples N and (b) the molecular sizes Natom . As this figure shows, the prediction time scales linearly with increases in N and remains unaffected by increases in Natom . In addition, Figures 6(a) and 6(b) show the distribution of the molecular sizes in the QM9 dataset and the MAE for each molecular size, respectively. As these figures show, even if the number of large-sized molecules (e.g., Natom ≥ 25) is relatively small, accurate

11

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(a)

(b)

Page 12 of 24

(c)

Figure 6: Panel (a) shows the distribution of the data samples in terms of the molecular sizes Natom in the training dataset, and panel (b) shows the distribution of the test dataset. These panels indicate that the data distributions are the same for the training and test samples. Panel (c) shows the mean MAE of the 13 properties for each molecular size. Interestingly, the high prediction performance cannot be achieved for the most frequent training samples (Natom = 19), while the larger-sized molecules (Natom > 25) can be well predicted.

performance can be achieved for these molecules (Figure 6(c)). Based on these observations in terms of accuracy and scalability, the application range of the DFT calculation can be enormous when using DNNs, e.g., for large-scale molecular screening as described in the Introduction. Figure 7 shows the effects of the atomic interactions and potentials in our model. The analyses are described with learning curves, where the x-axis is the epoch, i.e., the number of iterations in the training dataset, and the y-axis is the mean MAE of the 13 properties on the test dataset. In Figure 7(a), we observe that, if we do not consider the interactions, i.e., we simply mean over the hidden vectors, which uses 1/Natom instead of αij in Eq. (9) (see the Supporting Information), the performance can be relatively poor. In addition, if we do not consider the potentials, the performance can be very poor; this shows that the learning of the potentials is more crucial than the learning of the interactions in our model. Furthermore, we show the effects of the atomic interactions and potentials by describing the learning curves of the atomization energy, HOMO, and LUMO in Figures 7(b), 7(c), and 7(d). Comparing Figures 7(b) and 7(c), we can observe that learning atomic interactions improves the performance of all three properties. We believe that the knowledge of

12

ACS Paragon Plus Environment

Page 13 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

(a) Interactions + Potentials

(b) Interactions + Potentials

(c) No interactions

(d) No potentials

Figure 7: The effects of the modeling and learning of the interactions and potentials between the atoms. In these learning curves, the x-axis is the epoch, i.e., the number of iterations on the training dataset, and the y-axis is the mean MAE of the 13 properties in the test dataset.

atomic interactions leads to the knowledge of HOMO, LUMO, and other electronic structure information. This is natural because the local atomic interaction is determined by the atomic species, distances, and angles of them, and they are determined by ionicity, covalency, hybridization of orbitals. In addition, those chemical bondings also determine the HOMO, LUMO, and other electronic structure information. Our DNN learns the correlations between the atomic interactions and chemical bondings through the training. On the other hand, as Figure 7(d) shows, if we do not consider the potentials, the performances for HOMO and LUMO are very poor, whereas that for the atomization energy remains accurate. Indeed, the potentials provide information with respect to the electronic 13

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(a)

Page 14 of 24

(b)

(d)

(c)

Figure 8: Panel (a) shows the distribution of the training dataset, and panel (b) shows the distribution of the test dataset. This is the setting for our extrapolation evaluation. Panel (c) shows the mean MAE of the 13 properties for each (large) molecular size and panel (d) shows the MAE of the atomization energy, HOMO, and LUMO for each (large) molecular size.

structure in a molecule, that is, the learning of the potentials leads to improvements in the predictions of the electronic properties, such as HOMO and LUMO. On the other hand, the prediction error still more accurate in our model than the NMP and SchNet. More complex or specific descriptors related to the electronic structure, such as donor, sp2, and sp3, would be necessary to achieve better performance. Note that, without considering atomic potentials in a molecular structure, the model achieved the most accurate performance for the atomization energy (see Figure 7(d)). As we described in the caption of Table 1 and the Supporting Information, we obtained

14

ACS Paragon Plus Environment

Page 15 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

the best prediction performances with a relatively shallow model, i.e., a 6-layer neural network. Considering recent successful DNNs for image recognition such as ResNet, 15 which has over 100 layers, we believe that such extremely deep network is not necessary for modeling molecules. Indeed, while we did vary the number of layers and investigate the change in the performance, increasing the number of layers did not lead to improvements. However, the QM9 dataset includes only small-sized molecules, i.e., 3 ≤ Natom ≤ 29, and both training and test datasets equally include all sized molecules due to the random shuffle and split of the dataset (see Figures 6(a) and 6(b)). Considering this, we believe that, even if we use a shallow model, accurate performances can be achieved. Based on these observations, we considered training a model with small-sized molecules, and then testing this model with large-sized molecules, that is, we investigated the “extrapolation” with deeper models (related work in Supporting Information). In the evaluation of the extrapolation, we used 3-, 5-, 7-, and 9-layer models, i.e., T = {3, 5, 7, 9}. We trained a model with a dataset, including only the small-sized molecules, i.e., Natom = {3, 4, · · · , 19}, and then tested the trained model with each dataset, including the large-sized molecules, i.e., Natom = {20, 21, · · · , 29}. Figure 8(c) shows that all models achieved accurate prediction performance in the slightly larger molecules than the training molecules, e.g., Natom = {20, 21, 22}. However, the performance gradually decreased (i.e., the mean MAE gradually increased) according to the increase in the molecular sizes, even if we used deeper models. However, Figure 8(d) shows that the atomization energy can be well predicted even if the molecular size is large. In our model, we assume that the total molecular energy is the sum of each atomic energy; this assumption is used in other studies. 9,19,20 Our model also sums over each atom vector in a molecule and output the molecular vector (see Figure 1 (4)). Based on the assumption and operation, we believe that the stable performance for atomization energy is valid, because each atom captures the global molecular structure in a deep network (see Figure 1 (3)). On the other hand, the extrapolation performances for HOMO and LUMO are

15

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

significantly poor; these properties correspond strongly to the electronic structure compared to the atomization energy. As a result, we need to consider more complex relations between the molecular structure and HOMO or LUMO in a deep network. Therefore, our model may not be used to predict the properties, such as HOMO and LUMO, of larger-sized molecules that do not exist in the QM9 dataset (e.g., Natom ≥ 50). Furthermore, even though we tuned the other model hyperparameters (e.g., the dimensionality and L2 regularization), any improvements were hardly observable indicating that the current model does not have any components to handle an extrapolation. We believe that extrapolation is difficult but important to develop more practical applications of ML-based DFT calculations. We leave this to future work. In conclusion, we proposed a new DNN-based molecular property prediction that does not use any descriptors, is more compact than previously proposed DNNs, and involves learning atomic interactions and potentials in a molecular structure. In the benchmark QM9 dataset, we achieved fast and accurate prediction performance. In addition, we analyzed the effects of learning the interactions and potentials, and showed the importance of the potentials for predicting the properties with the respect to the electronic structure. Furthermore, we demonstrated an extrapolation evaluation, i.e., we trained a model with small molecules and tested it with large molecules, and achieved high performance in the slightly larger molecules than the training molecules. We believe that insights from this evaluation will be useful for developing more practical applications. We leave extrapolations in DNN-based molecular property predictions to future work.

Acknowledgements This research was supported by NEDO, Japan, and JST-PRESTO.

16

ACS Paragon Plus Environment

Page 16 of 24

Page 17 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

Supporting Information The neural network f in Eq. (1) is given by: ( ) ( (ℓ) ) (ℓ) f xi = ReLU Watom xi + batom ,

(2)

where ReLU is the element-wise rectified linear unit, 21 i.e., ReLU(x) = max(0, x), Watom ∈ Rd×d is the weight matrix to be learned, and batom ∈ Rd is the bias vector to be learned. Note that Watom and batom are not characterized by the layer ℓ. Other learning parameters in the following described neural networks have the same property. The neural network g in Eq. (1) is given by ( (ℓ) (ℓ) (ℓ) ) (ℓ) ) (ℓ) ( (ℓ) g xj , Vij , αij = αij f xj + vij ,

(3) (ℓ)

where f is the same neural network described in Eq. (2) and vij ∈ Rd is the vector rep(ℓ)

resentation of the potential Vij ∈ R between the i-th and j-th atoms. The scalar variable ( (ℓ) (ℓ) ) (ℓ) αij ∈ R is the weight on f xj + vij , which we assumed to be the interaction strength (or bond strength) between the i-th and j-th atoms. In other words, we consider (i) the potential within the atom vector (i.e., a feature of the atom) and (ii) the interaction strength by weighting the atom vector (i.e., a computation for the atom). Such modeling provides a interpretable machine learning; that is, we can describe a curve of learned potential based (ℓ)

(ℓ)

on Vij (see Figure 3) and we can analyze a strength of learned interaction based on αij

(see Figure 4). Since these are characterized with ℓ, we can observe the differently learned potentials and interactions in all layers in a deep network. The vector representation of the potential in Eq. (3) is given by (ℓ) vij

( = ReLU

(ℓ) wpotential Vij

) + bpotential ,

(4)

where wpotential ∈ Rd is the weight vector and bpotential ∈ Rd is the bias vector. In this paper, 17

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 24

(ℓ)

we use the Morse potential Vij , i.e., ( )2 ( (ℓ) (ℓ) (ℓ) (ℓ) ) Vij = Dij 1 − exp − aij (dij − rij ) ,

(5)

where dij is the Euclidean distance between the i-th and j-th atoms, i.e., dij = ||ri − rj || 1

(ℓ)

(ℓ)

(ℓ)

. In our model, Dij , aij , and rij are also learning parameters and are computed by the

neural networks as follows: (ℓ)

Dij

(ℓ)

aij

(ℓ)

rij

( ⊤ ( (ℓ) ) (ℓ) ) = σ wD xi ⊕ xi + bD , ( ( (ℓ) ) (ℓ) ) = σ wa⊤ xi ⊕ xi + ba , ( ( (ℓ) ) (ℓ) ) = σ wr⊤ xi ⊕ xi + br ,

(6) (7) (8)

where σ is the sigmoid function: σ(x) = 1/(1 + e−x ) and ⊕ is the concatenation of two vectors. Note that, in order to describe the potential curves and compare its steepness (see Figure 3), we consider the constraints, i.e., positive values for D and r with the sigmoid function and a = 1/(aij + ϵ), where we set ϵ = 0.2. Therefore, our model considers the (ℓ)

potential between the i-th and j-th atoms to be the vector vij , which is obtained from (ℓ)

(ℓ)

xi , xi , and dij . Because the atom vectors are characterized by the layer ℓ and contain information concerning the global molecular structure, the model allows us to learn atomic “many-body” potentials implicitly or indirectly within the neural network. Note that, for the scalar-valued potential we consider the constraints that the parameters in the Morse potential equation are positive; we then apply the ReLU function to the vector-valued representation (i.e., feature vector) of the potential, which allows us to correctly propagate the feature information in the forward computation of a deep network. ( (ℓ) ( (ℓ) ) (ℓ) (ℓ) (ℓ) ) In the following, we refer to f xi and f xj + vij as the hidden vectors hi and hij ,

1

We can also consider other potentials such as Lennard Jones and learn its parameters, i.e., ϵ (the depth of the potential well) and σ (the finite distance). However, we used the Morse type potential in this paper; we believe that the type of potential is not significantly important.

18

ACS Paragon Plus Environment

Page 19 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

respectively. We then re-write Eq. (1) as follows: (ℓ+1)

xi

(ℓ)

= hi +



(ℓ)

(ℓ)

(9)

αij hij .

j∈M\i

(ℓ)

(ℓ)

The motivation to introduce αij is that, if we simply sum or mean over the hidden vectors hij (ℓ)

without αij , the model considers all the hidden vectors to be equally important. However, their relative importance, which can be assumed to be their interaction strengths, are indeed different and are determined by the i-th and j-th atom states and their distances. In this (ℓ)

paper, we consider the interaction strength to be αij , which is (i) represented as a non-linear dot product in a projected space with a neural network, (ii) used as a weight described in Eq. (9), and (iii) learned by backpropagation. (ℓ)

(ℓ)

(ℓ)

More precisely, using the hidden vectors hi and hij as inputs, two new vectors yi and (ℓ)

yij are obtained as follows: (ℓ) yi (ℓ)

yij

( = ReLU

)

(ℓ) Wint hi

(10)

+ bint , ) (ℓ) = ReLU Wint hij + bint , (

(11)

where Wint ∈ Rd×d is the weight matrix and bint ∈ Rd is the bias vector. Note that, as seen (ℓ)

in Eq. (10), we use the same neural network for hi

(ℓ)

and hij because we wish to consider

the interaction strengths in a common space projected via a neural network. Then, taking (ℓ)

(ℓ)

the non-linear dot product between yi and yij , we obtain (ℓ)

(⟨

sij = σ

(ℓ) (ℓ)

yi yij

⟩) .

(12)

(ℓ)

Then normalizing sij with a softmax function, we obtain

(ℓ) αij

( (ℓ) ) exp sij =∑ ( (ℓ) ) . k exp sik

19

ACS Paragon Plus Environment

(13)

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Finally, we obtain the weighted sum of the hidden vectors, i.e., (ℓ)

(ℓ)

Page 20 of 24

∑ j∈M\i

(ℓ) (ℓ)

αij hij , as described

(ℓ)

(ℓ)

in Eq. (9). Therefore, αij is also a function of xi and xi as is the potential Vij ; this leads to the learning of atomic “many-body” interactions implicitly or indirectly within the neural network. Note that the above computation is inspired by “neural attention mechanism,” 22–24 which is widely used in deep learning–based machine translation systems. In linear regression, the training objective is to minimize the mean squared errors (MSEs) between the model output t′M = w⊤ zM + b and the quantum chemical property tM in the ∑ ′ 2 training dataset, i.e., the loss function is L(Θ) = 21 N i=1 ||tMi − tMi || , where Θ is the set of all learning parameters in our model, N is the number of data samples (molecules), and Mi is the i-th molecule in the training dataset. Note that each property tM in the training dataset is normalized to have a mean of 0 and a variance of 1. Then, we use the mean absolute errors (MAEs) to evaluate the prediction performance. We implemented the above model using PyTorch 25 version 0.4 (https://pytorch.org/), and the training details are as follows: the optimization is achieved via Adam 26 , which is a stocastic gradient descent (SGD)-based algorithm; the dimensionality of the atom vector is d = 100; and the number of layers is L = 6. In addition, we used the dev decay scheme; we kept track of the best performance on the development (or validation) set and decayed the learning rate by a constant factor if the model did not obtain a new best performance. In our settings, the constant factor is 0.5. Note that, while neural networks are usually trained using minibatches, its size for our implementation is 1 because we achieved the best performance in terms of the convergence of accuracy. The accuracy (MAE) saturated after approximately 20 epochs (see the learning curves in Figure 7), whereas, for example, SchNet 11 requires from 750 to 2,400 epochs with 32 minibatch sizes to saturate. There are some related studies modeling the atomic potentials within machine learning; 4,17–19 in particular, Smith et al., 2017 19 developed neural network potentials (NNPs). The NNP computes each atomic potential in a molecule, and then the total molecular energy is obtained by the sum of atomic energies produced by the NNP. On the other hand, our

20

ACS Paragon Plus Environment

Page 21 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

model computes the potentials between two atoms in a molecule, which are always considered in all layers of the deep network. This allows us to learn the atomic potentials considering the global molecular structure. We believe that this is a huge difference between their model and ours. In addition, Smith et al., 2017 19 also investigated the extrapolation in terms of the molecular size and energy prediction; their evaluation setting is similar to ours. They also trained the proposed model called “ANI-1” with small (8 heavy atoms) molecules and test the trained ANI-1 model with larger (10 heavy atoms) molecules. The prediction performance for the molecular energy is poorer when the model is tested with larger molecules, i.e., RMSE = 1.3 (8 heavy atoms) and RMSE = 1.9 (10 heavy atoms).

References (1) Kohn, W.; Sham, L. J. Self-consistent equations including exchange and correlation effects. 1965; p A1133. (2) Rupp, M.; Tkatchenko, A.; Müller, K.-R.; Von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. 2012; p 058301. (3) Snyder, J. C.; Rupp, M.; Hansen, K.; Müller, K.-R.; Burke, K. Finding density functionals with machine learning. 2012; p 253002. (4) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; Von Lilienfeld, O. A.; MuÌĹller, K.-R.; Tkatchenko, A. Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space. 2015; pp 2326–2331. (5) Faber, F. A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoenholz, S. S.; Dahl, G. E.; Vinyals, O.; Kearnes, S.; Riley, P. F.; von Lilienfeld, O. A. Prediction errors of molecular machine learning models lower than hybrid DFT error. 2017; pp 5255–5264. (6) Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012. 21

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(7) Graves, A.; Mohamed, A.-r.; Hinton, G. Speech recognition with deep recurrent neural networks. Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. 2013. (8) Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. Advances in neural information processing systems. 2014. (9) Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K. R.; Tkatchenko, A. Quantumchemical insights from deep tensor neural networks. 2017; p 13890. (10) Schütt, K.; Kindermans, P.-J.; Felix, H. E. S.; Chmiela, S.; Tkatchenko, A.; Müller, K.R. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing Systems. 2017. (11) Schütt, K. T.; Sauceda, H. E.; Kindermans, P.-J.; Tkatchenko, A.; Müller, K.-R. SchNet–A deep learning architecture for molecules and materials. 2018; p 241722. (12) Schölkopf, B.; Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. 2002. (13) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. 2014; p 140022. (14) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural message passing for quantum chemistry. Proceedings of the international conference on machine learning. 2017. (15) He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. (16) Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. 2009; pp 61–80.

22

ACS Paragon Plus Environment

Page 22 of 24

Page 23 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

(17) others„ et al. Deep potential: A general representation of a many-body potential energy surface. 2017. (18) Zhang, L.; Han, J.; Wang, H.; Car, R.; Weinan, E. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. 2018; p 143001. (19) Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. 2017; pp 3192–3203. (20) Lubbers, N.; Smith, J. S.; Barros, K. Hierarchical modeling of molecular energies using a deep neural network. 2018; p 241715. (21) LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. 2015; p 436. (22) Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. 2014. (23) Luong, M.-T.; Pham, H.; Manning, C. D. Effective approaches to attention-based neural machine translation. 2015. (24) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems. 2017. (25) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. 2017. (26) Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. 2014.

23

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry Some journals require a graphical entry for the Table of Contents. This should be laid out “print ready” so that the sizing of the text is correct. Inside the tocentry environment, the font used is Helvetica 8 pt, as required by Journal of the American Chemical Society. The surrounding frame is 9 cm by 3.5 cm, which is the maximum permitted for Journal of the American Chemical Society graphical table of content entries. The box will not resize if the content is too big: instead it will overflow the edge of the box. This box and the associated title will always be printed on a separate page at the end of the document.

24

ACS Paragon Plus Environment

Page 24 of 24