Automated Fragmentation Polarizable Embedding Density Functional

Dec 19, 2016 - Finally, we use this subset of snapshots to calculate the NMR shielding constants at the PE-KT3/pcSseg-2 level of theory for all atoms ...
0 downloads 0 Views 2MB Size
Subscriber access provided by Fudan University

Article

Automated Fragmentation Polarizable Embedding DFT Calculations of NMR Shielding Constants of Proteins with Application to Chemical Shift Predictions Casper Steinmann, Lars Andersen Bratholm, Jógvan Magnus Haugaard Olsen, and Jacob Kongsted J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.6b00965 • Publication Date (Web): 19 Dec 2016 Downloaded from http://pubs.acs.org on December 20, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Automated Fragmentation Polarizable Embedding DFT Calculations of NMR Shielding Constants of Proteins with Application to Chemical Shift Predictions Casper Steinmann,∗,†,‡ Lars Andersen Bratholm,¶ J´ogvan Magnus Haugaard Olsen,‡ and Jacob Kongsted‡ †Centre for Computational Chemistry, School of Chemistry, University of Bristol, Bristol BS8 1TS, United Kingdom ‡Department of Physics, Chemistry and Pharmacy, University of Southern Denmark, DK-5230 Odense M, Denmark ¶Department of Chemistry, University of Copenhagen, DK-2100 Copenhagen, Denmark E-mail: [email protected] Abstract Full-protein NMR shielding constants based on ab initio calculations are desirable since they can assist in elucidating protein structures from NMR experiments. In this work we present NMR shielding constants computed using a new automated fragmentation (J. Phys. Chem. B 2009, 113, 10380–10388) approach in the framework of polarizable embedding density functional theory. We extend our previous work to give both basis set recommendations and comment on how large the quantum mechanical region should be in order to successfully compute

13

C NMR shielding constants that

are comparable with experiment. The introduction of a probabilistic linear regression

1

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

model allows us to substantially reduce the number of snapshots that are needed to make comparisons with experiment. This approach is further improved by augmenting snapshot selection with chemical shift predictions by which we can obtain a representative subset of snapshots that gives the smallest predicted error compared to experiment. Finally, we use this subset of snapshots to calculate the NMR shielding constants at the PE-KT3/pcSseg-2 level of theory for all atoms in the protein GB3.

1

Introduction

NMR spectroscopy is a very powerful tool to elucidate protein structures. 1,2 There are, however, limitations on the types of structures NMR spectroscopy is useful for due to complexity of the resulting spectra as the system size increases. Use of computational tools to help understand these spectra has seen widespread adoption in recent years. 3 Parametrized NMR prediction tools based on simple rules about molecular geometry are available but genuine high-quality predictions of key NMR parameters directly from quantum mechanical (QM) calculations are generally preferable. The application of ab initio calculations to large molecular systems, such as proteins, gives rise to a variety of different problems most notably the severe scaling issue associated with even a modest level of theory. In recent years, it has become computationally feasible to perform Kohn-Sham density functional theory (KS-DFT) calculations on systems upwards and beyond thousands of atoms. 4 For example, Hartman et al. recently 5 computed NMR shielding constants for all atoms within a radius of up to 7 ˚ A of the substrates N -(40 -trifluoromethoxybenzoyl)-2-aminoethyl phosphate and 2aminophenol quinonoid intermediate in the α and β active sites, respectively, of tryptophan synthase. 6,7 Here, the strategy to reduce computational cost was to use a locally-dense basis set approach 8,9 in combination with ONIOM 10,11 -type calculations. Similarly, Flaig et al. used a linear-scaling approach 13,14 to compute NMR shielding constants for a wide range of systems (solute-solvent as well as proteins) with a QM/MM approach using more than a thousand atoms in the QM region. 12 Even though it is surely desirable to use KS-DFT 2

ACS Paragon Plus Environment

Page 2 of 37

Page 3 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

for calculations of molecular properties of large molecular systems based on a full-structure quantum mechanical description, 15 problems with current exchange-correlation (XC) functionals might lead to artifacts in such computations. XC functionals that seemingly perform adequately for small systems may not be suitable for larger systems. In fact, even for relatively small molecules, approximate XC functionals can in some cases lead to large errors in the computed quantities. 16 For example Jakobsen et al. showed that the electrostatic potential of insulin (787 atoms) generated by the B3LYP 18–20 functional, among others, had large errors compared to the MP2 reference electrostatic potential, clearly showing that the electron density produced by B3LYP was not even qualitatively correct. 17 Such an error in the electronic density is bound to also give erroneous NMR shielding constants. The error in the electrostatic potential of insulin could be partially corrected by using the CAM-B3LYP 21 functional. 17 However, XC functionals optimized for NMR shielding constants such as the KT3 22 generalized-gradient approximation XC functional used in this work might be prone to the same problems as the B3LYP functional. An alternative to full-structure KS-DFT calculations is to use a focused multiscale embedding approach provided that the system can be divided into two non-overlapping regions. The central/important region is treated using quantum mechanics, i.e. the quantum region. The effects from the remaining parts of the system, i.e. the classical region, are modeled using a polarizable embedding potential. Traditional QM/MM-embedding approaches only include a static charge-distribution described by partial charges. In this work we use the polarizable embedding (PE) model 23,24 where the embedding potential is based on multicenter multipole expansions for both the static and first-order induced charge-distributions. More specifically, each atomic site in the environment is assigned a set of permanent multipole moments as well as an anisotropic dipole-dipole polarizability which are derived from fragment ab-initio calculations. For large molecules a fragmentation scheme based on the molecular fractionation using conjugate caps (MFCC) is used. 25,26 To reproduce the electronic properties of the quantum region, assuming that they are well-localized, it is imperative that the electrostatic

3

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

potential from the classical region is accurate. It has been shown that using parameters, i.e. multipoles and polarizabilities, based on fragment calculations leads to highly accurate electrostatic potentials for solvents,, 23,27,28 DNA, 29 and proteins. 30 Comparing the electrostatic potential of insulin obtained at the MP2 level of theory with those obtained using different classical potentials, Olsen et al. showed that polarizable potentials based on fragment calculations, such as the ones used in the PE model, were able to reproduce the MP2 results with high accuracy compared to traditional force fields and rival even a full-structure quantummechanical treatment except at very short distances. 30 The PE model has been formulated with special focus on response and transition properties and has thus been combined with quantum-mechanical response theory 31 for both single reference (Hartree-Fock and KohnSham density functional theory) methods 23,24 and multi-reference methods. 32 See ref. 33 for a perspective on recent developments. In order to obtain reliable NMR shielding constants, the PE method includes corrections 34,35 to remove the dependence on the gauge-origin in the obtained shielding tensors based on gauge-including atomic orbitals. 36–38 In this work we seek a way to potentially calculate NMR shielding constants of all atoms in a protein. Setting up polarizable embedding DFT (PE-DFT) calculations for each residue in a protein for several snapshots is an exhaustive task that definitively should be automated. Therefore we have developed our own automated fragmentation (AF) method originally proposed by He et al. 39 which automatically derives a suitable quantum mechanical region for each residue based on simple geometrical rules. The entire protein is fragmented into nonoverlapping fragments called core regions which for proteins are single amino acid residues as illustrated in Figure 1. Nearby residues or solvent molecules within a certain distance from the core unit are included in a buffer region. The combined core and buffer region is treated using KS-DFT. In the original formulation of the automated fragmentation approach 39 as well as in later works, 40–42 the remainder of the system, outside the buffer region, is described using an embedding potential based on partial charges obtained either from an AMBER force field, 43 Mulliken charges or semi-empirical charge models. 44,45 Later, an implicit solvent

4

ACS Paragon Plus Environment

Page 4 of 37

Page 5 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 1: The base unit of computation in the automated fragmentation scheme is a single amino acid residue called the core region. R1 is the amino acid sidechain whereas the generic R group links to the protein backbone. In the simplest model system used to compute protein NMR shielding constants R is replaced with H. More elaborate model systems are constructed to increase accuracy. model was included which showed some improvement for the predicted hydrogen chemical shifts. 41,46 When analyzing the computed NMR shielding constants only results from the core region are used. This process is repeated for all residues in a protein and allows for embarrassingly parallel scaling since all NMR shielding calculations are independent. In this work, we have combined the automated fragmentation method with polarizable embedding density functional theory calculations (AF-PE-DFT) to enable predictions of NMR shielding constants in large systems. Our focus will be on NMR shielding constants for carbon atoms as they are experimentally most relevant but we will include discussions about hydrogen atoms too. Discussions on computed NMR shielding constants for other heavy elements are limited as the methodological requirements are out of scope for this work and will be subject to future studies. The definition of the core was presented above and in Section 2.2 we describe in detail how the buffer is constructed. The rest of the system is represented by an embedding potential as outlined above. From our previous work on NMR shielding constants in solute-solvent systems 34,35 it is our hypothesis that a large quantum mechanical region might not be necessary provided that the surrounding environment is described using a high-quality polarizable embedding potential. Comparison of computed NMR shielding constants with experimental chemical shifts are often done by averaging over several snapshots extracted from a molecular dynamics simulation. The number of snapshots to include in such a comparison is a compromise between computational cost versus the uncertainty in the calculated mean of the NMR

5

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

shielding constants. In fact, the appropriate number of snapshots to be included in the averaging can be hard to quantify. Because of this, we have tested three different approaches to select snapshots which are based on probabilistic models: The first approach is based on selecting snapshots randomly without considering individual residues. The two final approaches includes additional residue specific information, the first of which is based on using clustering techniques 47,48 to group similar structures to reduce the number of redundant calculations required. Finally we tested if augmenting snapshot selection by employing fast chemical shift predictions using ProCS15 49 would provide additional information. This paper is organized as follows. In Section 2 we will provide computational details on the molecular dynamics simulation, the potential generation, the automated fragmentation procedure and NMR shielding constant calculations. In Section 3 and subsections therein we will discuss the results we have obtained by applying the proposed AF-PE-DFT approach to obtain NMR shielding constants. This includes discussions on basis sets, required buffer size and possible ways to reduce computational cost of AF-PE-NMR using the above mentioned use of probabilistic models. Finally, in Section 4 we summarize our findings and provide an outlook for future work.

2 2.1

Computational Details GB3 and Molecular Dynamics

In this work we used the refined structure of the third IgG-binding domain of Protein G (GB3, PDB: 2OED) as our starting point. 50 GB3 is well studied and has plenty of experimental data available and thus serves as an excellent benchmark system for this study. The initial structure was protonated at pH = 6.5 using PDB2PQR 51,52 and PROPKA 53 resulting in an overall charge of −2. The protonated GB3, which consists of 56 residues, which amount to 862 atoms, was prepared for classical molecular dynamics simulations in GROMACS 54–56 with the CHARMM22 protein force field. 57,58 GB3 was solvated in 7059 TIP3P 59,60 water 6

ACS Paragon Plus Environment

Page 6 of 37

Page 7 of 37

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

core. For example, if the core region is Lys10, then residues 9 and 11 are included as buffers in the quantum region. The third model, 3, is an extension of model 1. Here, molecules that are hydrogen bonded (solvent molecules or nearby residues) but not covalently bonded are included in the buffer region. Finally, we use a combination of 2 and 3 to generate model 4 where covalently bonded residues and molecules that form hydrogen bonds to the core are included in the buffer region. In all cases the core+buffer region is capped with hydrogen atoms to satisfy valency. This procedure results in a total of four models for each of the 56 residues in GB3 for all 93 snapshots giving rise to a total of 20832 individual PE-DFT calculations. Each of these core+buffer regions are subject to the embedding potential described below in Section 2.3. PE-DFT calculations are discussed in Section 2.4.

2.3

Embedding Potential

The procedure for deriving the embedding potential has been described extensively in previous works 30,62 so we provide only a brief summary. Based on the molecular fractionation with conjugate caps 25 (MFCC) approach, S¨oderhjelm and Ryde 26 devised a computational protocol to derive multicentered classical parameters, e.g. atom-centered, for a large system based on a series of fragment calculations. Each amino acid is treated as a fragment and each fragment is appropriately capped with N -methyl and acetyl groups to satisfy valency and to mimic neighboring residues. Neighboring capping groups are joined to form socalled concaps. We then obtain distributed atom-centered multipole moments up to and including quadrupoles and anisotropic electric-dipole polarizabilities from calculations on the capped fragments and concaps. The embedding potential for the entire system can then be constructed by making the appropriate sums over atomic sites in both capped fragments and concaps. 26 All classical parameters are based on the LoProp procedure 63 using MOLCAS. 64,65 We used the B3LYP XC functional together with an ANO-type recontraction of the 6-31+G(d) 66–68 basis set as required by the LoProp procedure. 63 For solvent molecules, 8

ACS Paragon Plus Environment

Page 8 of 37

Page 9 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

the embedding potential is straightforwardly obtained from a gas-phase calculation at the same level of theory. This procedure has been automated in an in-house script which has been interfaced with FragIt 69 to remove any restrictions on fragmentation.

2.4

NMR Shielding Constants

In all calculations involving the evaluation of NMR shielding constants we used the KT3 22 generalized-gradient approximation XC functional on the core+buffer region together with the recent segment contracted 70 polarization consistent basis sets optimized specifically for NMR shielding constants (pcSseg-n). 71 We used the double-, triple- and quadruple-zeta quality, i.e. pcSseg-n (n=1,2,3), basis sets. In this work, we have not attempted to incorporate locally-dense basis sets 8,9,72 but this will be the subject of future studies. All NMR shielding constant calculations were carried out in DALTON 73,74 which is interfaced with the polarizable embedding library. 75 London orbital contribtions to the one-electron integrals were included through Gen1Int. 76,77 In all calculations we used a hydrogen link-atom scheme where classical charges within 1.5 ˚ A from any quantum-region atom were redistributed to the three nearest classical sites outside this threshold. Higher-order multipole moments and polarizabilities within this threshold distance were discarded. This was done in order to avoid overpolarization effects. 78

3

Results

Considering that the embedding potential for each of the 93 snapshots is to be calculated followed by PE-DFT calculations using the four models for each of the 56 residues in the GB3 protein, it is key to find a good compromise between accuracy and computational cost. It would therefore be preferable if the double-zeta pcSseg-1 basis set is adequate when comparing calculated NMR shielding constants with experiment. We start out in Section 3.1 by comparing NMR shielding constants obtained at the PE-KT3/pcSseg-1 and

9

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

PE-KT3/pcSseg-2 levels to constants from PE-KT3/pcSseg-3 calculations. Based on these results, Section 3.2 investigates how well NMR shielding constants calculated using the AFPE-DFT method reproduce experimental chemical shifts. This is done using straightforward averaging based on all 93 snapshots included in this study. However, even with a reduction in basis set size and model system size the computational costs are still high. This is in part because of the embedding potential construction but also because of the high number of embedded core+buffer calculations that must be performed. Regarding the former issue we are currently working on constructing averaged embedding parameters. This work is already completed for several solvent molecules, 28 the four nucleotides in DNA 29 and we are working on phospholipids. The latter issue can be dealt with using appropriate statistics and we will demonstrate in Section 3.3 how we can control the loss in accuracy from calculated NMR shielding constants when reducing the number of included snapshots. Finally, in Section 3.4 we present results based on PE-KT3/pcSseg-2 NMR shielding constants evaluated for an appropriate subset of snapshots which is selected based on the statistical analysis carried out in Section 3.3.

3.1

Basis-Set Effects

Figure 3 shows the correlation between

13

C NMR shielding constants calculated at the PE-

KT3/pcSseg-3 level of theory are and constants obtained using the pcSseg-1 and pcSseg-2 basis sets for model 2. The presented NMR shielding constants include only one of each residue type present in GB3 and only from the first snapshot due the computational demand of the pcSseg-3 basis set. For the largest calculations involving model 2 the number of basis functions in the calculations was around 3000. Calculations on the larger model systems (3 and 4) are too computationally demanding when using the pcSseg-3 basis set. It can be seen in Figure 3 that the NMR shielding constants obtained at the PE-KT3/pcSseg-1 and PE-KT3/pcSseg-2 levels show a linear relationship with PE-KT3/pcSseg-3. A maximum likelihood estimation of the regression parameters in a linear model between the 10

ACS Paragon Plus Environment

Page 10 of 37

Page 11 of 37

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 1: Calculated RMSDs between pcSseg-3 NMR shielding constants and back-calculated shielding constants from a linear fit to the pcSseg-1 and pcSseg-2 basis sets.

All C C’ Cα Cβ Cγ Cδ Cǫ all H all N all O

RMSD [ppm] pcSseg-1 pcSseg-2 0.78 0.22 0.46 0.13 0.30 0.10 0.79 0.28 0.77 0.30 0.70 0.20 0.70 0.22 0.34 0.14 0.98 0.37 4.09 1.62

Slope pcSseg-1 pcSseg-2 1.03 1.00 1.03 1.00 1.07 1.01 1.04 1.00 1.04 1.00 1.04 1.00 1.04 1.00 1.01 1.01 1.06 1.00 0.96 0.99

The RMSD of 0.78 ppm for all carbon atoms can be broken down into individual atom types from the side chains. In Table 1 we present RMSDs and slopes for all those fits. We observe that for Cα the RMSD drops to 0.30 and 0.10 ppm for pcSseg-1 and pcSseg-2, respectively, compared to the all carbon RMSD, whereas for Cβ , Cγ , Cδ the RMSDs are largely unchanged from the all carbon result for pcSseg-1 which are 0.79, 0.77 and 0.70 ppm, respectively. This observation is in agreement with the already mentioned NMR shielding constants obtained for Cδ from Glu15 and also the carboxylate groups in Thr11. In general an increased basis set size is preferable but the data presented in Table 1 shows that the improvement depends on which carbon atom is under investigation. For carbonyl carbons it is clear that the additional flexibility from the pcSseg-2 basis set should be included in a locally-dense basis-set approach to obtain better agreement with the pcSseg-3 basis set. Even though there is some loss of accuracy from using pcSseg-1 over pcSseg-2 compared to pcSseg-3, as reflected in both Figure 3 and the RMSDs in Table 1, we find that the predicted NMR shielding constants at the PE-KT3/pcSseg-1 level of theory are accurate enough to be used in comparison with experiment when we, in the following section, investigate all four proposed models.

12

ACS Paragon Plus Environment

Page 12 of 37

Page 13 of 37

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 14 of 37

Page 15 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Table 2: RMSD between calculated NMR shielding constants and experimental chemical shifts based on a linear regression model. All units in ppm. Method 1 2 3 4

Cα 2.81 1.40 2.84 1.25

Cβ 4.38 1.62 4.29 1.70

HN 0.68 0.68 0.66 0.61

Hα 0.63 0.59 0.63 0.55

NH 6.00 4.11 6.33 4.15

All C 5.01 2.79 4.76 2.68

All H 2.36 0.78 2.07 0.69

of the backbone, like model 2, and hydrogen-bonded neighbors, like model 3. Overall, we see that for hydrogen atoms in general, model 2 is on par with model 4 with RMSDs of 0.78 and 0.69 ppm, respectively. We note that this includes non-polar hydrogen atoms which have fewer requirements on both the quantum-mechanical and environmental description. The errors for hydrogen NMR shielding constants are lower than for other elements because of the smaller values of NMR shielding constants that hydrogen atoms have in general. We also include backbone nitrogen NMR shielding constant RMSDs in Table 2 as well because they have some importance in protein structure determination. They show much larger RMSDs (> 4 ppm for all models) than what was obtained for carbon but this is not unexpected as the large variations observed for nitrogen depend more heavily on the level of electronic structure theory. 80 Chemical shifts of nitrogen also exhibits a larger span in general than carbon atoms in protein structures. 81 Based on the above analysis, we conclude that for the purpose of computing NMR shielding constants of

13

C, model 2 is sufficiently accurate. In the following sections all data

analysis are therefore based on PE-KT3/pcSseg-1 level of theory on model 2.

3.3

Selecting Snapshots to Reduce Computational Cost

The above approach is traditional in the sense that 1) we select a number of snapshots evenly separated in time and 2) calculate all NMR shielding constants of all atoms of all residues for all snapshots. In this limit, where all 93 snapshots are included, we know the uncertainty of the computed NMR shielding constants as presented in the previous section. We now ask

15

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 37

the question: Is there a representative subset of snapshots, e.g. 20, that would yield similar results, both in terms of mean NMR shielding constants and variance, but at a reduced cost? To answer this we will explore the use of a probabilistic linear regression model:

yi = aµi + b + ǫ and xik = µi + ǫi .

(1)

Here, yi is the experimental chemical shift of atom i, xik is the calculated NMR shielding constant of atom i from snapshot k. All calculated shielding constants, xik , are distributed around the population mean µi which in principle should be evaluated as the mean in the P limit of infinite snapshots but in practice calculated from x¯i = K1 K n=1 xik , for atom i. In

eq 1 a and b are regression coefficients. We assume ǫ and ǫi are normally distributed errors

ǫ ∼ N (0, σ 2 ) and ǫi ∼ N (0, s2i ). The total variance of atom i in the predicted chemical shifts based on eq 1 is σ 2 + s2i where s2i is the unbiased sample variance of the calculated shielding of atom i and σ 2 is the rest of the variance. Both σ 2 and s2i will be discussed in more detail below. To simplify the probability distribution of yi we integrate out the population mean µi . This is done in detail in the appendix (eqs 6 to 8). The resulting probability distribution of yi becomes  yi ∼ N a¯ xi + b, σ 2 + a2 s2i /K ,

(2)

which means that the experimental chemical shift for an atom i is normally distributed with a mean a¯ xi + b, calculated from the mean NMR shielding constant, x¯i , and a variance that is comprised of two terms σ 2 + a2 s2i /K. The first term, σ 2 , describes the variance that arises from the choice of method used to calculate NMR shielding constants. The second term, a2 s2i /K, stems mainly from the variation of geometry in the MD snapshots but can have some dependence on the choice of method etc., but as the number of snapshots included is increased this term goes to zero. We will investigate both of these terms in Section 3.3.1. The probabilistic model of the chemical shift, yi , for atom i as presented in eq 1 will give a statistical prediction, yipred = E[yi ], calculated from the median of a series of Monte Carlo 16

ACS Paragon Plus Environment

Page 17 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(MC) simulations which samples the regression coefficients a and b as well as the σ 2 a finite number of times. In all calculations we assumed uniform priors for a and b and the Jeffrey’s prior (σ −2 ) for σ 2 , i.e. π (a, b, σ 2 ) ∝ σ −2 . Monte Carlo simulations were carried out with the python package PyMC. 82 This probabilistic approach is different from the results presented in Table 2 which was a maximum likelihood estimation of a and b from all 93 snapshots used to calculate yi . 3.3.1

Random Selection of Snapshots

We start out by selecting K snapshots randomly from the 93 snapshots followed by Monte Carlo simulations to simulate the parameters a, b and σ 2 for all atoms in those snapshots. This was repeated 100 times for each K. Based on that we calculate yipred as described above. In Figure 6 we plot the median of the standard deviations from our predictions of the chemical shift (yipred ) of Cα to the experimental chemical shift (yiexp ) as a function of the number of included snapshots used to calculate yipred (solid red line). In order to show the error distribution we plot data corresponding to ±68% centered around the mean in light red. This would correspond to an uncertainty of one standard deviation for normally distributed data. In Figure 6 we see that the standard deviation starts out at 1.7 ppm when including only two snapshots. As the number of snapshots used in the MC simulation is increased, the error in the predicted chemical shift falls to below 1.5 ppm at just seven snapshots. This should be compared to the limiting value of 1.42 ppm when K = 50 snapshots are included. We also show the maximum likelihood estimation from Table 2 as a dashed black line. Clearly, acceptable predictions are obtainable with less than 93 snapshots even if they are selected at random. Reducing the computational cost by a factor of three (∼ 31 snapshots) gives predicted errors of 1.44 ppm with uncertainties ±0.2 ppm which would be sufficiently accurate to make predictions. In Figure 7 we show the median of the residue-specific part of the variance, a2 s2i /K, i.e. the second term in eq 2 from the 100 Monte-Carlo simulations for K = 5, 10 and 20 snapshots. Here it can be seen that the minimum number of snapshots

17

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 18 of 37

Page 19 of 37

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 37

of the descriptors for the remaining residues. Here we focus on the NMR shielding constant of Cα , so the first clustering descriptor is based on inter-atomic Cα distances between all residues in a single snapshot. We call this scheme CA. Before clustering the data according to this descriptor, the dimensionality of each descriptor was reduced from 55 to 10 using principal component analysis. The other descriptor we used was the Sorted Coulomb Matrix 85,86 where a matrix Mk is constructed for each residue for each snapshot k with elements

Mijk =

   0.5 · Z 2.4   

i

for i = j

Zi Zj |Ri −Rj |

for i 6= j

.

(3)

Here, Zi is the nuclear charge of atom i and Ri is the Cartesian coordinates of atom i. The diagonal term in eq 3 corresponds to atomization energies and the off-diagonal terms are Coulomb repulsion between nuclei i and j. Different from the CA model above, the Coulomb Matrix descriptor uses all atoms in the protein from a single snapshot. However, only atoms within 8 ˚ A of the Cα atom in a residue are included to reduce noise. This corresponds, in effect, to placing all atoms outside of this threshold at infinite distance. As the number of atoms within this cutoff distance will differ between residues and even snapshots, the Mk matrices are padded with zeros to make them equal in size. After the matrices are constructed for each atom, the indices are permuted in such a way that the rows (and columns) are sorted with respect to their l2 -norm. 87 This scheme is called SCM. We clustered the data for both CA and SCM into either N = 5 or N = 10 groups according to the two descriptors presented above. From each of the clusters we selected the snapshot that was closest to the mean within each cluster. Based on these five or 10 snapshots for each residue we again performed a Monte-Carlo simulation to estimate the parameters a, b and σ 2 and predict the chemical shift. In Figure 6 we have marked the standard deviation of these predictions as crosses. They are 1.52 and 1.42 ppm for the CA model using five or 10 snapshots, respectively. The SCM results are 1.49 and 1.40 ppm for 20

ACS Paragon Plus Environment

Page 21 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

five or 10 snapshots, respectively. Both descriptors provide similar results but in all cases they are slightly better than the random selection from the previous section. Thus, the added flexibility by providing some residue-specific information by these clustering models gives rise to this modest improvement. We also attempted to improve snapshot selection for each individual residue based on two schemes (A and B) using the ProCS15 chemical shift predictor to help guide residue selection. For both schemes we predicted the chemical shifts of Cα of all residues of all 93 snapshots in GB3 with ProCS15. The predicted chemical shifts for Cα in residue j are sorted according to magnitude to construct the cumulative distribution function (CDF) We found that selecting the Kj snapshots for residue j that were equally spaced on the CDF of the predicted chemical shifts gave good results, as this maximized the predicted variance without impairing the predicted mean. This was necessary since the probability distributions of the ProCS15 predicted chemical shifts and the PE-KT3/pcSeg-1 results are not equal. For example, to select Kj = 3 snapshots for residue j from all 93 snapshots, we partitioned the CDF at three equisized intervals and the snapshots were selected as the center of these intervals. We illustrate this concept in Figure S1 in the Supporting Information for Kj = 3. This selection strategy was the same for all choices of Kj . The snapshots representing these data points were then used to calculate yipred from the PE-KT3/pcSseg-1 results based again on Monte-Carlo simulations like it was done both in the random selection of snapshots and in the clustering-based results above. For the ProCS15A model this was done for Cα for each residue in GB3 and the results shown in Figure 6 for Kj = 5 and Kj = 10. The predictions made by ProCS15A for Kj = 5 and Kj = 10 are 1.52 and 1.43 ppm, respectively. The difference to SCM, for example, is negligible which had values of 1.49 and 1.40 ppm, respectively. We note that for the random scheme, the CA, SCM and ProCS15A schemes the computational demand is K × 54 calculations, i.e. 270 and 540 for the examples used above which should be compared to the naive approach of performing ∼ 5000 calculations. Figure 7 showed that the variance in the predicted chemical shifts varies greatly with

21

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the specific residue. In order to take this variance into account we construct the ProCS15B scheme. Here we start out by selecting Kj = 3 snapshots for each residue j according to the procedure outlined above. Thus, the average number of snapshots included for all residues is ¯ = 3. However, if a Cα atom in residue j has a large residue specific variance s2j (calculated K from ProCS15 predictions) we increase Kj by one. This process was repeated until the total number of snapshots included per residue on average were equal to those of ProCS15A, i.e ¯ = 5 in ProCS15B which resulted in a total of 270 calculations but varying between 3 K ¯ = 10 a (for Tyr3 for example) and 17 (for Lys10) snapshots per residue. Similarly, for K total of 540 calculations are required. The snapshots we used for each residue is listed in Tables S8 and S9 in the Supporting Information. The ProCS15B results are 1.46 and 1.35 ¯ j = 5 and K ¯ j = 10, respectively and are shown in Figure 6. The ProCS15B ppm for K ¯ j = 10 is better than the results we obtain by including all 93 snapshots in result for K the computation. This result is, however, not unexpected since the observed variance in the results (shown in light red) does allow for random combinations of snapshots to give results with such an error in the prediction. ProCS15A and ProCS15B differ in the way that snapshots are selected from the ProCS15 chemical shift predictions. In ProCS15A we select a partitioning of the CDF and select the Kj = 5 or Kj = 10 most representative structures based on the predicted chemical shifts from ProCS15. Whereas in ProCS15B we start out by ¯ j = 3 selected the same way as for ProCS15A. If the residue a small number of snapshots K specific variance is large we refine the snapshots by increasing Kj by one. The ProCS15B scheme would in principle be applicable, with very low computational overhead, to an entire MD trajectory and not just a subset of snapshots as done in this work since the ProCS15B scheme does not require any a priori computed PE-KT3 results either. Currently the combined cost of evaluating the embedding potential in this case together with the NMR calculations makes such an approach infeasible. We are, however, working on extending our current database of polarizable embedding potentials to also include amino acids at which point we will pursue the above computational strategy.

22

ACS Paragon Plus Environment

Page 22 of 37

Page 23 of 37

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

mean signed error and mean absolute error of the data in Figure 8B is 0.02 and 0.63 ppm, respectively. A linear fit of either of the two presented sets of computed NMR shielding constants to the experimental results give comparable results when evaluating RMSDs for ¯ j = 10 and K ¯ j = 5. These back-calculated shielding constants yield 2.29, and 2.30 ppm for K results are obtained from 540 and 270 calculations at the PE-KT3/pcSseg-2 level of theory, respectively, which is far fewer calculations than the corresponding results obtained for all K = 93 snapshots at the PE-KT3/pcSseg-1 results 5208 calculations yielding an RMSD of back-calculated shielding constants of 2.35 ppm.

4

Summary and Outlook

We have taken the first steps towards an automated procedure to obtain NMR properties of large molecular structures in the framework of polarizable embedding DFT (PE-DFT). This procedure is realized in the form of an automated fragmentation 39 strategy to automatically generate appropriate quantum mechanical regions from a core (a single residue) and a buffer (nearby residues) for each residue in a protein. Each quantum region is surrounded by a classical polarizable embedding potential and PE-DFT calculations are carried out to evaluate NMR shielding constants for each atom. We have used our implementation of the automated fragmentation procedure to evaluate the NMR shielding constants of all carbon atoms in the third IgG-binding domain of Protein G (GB3) in order to compare how well our results compare with experimental chemical shifts. Initial tests on basis set requirements suggested that the pcSseg-1 basis set was adequate in comparison with pcSseg-3 (RMSD of 0.78 ppm). The same calculations done with pcSseg2 lowered the RMSD to 0.22 ppm. The pcSseg-1 basis set allowed us to test a large amount of quantum regions of varying size and composition for all snapshots included in this study. Comparing the computed NMR shielding constants using pcSseg-1 with experimental values we found the expected linear anti-correlation. Based on the calculations of the varying

24

ACS Paragon Plus Environment

Page 24 of 37

Page 25 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

region sizes we found that in order to accurately calculate

13

C NMR shielding constants a

buffer region consisting of only the neighboring covalently bonded residues gave good results. Including hydrogen-bonded residues and nearby water molecules yielded little improvement. It is expected that other buffer regions will be necessary for other atoms types such as backbone HN which will be the subject of future studies. In order to be able to afford the pcSseg-2 basis set a reduction in the number of snapshots required was needed. To make this reduction possible and gain insight into the convergence behavior of the computed NMR shielding constants we used a probabilistic linear regression model. We show that by selecting snapshots randomly, 30 snapshots are adequate to obtain a reasonable estimate of the NMR shielding constants of Cα atoms in GB3. However, we also found that individual residues showed very different convergence behavior with respect to the number of snapshots required to obtain a certain threshold in the computed NMR shielding constant. To take this residue-specific convergence behavior into account we found that it was possible to use a fast chemical shift predictor in the selection of appropriate snapshots for each residue. This procedure yielded results that were of the same quality in terms of mean and variance, as when using all snapshots for all residues but at a reduced cost. The procedure is outlined in Appendix B. We have identified several important future directions to take in this manuscript. One direction is to investigate other atom types which also have experimental relevance such as the HN , N and Hα atoms. These atoms will most likely have other requirements for the core and buffer region than the carbon atoms tested in this study. We have also shown in this work, that for some atoms, the basis set requirements are quite large whereas for others they are not. As such, a locally-dense basis set approach would be interesting and a subject of future studies. Even though chemical shift predictors based on empirical rules or quantum mechanical calculations perform well, the strength in the automated fragmentation approach coupled with polarizable embedding DFT calculations lies in the fact that it can be applied to systems

25

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 37

where chemical shift predictors are not applicable nor adequate, for example in investigations when a ligand binds to a protein.

Acknowledgement Computational resources were provided by the DeIC National HPC Center at the University of Southern Denmark. C.S. thanks the Danish Council for Independent Research (the Sapere Aude program) for financial support (Grant no. 4181-00370). J.M.H.O. acknowledges financial support from the Danish Council for Independent Research (DFF) through the Sapere Aude research career program (Grant no. DFF–1323-00744 and DFF–1325-00091). J.K. thanks the Danish Council for Independent Research (the Sapere Aude program), the Villum Foundation, and the Lundbeck Foundation.

Supporting Information Available Tables S1 to S6 are mean correlation values of NMR shielding constants of carbon-atoms in GB3. Tables S7 and S8 lists the snapshots selected based on the ProCS15B scheme for ¯ j = 5 and K ¯ j = 10, respectively, Figure S1 shows the cumulative distribution function for K Cα in GB3. Zip archive with snapshots extracted from a molecular dynamics simulation. This material is available free of charge via the Internet at http://pubs.acs.org/.

Appendix A

Probability Distribution for the Linear Regression Model

In a probabilistic linear regression model defined as

yi = aµi + b + ǫ,

26

ACS Paragon Plus Environment

(4)

Page 27 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

where a and b are linear regression coefficients, ǫ is a normal distributed error, i.e. ǫ ∼ N (0, σ 2 ) and µi is the population mean of atom i defined as

xik = µi + ǫi .

(5)

Here, xik is the calculated chemical shift of atom i in snapshot k. There is an atom specific normal distributed error associated with this term ǫi ∼ N (0, s2i ). The probability to obtain the chemical shift yi of atom i, given the probabilistic linear model above with parameters a, b, σ 2 , x¯i and s2i , is p(yi |a, b, σ

2

, x¯i , s2i ) 1 K

where x¯i is the sample mean



Z

PK k



p(yi |a, b, σ 2 , µi ) · p(µi |{xik }, s2i )dµi ,

(6)

−∞

xik and K is the number of snapshots. Inserting the

expressions for the probabilistic linear model from eqs 4 and 5 then gives the following expression

p(yi |a, b, σ

where we use that

2

, x¯i , s2i )

QK k



Z

∞ 2

N (yi − aµi − b, σ ) · −∞

K Y

N (µi − xik , s2i )dµi ,

(7)

k

N (µi − xik , s2i ) = N (µi − x¯i , s2i /K). Inserting this expression and

carrying out the integral gives

p(yi |a, b, σ 2 , x¯i , s2i ) ∝ N (yi − a¯ xi − b, σ 2 + as2i /K),

(8)

which is the result shown in eq 2 in the main text.

B

Recommended Procedure To Extract Snapshots

Based on the calculations presented in the manuscript, we here outline what we believe to be the best procedure to obtain a reduced set of snapshots which are selected based on

27

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

experimental chemical shifts. In the following, we will only discuss the ProCS15B scheme (steps 4, 5 and 6a to 6c) specifically which requires the use of ProCS15 49 to obtain chemical shifts for all geometries whereas steps 1, 2, 3 and 7 are agnostic to the programs or properties involved. The property of interest (step 7) is in this work NMR shielding constants but we expect to use the procedure outlined below in future studies. 1. Obtain .pdb file and experimental chemical shifts for the structure of interest. 2. Run classical molecular dynamics at experimental conditions. 3. Extract N snapshots from the classical molecular dynamics simulation. 4. Evaluate NMR chemical shifts for Cα using ProCS15 for all residues for all N snapshots. 5. For each residue j set the number of snapshots to include to Kj = 3. 6. (a) Select the snapshots for each residue j which partitions the cumulative distribution function in Kj equal intervals. See Supporting Information Figure S1 for an example of this. (b) If the calculated variance of the predicted shielding constants for residue j is larger than some threshold, σ thres , increase Kj by one and repeat (6a). Otherwise keep Kj for residue j constant until (6c) is also finished. If all Kj are constant, go to (7) ¯ j is equal to (c) When the average number of snapshots to include j residues, i.e. K some predefined value, e. g. 10, then go to (7). 7. Evaluate the property of interest using the snapshots selected above.

28

ACS Paragon Plus Environment

Page 28 of 37

Page 29 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

References (1) Barrett, P. J.; Chen, J.; Cho, M.-K.; Kim, J.-H.; Lu, Z.; Mathew, S.; Peng, D.; Song, Y.; Van Horn, W. D.; Zhuang, T.; S¨onnichsen, F. D.; Sanders, C. R. Biochemistry 2013, 52, 1303–1320. (2) Mittermaier, A. Science 2006, 312, 224–228. (3) Robustelli, P.; Stafford, K. A.; Palmer, A. G. J. Am. Chem. Soc. 2012, 134, 6365–6374. (4) Ratcliff, L. E.; Mohr, S.; Huhs, G.; Deutsch, T.; Masella, M.; Genovese, L. ArXiv e-prints 2016, arXiv:1609.00252 [physics.chem-ph]. (5) Hartman, J. D.; Neubauer, T. J.; Caulkins, B. G.; Mueller, L. J.; Beran, G. J. O. J Biomol NMR 2015, 62, 327–340. (6) Lai, J.; Niks, D.; Wang, Y.; Domratcheva, T.; Barends, T. R. M.; Schwarz, F.; Olsen, R. A.; Elliott, D. W.; Fatmi, M. Q.; en A. Chang, C.; Schlichting, I.; Dunn, M. F.; Mueller, L. J. J. Am. Chem. Soc. 2011, 133, 4–7. (7) Caulkins, B. G.; Yang, C.; Hilario, E.; Fan, L.; Dunn, M. F.; Mueller, L. J. Biochim. Biophys. Acta 2015, 1854, 1194–1199. (8) Chesnut, D. B.; Moore, K. D. J. Comput. Chem. 1989, 10, 648–659. (9) DiLabio, G. A. J. Phys. Chem. A 1999, 103, 11414–11424. (10) Svensson, M.; Humbel, S.; Froese, R. D. J.; Matsubara, T.; Sieber, S.; Morokuma, K. J. Phys. Chem. 1996, 100, 19357–19363. (11) Dapprich, S.; Kom´aromi, I.; Byun, K.; Morokuma, K.; Frisch, M. J. J. Mol. Struc.: THEOCHEM 1999, 461-462, 1–21. (12) Flaig, D.; Beer, M.; Ochsenfeld, C. J. Chem. Theory Comput. 2012, 8, 2260–2271.

29

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(13) Ochsenfeld, C.; Kussmann, J.; Koziol, F. Angew. Chem. Int. Ed. 2004, 43, 4485–4489. (14) Kussmann, J.; Ochsenfeld, C. J. Chem. Phys. 2007, 127, 054103. (15) Cole, D. J.; Hine, N. D. M. J. Phys-Condens. Mat. 2016, 28, 393001. (16) Eriksen, J. J.; Sauer, S. P.; Mikkelsen, K. V.; Christiansen, O.; Jensen, H. J. A.; Kongsted, J. Mol. Phys. 2013, 111, 1235–1248. (17) Jakobsen, S.; Kristensen, K.; Jensen, F. J. Chem. Theory Comput. 2013, 9, 3978–3985. (18) Becke, A. D. J. Chem. Phys. 1993, 98, 5648–5652. (19) Vosko, S. H.; Wilk, L.; Nusair, M. Can. J. Phys. 1980, 58, 1200–1211. (20) Lee, C.; Yang, W.; Parr, R. G. Phys. Rev. B 1988, 37, 785–789. (21) Yanai, T.; Tew, D. P.; Handy, N. C. Chem. Phys. Lett. 2004, 393, 51–57. (22) Keal, T. W.; Tozer, D. J. J. Chem. Phys. 2004, 121, 5654. (23) Olsen, J. M.; Aidas, K.; Kongsted, J. J. Chem. Theory Comput. 2010, 6, 3721–3734. (24) Olsen, J. M. H.; Kongsted, J. Adv. Quantum Chem.; Elsevier BV, 2011; Vol. 61; pp 107–143. (25) Zhang, D. W.; Zhang, J. Z. H. J. Chem. Phys. 2003, 119, 3599. (26) S¨oderhjelm, P.; Ryde, U. J. Phys. Chem. A 2009, 113, 617–627. (27) Schwabe, T.; Olsen, J. M. H.; Sneskov, K.; Kongsted, J.; Christiansen, O. J. Chem. Theory Comput. 2011, 7, 2209–2217. (28) Beerepoot, M. T. P.; Steindal, A. H.; List, N. H.; Kongsted, J.; Olsen, J. M. H. J. Chem. Theory Comput. 2016, 12, 1684–1695.

30

ACS Paragon Plus Environment

Page 30 of 37

Page 31 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(29) Nørby, M. S.; Steinmann, C.; Olsen, J. M. H.; Li, H.; Kongsted, J. J. Chem. Theory Comput. 2016, Accepted. (30) Olsen, J. M. H.; List, N. H.; Kristensen, K.; Kongsted, J. J. Chem. Theory Comput. 2015, 11, 1832–1842. (31) Olsen, J.; Jørgensen, P. J. Chem. Phys. 1985, 82, 3235. (32) Hedeg˚ ard, E. D.; List, N. H.; Jensen, H. J. A.; Kongsted, J. J. Chem. Phys. 2013, 139, 044101. (33) List, N. H.; Olsen, J. M. H.; Kongsted, J. Phys. Chem. Chem. Phys. 2016, 18, 20234– 20250. (34) Kongsted, J.; Nielsen, C. B.; Mikkelsen, K. V.; Christiansen, O.; Ruud, K. J. Chem. Phys. 2007, 126, 034510. (35) Steinmann, C.; Olsen, J. M. H.; Kongsted, J. J. Chem. Theory Comput. 2014, 10, 981–988. (36) London, F., J. Phys. Radium 1937, 8, 397–409. (37) Wolinski, K.; Hinton, J. F.; Pulay, P. J. Am. Chem. Soc. 1990, 112, 8251–8260. (38) Helgaker, T.; Jorgensen, P. J. Chem. Phys. 1991, 95, 2595–2601. (39) He, X.; Wang, B.; Merz, K. M. J. Phys. Chem. B 2009, 113, 10380–10388. (40) Zhu, T.; He, X.; Zhang, J. Z. H. Phys. Chem. Chem. Phys. 2012, 14, 7837–7845. (41) Zhu, T.; Zhang, J. Z. H.; He, X. J. Chem. Theory Comput. 2013, 9, 2104–2114. (42) Tang, S.; Case, D. A. J. Biomol. NMR 2011, 51, 303–312.

31

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(43) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. J. Am. Chem. Soc. 1995, 117, 5179–5197. (44) Storer, J. W.; Giesen, D. J.; Cramer, C. J.; Truhlar, D. G. J. Comput. Aid Mol. Des. 1995, 9, 87–110. (45) Zhu, T.; Li, J.; Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G. J. Chem. Phys. 1998, 109, 9117. (46) Swails, J.; Zhu, T.; He, X.; Case, D. A. J Biomol NMR 2015, 63, 125–139. (47) Jain, A. K.; Murty, M. N.; Flynn, P. J. ACM Comput. Serv. 1999, 31, 264–323. (48) Berkhin, P. Grouping Multidimensional Data; Springer Science + Business Media, pp 25–71. (49) Larsen, A. S.; Bratholm, L. A.; Christensen, A. S.; Channir, M.; Jensen, J. H. PeerJ 2015, 3, e1344. (50) Ulmer, T. S.; Ramirez, B. E.; Delaglio, F.; Bax, A. J. Am. Chem. Soc. 2003, 125, 9179–9191. (51) Dolinsky, T. J.; Nielsen, J. E.; McCammon, J. A.; Baker, N. A. Nucleic Acids Res. 2004, 32, W665–W667. (52) Dolinsky, T. J.; Czodrowski, P.; Li, H.; Nielsen, J. E.; Jensen, J. H.; Klebe, G.; Baker, N. A. Nucleic Acids Res. 2007, 35, W522–W525. (53) Li, H.; Robertson, A. D.; Jensen, J. H. Proteins 2005, 61, 704–721. (54) Essmann, U.; Perera, L.; Berkowitz, M. L.; Darden, T.; Lee, H.; Pedersen, L. G. J. Chem. Phys. 1995, 103, 8577.

32

ACS Paragon Plus Environment

Page 32 of 37

Page 33 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(55) Spoel, D. V. D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. E.; Berendsen, H. J. C. J. Comput. Chem. 2005, 26, 1701–1718. (56) Hess, B.; Kutzner, C.; van der Spoel, D.; Lindahl, E. J. Chem. Theory Comput. 2008, 4, 435–447. (57) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wi´orkiewicz-Kuczera, J.; Yin, D.; Karplus, M. J. Phys. Chem. B 1998, 102, 3586–3616. (58) Mackerell, A. D.; Feig, M.; Brooks, C. L. J. Comput. Chem. 2004, 25, 1400–1415. (59) Jorgensen, W. L. J. Am. Chem. Soc. 1981, 103, 335–340. (60) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. J. Chem. Phys. 1983, 79, 926–935. (61) Schr¨odinger, LLC., The PyMOL Molecular Graphics System, Version 1.8. 2015. (62) Steinmann, C.; Kongsted, J. J. Chem. Theory Comput. 2015, 11, 4283–4293. (63) Gagliardi, L.; Lindh, R.; Karlstr¨om, G. J. Chem. Phys. 2004, 121, 4494–4500. (64) Karlstr¨om, G.; Lindh, R.; Malmqvist, P.-˚ A.; Roos, B. O.; Ryde, U.; Veryazov, V.; Widmark, P.-O.; Cossi, M.; Schimmelpfennig, B.; Neogrady, P.; Seijo, L. Comput. Mater. Sci. 2003, 28, 222–239. (65) Aquilante, F.; De Vico, L.; Ferr´e, N.; Ghigo, G.; Malmqvist, P.-˚ A.; Neogr´ady, P.; Pedersen, T. B.; Pitoˇ n´ak, M.; Reiher, M.; Roos, B. O.; Serrano-Andr´es, L.; Urban, M.; Veryazov, V.; Lindh, R. J. Comput. Chem. 2010, 31, 224–247. 33

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(66) Hehre, W. J.; Ditchfield, R.; Pople, J. A. J. Chem. Phys. 1972, 56, 2257– 2261. (67) Hariharan, P. C.; Pople, J. A. Theor. Chim. Acta 1973, 28, 213–222. (68) Clark, T.; Chandrasekhar, J.; Spitznagel, G. W.; Schleyer, P. V. R. J. Comput. Chem. 1983, 4, 294–301. (69) Steinmann, C.; Ibsen, M. W.; Hansen, A. S.; Jensen, J. H. PLoS ONE 2012, 7, e44480. (70) Jensen, F. J. Chem. Theory Comput. 2014, 10, 1074–1085. (71) Jensen, F. J. Chem. Theory Comput. 2015, 11, 132–138. (72) Reid, D. M.; Kobayashi, R.; Collins, M. A. J. Chem. Theory Comput. 2014, 10, 146– 152. (73) Aidas, K.; Angeli, C.; Bak, K. L.; Bakken, V.; Bast, R.; Boman, L.; Christiansen, O.; Cimiraglia, R.; Coriani, S.; Dahle, P.; Dalskov, E. K.; Ekstr¨om, U.; Enevoldsen, T.; Eriksen, J. J.; Ettenhuber, P.; Fern´andez, B.; Ferrighi, L.; Fliegl, H.; Frediani, L.; Hald, K.; Halkier, A.; H¨attig, C.; Heiberg, H.; Helgaker, T.; Hennum, A. C.; Hettema, H.; Hjertenæs, E.; Høst, S.; Høyvik, I.-M.; Iozzi, M. F.; Jansik, B.; Jensen, H. J. A.; Jonsson, D.; Jørgensen, P.; Kauczor, J.; Kirpekar, S.; Kjærgaard, T.; Klopper, W.; Knecht, S.; Kobayashi, R.; Koch, H.; Kongsted, J.; Krapp, A.; Kristensen, K.; Ligabue, A.; Lutnæs, O. B.; Melo, J. I.; Mikkelsen, K. V.; Myhre, R. H.; Neiss, C.; Nielsen, C. B.; Norman, P.; Olsen, J.; Olsen, J. M. H.; Osted, A.; Packer, M. J.; Pawlowski, F.; Pedersen, T. B.; Provasi, P. F.; Reine, S.; Rinkevicius, Z.; Ruden, T. A.; Ruud, K.; Rybkin, V.; Salek, P.; Samson, C. C. M.; de Mer´as, A. S.; Saue, T.; Sauer, S. P. A.; Schimmelpfennig, B.; Sneskov, K.; Steindal, A. H.; Sylvester-Hvid, K. O.; Taylor, P. R.; Teale, A. M.; Tellgren, E. I.; Tew, D. P.; Thorvaldsen, A. J.; Thøgersen, L.; Vahtras, O.; Watson, M. A.; Wilson, D. J. D.; Ziolkowski, M.; ˚ Agren, H. WIREs Comput. Mol. Sci. 2013, doi: 10.1002/wcms.1172.

34

ACS Paragon Plus Environment

Page 34 of 37

Page 35 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(74) Dalton,

a

Molecular

Electronic

Structure

Program,

version

(2016.0),

see

http://daltonprogram.org/, accessed March 1st 2016. (75) The PE library developers, PElib: The Polarizable Embedding library (version 1.0.8). 2014; https://gitlab.com/pe-software/pelib-public, accessed August 29th, 2016. (76) Gao, B.; Thorvaldsen, A. J.; Ruud, K. Int. J. Quantum Chem 2011, 111, 858–872. (77) Gao, B. Gen1Int (version 0.2.1). 2012; https://gitlab.com/bingao/gen1int, accessed August 29th, 2016. (78) Senn, H. M.; Thiel, W. Angew. Chem. 2009, 48, 1198–1229. (79) V¨ogeli, B.; Kazemi, S.; G¨ untert, P.; Riek, R. Nat. Struct. Mol. Biol. 2012, 19, 1053– 1057. (80) Fadda, E.; Casida, M. E.; Salahub, D. R. J. Phys. Chem. A 2003, 107, 9924–9930. (81) Bank, B. M. R. D. Statistics Calculated for Selected Chemical Shifts from Atoms in the 20 Common Amino Acids. 2016; http://www.bmrb.wisc.edu/ref_info/statsel. htm, accessed August 29th, 2016. (82) Patil, A.; Huard, D.; Fonnesbeck, C. J. Stat. Software 2010, 35 . (83) Kabsch, W.; Sander, C. Biopolymers 1983, 22, 2577–2637. (84) Ward, J. H. J. Am. Stat. Assoc. 1963, 58, 236–244. (85) Rupp, M.; Tkatchenko, A.; M¨ uller, K.-R.; von Lilienfeld, O. A. Phys. Rev. Lett. 2012, 108, 058301. (86) Montavon, G.; Hansen, K.; Fazli, S.; Rupp, M.; Biegler, F.; Ziehe, A.; Tkatchenko, A.; Lilienfeld, A. V.; M¨ uller, K.-R. In Advances in Neural Information Processing Systems 25 ; Pereira, F., Burges, C. J. C., Bottou, L., Weinberger, K. Q., Eds.; Curran Associates, Inc., 2012; pp 440–448. 35

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(87) Rupp, M.; Ramakrishnan, R.; von Lilienfeld, O. A. J. Phys. Chem. Lett. 2015, 6, 3309–3313. (88) Neese, F. WIREs. Comput. Mol. Sci. 2011, 2, 73–78. (89) Valeev, E. F. A library for the evaluation of molecular integrals of many-body operators over Gaussian functions. 2014; http://libint.valeyev.net/, accessed August 29th, 2016. (90) Sure, R.; Grimme, S. J. Comput. Chem. 2013, 34, 1672–1685. (91) Grimme, S.; Antony, J.; Ehrlich, S.; Krieg, H. J. Chem. Phys. 2010, 132, 154104. (92) Grimme, S.; Ehrlich, S.; Goerigk, L. J. Comput. Chem. 2011, 32, 1456–1465. (93) Kruse, H.; Grimme, S. J. Chem. Phys. 2012, 136, 154101. (94) Dunning, T. H. J. Chem. Phys. 1989, 90, 1007. (95) Woon, D. E.; Dunning, T. H. J. Chem. Phys. 1993, 98, 1358.

36

ACS Paragon Plus Environment

Page 36 of 37

Page 37 of 37

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment