Improving the numerical stability of the NAST force ... - ACS Publications

Apr 19, 2019 - We describe changes to the formulation (NAST improved - NASTI) which smooth the dihedral energy term when ... View: PDF | PDF w/ Links...
0 downloads 0 Views 4MB Size
Subscriber access provided by BUFFALO STATE

Structure Prediction

Improving the numerical stability of the NAST force field for RNA simulations Nils Petersen, Thomas Ort, and Andrew E. Torda J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.9b00089 • Publication Date (Web): 19 Apr 2019 Downloaded from http://pubs.acs.org on April 25, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Improving the numerical stability of the NAST force field for RNA simulations Nils P. Petersen,∗,† Thomas Ort,‡ and Andrew E. Torda† †Centre for Bioinformatics, University of Hamburg, Bundesstr. 43, 20146 Hamburg, Germany ‡Laboratory automation and biomanufacturing engineering, Fraunhofer Institute for Manufacturing Engineering and Automation IPA, Nobelstr. 12, 70569 Stuttgart, Germany E-mail: [email protected] Phone: +49 40 428387335. Fax: +49 40 428387332 Abstract The NAST force field is a popular tool for modelling RNA and is typical of lowresolution approaches. Unfortunately, some combinations of bond and dihedral angles can reach cliffs on the energy landscape which lead to numerical disasters. We describe changes to the formulation (NAST improved - NASTI) which smooth the dihedral energy term when neighbouring angles become flat. We also improved the fit to experimental structures by replacing the harmonic term for the backbone angles with spline functions and using a more sophisticated approach to calculate energies for fragments that span both helix and loop regions. A newer, larger set of structures was used for the parameterisation. The new formulation can be run for millions of steps without a thermostat, whereas NAST routinely suffers numerical catastrophes. Simulations with NASTI showed no decrease in the quality of the structures as reflected by slightly better GDT-TS scores and, in three of the five cases, marginally better RMSD values when compared to the crystal structures.

1

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Introduction

Low-resolution force fields should allow one to visit large parts of phase space, albeit with a less detailed model. The forms used for interaction functions may, however, bring less obvious problems. Here we describe the numerical problems in one popular force field for RNA (NAST by Jonikas et al. 1 ) and how they can be repaired, with another admittedly ad hoc interaction form. Atomistic models might be the most intuitive representation of molecules, but they may be too detailed for some applications. Not every atom is of interest if one is only looking at gross structural properties and merging particles into united particles leads to fewer energy calculations. 2 One will introduce limitations, but these may be something one has to live with. Energy models will be less accurate. It will be harder to build in anisotropic interactions and removing degrees of freedom will not help the temperature-transferability of the model. The art and craft is hidden in finding interaction functions which do not do too much damage, but this has its own dangers. A form may often work, but occasionally fail disastrously. This is the problem adressed in this work. Less detail implies less accuracy, but it also implies less versatility or transferability. 3 A coarse-grain model is always some kind of average of a more detailed model, but it is parameterized so as to work where it is more important. This might mean restricting the application to a certain temperature range or chemical environment. One could even try to make a virtue of this vice and restrict the application area to a specific niche. This is how we would see the NAST model for RNA simulations. In this field, one often believes the secondary structure of a molecule is known, either from experiment 4,5 or from a prediction method. 6,7 NAST uses this information to restrain the structure in regions which are expected to form classic base-paired helices. NAST can also enforce tertiary contacts via a simple harmonic term. This leads to some requirements and restrictions. One needs a list of basepairs and one should accept that the goal is not a physically realistic simulation. Instead, one will use Newtonian dynamics as a sampling method to find conformations which are 2

ACS Paragon Plus Environment

Page 2 of 25

Page 3 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

both consistent with the secondary structure and energetically plausible. These requirements led to the form of the NAST force field. The first feature is the use of one particle per nucleotide, centered at the C3’ atom in the sugar ring. This is sufficient if one is most interested in the overall shape of the molecule. The second feature is more unusual. Sites are labeled as to whether they are within or outside base-paired/helical regions. Within helical regions, angles and dihedrals are simply driven to ideal A-helix geometry with force constants chosen by trial and error. Outside of these regions, reference values for geometry are taken from averages over ribosome structures, assuming symmetric distributions and basing force constants on an inverse-Boltzmann relation. The force field is only applicable if one has information or beliefs about base-pairs, but this is often the case with RNA. The NAST force field has been used this way with group I introns, 8 tRNA, 9 bistable sequences, 10 RNA junctions, 11 trans activation response regions, 12 the Panicum mosaic virus 13 and parts of the tobacco mosaic virus. 14 Weinreb et al. used NAST to implement a pipeline to predict RNA 3D structures using contacts from evolutionary couplings. 15 The NAST force field may perfectly address one application domain, but there may be a problem waiting to wreak numerical havoc on one’s simulations. A force field for Newtonian dynamics should be continuous and differentiable, but there is a less obvious limitation. The rate of change of energy with respect to coordinates cannot be too large. If it is, the linear approximation used by a first-order integrator is not valid and one suffers numerical errors or catastrophes. 16 The NAST force field usually operates within this limitation, but there are conformations, easily reached in normal use, which cause the system to explode. Our operational definition of explosion will be when the temperature changes by orders of magnitude over a handful of time steps. We are not the first to see this problem. The distributed version of the code takes measures to catch an explosion and re-assign reasonable velocities to all particles. NAST’s problem stems from the dihedral angle term and a soft term for neighboring 3

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

angles. The dihedral term has a conventional form, Eφ = k cos(φ − φ0 ) in which φ is the dihedral angle and φ0 a reference value. If the adjacent bond angle were 0 or π rad the dihedral would be undefined. This may not happen, but if the system approaches this point, the energy landscape can be so steep that it will never be handled by a simple integrator (Fig. 1 A). A tiny movement in Cartesian coordinates can lead to a massive change in energy. This is not pessimistic doomsaying. The problem occurs in normal simulations, but is obscured in the original code by resetting velocities and very tight coupling to a temperature bath. We would suggest it is more elegant to remove the problem. We propose changes to the interaction functions without changing any of the core features. Given that we make improvements to the original force field, we will refer to the modified version as NASTI (NAST improved). The main difference compared to NAST is the functional form of the dihedral energy term. In NASTI, the energy associated with each dihedral angle is a function of the torsion angle, but also of the two included bond angles as shown in Figure 1 B. The torsional term has its normal periodic form, but as either of the included bond angles approaches 0 or π the energy function becomes flatter, avoiding the large energy jumps observed in NAST. This work also provided the opportunity to update the form and parameterization of some other terms such as the treatment of helix regions and energy terms involving particles in loops and helices. All force constants are fitted to distributions from crystallographic structures instead of enforcing ideal helix geometry. Energy terms involving particles from both loop and helix regions are parameterized differently, based on the base pairing pattern of the particles. We also replaced the harmonic energy terms on the backbone angles with spline functions to account for skew in the angle distributions of crystallographic structures. For evaluation, we generated structure ensembles using both force fields and computed root mean square differences (RMSD) and global distance test (GDT-TS) values 17 with respect to the crystal structures. We also compared the distributions of the geometries measured in the ensembles to those found in ribosomal subunits. Two requirements were set for NASTI. First, it must reproduce distributions of 4

ACS Paragon Plus Environment

Page 4 of 25

Page 5 of 25

B

undefined

E( )

A

0

0

0+

0

0

0+

E( , , )

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 1: Relative energies of stable and unstable dihedral angles in (A) NAST and (B) NASTI. The center row shows four arrangements of atoms. The top row are the corresponding NAST energies as a function of the dihedral angle φ. If the adjacent angle passes π rad, φ instantly changes it’s value by π rad. In the most extreme case the energy jumps from the minimum to the maximum of this energy term, as indicated by the star in the far right plot. The bottom row shows the same energy dependence in NASTI. structural properties as well as the original version. Second, simulations should be so close to conserving energy, that they run without problems with minimal influence from temperature coupling.

2 2.1

Methods Datasets

Three sets of RNA structures were used in the parameterization and evaluation. For parameterization of energy terms, 40 structures, given in supplementary Table S1, were taken from 5

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 25

the list of non-redundant RNA-structures by Leontis and Zirbel. 18 We used only structures with a chain length of more than 150 nucleotides. Global structural alignments were computed for all pairs using FRIEs. 19 If the percentage of structural identity 20 within a pair was greater than 0.7, the molecule with the shorter chain was removed. Five structures given by Weinreb et al. 15 were used for evaluation simulations. Base pairs and contacts were determined using RNAView. 21 Contacts along the Watson-Crick edges of the nucleotides were used for the secondary structure input and pseudoknots smaller than three nucleotides were removed. For each structure a small list of additional tertiary contacts was used. A contact was added if it was amongst the strongest N predictions (for chain length N ) by Weinreb et al. 15 and not already in the list of secondary structure restraints. NASTI was parameterized on a larger set of structures than NAST, so it might be unfair to compare simulation snapshots against the larger parameterization set. To avoid this problem, distributions of distances, angles and torsion angles in simulations were evaluated by comparison against the three ribosomal subunits used by Jonikas et al. 1

2.2

Energy function

NASTI uses the same coarse grained representation as its predecessor with one pseudoparticle per nucleotide centered at the C3’ atom. As in NAST, each particle is labeled as to whether it is part of a base pair or a non-paired loop particle. The total energy of the system is

Etotal = Enon_bonded + Ebonds + Eangles + Edihedrals ,

(1)

where Enon_bonded is the energy due to non-bonded particles and Ebonds , Eangles , Edihedrals are energy terms due to bond distances, bond angles and dihedrals respectively. For Enon_bonded NASTI uses exactly the same functional form and parameters as NAST. The bonded energy terms stem from every backbone bond distance, angle and torsion angle. Specific helix geometry terms act on sites within base pairs. If particle i is paired with particle j, there are

6

ACS Paragon Plus Environment

Page 7 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

four additional restraints. One force restrains the distance rhlx between the two particles and another one the angle θhlx affecting the triplet (i + 1, i, j). There are two different dihedral restraints: φhlx for particles (i+1, i, j, j −1) and φknight for (i−1, i, i+1, j −1). Topologically these terms have the same form as in NAST, but some of the functional forms were altered. We first describe the parameterization and then the functional forms. NASTI applies different energy contributions to the particles in helix, loop and border regions. In the following we use a nomenclature of 0 for a non-paired nucleotide and 1 for a base-paired site. The sequence of labels gives each geometry a signature. A backbone angle for example involves three particles and thus there are eight possible different signatures. The signature 011 would describe an angle between three particles of which the first is not paired while the other two are. We collected all values for each combination of geometry and signature from the Leontis and Zirbel set. We then fitted the parameters of the corresponding energy term to the data using the Boltzmann relation

E(x) = −RT ln P (x),

(2)

where E is a pseudo-energy contribution, R and T are the gas constant and temperature and P (x) the probability of geometry x. Bond distances are restrained between neighboring particles along the backbone and between the two nucleotides that form a base pair. Normal distributions were fitted to the observed distributions. This directly leads to a harmonic energy term Ebond (ri ) = kibond (ri − ri0 )2

(3)

with distance ri , distance optimum ri0 and force constant kibond for bond i. The form of the energy term is thus the same as in NAST while the parameters are different. Before fitting, outlier distances were removed. If Q1 and Q3 are the first and third quartiles, and we define the interquartile range δQ = Q3 −Q1 , we removed instances smaller than Q1 −1.5δQ or larger 7

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 25

than Q3 + 1.5δQ . The energy contribution from bond angles was given by

Eangles = Ebackbone_angles + Ehelix_angles

(4)

Ehelix_angles accounts for the angles between base pairs and the backbone in helices. They were parameterized by fitting normal distributions, again yielding a harmonic potential. Ebackbone_angles were parameterized with a more sophisticated approach using splines. First a histogram was built using bin width w following Scott: 22 1

(5)

w = 3.49σN − 3 ,

where σ is the standard deviation and N the number of samples. The probability density from this histogram was converted to pseudo-energies using equation 2. Finally, quadratic splines were fit to the energies to account for the asymmetry in the distributions. We used an implementation of a B-Spline fitting algorithm by Dierckx 23 provided in scipy (http://www.scipy.org). The energy terms in our force field implementation have the form

Ebackbone_angle (θi ) =

Nintervals X

sj (θi )

(6)

j=1

with

sj (θi ) =

   ai,j θ2 + bi,j θ + ci,j , if ti,j ≤ θ < ti,j+1

(7)

otherwise

  0,

where ti,j is the boundary of the interval j and ai,j , bi,j and ci,j are the interval specific coefficients for angle θi . Dihedral angle distributions were fit to von-Mises distributions, 24 giving a pseudo-energy

8

ACS Paragon Plus Environment

Page 9 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

term: Eperiodic (φi ) = kidih cos(φi − φ0i )

(8)

where kidih is the force constant and φ0i the reference dihedral for dihedral φi . To avoid the problems with sudden energy jumps, this was coupled to the angle term    0.5 cos[π( θ − 1)] + 0.5, if θ < b  b    f (θ) = 1, if b ≤ θ ≤ π − b       0.5 cos[π( θ−π + 1)] + 0.5, if θ > π − b b

(9)

where θ is the angle and b is an arbitrarily set parameter that controls the range in which a changing angle changes the energy. The value was chosen empirically to be small while still giving numerically stable simulations. We set b to 0.1 rad for all dihedrals. Finally, we combined equation 8 and 9 to get Edih (φi , αi , βi ) = f (αi )f (βi )(Eperiodic (φi ) − 2kidih ) + 2kidih

(10)

where αi and βi are the neighbor angles to dihedral φi .

2.3

Initial random structures

Initial random structures were generated following Chen et al. 8 Starting from a random position in the sequence, particles are added successively at both ends at random coordinates but with a correct bond length and avoiding collisions. After each particle addition 5 × 104 steps of simulation at 300 K were used to relax the system.

9

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2.4

Simulations and decoy sampling

All simulations were done twice, once with NAST and once with NASTI. Using different initial random number seeds, we ran 50 simulations without a thermostat (NVE) and 100 simulations with an Anderson thermostat (NVT) for each structure in the test set in both force fields. The collision frequency was set to 1 ps−1 for equilibration and to 1 ps−1 and 0.01 ps−1 during production in NAST and NASTI respectively. Each simulation started with a different initial random structure. After equilibration of 104 steps, each case was simulated for 106 steps with a time step of 0.005 ps and a snapshot saved every 103 steps. NAST simulations were run using the most recent version (release 1.0 from https://simtk.org/projects/nast).

2.5

Structure quality evaluation

RMSD and GDT-TS values were computed for every snapshot with respect to the PDBstructure. The RMSD between two sets of C3’ atoms measures the average error between the coordinates in angstroms. GDT-TS scores were originally popularised for protein comparisons 17 where they are the average of the fraction of Cα atom pairs superimposable to within 1, 2, 4 and 8 Å. This gives a convenient score between 1 (very similar structures) and 0 (very different). They were adapted for RNA using C3’ backbone atoms.

2.6

Bootstrap percentile confidence interval and permutation test

The significance of differences in RMSD and GDT-TS scores was evaluated using bootstrap confidence intervals and permutation tests. 25 This kind of test assumes independence of measurements, but this is not the case for measurements within a trajectory. To minimize the effect of correlations, we took the average of all RMSD or GDT-TS scores for snapshots from the same trajectory. This led to 100 mean values from 100 simulations for both RMSD and GDT-TS. Working with average values from the NAST force field could be problematic, since an unstable trajectory led to samples with outrageous energy values. This meant

10

ACS Paragon Plus Environment

Page 10 of 25

Page 11 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

outliers had to be removed. If δQ for each run was the interquartile energy range from the sample distribution, then points were removed if E > Q3 + 1.5δQ where Q3 is the third quartile. In the following a sample refers to a collection of per simulation mean RMSD or GDT-TS scores. For the 99 % bootstrap percentile confidence interval, we took 50 000 resamples with replacement from the original sample of per simulation averages. Each resample had the same size as the original. We took the average of each resample. From the averages we took the 0.5th and 99.5th percentile as lower and upper limits of the 99 % bootstrap percentile confidence interval. We used a two sided permutation test against the null hypothesis that the means in the samples from NAST and NASTI are equal. A joint sample came from merging the outcomes of NAST and NASTI, again using per simulation averages. We sampled 50 000 permutations of this joint distribution. Each permutation was divided into two groups with the sizes of the original samples and the difference of the means of these subsamples was computed. The collection of these differences is referred to as the permutation distribution. Mean and standard deviations as parameters of a t-distribution were used to compute p-values for the observed differences.

2.7

Histogram distance

Histogram distances were computed following Cao and Petzold 26 to compare sampled distributions of bonds, angles and dihedrals with the respective distributions from the PDB reference data. The bin width was calculated using equation 5 with the statistics of the reference data.

2.8

Implementation

NASTI was implemented using the OpenMM library. 27 The new energy terms for dihedrals and angles were added to OpenMM as a C++ plugin. The main routines of NASTI are 11

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

written in python calling OpenMM functions through the provided interface.

2.9

Availability

NASTI is available at https://gitlab.com/nilspetersen/nasti.

3

Results

All simulations were run on the five RNA structures used by Weinreb et al. 15

3.1

Simulation stability

All simulations began by discarding 104 steps for equilibration, initially coupled to an Andersen thermostat at 300 K with collision frequency 1 ps−1 , followed by production simulations, each of 5 ns, with sampling. These were not coupled to a thermostat. Figure 2 shows the most important result of this work very clearly. The original NAST force field frequently leads to a numerical catastrophe. Simulations with NASTI are completely stable, each near 300 K as shown by the inset.

3.2

Sampling structure ensembles

Given that NASTI seems to be much more stable than NAST, it remains to be shown that its sampling behavior is as good as its forefather. Trajectories were calculated with a weakly coupled temperature bath at 300 K. 105 snapshots were saved for each member of the test set, for both force fields. These were used to calculate distributions of bond lengths, angles and dihedral angles and then for comparisons of snapshots with the reference structures from the protein data bank. 28 The first calculations reflect local geometry. The original NAST force field was parameterized on a set of ribosome structures, but NASTI was based on a larger set. It would be unfair to test NAST on this set, so we restrict the first comparisons to the ribosomal 12

ACS Paragon Plus Environment

Page 12 of 25

Page 13 of 25

}

6 × 102

10

Kinetic Energy [K]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

32

1028

4 × 102

1024

3 × 102

NAST

1020 10

2 × 102

16

1012 108 104 10

NASTI

}

0

0

1000

2000

3000

4000

5000

Time [ps] Figure 2: Kinetic energy in temperature units from trajectories without coupling to a thermostat. Labels show whether NAST or NASTI was used. The inset is zoomed to show the slight difference between stable simulations. Colors are arbitrary and only serve to separate calculations. There are 250 simulations for both force fields. structures of Jonikas et al. 1 The distributions for each structural property corresponding to an energy term were collected over all sampled conformations from all test structures and compared by a histogram distance. Table 1 shows that in all but one of the cases, the distributions sampled with NASTI are closer to the reference than those from NAST. The most striking observation concerns an angle restraining the helix geometry (Fig. 3 A). The values produced by NAST have a mean of 1.3 rad and standard deviation σ = 0.2 rad, differing significantly from those found in the helices of crystal structures, where the mean is 1.1 rad and σ = 0.1 rad. The corresponding distribution sampled with NASTI reproduces the reference distribution more closely with a mean of 1.1 rad and σ = 0.1 rad. Using better 13

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

A

5

i+1 j

Probability

i

Ribosome NAST structures

B

4

3

NAST

2

1

0 0.0

0.5

1.0

1.5

5

2.0

2.5

3.0

Ribosome NASTI structures

4

Probability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 25

3

NASTI

2

1

0 0.0

0.5

1.0 11 hlx

1.5

2.0

2.5

3.0

Angle [rad]

Figure 3: (A) Comparison of inner helix angle (shown left) from ribosome reference structures (blue) with values from simulation snapshots (red) from NAST (top center) and NASTI (bottom center). (B) Helices from structures sampled with NAST (pink, top right) and NASTI (green, bottom right) superimposed on the target structure (red). parameters for this angle term allows NASTI to sample helices which are closer to those found in crystal structures (Fig. 3 B). We also found that dihedrals in NASTI are less restrained and visit a wider range than in NAST. This is especially evident for those which include both helix and non-helix particles. Figures S1 and S2 illustrate the sampled distributions of which those from NAST have significantly smaller variances compared to those from NASTI. The last comparison was between calculated coordinates and the structures in the protein data bank. RMSD and GDT-TS scores were computed with respect to the PDB-structure for each sample. Figure 4 shows box plots of the metrics for all five structures with median and 14

ACS Paragon Plus Environment

Page 15 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Table 1: Histogram distances between sampled distributions and those collected from crystal structures. The smaller value in each line is printed bold. The abbreviations bb and hlx refer to values computed on the backbone or helices, including particles on both sides of the helix. rbb

rhlx θbb

θhlx φbb

φhlx φknight

00 01 10 11 11 000 001 010 011 100 101 110 111 10 11 0000 0001 0010 0011 0101 0110 0111 1000 1001 1011 1100 1101 1110 1111 10 11 111

NAST NASTI 0.02 0.03 0.03 0.03 0.04 0.03 0.06 0.03 0.03 0.01 0.09 0.07 0.18 0.05 0.17 0.06 0.09 0.06 0.09 0.07 0.20 0.05 0.11 0.04 0.12 0.06 0.26 0.17 0.45 0.07 0.08 0.02 0.08 0.04 0.19 0.05 0.15 0.06 0.14 0.04 0.16 0.05 0.15 0.04 0.07 0.04 0.14 0.05 0.14 0.06 0.17 0.04 0.17 0.07 0.15 0.05 0.11 0.03 0.17 0.03 0.07 0.02 0.07 0.02

15

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 25

quartiles. Judged by RMSD values, the NASTI samples are closer to the PDB structures in three of five cases. For two riboswitch structures (1y26 and 2gdi), the original NAST force field produced structures closer to the reference coordinates. On the other hand, the GDT-TS scores were always better from NASTI. Bootstrapping was used to evaluate whether these observations are significant or could be due to chance. We computed 99 % bootstrap confidence intervals and performed permutation tests on the sampled means of per simulation averages of RMSD and GDT-TS. Table 2 shows the means, bootstrap confidence intervals and permutation test p-values for the 5 test cases. The results clearly support the previous observations. The means have similar behaviour to the medians and quartiles. The 99 % confidence intervals do not overlap and the highest p-value from a permutation test is ∼ 0.5 %. Table 2: Comparison of NAST and NASTI snapshots with PDB structures. Mean RMSD and GDT-TS of simulation averages with 99 % confidence interval and permutationtest p-value. The means of the more correct results are printed bold. RMSD Mean (CI) in angstroms NAST

NASTI

1fir 15.1 (14.6, 15.6) 9.4 (9.1, 9.9) 1y26 11.1 (10.8, 11.4) 11.5 (11.4, 11.6) 2gdi 13.4 (13.2, 13.7) 13.9 (13.8, 14.0) 3q3z 12.3 (11.9, 12.7) 10.7 (10.1, 11.4) 4lvv 12.3 (11.8, 12.7) 10.2 (9.6, 11.0)

P -Value 0 −03

5.1 × 10 5.8 × 10−04 8.2 × 10−07 1.7 × 10−09

GDT-TS Mean (CI) NAST 1fir 1y26 2gdi 3q3z 4lvv

0.19 0.22 0.15 0.17 0.19

(0.18, (0.22, (0.14, (0.16, (0.18,

NASTI 0.20) 0.23) 0.15) 0.18) 0.20)

0.30 0.26 0.16 0.23 0.27

16

(0.29, (0.26, (0.16, (0.22, (0.26,

P -Value

0.30) 0 0.27) 2.2 × 10−16 0.17) 3.3 × 10−10 0.24) 0 0.28) 0

ACS Paragon Plus Environment

Page 17 of 25

30

NAST NASTI

RMSD (Å)

25 20 15 10 5 0

1fir

1.0

1y26 2gdi

3q3z

4lvv

3q3z

4lvv

NAST NASTI

0.8 GDT-TS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

0.6 0.4 0.2 0.0

1fir

1y26 2gdi

Figure 4: RMSD (left) and GDT-TS (right) for structure ensembles sampled with NAST and NASTI. Unstable NAST simulations occasionally led to very high RMSD values outside of the range of the plots and are not shown.

4

Discussion

The most important results can be quickly summarized. The new force field produces similar results to its forefather, but without numerical instabilities and with excellent energy conservation. The NASTI force field may be just as ad hoc as its predecessor, but it can be used for sampling in a better defined micro-canonical ensemble since strong coupling to the heat bath is not necessary. As well as being intellectually appealing, this has practical consequences. Firstly, the original code for restarts and resets is not necessary. Secondly, the analysis of trajectories is much simpler. One no longer has to look for outliers. The method 17

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

can now be built into the kind of pipeline constructed by Weinreb et al. 15 There is a less obvious practical consequence. The model can be used on larger systems. Since the number of dihedral angles scales with the size of a molecule, the probability of a numerical explosion also scaled with the size of a molecule in the original NAST formulation. Continuing in this vein, NASTI is better suited to replica exchange methods than its parent. Jonikas et al. 1 noted that the quality of sampled structures may be limited by the model quality more than by sampling efficiency. Now, at least, one can sample reliably. Rather than basking in self-accredited pride, it is more helpful to look at weaknesses in the current work. The formulations used in NASTI have little physical basis. A critic would say they are simply a mechanism to persuade the geometry of RNA molecules to conform to the statistics of known structures. A harsher critic would add all the problems of knowledgebased force fields and ask whether it is correct to use a free-energy approximation as if it were a potential energy. 29 One can always defend this kind of force field by conceding that it is a pragmatic scoring function that will be used for conformational searching without any claims to statistical mechanical rigor. We would also admit that its functional forms are not works of beauty. The problems are best seen by comparison with a conventional atomistic force field. In an atomistic force field, bond angles are extremely rigid and often the highest frequency degrees of freedom. If bond angles do not vary much, than there are no numerical problems with dihedral angles. In a force field such as NAST, the angles represent the geometry between sites which are not really bonded to each other. If one wants to match the distributions from crystallographic structures, these angles have to be rather flexible. This led to the instability in the original dihedral angle terms. Our solution uses a cross term involving angles and torsion angles and it does reproduce crystallographic geometry rather well. A more elegant solution may have been to construct a single term simultaneously parameterized on angles and torsion angles. If one accepts that the goal is just to sample expected geometries, than the NASTI force 18

ACS Paragon Plus Environment

Page 18 of 25

Page 19 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

field is a real improvement over NAST. Although the new force field was parameterized on more structures, it reproduces the geometry distributions of the original ribosomes used for parameterizing NAST as well or better than NAST. For most properties, the differences are very small, but some observations can be made. From Table 1, it seems that there are small improvements in both the backbone geometry (bond lengths, angles and torsion angles) as well as helix geometry. We see two possible reasons. First, NASTI energies distinguish between cases where all particles are within a loop or within a helix and specific cases where some particles are part of a base pair and others are not. The effect of this is especially clear for dihedrals in helix border regions. In contrast, in NAST, if two or more base pairs are part of a helix, the dihedral along the backbone is restrained to match helix geometry, resulting in distributions with a smaller spread compared to the reference distributions. NASTI’s parameterization is slightly more complicated, but the resulting distributions are closer to the reference values. Second, NASTI terms for backbone angles allow for skew in the distributions which is necessary to better reproduce the observed angle distributions. The most striking difference between NAST and NASTI was observed for the angle given by the orientation of a base pair to the backbone. The distribution sampled with NAST differs significantly from that found in the reference structures. This seems to simply be a result of a non-optimal parameter choice. In the distributed code and supplementary material, the corresponding ideal value is 1.5 rad. In the main paper it is referred to as 1.76 rad. Neither value is close to the mean (1.08 rad) seen in crystallographic structures. Perhaps local geometry is not of interest if one is looking at the ovarall folding of a molecule. RMSD and GDT-TS scores, comparing models to data bank structures, provide a better overview. The GDT-TS scores from both force fields seem low, but this reflects the measures sensitivity to coordinate differences larger than 1 Å. In these flexible force fields, C3’ are not accurately placed. The simulation results do show differences between RMSD and GDT-TS. On three of five tests, NASTI samples had a lower RMSD than the 19

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

equivalents from NAST. This is obviously not significant. Considering GDT-TS, NASTI was always slightly better. We would not claim that this represents any interesting improvement, although the resampling tests suggest that it is a real effect and not noise. What one can see is a difference between the quality measures. RMSD values can be skewed by outliers that can reflect mobile parts of molecules. Because of the way the GDT-TS score is calculated, it is more likely to reward good local geometry. An evangelist could probably find more examples that are favored by either NAST or NASTI. The rigidity of the NAST backbone will be advantageous when the correct structure of a fragment with unpaired bases really is close to helix geometry. NASTI on the other hand is more likely to successfully reproduce loop structures that differ significantly from that found in helices. NASTI is also better at reproducing helices that are similar to those found in crystal structures. It is more fruitful to consider where future improvements could be made within the NAST-style framework. We see a case for restraints to enforce coaxial stacking or specific loop motifs, 30 although this would have to be balanced against the extra complication. We should also note that NASTI is just an improvement and not a wholesale reconstruction. For example, there were no changes to the non-bonded parameters. The problems addressed here are of wider interest than RNA structure. A coarse-grain force field should not be judged solely on its energies or structures. The shape of the energy landscape can have serious practical consequences. In the specific area of RNA structure calculations, NASTI and its forefather only address a specific application. How can one quickly sample plausible conformations allowed by predicted or measured base pairing? This niche is different to the problems addressed by more general force fields, where the emphasis lies more in the direction of de novo structure prediction. 31–33 This area is more ambitious, but requires more interaction sites, more computer time and is a step towards atomistic simulations. For the moment, the NAST approach remains an area of interest. 15 We would suggest that newer model building pipelines should 20

ACS Paragon Plus Environment

Page 20 of 25

Page 21 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

be built with NASTI in mind. The structures will be at least as good, but generating the trajectories is numerically less traumatic.

Supporting Information Available The following file is freely available at http://pubs.acs.org. • supplement.pdf: Geometry distributions, the parameterization set and the force field parameters.

References (1) Jonikas, M. A.; Radmer, R. J.; Laederach, A.; Das, R.; Pearlman, S.; Herschlag, D.; Altman, R. B. Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters. RNA 2009, 15, 189–199. (2) Kmiecik, S.; Gront, D.; Kolinski, M.; Wieteska, L.; Dawid, A. E.; Kolinski, A. Coarsegrained protein models and their applications. Chem. Rev. 2016, 116, 7898–7936. (3) Kar, P.; Feig, M. Recent advances in transferable coarse-grained modeling of proteins. Adv. Protein Chem. Struct. Biol. 2014, 96, 143–180. (4) Merino, E. J.; Wilkinson, K. A.; Coughlan, J. L.; Weeks, K. M. RNA structure analysis at single nucleotide resolution by Selective 2’-Hydroxyl Acylation and Primer Extension (SHAPE). J. Am. Chem. Soc. 2005, 127, 4223–4231. (5) Mortimer, S. A.; Weeks, K. M. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 2007, 129, 4144–4145. (6) Mathews, D. H.; Turner, D. H. Prediction of RNA secondary structure by free energy minimization. Curr. Opin. Struct. Biol. 2006, 16, 270–278. 21

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(7) Rivas, E. The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective. RNA Biol. 2013, 10, 1185–1196. (8) Chen, C.; Mitra, S.; Jonikas, M.; Martin, J.; Brenowitz, M.; Laederach, A. Understanding the role of three-dimensional topology in determining the folding intermediates of group I introns. Biophys. J. 2013, 104, 1326–1337. (9) Sen, A.; Thakur, S.; Bothra, A. K.; Sur, S.; Tisa, L. S. Identification of TTA codon containing genes in Frankia and exploration of the role of tRNA in regulating these genes. Arch. Microbiol. 2012, 194, 35–45. (10) Fürtig, B.; Wenter, P.; Pitsch, S.; Schwalbe, H. Probing mechanism and transition state of RNA refolding. ACS Chem. Biol. 2010, 5, 753–765. (11) Jung, S.; Schlick, T. Candidate RNA structures for domain 3 of the foot-and-mouthdisease virus internal ribosome entry site. Nucleic Acids Res. 2013, 41, 1483–1495. (12) Mitrasinovic, P. M.; Tomar, J. S.; Nair, M. S.; Barthwal, R. Modeling of HIV-1 TAR RNA-ligand complexes. Med. Chem. 2011, 7, 301–308. (13) Wang, Z.; Parisien, M.; Scheets, K.; Miller, W. A. The cap-binding translation initiation factor, eIF4E, binds a pseudoknot in a viral cap-independent translation element. Structure 2011, 19, 868–880. (14) Zeng, Y.; Larson, S. B.; Heitsch, C. E.; McPherson, A.; Harvey, S. C. A model for the structure of satellite tobacco mosaic virus. J. Struct. Biol. 2012, 180, 110–116. (15) Weinreb, C.; Riesselman, A. J.; Ingraham, J. B.; Gross, T.; Sander, C.; Marks, D. S. 3D RNA and functional interactions from evolutionary couplings. Cell 2016, 165, 963–975. (16) Winger, M.; Trzesniak, D.; Baron, R.; van Gunsteren, W. F. On using a too large integration time step in molecular dynamics simulations of coarse-grained molecular models. Phys. Chem. Chem. Phys. 2009, 11, 1934–1941. 22

ACS Paragon Plus Environment

Page 22 of 25

Page 23 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(17) Zemla, A.; Venclovas, C.; Moult, J.; Fidelis, K. Processing and analysis of CASP3 protein structure predictions. Proteins 1999, S3, 22–29. (18) Leontis, N. B.; Zirbel, C. L. In RNA 3D structure analysis and prediction; Leontis, N. B., Westhof, E., Eds.; Springer: Berlin, Heidelberg, 2012; pp 281–298. (19) Wiegels, T.; Bienert, S.; Torda, A. E. Fast alignment and comparison of RNA structures. Bioinformatics 2013, 29, 588–596. (20) Capriotti, E.; Marti-Renom, M. A. RNA structure alignment by a unit-vector approach. Bioinformatics 2008, 24, i112–i118. (21) Yang, H.; Jossinet, F.; Leontis, N.; Chen, L.; Westbrook, J.; Berman, H.; Westhof, E. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 2003, 31, 3450–3460. (22) Scott, D. W. On optimal and data-based histograms. Biometrika 1979, 66, 605–610. (23) Dierckx, P. An algorithm for smoothing, differentiation and integration of experimental data using spline functions. J. Comput. Appl. Math. 1975, 1, 165–184. (24) Mardia, K. V.; Jupp, P. E. Directional Statistics; John Wiley & Sons: Chichester, 2008. (25) Moore, D.; McCable, G.; Craig, B. Introduction to the Practice of Statistics, 6th ed.; W. H. Freeman and Company: New York, 2009; Chapter 16, pp 1–59. (26) Cao, Y.; Petzold, L. Accuracy limitations and the measurement of errors in the stochastic simulation of chemically reacting systems. J. Comput. Phys. 2006, 212, 6–24. (27) Eastman, P.; Swails, J.; Chodera, J. D.; McGibbon, R. T.; Zhao, Y.; Beauchamp, K. A.; Wang, L. P.; Simmonett, A. C.; Harrigan, M. P.; Stern, C. D.; Wiewiora, R. P.; Brooks, B. R.; Pande, V. S. OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comput. Biol. 2017, 13, e1005659. 23

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(28) Berman, H.; Henrick, K.; Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol. 2003, 10, 980. (29) Thomas, P. D.; Dill, K. A. Statistical potentials extracted from protein structures: How accurate are they? J. Mol. Biol. 1996, 257, 457–469. (30) Bayrak, C. S.; Kim, N.; Schlick, T. Using sequence signatures and kink-turn motifs in knowledge-based statistical potentials for RNA structure prediction. Nucleic Acids Res. 2017, 45, 5414–5422. (31) Boniecki, M. J.; Lach, G.; Dawson, W. K.; Tomala, K.; Lukasz, P.; Soltysinski, T.; Rother, K. M.; Bujnicki, J. M. SimRNA: A coarse-grained method for RNA folding simulations and 3D structure prediction. Nucleic Acids Res. 2016, 44, e63. (32) Poblete, S.; Bottaro, S.; Bussi, G. A nucleobase-centered coarse-grained representation for structure prediction of RNA motifs. Nucleic Acids Res. 2018, 46, 1674–1683. (33) Bernauer, J.; Huang, X.; Sim, A. Y.; Levitt, M. Fully differentiable coarse-grained and all-atom knowledge-based potentials for RNA structure evaluation. RNA 2011, 17, 1066–1075.

24

ACS Paragon Plus Environment

Page 24 of 25

Page 25 of 25

Graphical TOC Entry Kinetic Energy [K]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

104

102

NAST NAST improved 100 0

2000

4000

Time [ps]

25

ACS Paragon Plus Environment