IsRNA: An Iterative Simulated Reference State Approach to Modeling

Publication Date (Web): March 2, 2018 ... Through iterative molecular dynamics simulations, we build the correlation effects into the reference states...
0 downloads 4 Views 2MB Size
Article Cite This: J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

pubs.acs.org/JCTC

IsRNA: An Iterative Simulated Reference State Approach to Modeling Correlated Interactions in RNA Folding Dong Zhang and Shi-Jie Chen* Department of Physics, Department of Biochemistry, and MU Informatics Institute, University of Missouri, Columbia, Missouri 65211, United States S Supporting Information *

ABSTRACT: Coarse-grained RNA folding models promise great potential for RNA structure prediction. A key component in a coarse-grained folding model is the force field. One of the challenges in the coarse-grained force field calculation is how to treat the correlation between the different degrees of freedoms. Here, we describe a new approach (IsRNA) to extract the correlated energy functions from the known structures. Through iterative molecular dynamics simulations, we build the correlation effects into the reference states, from which we extract the energy functions. The validity of IsRNA is supported by the close agreement between the simulated Boltzmann-like probability distributions for all the structure parameters and those observed from the experimentally determined structures. The correlated energy functions derived here may provide a new tool for RNA 3D structure prediction. with physics-based atomic force fields, such as AMBER9−12 and CHARMM,13−15 can give detailed structure changes and folding dynamics. However, the astronomically large conformational space makes all-atom exhaustive conformational sampling computationally not feasible. As a result, computer simulations are mostly useful for small RNA molecules, such as RNAs up to 40 nucleotides (nt),16−19 while simulations for medium or large size RNAs are often hindered by the limited coverage of the conformational space. To overcome the limitations due to the large conformational space, we resort to a coarse-grained model by representing a group of atoms using a pseudobead and smoothing the rough free energy landscape of the all-atom system through the coarse-grained representation. The key ingredient of a coarsegrained model is a reliable force field that can capture the essential physical features of the energy functions while maintaining a sufficiently simple description of the conformational system for efficient sampling. There are two commonly used methods to construct a force field for a coarse-grained model:31 physics-based approach and

1. INTRODUCTION Ribonucleic acid (RNA) functions depend on their threedimensional (3D) structures and dynamic behaviors,1,2 which often involve distinct biologically active conformations.3,4 Therefore, understanding RNA biological functions requires a detailed knowledge of RNA 3D structures and thermodynamic stabilities. Experimental methods, such as X-ray crystallography and NMR spectroscopy, can determine RNA 3D structures at high resolution. However, the experiments are often laborious and time-consuming and unable to provide sufficient information about folding stability and conformational changes. In the past decade, inspired by the increasing discoveries of new RNA functions, a number of highly promising computational methods, such as FARNA/FARFAR,5 MC-Fold/MCSym,6 Vfold,7 DMD,23 and RNAComposer,8 have been developed to predict RNA 3D structures. These theoretical approaches employed a variety of approaches, such as bioinformatic algorithms, all-atom computer simulations, and coarse-grained (CG) models. However, these methods are, to a certain extend, limited by the availability of the known structure templates/fragments. Furthermore, most structure prediction models do not treat conformational changes or folding dynamics. In contrast, Molecular Dynamics (MD) simulations © XXXX American Chemical Society

Received: December 6, 2017 Published: March 2, 2018 A

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

approaches for protein folding,32−36 one of the main challenges in the knowledge-based approach to RNA folding is how to handle the reference state, which accounts for the expected probability distribution in the free state with potential energy E(x) = 0. For a given structure parameter x, such as a bond angle θ, torsion angle ϕ, or pairwise distance r, the way the reference state is treated would directly influence the resultant potential energy through the Inverse Boltzmann law. An accurate energy function and reference state should correctly account for the chain connectivity, excluded volume, and the interdependence/correlation between different structure parameters. These features, however, cannot be fully captured in most reference state models, such as the noninteracting ideal gas model and the quasi-chemical model.26−29 Furthermore, most existing approaches for the parametrization of energy functions are based only on the native structures. Therefore, the extracted energy functions (force field) are strongly biased toward the native structures and cannot account for the nonnative interactions and the resultant intermediate/misfolded structures. In the present study, based on a four/five-bead coarsegrained representation of RNA nucleotides, we develop an iterative simulated RNA reference state method (IsRNA) for the extraction of the knowledge-based RNA coarse-grained force field. The IsRNA method has two major advantages. First, through MD-based iterative simulated construction of the reference states, the energy terms for the different structure parameters are added stepwise to the total force field. This stepby-step procedure allows us to account for correlations between different structure parameters as well as the effects from chain connectivity and excluded volume. In each iteration step, the energy functions extracted in the previous steps are used to generate the reference state for the next structure parameter. The iterative process continues until the simulated probability distributions for all the structure parameters agree with those derived from the PDB structures. Second, in each iteration step in IsRNA, the MD simulations sample both native-like and non-native conformations to generate the reference state ensemble. Therefore, the IsRNA accounts for both native and non-native interactions in RNA folding.

knowledge-based approach. The former is based on fundamental physics principles and is usually computed through specific averaging schemes for all-atom interactions, while the force field in the latter approach is calculated from the frequencies of occurrence of the different interactions in the solved 3D structures. The third approach consists of hybrid knowledge-based and physics-based methods. For instance, the local potentials, including bond stretching, bond angle bending, and torsion angle constraints may be derived from knowledgebased approach, while the nonlocal energy functions may be constructed through physics-based approaches, such as the multipole-multipole base−base interactions in NARES-2P21 and the many-body orientation-dependent base−base interactions in the HiRE-RNA model.29 A variety of coarse-grained models for RNA folding have been proposed at the different level of coarse graining of the conformational representation.20−29 For instance, NAST,20 a one pseudobead per residue coarse-grained model, can generate, cluster, and rank RNA 3D structures using an RNA-specific knowledge-based energy function and the secondary structure and tertiary contact constraints. He et al. developed NARES-2P,21 a two-particle nucleotide representation, and demonstrated that average interactions between base dipoles, together with chain connectivity and excluded-volume interactions, are sufficient to stabilize helix structures for DNA and RNA molecules. Denesyuk and Thirumalai developed the Three Interaction Site (TIS) model,22 a thermodynamically robust coarse-grained model, in which the force field parameters are determined by matching simulations with experimental melting data, to predict folding thermodynamics of RNA in monovalent salt solutions. In iFoldRNA,23 based on a three-bead nucleotide representation and the decomposition of the nearest-neighbor interactions to parametrize the force fields, for 150 structurally diverse RNAs as test cases, Discrete Molecular Dynamics can predict structures with RMSD < 4 Å and 94% of the total native base-pairs. Shi et al.24 used a threebead representation coarse-grained model (Vfold model7) to predict 3D structures with salt effect. Using a single rigid body with five interaction sites to represent a nucleotide and the nearest-neighbor interaction model, Louis et al.25 developed a nucleotide-based model, oxRNA, to predict the structural, mechanical, and thermodynamic properties of RNA. Xia et al.26,27 and Bell et al.30 introduced a five-bead coarse-grained nucleotide model and a set of knowledge-based coarse-grained potentials derived from the structural statistics. With Watson− Crick (WC) base pairs and small-angle X-ray scattering (SAXS) structural information as constraints, the model is able to characterize complex tertiary structures of large size RNAs. SimRNA, a multilevel representation RNA coarse-grained model,28 employs a statistical potential to approximate the intrachain contacts. The model can fold small size RNA molecules using only sequence information and more complex 3D structures with additional constraints such as secondary structure and/or long-range contacts. The latest version of the HiRE-RNA coarse-grained force field,29 with specifically designed energy terms for base-stacking and base-pairing, including noncanonical pairs, can predict the structure, stability, and free energy surfaces for RNAs of 12 to 76 nucleotides long at the different levels of structural complexities. As more and more RNA 3D structures are determined, the coarse-grained force field derived from the knowledge-based approaches becomes increasingly reliable for RNA 3D structure prediction. However, similar to the statistical potential

2. METHODS 2.1. Coarse-Grained RNA Conformational Model. Similar to other RNA coarse-grained models,7,23,26−28 in the current IsRNA model, we describe the RNA backbone using two coarse-grained beads located at the P (denoted as bead P) and C4′ atoms (denoted as bead S) to represent the phosphate group and the ribose sugar ring, respectively. For the base moiety, we define three coarse-grained beads for purines and two coarse-grained beads for pyrimidines, such as beads RC, GO, and GN for base G and beads YC and CN for base C. The details of the transformations from the all-atom model to the coarsegrained representation for all four types of nucleotides are shown in Figure 1. All the coarse-grained beads are positioned at the center-of-mass of the constituent heavy-atom groups. In such a coarse-grained representation, a base pair can be reproduced by two pairwise interactions, which enables us to well capture the hydrogen bonding configurations and the energies for the different base pairing modes.37 As shown in Table S1 of the Supporting Information (SI), a total of ten unique types of coarse-grained beads are used in the IsRNA model. B

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

caused by the chain connectivity and many-body nonlocal interactions including the volume exclusion between the different segments. Previous studies showed that improper treatment of the correlation effect could result in incorrect magnitudes and even incorrect signs of the extracted statistical potentials.38 For two structure parameters x1 and x2, such as two correlated torsional angles, the correlation is embedded in both pobs(x1) and pobs(x2). The correlation between x1 and x2 should be counted only once in the total energy function. Thus, we need a procedure to remove the redundancy when extracting the energy terms from the observed statistics. Considering that E(x) in eq 2 is the energy in excess of the reference state interactions, as explained below, we can account for the correlation effect through proper construction of the reference state. 2.4. Accounting for Correlation through Iterative Construction of the Reference States. Coarse-Grained Molecular Dynamics (CGMD) Simulation Uncovers the Correlation Effect. We start from the most fundamental interactions for building a chain molecule: the bond stretching energies Ebond (chain connectivity) and the excluded volume interactions ELJ in eq 1. By treating pref(b) as a bond length bindependent constant, namely, assuming no interactions in the reference state, we extract the statistical potentials Ebond and ELJ from eq 2. Based on the bond stretching and volume exclusion energies, the backbone energy Ebk = Ebond + ELJ as the (coarsegrained) force field, CGMD generates a conformational ensemble ref0. For some structure parameters, such as the bond angle x = θ2(GO-RC-GN), whose conformation is fixed by the chain connectivity (bond stretching) and volume exclusion effects, the distributions in ref0 would match the observed statistics pobs(x). According to eq 2, we have E(x) ≃ 0 (see Figure 2a). This example illustrates two important results. First, a small (large) |E(x)| = |Eobs(x) − Eref(x)| may indicate a strong (weak) correlation between x and the structure parameters considered in the reference state. Second, the structure parameters, such as θ2(GO-RC-GN) here, whose motion is fully correlated with the structure parameters considered in the reference state force field (the bond stretching and volume exclusion here), should not be included in the force field in order to avoid redundancy. As a result, not all the structure parameters need to be included in the final coarse-grained force field. For other structure parameters, such as x = θ1(S-RC-GO), the distribution pref0(x) of x in ref0 (the thick red line in Figure 2a) is not close to pobs(x) (black squares in Figure 2a), suggesting that θ1(S-RC-GO) is not fully correlated with the chain connectivity (bond stretching) and excluded volume interactions, and its distribution is also dependent on its own interaction. However, if we add the excess energy for x, E(x) = Eobs(x) − Eref0(x) (eq 2), to the force field, the resultant simulated distribution, psim(x) (the thick blue line in Figure 2a), would match the observed pobs(x). Iterative Construction of the Reference State and the Force Field. For illustrative purposes, we consider three correlated torsion angles, x1 = ϕ(5′S-P-S-P3′), x2 = ϕ(5′RCS-P-S3′), and x3 = ϕ(5′S-P-S-RC3′) (see Figure 2b). These torsion angles are correlated due to their spatial arrangements and the chain connectivity. We use x0 to collectively denote the fundamental structure parameters including the bond stretching (b), bond angle bending (θ), and the LJ excluded volume

Figure 1. Coarse-grained representation of an RNA structure. The backbone is described by two coarse-grained beads located at P and C4′ (denoted as S throughout this paper for its connection to the sugar pucker). The purine bases (A and G) are represented by three coarse-grained beads: (RC, AC, AN) for A and (RC, GO, GN) for G. The pyrimidine bases (U and C) are represented by two coarse-grained beads: (YC, UO) for U and (YC, CN) for C. The above coarse-grained beads for the bases are positioned at the center-of-mass of corresponding heavy-atom groups.

2.2. Coarse-Grained Force Field. In the IsRNA model, the total force field consists of two types of energy terms: the local interactions, which depend on the local RNA structure, and the nonlocal terms, which describe the interactions between non-neighboring nucleotide residues: Etotal = E bond(b) + Eangle(θ ) + Etorsion(ϕ) + Epair(r ) + Eele(r ) + E LJ(r )

(1)

Here Ebond, Eangle, and Etorsion are the local energy terms, representing the bond stretching, bond angle bending between two successive bonds, and torsional energies between three successive bonds, respectively. The nonlocal term Epair accounts for base−base contacts, including base pairing and base stacking, and base-backbone interactions. The electrostatic term Eele accounts for backbone−backbone electrostatic interactions. The last term ELJ describes the excluded volume interaction between any two nonbonded coarse-grained beads. Details of the energy terms, including the functional forms and parameters, are given in the SI. 2.3. Knowledge-Based Approach to the Extraction of the Correlated Energies. The basic assumption in the knowledge-based approach is that for a given structure parameter x, such as the bond length b, bond angle θ, torsion angle ϕ, or pairwise distance r, the probability distribution pobs(x) observed from the native structure (PDB) database follows the Boltzmann distribution law. Therefore, the potential of mean force for x, i.e., the excess energy relative to the reference state, can be extracted as E(x) = −kBT ln

pobs (x) pref (x)

= Eobs(x) − Eref (x)

Eobs(x) = −kBT ln pobs (x)

(2)

Eref (x) = −kBT ln pref (x)

where Eobs(x) and Eref(x) are the energy functions corresponding to the observed and the reference states, respectively. One of the challenges in the calculation of energy parameters through knowledge-based approach is the correlation between the different structure parameters. The correlation can be C

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Here u(xi) is the direct interaction term for xi, and cij is the cross-correlation term between xi and xj. Eobs(xi) = u(xi) +

∑ cij (5)

j

is the total energy for xi, which determines the observed statistics for xi. To extract the force field E(x0,x1,x2,x3) from the observed statistics pobs(xi) (i = 0,1,2,3), we calculate the energy terms E(xi) in eq 3 using iteratively updated reference states for the different structure parameters. 1. We start from x0. Based on Ebk(x0) as the force field, CGMD yields a simulated distribution psim(x0) for x0. Since psim(x0) matches pobs(x0), similar to the aforementioned θ1(SRC-GO), following eq 5, the force field Ebk(x0) should have accounted for all the direct and correlated interactions related to x0: E bk (x0) ≃ u(x0) + c01 + c02 + c03

(6)

2. Based on Ebk(x0) for x0, we derive the energy term Etorsion(x1) for x1 as the next structure parameter to be determined in eq 3. The key is to use the x0-related force field Ebk(x0) to generate the reference state (labeled as ref0) for x1. Then, the energy function for x1 is given by eq 2: Etorsion(x1) = Eobs(x1) − E ref 0(x1)

We note that the force field for the reference state ref0, which accounts for the x0 -related energy only, affects the x 1 distribution through the correlation between x0 and x1: Eref0(x1) ≃ c01. Therefore, combining the above equation with eq 5 for x1, we have

Figure 2. Correlations between the different structure parameters. (a) The reference state probability distributions pref(θ) for two representative bond angles. The reference state is generated by the coarse grained molecular dynamics (CGMD) simulation using only the bond stretching and excluded volume interactions as the force field. Also shown in the figure are the observed probability distributions pobs(θ) from the native structure data set and the simulated Boltzmann-like distribution psim(θ1) from the CGMD simulation with E(θ1), the extracted energy for θ1, added to the force field. (b) The iteration procedure for the extraction of the energy for the torsional angle ϕ = ϕ(5′S-P-S-RC3′). The key issue in the calculation is how to treat the correlation between ϕ and other torsional angles such as x1 = ϕ(5′S-P-S-P3′) and x2 = ϕ(5′RC-S-P-S3′) in the current system. The black squares denote the ϕ distribution observed in the known structures. The other curves show the CGMDgenerated distributions based on the different force fields. Here the CGMD-generated reference state refn is guided by the force field n FF(n) = ∑i = 0 E(xi) (n = 0, 1, 2).

Etorsion(x1) ≃ u(x1) + c12 + c13

E(x0 , x1 , x 2 , x3) = E bk (x0) + Etorsion(x1) + Etorsion(x 2) (3)

Here Ebk(x0) is the backbone energy Ebond + Eangle + ELJ, and Etorsion(x1), Etorsion(x2), and Etorsion(x3) are the torsional potential energies. The apparent additive energy in eq 3 accounts for, albeit implicitly, the correlations between different structure parameters. This total energy can be heuristically decomposed into explicit correlated terms:

Etorsion(x 2) ≃ u(x 2) + c 23

(9)

Physically, Etorsion(x2) accounts for the direct interaction and the correlation for the structure parameter x3, while the correlation effects from x0 and x1 have already been included in Ebk(x0) and Etorsion(x1). As shown in Figure 2b, the distribution for x3 in the reference state ref1 (squares with red cross inside) approaches pobs(x3) (black filled squares), indicating a correlation between x1 and x3 .

E(x0 , x1 , x 2 , x3) ≃ u(x0) + u(x1) + u(x 2) + u(x3) + c01 + c02 + c03 + c12 + c13 + c 23

(8)

Physically, Etorsion(x1) accounts for the direct interaction and the correlation effects for x1 from all the other structure parameters except x0, whose correlation effect has already been included in Ebk(x0). As shown in Figure 2b, the reference distribution for x3 = ϕ in ref0 (red empty squares in Figure 2b) shows no similarity to the observed statistics (black filled squares), suggesting that x0 and x3 are only weakly correlated and the x0-related correlation c03 contributes weakly to the observed distribution pobs(x3). 3. Following the same approach, we extract the energy Etorsion(x2) for x2 in eq 3. Using the x0 and x1-related energy functions Ebk(x0)+Etorsion(x1) as the force field, CGMD simulation gives a new conformational ensemble ref1. Treating ref1 as the reference state for x2, from eq 2, we have Etorsion(x2) = Eobs(x2) − Eref1(x2). The x2 distribution in the reference state ref1 is determined by the correlation effects from x0 and x1: Eref1(x2) ≃ c02+c12. This result combined with eq 5 for x2 yields

interactions (r). In this simplified x0,x1,x2,x3 four-structure parameter system, according to eq 1, the total energy is given as

+ Etorsion(x3)

(7)

(4) D

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation 4. Continuing the iteration, we have the energy term for x3: Etorsion(x3) = Eobs(x3) − Eref2(x3) ≃ u(x3), where the reference state ref2 is the CGMD-simulated conformational ensemble driven by the force field Ebk(x0) + Etorsion(x1) + Etorsion(x2). Because here all the correlation energies related to x3 have been accounted for in the ref2, Etorsion(x3) is determined only by the direct interaction for x3. 5. Adding the energy terms above together gives the total energy E(x0,x1,x2,x3) in eq 3. The resultant CGMD-simulated distribution of x3 (the blue filled squares in Figure 2b) matches the observed statistics (black filled squares), as do the distributions of the other structure parameters x0, x1, and x2. The above iterative simulated construction of the reference state (the “iterative simulated reference stater” approach) may represent a new approach to the extraction of the knowledgebased energy parameters. Compared with the previous approaches,27,29 the current method distinguishes itself by treating the correlation effects in a direct way, instead of employing heuristic parameters such as the different weight coefficients for the different energy terms. Furthermore, the CGMD simulation samples both native-like and non-native conformations. Therefore, this new method can account for not only the native state but also the folding energy landscape. Moreover, for each iterative step, such as the n-th step (n = 2, n−2 3, 4, 5 in the above example), the force field ∑i = 0 E(xi) contains all the interactions and correlations for x0,···,xn−2. Therefore, the CGMD-simulated distributions match the observed statistics for these structure parameters. Because the later steps do not change the energy functions derived in the previous steps, the iteration process leads to a monotonically increasing number of the structure parameters whose simulated distributions agree with their observed statistics. Furthermore, the individual energy term E(xi) may be dependent on the iteration order of the structure parameters. However, the sum of the energy terms (the final total force field) as shown in eq 3 appears to be independent of the iteration order. In fact, the different iteration orders of the structure parameters only correspond to the different ways to parse the interactions and correlations. It is important to note that the termination condition for the iteration is that the simulated distributions are consistent with the observed statistics for all the structure parameters. Therefore, the final force field after the termination of the iteration may be independent of iteration order. For illustration, Figure 3 shows the results for the C-G base pairing energy terms E(r1) and E(r2), where r1 = r(GN−CN) and r2 = r(GO−CN), are derived by following two different iterations orders: r1 → r2 (Figure 3a) and r2 → r1 (Figure 3b). Although the reversed iteration orders are used, the resultant energy functions appear to be the same. 2.5. Structure Data Set and the Observed Statistics for Structure Parameters. We extract the observed distribution pobs(x) for different structure parameter x’s from the database of known structures (PDB). The structure data set includes all the known RNA structures, excluding those with modified bases or bound with ligand/protein/DNA. Structures with redundant/homologous sequences or NMR structures with low resolutions (>5 Å) are also excluded. In total, there are 299 structures in the final structure data set. The data set covers the different types of RNA structures such as ribozyme, tRNA, riboswitch, and other types of structures. We then transform all the (atomic) structures into coarse-grained conformations and count the occurrence frequency (probability) pobs(x) for each

Figure 3. Different iteration orders in the IsRNA lead to the same final total energies. (a) E(r1) is first extracted followed by the extraction of E(r2). The r1 and r2 distributions in the initial reference states are shown as red squares for pair r1 and red empty circles for r2. Here the initial reference state is generated based on only the local energies and the excluded volume interactions. The distributions in the simulated and the reference states, after the energy E(r1) is extracted and added to the force field, are shown as the blue squares for psim(r1) and red full circles for pref(r2). The distribution in the simulated conformational ensemble (with both E(r1) and E(r2) added to the force field) are shown as green squares for r1 and blue circles for r2. (b) Similar to (a) with the iteration order reversed.

concerned structure parameter x. From eq 2, we can calculate the observed energy Eobs(x) = −kBT ln pobs(x). For our coarsegrained conformational representation, a total of 12 bonds (b), 18 bond angles (θ), 22 torsion angles (ϕ), and 55 contact pair types (r) are observed from the native structure data set. The statistical bin sizes for the bond length, bond angle, torsion angle, and pairwise distance are 0.01 Å, 1◦, 6◦, and 0.2 Å, respectively. For the pairwise distributions pobs(r), only the coarse-grained beads within pairwise distance r < 20 Å are considered in the statistics. 2.6. CGMD-Simulated Conformational Ensemble for the Reference State and the Simulated Distributions. In each iteration step, we select 40 structures from the structure database to generate the conformational ensemble, from which we compute the probability distributions pref(x) and/or psim(x) for the different structure parameters. The collection of these 40 structures is fixed for all the iteration steps, and the PDB IDs are shown in Table S2 of the SI. To avoid the possible bias, these 40 structures are carefully selected to include a variety of structural types, such as stem-loop, pseudoknot, triplex, multibranched junctions, and kissing-loop structures, with sizes ranging from 22 to 188 nucleotides. The total sequence length is 2609 nucleotides. E

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation For a given force field, using the above 40 structures as initial states, we perform a series of coarse-grained molecular dynamics (CGMD) simulations using the modified open source software, LAMMPS,39 to generate a conformational ensemble. For each iterative simulated step, the CGMDgenerated conformational ensemble is used to calculate the distribution psim(x) of the concerned structure parameter x. For a reference state conformational ensemble, psim(x) is treated as the reference distribution pref(x). It is important to note that, although the simulations start out with the native structures, the equilibrium distribution in the CGMD-generated ensemble is fully governed by the force field and can cover both the nativelike and non-native interactions. Furthermore, our control test shows that the CGMD-generated conformational ensemble appears to have properly covered the conformational space, and the extracted energy parameters are likely independent of the selection of the training set (the 40 structures here) (see Figure S5 in the SI). 2.7. Parametrization of the Force Field. Although the iteration order of structure parameters does not alter the total energy, it can affect the convergence rate of the iteration and the choice of the (independent) structure parameters used in the final force field. In our calculation, we follow the iteration order: bond angle → torsion angle → pairwise interaction to extract the energy terms. Within each structure parameter category (such as all the bond angles), the structure parameter x of the largest |Eobs(x)−Eref(x)| is selected in each iteration step. In that way, the convergence rate of the iteration is optimized, and the number of the total structure parameters in the final force field is expected to minimal. 1. We first determine all 12 bond stretching energies Ebond(b) from the observed statistics pobs(b) through eq 2: Ebond(b) = −kBT ln(pobs(b)/pref(b)) by assuming a constant pref(b). As a result, we have Ebond(b) = Eobs(b). From the bond length geometry, we estimate the diameters of all coarse-grained beads in the excluded volume interactions ELJ. 2. Based on the bond stretching and excluded volume interactions, we parametrize the bond angle energies one-byone through iterative simulated construction of the reference states until the simulated distributions psim(θ) are consistent with the observed distributions pobs(θ) for all the 18 bond angles. 3. With the bond angle energies added to the bond stretching and excluded volume interactions, the torsion energy functions are parametrized one-by-one through the iterative simulated reference state approach until the simulated distributions psim(ϕ) match the observed ones pobs(ϕ) for all the 22 torsional angles. 4. Finally, with the torsion angle energies added to the bond stretching, bond angle, and excluded volume interactions, the pairwise interactions are determined step-by-step through the iterative process until the simulated distributions psim(r) are consistent with the observed statistics pobs(r) for all the 55 pairwise interactions. See Figure 4 for a flowchart that summarizes the above procedure. Figure 5 shows how the first energy term in the category of nonlocal pairwise interactions is parametrized. After all the local energies (bond stretching, bond angle, torsion angle, and excluded volume) have been extracted and integrated into the force field, we run CGMD simulations to construct a reference state refs. From this reference state, we calculate the reference distributions prefs(r) for all the pairwise distances r, among

Figure 4. A flowchart for the step-by-step extraction of the correlated energy functions through iterative simulated construction of the reference states.

Figure 5. Parametrization of the first nonlocal pairwise energy (rs = r(GN−CN)) after all the local energy terms have been extracted. In the reference state for rs, the bond stretching, bond angle, torsion angle, and the excluded volume interactions are all included. By fitting the formula for Epair(r) (eq S4) (the magenta line) to Eobs(rs) − Eref(rs) (see eq 2), the parameters for Epair(rs) can be determined, and the resultant simulated distribution matches the observed statistics. For comparison, the result from the noninteracting ideal gas model (Efree ref (r) ∼ −2kBT ln(r)) as the reference state is also given (orange dashed line).

which rs = r(GN−CN) has the largest |Eobs(rs)−Erefs(rs)|, and thus E(rs) is selected for the energy extraction for the step. As shown in Figure 5, the energy Erefs(rs) (see eq 2) determined from the reference state refs is lower than that from F

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 6. A selected set of the extracted energy functions. The first row shows the energies for the bond stretching b(P-S), bond angle θ(P-S-RC), and torsion angle ϕ(S-RC-GO-GN), respectively. The second and the third rows show the two main pairwise interactions E(r) for the canonical base pairs GC, AU, and GU, respectively. 2 the noninteracting ideal gas model Efree ref (rs) ∼ −kBT ln(4πrs ) in the short-range of rs. The difference is attributed to the correlations between the different local interactions. From eq 2, the formula for Epair(r) (eq S4 in the SI), and the distributions pobs(rs) and prefs(rs), we can find the energy parameters that give the best fit to the pairwise energy Epair(rs) = Eobs(rs) − Erefs(rs). After the pairwise energy Epair(rs) is determined and added to the force field, we rerun the CGMD simulations to generate a new conformation ensemble. The simulated distribution psim(rs) from this new conformation ensemble would match the observed distribution pobs(rs) (see Figure 5). Moreover, this new conformational ensemble serves as an updated reference state for the next structure parameter whose energy function is to be determined in the subsequent step.

and Figures S1−S4 in the SI for a complete list of all the structure parameters. As a further test of the IsRNA method, we use the whole data set of the 299 structures as the training set. The test results show the same agreement between the simulated distributions and the observed statistics (see Figure S5). The anharmonic profiles of bond stretching (such as b(P-S)) and bond angles (such as θ(P-S-RC)) lead to the functional form of a combined harmonic and Gaussian energy functions for Ebond(b) (eq S1) and Eangle(θ) (eq S2). Compared with the widely used harmonic functions,22,24,26,27,29 the anharmonic potential energy has the advantage of providing a broader and more accurate sampling for the backbone conformations. The trigonometric function in Etorsion(ϕ) (eq S3) is used to describe the multiple local minima in the observed torsion angle distribution. The distributions of the six main pairwise interactions for the canonical base pairs GC, AU, and GU show oscillatory behaviors, arising primarily from the interactions presented in the helices. Moreover, the extracted pairwise interactions are weaker for the GU base pair than the GC and AU base pairs. In general, we found that the combination of a Morse potential and two Gaussian terms in Epair(r) (eq S4) can give good agreement between the observed pobs(r) and simulated psim(r) distributions for the pairwise interactions. Because the energy functions are extracted based on the sampling of the whole conformation space, including the nonnative conformations, our approach may be able to account for the folding stability. Specifically, the extracted force field may be

3. RESULTS AND DISCUSSION The validity of the iterative simulated reference state approach in IsRNA is supported by the fact that the simulated distributions psim(x) agree with the observed distributions pobs(x) for all the structure parameters, including the bond stretching (b), bond angles (θ), torsion angles (ϕ), and pairwise distances (r). Furthermore, due to the correlation between different structure parameters, not all the structure parameters are independent variables in the final force field. In fact, our calculation shows that only 12 of the 18 bond angle types, 14 of the 22 torsion angle types, and 38 of the 55 nonlocal pairwise interaction types are independent variables. See Figure 6 for a selected set of the extracted energy functions G

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 7. Equilibrium conformational distributions (in terms of the RMSDs to the native structures) for three selected RNAs at the different stages of the iterative simulated procedures for the extraction of energy functions: “Bond” denotes the force field that consists of all the bond stretching energies (Ebond(b)) and excluded volume interactions (ELJ(r)), “Bond angle” for all the Ebond(b), ELJ(r), and bond angle energies (Eangle(θ)), “Torsion angle” for all the Ebond(b), ELJ(r), Eangle(θ), and torsion angle energies (Etorsion(ϕ)), “Half pairwise” for all the local energies (Ebond(b), Eangle(θ), Etorsion(ϕ), and ELJ(r)), and the first half of the 38 extracted nonlocal pairwise interactions, and “All” for the full set of all the energy functions. (a) Stem-loop structure (PDB id: 2l2j). (b) Pseudoknot (PDB id: 1ymo). (c) tRNA (PDB id: 1zo3). The RMSD calculations are based on all the coarse-grained beads. The insets show the most probable conformations after all the energy functions have been extracted and implemented in the force field. The corresponding native structures are indicated by the orange color.

the CGMD simulation, is still not as close to the native structure as the hairpin and pseudoknot structures shown above. As shown in the inset of Figure 7c, we find that the large RMSD mainly comes from an incorrect spatial arrangement of the helices in the 4-way junction for the simulated structures. We note that with the native secondary structure as the constraint, the SimRNAweb41 predicts this tRNA structure with RMSD = 18 Å. These results show that additional factors, such as the ion effects and the sequence-dependent many-body interactions, may be important for the spatial arrangement of helices in multibranched junctions.

able to stabilize the native structures over the non-native folds. As shown in Figure 7, for three selected illustrative RNA systems, the step-by-step parametrization of the force field can indeed lead to systematic improvements in the prediction of the native-like structures, as indicated by the decreasing root-meansquare deviation (RMSD) to the native structure. For all three cases, before the nonlocal pairwise interactions are extracted and included in the force field (after the bond stretching, bond angle, and torsion energies are determined), the CGMDgenerated structures show notably large RMSD, suggesting the importance of the nonlocal pairwise interactions in stabilizing the native folds. Following our iteration order, we find that the first half of the 38 extracted nonlocal pairwise energy terms is dominated by the canonical and noncanonical base pairing as well as stacking interactions, while the other half mainly comes from the other interactions such as base-backbone interactions. As shown in Figure 7a, for the stem-loop structure, which includes helices, internal loops, and hairpin loops (PDB id: 2l2j), a near-native structure distribution can be achieved if only the first half of pairwise interactions is parametrized and included in the force field, and the most probable conformations have an RMSD around 5 Å. When all the energy terms are extracted and included in the force field, the RMSD is reduced to 3 Å. For the triple helix pseudoknot in Figure 7b, which contains rich loop-helix interactions (PDB id: 1ymo), without including the second half of the nonlocal pairwise energy terms, whether or not including the first half of the pairwise interactions into the force field makes nearly no difference to the RMSD distribution, which suggests these interactions alone are not sufficient the stabilize the structure. However, a full force field including the second half of the pairwise energy terms can lead to a near-native structure distribution with RMSD = 7 Å. The result suggests a cooperatively coupled interactions network40 to stabilize this triplex pseudoknot structure. For the tRNA structure in Figure 7c, which includes a 4-way junction (PDB id: 1zo3), the iterative inclusion of the nonlocal pairwise interactions into the force field indeed leads to a more native-like structure distribution in the conformation ensemble. However, the most probable distribution (with RMSD = 14 Å), after all the energy functions are constructed and included in

4. CONCLUSION One of the greatest challenges for a coarse-grained model is an accurate force field that can capture the key structural features. As a widely used method to construct the coarse-grained force field, the knowledge-based approach is challenged by the reference state problem. A physical reference state should correctly reflect the excluded volume effect, inherent chain connectivity, and correlations between the different energy terms (to avoid overconstraint). To tackle the challenges, here we develop the IsRNA method. The key component of IsRNA is to utilize CGMD-generated conformational ensembles as iteratively updated reference states to extract the different (correlated) energy terms. The validity of IsRNA is supported by the close agreement between the simulated conformational distribution and the (experimentally) observed statistics for all the structure parameters. Because the IsRNA method is based on the conformational sampling of both non-native and native-like structures, the energy functions extracted here can account for, to some extent, the effect of the folding energy landscape. Furthermore, the IsRNA-derived coarse-grained force field may lead to a new method for de novo RNA 3D structure prediction. Therefore, the current study serves as a promising starting point for the development of a new RNA structure prediction method.



ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jctc.7b01228. H

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation



molecular energy, minimization, and dynamics calculations. J. Comput. Chem. 1983, 4, 187−217. (14) MacKerell, A. D., Jr.; Banavali, N.; Foloppe, N. Development and current status of the CHARMM force field for nucleic acids. Biopolymers 2000, 56, 257−265. (15) Brooks, B. R.; Brooks, C. L., 3rd; Mackerell, A. D., Jr; Nilsson, L.; Petrella, R. J.; Roux, B.; Won, Y.; Archontis, G.; Bartels, C.; Boresch, S.; Caflisch, A.; Caves, L.; Cui, Q.; Dinner, A. R.; Feig, M.; Fischer, S.; Gao, J.; Hodoscek, M.; Im, W.; Kuczera, K.; Lazaridis, T.; Ma, J.; Ovchinnikov, V.; Paci, E.; Pastor, R. W.; Post, C. B.; Pu, J. Z.; Schaefer, M.; Tidor, B.; Venable, R. M.; Woodcock, H. L.; Wu, X.; Yang, W.; York, D. M.; Karplus, M. CHARMM: the biomolecular simulation program. J. Comput. Chem. 2009, 30, 1545−1614. (16) Zhuang, Z.; Jaeger, L.; Shea, J.-E. Probing the structural hierarchy and energy landscape of an RNA T-loop hairpin. Nucleic Acids Res. 2007, 35, 6995−7002. (17) Bowman, G. R.; Huang, X.; Yao, Y.; Sun, J.; Carlsson, G.; Guibas, L. J.; Pande, V. S. Structural insight into RNA hairpin folding intermediates. J. Am. Chem. Soc. 2008, 130, 9676−9678. (18) Zhang, Y.; Zhang, J.; Wang, W. Atomistic analysis of pseudoknotted RNA unfolding. J. Am. Chem. Soc. 2011, 133, 6882− 6885. (19) Chen, A. A.; Garca, A. E. High-resolution reversible folding of hyperstable RNA tetraloops using molecular dynamics simulations. Proc. Natl. Acad. Sci. U. S. A. 2013, 110, 16820−16825. (20) Jonikas, M. A.; Radmer, R. J.; Laederach, A.; Das, R.; Pearlman, S.; Herschlag, D.; Altman, R. B. Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters. RNA 2009, 15, 189−199. (21) He, Y.; Maciejczyk, M.; Odziej, S.; Scheraga, H. A.; Liwo, A. Mean-field interactions between nucleic-acid-base dipoles can drive the formation of a double helix. Phys. Rev. Lett. 2013, 110, 098101. (22) Denesyuk, N. A.; Thirumalai, D. Coarse-grained model for predicting RNA folding thermodynamics. J. Phys. Chem. B 2013, 117, 4901−4911. (23) Ding, F.; Sharma, S.; Chalasani, P.; Demidov, V. V.; Broude, N. E.; Dokholyan, N. V. Ab initio RNA folding by discrete molecular dynamics: From structure prediction to folding mechanisms. RNA 2008, 14, 1164−1173. (24) Shi, Y.-Z.; Wang, F.-H.; Wu, Y.-Y.; Tan, Z.-J. A coarse-grained model with implicit salt for RNAs: Predicting 3D structure, stability and salt effect. J. Chem. Phys. 2014, 141, 105102. (25) Šulc, P.; Romano, F.; Ouldridge, T. E.; Doye, J. P. K.; Louis, A. A. A nucleotide-level coarse-grained model of RNA. J. Chem. Phys. 2014, 140, 235102. (26) Xia, Z.; Gardner, D. P.; Gutell, R. R.; Ren, P. Coarse-grained model for simulation of RNA three-dimensional structures. J. Phys. Chem. B 2010, 114, 13497−13506. (27) Xia, Z.; Bell, D. R.; Shi, Y.; Ren, P. RNA 3D structure prediction by using a coarse-grained model and experimental data. J. Phys. Chem. B 2013, 117, 3135−3144. (28) Boniecki, M. J.; Lach, G.; Dawson, W. K.; Tomala, K.; Lukasz, P.; Soltysinski, T.; Rother, K. M.; Bujnicki, J. M. SimRNA: a coarsegrained method for RNA folding simulations and 3D structure prediction. Nucleic Acids Res. 2016, 44, e63. (29) Cragnolini, T.; Laurin, Y.; Derreumaux, P. Pasquali S. Coarsegrained HiRE-RNA model for ab initio RNA folding beyond simple molecules, including noncanonical and multiple base pairings. J. Chem. Theory Comput. 2015, 11, 3510−3522. (30) Bell, D. R.; Cheng, S. Y.; Salazar, H.; Ren, P. Capturing RNA folding free energy with coarse-grained molecular dynamics simulations. Sci. Rep. 2017, 7, 45812. (31) Noid, W. G. Perspective: Coarse-grained models for biomolecular systems. J. Chem. Phys. 2013, 139, 090901. (32) Thomas, P. D.; Dill, K. A. Statistical potentials extracted from protein structures: how accurate are they? J. Mol. Biol. 1996, 257, 457− 469.

Detailed formulas and parameters for energy terms in the coarse-grained force field, technical details about coarsegrained MD simulation, complete list of the extracted energy functions for all the structure parameters, results for the benchmark tests over the whole data set of the 299 structures, properties of the ten types of the coarsegrained (CG) beads, and PDB IDs of the 40 simulated structures (PDF)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Shi-Jie Chen: 0000-0002-8093-7244 Funding

This research was supported by NIH grants GM063732 and GM117059. Most of the computations involved in this research were performed on the HPC resources at the University of Missouri Bioinformatics Consortium (UMBC). The HPC equipment for the computational work is supported by the National Science Foundation CNS-1429294. Notes

The authors declare no competing financial interest.



REFERENCES

(1) Holbrook, S. R. RNA structure: the long and the short of it. Curr. Opin. Struct. Biol. 2005, 15, 302−308. (2) Strobel, E.; Seeling, K.; Tebbe, C. C. Diversity responses of rumen microbial communities to Fusarium-contaminated feed, evaluated with rumen simulating technology. Environ. Microbiol. 2008, 10, 483−496. (3) Scott, W. G. Ribozymes. Curr. Opin. Struct. Biol. 2007, 17, 280− 286. (4) Serganov, A.; Patel, D. J. Molecular recognition and function of riboswitches. Curr. Opin. Struct. Biol. 2012, 22, 279−286. (5) Das, R.; Baker, D. Automated de novo prediction of native-like RNA tertiary structures. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 14664−14669. (6) Parisien, M.; Major, F. The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature 2008, 452, 51−55. (7) Cao, S.; Chen, S.-J. Physics-based de novo prediction of RNA 3D structures. J. Phys. Chem. B 2011, 115, 4216−4226. (8) Popenda, M.; Szachniuk, M.; Antczak, M.; Purzycka, K. J.; Lukasiak, P.; Bartol, N.; Blazewicz, J.; Adamiak, R. W. Automated 3D structure composition for large RNAs. Nucleic Acids Res. 2012, 40, e112. (9) Pearlman, D. A.; Case, D. A.; Caldwell, J. W.; Ross, W. S.; Cheatham, T. E.; Debolt, S.; Ferguson, D.; Seibel, G.; Kollman, P. Amber, a package of computer-programs for applying molecular mechanics, normal-mode analysis, molecular-dynamics and free-energy calculations to simulate the structural and energetic properties of molecules. Comput. Phys. Commun. 1995, 91, 1−41. (10) Cheatham, T. E., 3rd; Cieplak, P.; Kollman, P. A. A modified version of the Cornell et al. force field with improved sugar pucker phases and helical repeat. J. Biomol. Struct. Dyn. 1999, 16, 845−862. (11) Case, D. A.; Cheatham, T. E., 3rd; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M., Jr.; Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J. The Amber biomolecular simulation programs. J. Comput. Chem. 2005, 26, 1668−1688. (12) Pérez, A.; Marchán, I.; Svozil, D.; Sponer, J.; Cheatham, T. E., 3rd; Laughton, C. A.; Orozco, M. Refinement of the AMBER force field for nucleic acids: Improving the description of alpha/ gamma conformers. Biophys. J. 2007, 92, 3817−3829. (13) Brooks, B. R.; Bruccoeri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. CHARMM: A program for macroI

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation (33) Zhou, H.; Zhou, Y. Distancescaled, finite idealgas reference state improves structurederived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002, 11, 2714−2726. (34) Miyazawa, S.; Jernigan, R. L. Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules 1985, 18, 534−552. (35) Miyazawa, S.; Jernigan, R. L. An empirical energy potential with a reference state for protein fold and sequence recognition. Proteins: Struct., Funct., Genet. 1999, 36, 357−369. (36) Huang, S. Y.; Zou, X. Q. An iterative knowledge-based scoring function to predict protein-ligand interactions: I. Derivation of interaction potentials. J. Comput. Chem. 2006, 27, 1866−1875. (37) Lemieux, S.; Major, F. RNA canonical and noncanonical base pairing types: a recognition method and complete repertoire. Nucleic Acids Res. 2002, 30, 4250−4263. (38) Thomas, P. D.; Dill, K. A. An iterative method for extracting energy-like quantities from protein structures. Proc. Natl. Acad. Sci. U. S. A. 1996, 93, 11628−11633. (39) Plimpton, S. Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 1995, 117, 1−19. (40) Homan, P. J.; Favorov, O. V.; Lavender, C. A.; Kursun, O.; Ge, X.; Busan, S.; Dokholyan, N. V.; Weeks, K. M. Single-molecule correlated chemical probing of RNA. Proc. Natl. Acad. Sci. U. S. A. 2014, 111, 13858−13863. (41) Magnus, M.; Boniecki, M. J.; Dawson, W.; Bujnicki, J. M. SimRNAweb: a web server for RNA 3D structure modeling with optional restraints. Nucleic Acids Res. 2016, 44 (Web Server issue), W315−W319.

J

DOI: 10.1021/acs.jctc.7b01228 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX