Improved Accuracy in RNA–Protein Rigid Body Docking by

Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064,...
2 downloads 15 Views 4MB Size
Subscriber access provided by Northern Illinois University

Article

Improved accuracy in RNA-protein rigid body docking by incorporating force field for molecular dynamics simulation into the scoring function Junichi Iwakiri, Michiaki Hamada, Kiyoshi Asai, and Tomoshi Kameda J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.6b00254 • Publication Date (Web): 05 Aug 2016 Downloaded from http://pubs.acs.org on August 15, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Improved accuracy in RNA-protein rigid body docking by incorporating force field for molecular dynamics simulation into the scoring function Junichi Iwakiri,† Michiaki Hamada,‡,¶ Kiyoshi Asai,†,¶ and Tomoshi Kameda∗,¶ Graduate School of Frontier Sciences, The University of Tokyo, 5–1–5 Kashiwanoha, Kashiwa, Chiba 277–8562, Japan., Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N–06–10, 3–4–1, Okubo Shinjuku-ku, Tokyo 169–8555, Japan., and Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2–4–7 Aomi, Koto-ku, Tokyo 135–0064, Japan. E-mail: [email protected]

Phone: +81-3-3599-8612. Fax: +3-3599-8612

Abstract RNA-protein interactions play fundamental roles in many biological processes. To understand these interactions, it is necessary to know the three-dimensional structures of RNAprotein complexes. However, determining the tertiary structure of these complexes is often difficult, suggesting that an accurate rigid body docking for RNA-protein complexes is needed. In general, the rigid body docking process is divided into two steps: generating candidate structures from the individual RNA and protein structures and then narrowing down the candidates. ∗ To

whom correspondence should be addressed University of Tokyo ‡ Waseda University ¶ National Institute of Advanced Industrial Science and Technology (AIST) † The

1 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

In this study, we focus on the former problem to improve the prediction accuracy in RNAprotein docking. Our method is based on the integration of physico-chemical information about RNA into ZDOCK, which is known as one of the most successful computer programs for protein-protein docking. Because recent studies showed the current force field for molecular dynamics simulation of protein and nucleic acids is quite accurate, we modeled the physicochemical information about RNA by force fields such as AMBER and CHARMM. A comprehensive benchmark of RNA-protein docking, using three recently developed datasets, reveals the remarkable prediction accuracy of the proposed method compared with existing programs for docking: the highest success rate is 34.7% for the predicted structure of the RNA-protein complex with the best score and 79.2% for 3,600 predicted ones. And, three full atomistic force fields for RNA (amber94, amber99 and CHARMM22) produced almost same accurate result, which showed current force fields for nucleic acids are quite accurate. In addition, we found that the electrostatic interaction and the representation of shape complementary between protein and RNA plays the important roles for accurate prediction of the native structures of RNA-protein complexes.

1 Introduction The recognition of specific RNAs by RNA-binding proteins is a hallmark of biological regulatory processes, such as splicing, localization and translation. Recently, macromolecular interactions between RNA and RNA-binding proteins were comprehensively identified by high-throughput experimental methods combining crosslinking, next-generation sequencing, microarrays and mass spectrometry. 1–4 Using these methods, thousands of proteins interacting with messenger RNAs were simultaneously identified in human cells, and 30% of these had not previously been reported as RNA-binding proteins. To understand the mechanisms underlying such RNA-protein interactions, the atomic details of interactions in the three-dimensional structures of RNA-protein complexes need to be analyzed. However, experimental determination of these complexes is still difficult, which is evident in the slow growth of the number of structures of RNA-protein complexes

2 ACS Paragon Plus Environment

Page 2 of 29

Page 3 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

deposited in the Protein Data Bank (PDB). 5 To mitigate this experimental limitation, accurate computational modeling of three-dimensional structures of RNA-protein complexes is urgently required. 6 In the modeling of complexes, one important approach is computational docking, in which an appropriate orientation of two macromolecules is sought using the three-dimensional structures of their unbound states. Computational docking is usually divided into two steps. 7 The first step is sampling: searching the possible orientations and conformations of two macromolecules from the diverse conformational space and generating candidate structures for complexes (called decoys). The second step is rescoring: among the generated decoys, discriminating the near-native complexes from the non-native ones using scoring functions customized to the types of macromolecules (e.g. proteinprotein, protein-DNA, and RNA-protein) forming the complex. In recent years, most of the studies related to RNA-protein docking have focused on the rescoring step for narrowing down the large number of decoys generated by docking software, such as FTdock and GRAMM. 8–12 However, in the sampling step, these computer programs often fail to generate near-native structures in the resulting set of decoys because their internal scoring functions were developed not for RNA-protein docking but for protein-protein docking. 13 For example, Wang et al. generated 10,000 decoys for a target RNA-protein complex using FTdock to demonstrate their new scoring function but FTdock could not find enough near-native structures in 16 out of 33 cases. 11 This suggests that generating an appropriate set of decoys that includes several near-native structures in the sampling step is the major bottleneck in RNA-protein docking, and the information for RNA-protein interaction should be included properly. One solution for improving the sampling step is to include an appropriate scoring function, reflecting RNA-protein interactions observed in native complexes, into the docking software. Over the past two decades, RNA-protein interactions have been frequently investigated using the threedimensional structures of RNA-protein complexes. 14–19 In all of these analyses, electrostatic interactions between the positively charged amino acids of proteins and the negatively charged phosphate groups of RNAs were frequently observed in the RNA-protein interfaces, which implies

3 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

that electrostatic interactions need to be included in the scoring function. However, such physicochemical interactions are not included in existing docking software that can deal with RNA-protein complexes. In this study, we modeled physico-chemical information about RNA, such as the partial charge and vdW radii of atoms, by force filed of molecular dynamics (MD) simulation for nucleid acids, such as AMBER 20–22 and CHARMM. 23–26 Recent studies showed the MD simulation with the current molecular force field is capable of reproducing the properties of protein and RNA molecules. D.Shaw and coworkers showed MD simulation with such force field from unfolded form of protein can fold to their experimentally determined native structures. 27 They also carried out molecular dynamics simulations of that ligands and a protein were initially located in water box at random positions, and confirmed the ligand bound its binding site of the protein correctly. 28 And, Sakuraba et al. showed RNA duplex dimerization free energy changes upon mutations derived from the MD simulation quantitatively agree with the ones from the experiments (the absolute deviation of 0.55 kcal / mol and the R2 value of 0.97). 29 In addition, we also included the statistical potential for RNA, which was developed in the same manner as in ZDOCK3.0 for amino acids and compound. 30 To assess the performance of our method, comprehensive benchmarking was undertaken using three recently developed non-redundant datasets of RNA-protein complexes that include a large number of unbound structures of proteins and RNAs. 31–33

2 MATERIALS AND METHODS 2.1 Datasets Recently, three benchmark datasets of RNA-protein complexes derived from the Protein Data Bank (PDB) 5 have been developed to evaluate the performance of docking software for predicting the structures of RNA-protein complexes. The first dataset developed by Zou et al. (dataset1) consists of 72 complexes, comprising 52 RNA-protein complexes with unbound structures of both proteins and RNAs (unbound-unbound targets), and 20 complexes with unbound structures of proteins and 4 ACS Paragon Plus Environment

Page 4 of 29

Page 5 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

bound structures of RNAs (unbound-bound targets). 33 The second dataset developed by Fernandez et al. (dataset2) consists of 71 complexes comprising 9 unbound-unbound targets and 62 unboundbound targets. 31 The third dataset developed by Bahadur et al. (dataset3) consists of 45 complexes comprising 9 unbound-unbound targets and 36 unbound-bound targets. 32 From this dataset, only one unbound-bound target, involving a RNA/DNA hybrid duplex (PDBID: 1ZBI), was excluded from our benchmark because this hybrid duplex is not appropriate for benchmarking RNA-protein docking. In these three datasets, modified residues are removed and only heavy atoms of the standard 20 types of amino acids and 4 types of nucleotides are used for our benchmarks. In cases of structures determined by NMR, the first model is used for the benchmark.

2.2 Docking software In this study, three docking software packages—ZDOCK, 34 FTdock 13 and GRAMM 35 —are used for predicting the structures of RNA-protein complexes. ZDOCK is one of the best docking programs for predicting the structure of protein-protein complexes. 36 However, it can also be used for RNA-protein complexes. In a recent CAPRI (Critical Assessment of PRediction of Interactions) contest involving RNA-protein docking (targets 33 and 34), ZDOCK was used to predict RNA-protein complexes. 37 The latest version, ZDOCK 3.0.2, 38 is used in this study and its docking procedure is performed with the default settings: a 1.2 Å grid step, variable grid size for fitting the size of a protein (considered as a receptor) and RNA (considered as ligand), with a 15◦ angle step for rotation of the ligand. In recent studies, FTdock 13 has been frequently used to generate a number of predicted structures of RNA-protein complexes (decoys). 11,39,40 In this study, two settings were used: (i) 1.2 Å grid step and 15◦ angle step; and (ii) 0.875 Å grid step (FTdock default) and 9◦ angle step (minimum value in FTdock). Setting (i) was used to keep same grid and angle steps as used in ZDOCK. In contrast, the minimum angle step and smaller grid step in setting (ii) were used to obtain better predictions. Setting (ii) is used to give representative results for FTdock, because its prediction performance was better than that of setting (i). 5 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

GRAMM was also used to generate decoys for RNA-protein docking by Tuszynska and Bujnicki. 9 As with FTdock, two settings were used: (i) the parameters used in a previous study; 9 and (ii) an angle step of 15◦ (only the angle step was changed from (i)) to have the same angle step as used in ZDOCK. In this study, setting (i) is considered as representative of GRAMM because it produced better predictions than setting (ii). Note that the minimum grid step allowed by GRAMM was used for each target complex to obtain better predictions in these two settings due to the limitation of GRAMM caused by its fixed grid size (64 × 64 × 64).

2.3 Scoring functions In this paper, we propose a novel sampling method for RNA-protein docking that reflects the nature of RNA-protein interactions. From version 2.3, ZDOCK has predicted the three-dimensional structure of protein-protein complexes by using a score function composed of three terms: 30,41 the potential for pairwise shape complementary among proteins, the electrostatic potential (Coulomic  energy with a distance-dependent dielectric function ε ri j . see Detail in 13,42 ) and the statistical potential (an energy function for the weight of each amino acid-amino acid and amino acidcompound pair, which is derived from the analysis of known protein complex structures). To represent the scoring function, information about the Van der Waals radii, the partial charge of atoms and the statistical potential is required. Unfortunately, this information for nucleic acids is not included by default in ZDOCK. In a recent CAPRI contest involving RNA-protein docking (targets 33 and 34), Weng and coworkers predicted RNA-protein complexes by using ZDOCK. 37 Although their scoring function was specifically developed for protein-protein docking, their prediction for target 34 was very close to being acceptable. This result implies that the current scoring function of ZDOCK also works well for RNA. To improve the predictive ability for RNA, we imported relevant parameters for nucleic acids from other sources and added them into the scoring function of ZDOCK 3.0.2. The partial charge of RNA is imported from a force field for molecular dynamics simulation. In default ZDOCK, the partial charge of protein was derived from CHARMM19 force fild, 34 6 ACS Paragon Plus Environment

Page 6 of 29

Page 7 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

our strategy is resembling.

Various force fields have been proposed. We used four force fields

(AMBER94, 20 AMBER03, 21,22 CHARMM19 23–25 and CHARMM22 26 ) and compared their performance for structural prediction. In AMBER94, AMBER03 and CHARMM22, the protein and RNA molecule is represented by all atoms, the protein and RNA molecule is represented by heavy atoms and polar hydrogen atoms in CHARMM19. However, in ZDOCK, the molecule is represented by only heavy atoms (i.e. not using hydrogen atom). In this study, we used only the heavy atoms of RNA and proteins for prediction like as default ZDOCK. And, we integrated the partial charge of hydrogen atoms into that of the heavy atom connected to them and the van der Waals radii of heavy atoms of RNA were imported from CHARMM19 force fields (for details, see the Supplementary Figure S1). The statistical potential of RNA were imported from the parameters of compounds, such as adenosine triphosphate (ATP), supplied in the original ZDOCK3.0.2. Supplementary Figure S2 shows the generation procedure for the statistical potential of adenine from the statistical potential for ATP as an example. All values of the Van der Waals radii, the partial charges and the indexes for the statistical potential are provided through uniCHARMM files (for details, see the Supplementary information).

2.4 Performance evaluation Each docking program was used to generate 3,600 decoys for a target RNA-protein complex in the datasets. These decoys were evaluated by calculating ligand-RMSD (Root Mean Square Deviation) which is commonly used to measure the deviation of the predicted decoy from a corresponding native structure. 9,11,39 The ligand-RMSD between decoys and native structures were calculated by the following procedure. For each target complex, primary sequences of decoy and native structures were aligned to find the aligned (equivalent) residues by using clustalW software. 43 Among these residues, Cα atoms of the decoy were superimposed on those of the native structure. Then, the ligand-RMSD between the decoy and the native structure was calculated using all heavy atoms with equivalent residues in the RNA chains. These superimposition and RMSD calculations were performed using the McLachlan algorithm 44 as implemented in the program ProFit (Martin, 7 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

A.C.R. and Porter, C.T., http://www.bioinf.org.uk/software/profit/). In the ligand-RMSD calculation, missing atoms in decoys and native structures were excluded. Among the 3,600 generated decoys, decoys with ligand-RMSDs less than 10 Å were defined as “near-native” structures. This threshold was frequently used to evaluate the performance of RNA-protein docking. 9,11,39,45 The prediction performance of our method, as well as that of FTdock and GRAMM, was evaluated on the basis of success rate and the number of near-native structures in 3,600 generated decoys ranked according to their docking scores. The success rate was taken to be the percentage of target RNA-protein complexes for which the docking software predicts at least one near-native prediction within a given number of top-scoring decoys. 30 These decoys were also evaluated with another reaction coordinate, Q-value. Q-value is a measure of nativeness of a biomolecule structure. The value of Q close to unity means the conformation is similar to the native structure, while the Q-value of zero means the conformation is dissimilar to the native structure. Q-value is well used in the research of protein folding. 46–48 To define Q-value, we regard that the i-th amino acid and j-th nucleotide are in the “native contacts”, if one of the non-hydrogen atoms in the ith amino acid residue are within 4 Å to any non-hydrogen atom in the jth nucleotide. The Q-value of a predicted structure is defined as the number of the native contacts in the predicted structure divided by the total number of the native contacts in the native structure.

3 RESULTS AND DISCUSSION 3.1 Docking model for Protein and RNA Before showing our result, we introduce our model briefly. For protein-RNA docking, we used ZDOCK program, which is known as one of the best docking programs for predicting the structure of protein-protein complexes. 36 Unfortunately, the parameter for nucleic acids, such as partial charge of atoms, vdW radii, is not included in ZDOCK, we developed the parameter from force field for MD simulation. In detail, please see Material and Methods, Supplementary Figures S1 8 ACS Paragon Plus Environment

Page 8 of 29

Page 9 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

and S2.

3.2 Prediction performance The prediction performance of three docking methods (ZDOCK with our scoring method based on AMBER94 force field, GRAMM and FTDock) were benchmarked using the three different datasets. For dataset1, the success rates of FTdock for the top 1, top 10, top 100 and top 3600 predicted complexes were 11.1%, 20.8%, 34.7% and 70.8%, respectively. Similar results were obtained by GRAMM: top1: 2.8%; top10: 6.9%; top100: 23.6%; and top3600: 72.2% (Figure 1a). In contrast, the success rates of ZDOCK with our scoring method were 29.2% (top 1), 40.3% (top 10), 61.1% (top 100) and 87.5% (top 3600), which are much better than the results from the other two programs. For the other two datasets, ZDOCK with our scoring method was also found to be superior (Supplementary Figures S3a and S4a). In addition, the average numbers of near-native structures in 3,600 predicted structures (decoys) generated by ZDOCK with our scoring method were 4-6 times larger than those generated by FTdock and GRAMM (Figure 1b, Supplementary Figures S3b and S4b). Furthermore, overall success rates and the average number of near-native structures of ZDOCK with our scoring method were better than those obtained from the original ZDOCK, which does not use our proposed scoring (Figure 1, Supplementary Figures S3 and S4). For dataset1, Huang and Zou categorized the 72 targets of RNA-protein complexes into three groups, “easy”, “medium” and “difficult”, according to the deviations of unbound structures from bound ones. 33 In the “easy” group, ZDOCK with our scoring method could generate at least one near-native structure for 47 out of 49 targets (96%), whereas GRAMM and FTdock generated such structures for only 38 (78%) and 41 (84%) targets, respectively (Supplementary Table S1). Similarly for the “medium” group which includes 16 targets, ZDOCK with our scoring method, GRAMM and FTdock generated near-native structures for 14 (88%), 12 (75%) and 9 (56%) targets, respectively. However, none of the docking methods could find near-native structures for most of the targets in the “difficult” group. These results suggest that ZDOCK with our scoring method has an advantage over GRAMM and FTdock for predicting RNA-protein complex structures that 9 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

(a) 1

0.8 Success rate

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

0.6

Proposed ZDOCK (default)

0.4

0.7

FTDock

0.35

0.2

GRAMM

0 0

0 0

1200

2400

50

100 3600

Number of decoys

(b) GRAMM

FTDock

ZDOCK (default)

Proposed 0

5 10 15 20 25 30 Average number of near-native structures

Figure 1: Comparison of prediction performances for ZDOCK with our scoring method based on AMBER94 force field (Proposed), ZDOCK default, GRAMM 35 and FTdock 13 for dataset1 including 72 targets of RNA-protein complexes. 33 (a) Success rates for 3,600 generated decoys. Success rates for top-100-scoring decoys are shown in an inset. Success rate is the percentage of target RNA-protein complexes for which the docking software matches at least one near-native prediction within a given number of top-scoring decoys. Success rates for a top scoring structure: Proposed (29.2%), ZDOCK default (20.8%), GRAMM (2.8%) and FTdock (11.1%). (b) Average numbers of near-native structures (ligand-RMSD of predicted complexes less than 10 Å from a native RNA-protein complex) in 3,600 decoys. experience slight or moderate conformational changes upon binding to partner molecules. This advantage could be explained by the three major differences in their scoring functions. First, the electrostatic interactions in RNA are considered only in the scoring function of our method. In GRAMM, the score function does not contain the electrostatic interaction 35 and in FTdock, although the score for electrostatic interactions is calculated separately (ESratio), it is not incorporated into the final score. 13 Second, the score for shape complementary between RNA and protein is different in these programs. In GRAMM and FTdock, the same value is assigned

10 ACS Paragon Plus Environment

Page 10 of 29

Page 11 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

for all positions on the surface of molecules. 13,35 On the other hand, in ZDOCK (from version 2.1 49 ), the shape complementary score in a position is determined by the total number of neighbor atoms, which improves the performance of tertiary structure prediction in the case of proteinprotein complexes. 42 Third, additional scoring for the knowledge-based statistical potential energy is implemented only in ZDOCK 30 on which our method is based. The effect of three factors was investigated in latter section. In order to make more accurate predictions, ZDOCK with our scoring method was benchmarked again after changing the angular interval of rotation from 15◦ to 6◦ (Supplementary Figures S5, S6 and S7); Using 6◦ for the angle step, the success rate improved slightly: from 29.2% to 34.7% for the top 1, and from 40.3% to 45.8% for the top 10 structures. Recently, Huang et al. developed a novel RNA-protein docking protocol consisting of two programs: RPDOCK for generating decoys and DECK-RP for re-scoring the generated decoys. 45 In their protocol, the scoring function of RPDOCK was mainly based on FTdock and customized for RNA-protein complexes, for example, a special score is added for stacking interactions between aromatic side chains and nucleotide bases. According to their results, the success rates of RPDOCK for a top 1 prediction for dataset1 and dataset2 (called testing set II and testing set I in their paper) were estimated to be approximately 18% and 3%, respectively, whereas the success rates of ZDOCK with our scoring method were 29% and 17%. The success rates of RPDOCK for top 100 predictions were reported as 56% (dataset1) and 50% (dataset2), whereas the success rates of ZDOCK with our scoring method were 61% (dataset1) and 48% (dataset2) (Figure 1a, Supplementary Figure S3a). Each dataset contains not only unbound-unbound docking cases but also many unbound-bound docking cases. In one unbound-bound case, either protein or RNA is derived from same PDB entries of the native RNA-protein complexes. In another case, either protein or RNA is derived from similar RNA-protein complexes to native ones. In order to make real unbound-unbound cases, all of these unbound-bound cases were excluded from each datasets. After excluding the unboundbound cases, only 2, 9 and 5 unbound-unbound cases remained in dataset1, dataset2 and dataset3,

11 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 29

respectively. Using these three selected datasets for evaluation, the success rates of top1, top10, top100 and top3600 predictions of our method were 0%, 0%, 0% and 100% for dataset1, 0%, 0%, 0% and 60% for dataset2,and 0%, 0%,11% and 56% for dataset3. The detailed prediction results of ZDOCK with our scoring method and other three methods using these real unbound-unbound cases are shown in Supplementary Table 2. Note that the numbers of real unbound-unbound cases in the the current RNA-protein docking benchmark datasets are too small for evaluating the prediction performances of RNA-protein docking methods. Our results show that development of accurate docking methods for the real unbound-unbound cases is still challenging.

3.3 Evaluation of various force fields In the previous section, it is shown the electrostatic interaction may be the one of important factors for accurate prediction. For calculating the electrostatic interaction between protein and RNA, we imported the partial charge of RNA atoms from a force field for molecular dynamics simulation. In this section, we consider the prediction of three-dimensional structures using four force fields (amber94, amber03, CHARMM19 and CHARMM22), and compare the effects of different force fields on prediction performance (Figure 2 and Supplementary Figures S8 and S9). The calculation using three force fields (AMBER94, AMBER03 and CHARMM22) showed good and similar success rates, which reconfirmed current force fields for MD simulation are quite accurate, as well as other studies. 27–29 On the other hand, the success rate of CHARMM19 was lower (Figure 2a and Supplementary Figures S8a and S9a). A similar tendency was observed in terms of the average numbers of near-native structures: the results of AMBER94, AMBER03 and CHARMM22 are almost same, while the result of CHARMM19 is worse (Figure 2b and Supplementary Figures S8b and S9b). To understand the reason for these differences, we investigated the differences between CHARMM19 and the other force fields and found that the total charge per nucleotide is different: -0.32 in CHARMM19 and -1.0 in the other three force fields. This means the electrostatic interaction of CHARMM19 is about 3 times weaker than that of the other force fields. Thus, the poorer per12 ACS Paragon Plus Environment

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(a) 1

0.8 Success rate

Page 13 of 29

0.6 amber94 amber03 0.4

0.7

charmm19

0.35

charmm22

0.2 0 0

50

100

0 0

1200 2400 Number of decoys

3600

(b)

charmm22

charmm19

amber03

amber94 0

5 10 15 20 25 Average number of near-native structures

30

Figure 2: Prediction performance of our method combined with four different force fields (AMBER94, AMBER03, CHARMM19 and CHARMM22) for dataset1 which contains 72 targets. 33 (a) Success rates for 3,600 generated decoys. Success rates for the top 100 highest-scoring decoys are shown in the inset. Success rates for the top scoring structure: AMBER94 (29.2%), AMBER03 (27.8%), CHARMM19 (25.0%) and CHARMM22 (27.8%). (b) Average numbers of near-native structures (L-RMSD less than 10 Å from a native RNA-protein complex) in 3,600 decoys. formance of CHARMM19 may be attributed to an underestimation of the electrostatic interaction.

3.4 Investigation of electrostatic effects Most of the recent studies which performed an analysis of the interfaces of RNA-protein complex structures consistently showed the importance of the electrostatic interactions between the positively charged amino acids of proteins and the negatively charged phosphate groups of RNA. 14–16,18 These observations imply that the scoring for electrostatic interactions should seriously affect the prediction accuracy of RNA-protein docking. In fact, we showed that predictions using CHARMM19 were less accurate, which may be attributed to an underestimation of the electrostatic interaction

13 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

(a) 1

0.8 Success rate

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

w/o elec 0.6

elec x0.5 elec x1

0.4

0.2

0.7

elec x2

0.35

elec x10 elec x50

0 0

0 0

1200

2400

50

100 3600

Number of decoys

(b) w/o elec elec x0.5 elec x1 elec x2 elec x10 elec x50 0

5 10 15 20 25 Average number of near-native structures

30

Figure 3: Prediction performance of our method with various weights for a scoring of electrostatic interactions for dataset1 which contains 72 targets. 33 (a) Success rates for 3,600 generated decoys. Success rates for the top 100 highest-scoring decoys are shown in the inset. Success rates for a top scoring structure: w/o elec (20.8%), elec ×0.5 (25.0%), elec ×1 (29.2%), elec ×2 (27.8%), elec ×10 (18.1%) and elec ×50 (4.2%). (b) Average numbers of near-native structures (L-RMSD less than 10 Å from a native RNA-protein complex) in 3,600 decoys. The AMBER94 force field was used. in previous section. To investigate the effects of electrostatic interactions on the prediction performance, several weights were applied to the electrostatic interactions in the scoring function in ZDOCK with our scoring method based on AMBER94 force field (Figure 3 and Supplementary Figures S10 and S11). Note that weight 1 (elec x1) indicates the default weight from the original ZDOCK which was already optimized for protein-protein docking. As expected from the analysis mentioned above, removing electrostatic interactions from the scoring led to a remarkable decrease in overall success rate. In addition, a greater weight, such as 10- or 50-fold, for this interaction also decreased the success rate (Figure 3a and Supplementary Figures S10a and S11a). However, using such greater weights slightly increased the average 14 ACS Paragon Plus Environment

Page 14 of 29

Page 15 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(a) 2ZKO, Normal weight

(b) 1K8W, Normal weight

(c) 2ZKO, 50-fold weight

(d) 1K8W, 50-fold weight

Figure 4: Representative near-native structures generated for two target complexes; dsRNA interacting with the NS1 protein of human influenza virus A (PDBID: 2ZKO), T stem-loop RNA interacting with pseudouridine synthase TruB (PDBID: 1K8W). (a) and (b) Near-native structures generated by our method with normal weight for scoring of electrostatic interactions. (c) and (d) Near-native structures generated by our method with 50-fold weight. Native RNA structures are red-colored. number of near-native structures (Figure 3b and Supplementary Figures S10b and S11b). To understand this discrepancy, we investigated the predicted near-native structures in detail, and found the predicted structures with highest weight tended to include steric clashes between RNA and proteins in several targets. For example, some atoms of the RNA clearly overlapped with those of the protein (Figure 4), suggesting that higher weights for the electrostatic interaction reduce the penalty in the shape complementary score for overlapping atoms and lead to the generation of artificial near-native structures. For dataset1, Supplementary Table S1 shows the “rank”, which indicates the order of the near-

15 ACS Paragon Plus Environment

1

0 -1 -2 -3

-1 -2 -3 -4

-5

-5 10

20

30

40

50

60

1

0

-4 70

-1 -2 -3 -4

30

40

50

-5

0

10

20

0.8

1

30

40

50

60

RMSD 2 1

0 -1 -2 -3

0 -1 -2 -3 -4 -5 -6 -7

-5

Q-value

-5

60

-4 0.6

-4

-8 20

zdock score

1

zdock score

2

0

0.4

-3

RMSD

1

0.2

-2

-7 10

RMSD

0

0 -1

-6

0

2

Page 16 of 29

2

zdock score

2

1

zdock score

2

0

zdock score

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

zdock score

Journal of Chemical Theory and Computation

-8 0

0.2

0.4

0.6

0.8

1

0

Q-value

0.2

0.4

0.6

0.8

1

Q-value

Figure 5: An example of changing the weight of the electrostatic interaction in our method for a target involving double stranded RNA interacting with the NS1 protein of the human influenza virus (PDBID: 2ZKO): 50 (a) without scoring for electrostatic interactions, (b) normal weight and (c) 50-fold weight for electrostatic interactions. The top images are the highest-scoring decoys generated by our method. The protein fixed at the center is shown as a surface, colored according to its electrostatic potential (blue: positive and red: negative). The centroids of the predicted RNAs are shown as spheres, colored according to their ligand-RMSD from a native structure (a smaller RMSD is shown in dark green). The middle figures show the score-RMSD dependencies of the 3,600 decoys: the y-axis represents the Z-score of the ZDOCK score, the x-axis represents the ligand-RMSD. The bottom figures show the score-Q-value dependencies of the 3,600 decoys: the y-axis represents the Z-score of the ZDOCK score, the x-axis represents the Q-value. In the middle and bottom figures, near-native structures are plotted in red and non near-native decoys in black. Near-native structures are plotted in red and non near-native decoys in black. native structure with the highest score among the 3,600 decoys for each target. For 16 out of 72 targets, using a 50-fold weight for electrostatic interactions provided a better rank than that obtained using normal weight. In contrast, using a 50-fold weight provided a worse rank than that obtained using normal weight for 47 targets. For the remainder (9 targets), the ranks are the same. Figure 5 shows a typical example of our sampling results, when the 50-fold weight provides a better rank than normal weight. In this result, for a target involving double stranded RNA (dsRNA) interacting with the NS1 protein of the human influenza virus (PDBID: 2ZKO), 50 the predicted

16 ACS Paragon Plus Environment

Page 17 of 29

2

2

2

0

-2 -4 -6 -8 -10

0

zdock score

zdock score

zdock score

0

-2 -4 -6 -8

-14

-12 0

20

40

60

80

100

120

2

10

20

30

40

50

60

70

80

90

0

-6 -8 -10

-2 -4 -6 -8

-14

-12 0.8

1

40

50

60

-2 -4 -6 -8

-10

-12

30

0

zdock score

zdock score

-4

Q-value

20

2

0

0.6

10

RMSD

2

0

0.4

-6

RMSD

-2

0.2

-4

-10 0

RMSD

0

-2

-8

-10

-12

zdock score

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

-10 0

0.2

0.4

0.6

0.8

1

0

Q-value

0.2

0.4

0.6

0.8

1

Q-value

Figure 6: An example of changing the weight of electrostatic interaction in our method for a target involving eukaryotic dimethylallyltransferase interacting with tRNA (PDBID: 3EPH): 51 (a) without scoring for electrostatic interactions, (b) normal weight and (c) 50-fold weight for electrostatic interactions. The top images are the highest-scoring decoys generated by our method. The protein fixed at the center is shown as a surface, colored according to its electrostatic potential (blue: positive and red: negative). The centroids of the predicted RNAs are shown as spheres, colored according to their ligand-RMSD from a native structure (a smaller RMSD is shown in dark green). The middle figures show the score-RMSD dependencies of the 3,600 decoys: the y-axis represents the Z-score of the ZDOCK score, the x-axis represents the ligand-RMSD. The bottom figures show the score-Q-value dependencies of the 3,600 decoys: the y-axis represents the Z-score of the ZDOCK score, the x-axis represents the Q-value. In the middle and bottom figures, near-native structures are plotted in red and non near-native decoys in black. dsRNAs were located adjacent to a positively charged surface of the NS1 protein when using the 50-fold weight, whereas the dsRNAs were distant from the positively charged surface when using the normal weight. This suggests that the electrostatic interaction between the dsRNA and the NS1 protein is mainly responsible for complex formation in this case. On the other hand, Figure 6 shows another example of our sampling results, when the normal and 50-fold weights provide the same rank. In this result, for a target involving eukaryote dimethylallyltransferase interacting with tRNA (PDBID: 3EPH), 51 most predicted tRNAs were located at the indented surface of the protein. In fact, we manually checked these structures and found the tRNA tightly fitted this 17 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

indented protein surface (data not shown). This suggests that shape complementarity is important for binding in this case. In Supplementary Figure S12, we analyzed the effect of electrostatics according to the two types of RNA (tRNA and Double Helix; these are abundant in the dataset). Interestingly, the average rank for double helices is more sensitive to changes in the weight of electrostatic interactions, in comparison with the rank for tRNAs. Especially, normal or smaller weights provide worse average rank for double helices, in comparison with tRNAs. This result suggests that the effect of electrostatic interactions depends on the types of RNA.

3.5 Investigation of the shape complementary score effect Previously, it is shown that the shape complementary score is different in these docking programs, which may effects the prediction accuracy significantly. To evaluate the effect of the shape complementary score on the prediction of RNA-protein complexes, we firstly surveyed how the values of van der Waals (VDW) radii for RNA atoms which may affect the prediction accuracy, because the proper values of van der Waals radii are needed for precise representation of molecule shape and calculation of the shape complementarity score in docking simulation. However, the values of VDW radii for 4 types of nucleotides (adenine, guanine, cytosine and uracil) are not supplied in original ZDOCK. If an RNA-protein docking calculation is carried out using original ZDOCK, 1.90 Å is routinely assigned for the VDW radii of all atoms of RNA.In this study, we provided the VDW radii set for each atom of RNA by reference to the CHARMM19 force field (see details in Material and Methods). To investigate how the VDW radii effects the prediction accuracy, the prediction performances were compared for ZDOCK only using the shape complementary score (i.e. omitting the electorostatic potential and statistical potential) with and without our VDW radii (Figure 7, Supplementary Figures S13 and S14). In all datasets, success rates of ZDOCK with our VDW radii set were improved compared to those from ZDOCK with default VDW radii, for example, in Figure 7, success rate for a top scoring structure is 25.0% (our VDW radii), 20.8% (ZDOCK default), whereas the number of near native structures did not change dramatically. These results 18 ACS Paragon Plus Environment

Page 18 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(a) 1

0.8 Success rate

Page 19 of 29

0.6 our VDW radii

0.4

0.7

ZDOCK (default)

0.35

0.2

0 0

0 0

1200

2400

50

100 3600

Number of decoys

(b)

ZDOCK (default)

our VDW radii

0

5 10 15 20 25 30 Average number of near-native structures

Figure 7: Comparison of prediction performances between ZDOCK using only shape complementary score with only our VDW radii and ZDOCK default (without using our parameters), for dataset1, which includes 72 targets of RNA-protein complexes. 33 (a) Success rates for 3,600 generated decoys. Success rates for top100-scoring decoys are shown in an inset. Success rate for a top scoring structure: our VDW radii (25.0%), ZDOCK default (20.8%). (b) Average numbers of near-native structures (ligand-RMSD of predicted complexes less than 10 Å from a native RNA-protein complex) in 3,600 decoys. indicate that the proper values of VDW radii for biomolecules are required for accurate predictions. And, surprinsingly, prediction accuracy using only shape complementary score is relatively high. For example, success rate for a top scoring structure is 29.2% in our proposed method and 25.0% in case of using only shape complementary score. Next, we compared the results using only shape complementary score with those of the other two programs for the RNA-protein datasets used in this study. The performance of ZDOCK with only the shape complementary score (Figure 7, Supplementary Figures S13 and S14) was still better than that of GRAMM or FTdock (Figure 1, Supplementary Figures S3 and S4) for all datasets. For example, in dataset1, success rate for a top scoring structure is 25.0% (ZDOCK only using the shape complementary score with 19 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

our vdW radii), 20.8% (ZDOCK only using the shape complementary score with default vdW radii), GRAMM (2.8%) and FTdock (11.1%). The results showed the shape complementary score also played the important role in our prediction accuracy. Especially, the better performance was showed by ZDOCK only using the shape complementary score comparing with other two prediction programs, which suggests the shape complementary score in original ZDOCK is quite well developed.

3.6 Investigation of the statistical potential effect Lastly, we investigate the effect of the statistical potential on the prediction results. Our method was benchmarked with and without the statistical potential (See Materials and Methods section for the details of statistical potentials). The statistical potential affected the prediction accuracy differently in each datasets. For dataset1, the success rate and the number of near-native structures were similar in the presence and absence of our statistical potential (Supplementary Figure S15). For dataset2, the number of near-native structures decreased in case of not using statistical potential, although the success rate was similar in both cases (Supplementary Figure S16). For dataset3, the success rate for 3,600 decoys increased without the statistical potential, on the other hands, the number of near-native structures was similar in both cases (Supplementary Figure S17). These results suggest that the effect of the statistical potential depends on the situation, and it is difficult to discuss the reason.

3.7 Future directions Recently, several studies have introduced novel rescoring methods 9,12,52,53 to discriminate the nearnative structures from a large number of decoys generated by existing sampling methods which were not specifically developed for RNA-protein docking. We have shown that our method provides more effective sampling for RNA-protein docking. In the future, our sampling method will be combined with these sophisticated rescoring methods using initial structures generated by our method as input to improve the overall prediction accuracy of the RNA-protein docking. 20 ACS Paragon Plus Environment

Page 20 of 29

Page 21 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

4 Conclusion In this study, we have proposed the integration of physico-chemical interaction modeled by force field for molecular dynamics (MD) simulation into the scoring function of ZDOCK, one of the best rigid body docking software for protein-protein complexes, to enable accurate prediction of threedimensional structures of RNA-protein complexes. Our comprehensive benchmarking showed that the proposed method had higher success rates and produced more near-native structures than the existing methods. In our scoring function, the electrostatic interaction modeled by force field and the shape complementarity score are particularly effective for improving the prediction accuracy. Overall, this study provides fundamental insights into the strategies of rigid body docking for predicting the three-dimensional structures of RNA-protein complexes. And, our study shows current force field is quite accurate as well as other study. 27–29

Funding This work was supported by MEXT KAKENHI Grant Numbers 25240044 (Grant-in-Aid for Scientific Research (A) to KA, in part), 24680031 (Grant-in-Aid for Young Scientists (A) to MH, in part) and 40415774 (Grant-in-Aid for Young Scientists (B) to TK, in part) and a Grant-in-Aid for Scientific Research on Innovative Areas (221S0002) from the Japanese Ministry of Education, Culture, Sports, Science and Technology.

Acknowledgement Computational docking was performed using the super computer system provided by the National Institute of Genetics (NIG), Research Organization of Information and Systems (ROIS), Japan.

Supporting Information Available Our proposed method integrates information about the Van der Waals radii, the partial charge of atoms and the index numbers for the statistical potential into ZDOCK through a uniCHARMM 21 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

file. Four variations of uniCHARMM files named uniCHARMM-XXXXX (where XXXXX corresponds to the name of force field) are provided to use our method with ZDOCK. This material is available free of charge via the Internet at http://pubs.acs.org/.

References (1) Hafner, M.; Landthaler, M.; Burger, L.; Khorshid, M.; Hausser, J.; Berninger, P.; Rothballer, A.; Ascano, M.; Jungkamp, A. C.; Munschauer, M.; Ulrich, A.; Wardle, G. S.; Dewell, S.; Zavolan, M.; Tuschl, T. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 2010, 141, 129–141. (2) Baltz, A. G.; Munschauer, M.; Schwanhausser, B.; Vasile, A.; Murakawa, Y.; Schueler, M.; Youngs, N.; Penfold-Brown, D.; Drew, K.; Milek, M.; Wyler, E.; Bonneau, R.; Selbach, M.; Dieterich, C.; Landthaler, M. The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts. Mol. Cell 2012, 46, 674–690. (3) Castello, A.; Fischer, B.; Eichelbaum, K.; Horos, R.; Beckmann, B. M.; Strein, C.; Davey, N. E.; Humphreys, D. T.; Preiss, T.; Steinmetz, L. M.; Krijgsveld, J.; Hentze, M. W. Insights into RNA biology from an atlas of mammalian mRNA-binding proteins. Cell 2012, 149, 1393–1406. (4) Ray, D.; Kazan, H.; Cook, K. B.; Weirauch, M. T.; Najafabadi, H. S.; Li, X.; Gueroussov, S.; Albu, M.; Zheng, H.; Yang, A.; Na, H.; Irimia, M.; Matzat, L. H.; Dale, R. K.; Smith, S. A.; Yarosh, C. A.; Kelly, S. M.; Nabet, B.; Mecenas, D.; Li, W.; Laishram, R. S.; Qiao, M.; Lipshitz, H. D.; Piano, F.; Corbett, A. H.; Carstens, R. P.; Frey, B. J.; Anderson, R. A.; Lynch, K. W.; Penalva, L. O.; Lei, E. P.; Fraser, A. G.; Blencowe, B. J.; Morris, Q. D.; Hughes, T. R. A compendium of RNA-binding motifs for decoding gene regulation. Nature 2013, 499, 172–177. (5) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.;

22 ACS Paragon Plus Environment

Page 22 of 29

Page 23 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235– 242. (6) Tuszynska, I.; Matelska, D.; Magnus, M.; Chojnowski, G.; Kasprzak, J. M.; Kozlowski, L. P.; Dunin-Horkawicz, S.; Bujnicki, J. M. Computational modeling of protein-RNA complex structures. Methods 2014, 65, 310–319. (7) Puton, T.; Kozlowski, L.; Tuszynska, I.; Rother, K.; Bujnicki, J. M. Computational methods for prediction of protein-RNA interactions. J. Struct. Biol. 2012, 179, 261–268. (8) Perez-Cano, L.; Fernandez-Recio, J. Optimal protein-RNA area, OPRA: a propensity-based method to identify RNA-binding sites on proteins. Proteins 2010, 78, 25–35. (9) Tuszynska, I.; Bujnicki, J. M. DARS-RNP and QUASI-RNP: new statistical potentials for protein-RNA docking. BMC Bioinformatics 2011, 12, 348. (10) Setny, P.; Zacharias, M. A coarse-grained force field for Protein-RNA docking. Nucleic Acids Res. 2011, 39, 9118–9129. (11) Li, C. H.; Cao, L. B.; Su, J. G.; Yang, Y. X.; Wang, C. X. A new residue-nucleotide propensity potential with structural information considered for discriminating protein-RNA docking decoys. Proteins 2012, 80, 14–24. (12) Huang, S. Y.; Zou, X. A knowledge-based scoring function for protein-RNA interactions derived from a statistical mechanics-based iterative method. Nucleic Acids Res. 2014, 42, e55. (13) Gabb, H. A.; Jackson, R. M.; Sternberg, M. J. Modelling protein docking using shape complementarity, electrostatics and biochemical information. J. Mol. Biol. 1997, 272, 106–120. (14) Allers, J.; Shamoo, Y. Structure-based analysis of protein-RNA interactions using the program ENTANGLE. J. Mol. Biol. 2001, 311, 75–86.

23 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(15) Bahadur, R. P.; Zacharias, M.; Janin, J. Dissecting protein-RNA recognition sites. Nucleic Acids Res. 2008, 36, 2705–2716. (16) Ellis, J. J.; Broom, M.; Jones, S. Protein-RNA interactions: structural analysis and functional classes. Proteins 2007, 66, 903–911. (17) Sonavane, S.; Chakrabarti, P. Cavities in protein-DNA and protein-RNA interfaces. Nucleic Acids Res. 2009, 37, 4613–4620. (18) Gupta, A.; Gribskov, M. The role of RNA sequence and structure in RNA–protein interactions. J. Mol. Biol. 2011, 409, 574–587. (19) Iwakiri, J.; Tateishi, H.; Chakraborty, A.; Patil, P.; Kenmochi, N. Dissecting the protein-RNA interface: the role of protein surface shapes and RNA secondary structures in protein-RNA recognition. Nucleic Acids Res. 2012, 40, 3299–3306. (20) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 1995, 117, 5179–5197. (21) Duan, Y.; Wu, C.; Chowdhury, S.; Lee, M. C.; Xiong, G.; Zhang, W.; Yang, R.; Cieplak, P.; Luo, R.; Lee, T.; Caldwell, J.; Wang, J.; Kollman, P. A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 2003, 24, 1999–2012. (22) Lee, M. C.; Duan, Y. Distinguish protein decoys by using a scoring function based on a new AMBER force field, short molecular dynamics simulations, and the generalized born solvent model. Proteins 2004, 55, 620–634. (23) Neria, E.; Fischer, S.; Karplus, M. Simulation of activation free energies in molecular systems. J. Chem. Phys. 1996, 105, 1902. 24 ACS Paragon Plus Environment

Page 24 of 29

Page 25 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(24) Nilsson, L.; Karplus, M. Empirical Energy Functions for Energy Minimizations and Dynamics of Nucleic Acids. J. Comput. Chem. 1986, 7, 591–616. (25) Reiher, W., III. Theoretical Studies of Hydrogen Bonding. Ph.D. Thesis, Department of Chemistry, Harvard University, Cambridge, MA, USA 1985, (26) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; WiorkiewiczKuczera, J.; Yin, D.; Karplus, M. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 1998, 102, 3586–3616. (27) Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Shaw, D. E. How fast-folding proteins fold. Science 2011, 334, 517–520. (28) Shan, Y.; Kim, E. T.; Eastwood, M. P.; Dror, R. O.; Seeliger, M. A.; Shaw, D. E. How does a drug molecule find its target binding site? J. Am. Chem. Soc. 2011, 133, 9181–9183. (29) Sakuraba, S.; Asai, K.; Kameda, T. Predicting RNA Duplex Dimerization Free-Energy Changes upon Mutations Using Molecular Dynamics Simulations. J. Phys. Chem. Lett. 2015, 6, 4348–4351. (30) Mintseris, J.; Pierce, B.; Wiehe, K.; Anderson, R.; Chen, R.; Weng, Z. Integrating statistical pair potentials into protein complex prediction. Proteins 2007, 69, 511–520. (31) Perez-Cano, L.; Jimenez-Garcia, B.; Fernandez-Recio, J. A protein-RNA docking benchmark (II): extended set from experimental and homology modeling data. Proteins 2012, 80, 1872– 1882. (32) Barik, A.; C, N.; P, M.; Bahadur, R. P. A protein-RNA docking benchmark (I): nonredundant cases. Proteins 2012, 80, 1866–1871. 25 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(33) Huang, S. Y.; Zou, X. A nonredundant structure dataset for benchmarking protein-RNA computational docking. J. Comput. Chem. 2013, 34, 311–318. (34) Chen, R.; Weng, Z. Docking unbound proteins using shape complementarity, desolvation, and electrostatics. Proteins 2002, 47, 281–294. (35) Katchalski-Katzir, E.; Shariv, I.; Eisenstein, M.; Friesem, A. A.; Aflalo, C.; Vakser, I. A. Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc. Natl. Acad. Sci. U.S.A. 1992, 89, 2195–2199. (36) Hwang, H.; Vreven, T.; Janin, J.; Weng, Z. Protein-protein docking benchmark version 4.0. Proteins 2010, 78, 3111–3114. (37) Hwang, H.; Vreven, T.; Pierce, B. G.; Hung, J. H.; Weng, Z. Performance of ZDOCK and ZRANK in CAPRI rounds 13-19. Proteins 2010, 78, 3104–3110. (38) Pierce, B. G.; Hourai, Y.; Weng, Z. Accelerating protein docking in ZDOCK using an advanced 3D convolution library. PLoS ONE 2011, 6, e24657. (39) Perez-Cano, L.; Solernou, A.; Pons, C.; Fernandez-Recio, J. Structural prediction of proteinRNA interaction by computational docking with propensity-based statistical potentials. Pac. Symp. Biocomput. 2010, 293–301. (40) Parisien, M.; Wang, X.; Perdrizet, G.; Lamphear, C.; Fierke, C. A.; Maheshwari, K. C.; Wilde, M. J.; Sosnick, T. R.; Pan, T. Discovering RNA-protein interactome by using chemical context profiling of the RNA-protein interface. Cell Rep. 2013, 3, 1703–1713. (41) Zhang, C.; Vasmatzis, G.; Cornette, J. L.; DeLisi, C. Determination of atomic desolvation energies from the structures of crystallized proteins. J. Mol. Biol. 1997, 267, 707–726. (42) Chen, R.; Li, L.; Weng, Z. ZDOCK: an initial-stage protein-docking algorithm. Proteins 2003, 52, 80–87.

26 ACS Paragon Plus Environment

Page 26 of 29

Page 27 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(43) Larkin, M. A.; Blackshields, G.; Brown, N. P.; Chenna, R.; McGettigan, P. A.; McWilliam, H.; Valentin, F.; Wallace, I. M.; Wilm, A.; Lopez, R.; Thompson, J. D.; Gibson, T. J.; Higgins, D. G. Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23, 2947–2948. (44) McLachlan, A. D. Rapid comparison of protein structures. Acta Crystallogr., Sect. A: Cryst. Phys., Diffr., Theor. Gen. Crystallogr. 1982, 38, 871–873. (45) Huang, Y.; Liu, S.; Guo, D.; Li, L.; Xiao, Y. A novel protocol for three-dimensional structure prediction of RNA-protein complexes. Sci. Rep. 2013, 3, 1887. (46) Clementi, C.; Nymeyer, H.; Onuchic, J. N. Topological and energetic factors: what determines the structural details of the transition state ensemble and "en-route" intermediates for protein folding? An investigation for small globular proteins. J. Mol. Biol. 2000, 298, 937– 953. (47) Koga, N.; Takada, S. Roles of native topology and chain-length scaling in protein folding: a simulation study with a G¯o-like model. J. Mol. Biol. 2001, 313, 171–180. (48) Kameda, T. Importance of sequence specificity for predicting protein folding pathways: perturbed Gaussian chain model. Proteins 2003, 53, 616–628. (49) Chen, R.; Weng, Z. A novel shape complementarity scoring function for protein-protein docking. Proteins 2003, 51, 397–408. (50) Cheng, A.; Wong, S. M.; Yuan, Y. A. Structural basis for dsRNA recognition by NS1 protein of influenza A virus. Cell Res. 2009, 19, 187–195. (51) Zhou, C.; Huang, R. H. Crystallographic snapshots of eukaryotic dimethylallyltransferase acting on tRNA: insight into tRNA recognition and reaction mechanism. Proc. Natl. Acad. Sci. U.S.A. 2008, 105, 16142–16147.

27 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(52) Zhao, H.; Yang, Y.; Zhou, Y. Structure-based prediction of RNA-binding domains and RNAbinding sites and application to structural genomics targets. Nucleic Acids Res. 2011, 39, 3017–3025. (53) Guilhot-Gaudeffroy, A.; Froidevaux, C.; Aze, J.; Bernauer, J. Protein-RNA complexes and efficient automatic docking: expanding RosettaDock possibilities. PLoS ONE 2014, 9, e108928.

28 ACS Paragon Plus Environment

Page 28 of 29

Page 29 of 29

1

0.8

Success rate

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

0.6

0.4 Proposed (AMBER or CHARMM) 0.2

Existing docking software

0 0

1200

2400

3600

Number of decoys

Table of Contents Graphic

29 ACS Paragon Plus Environment