Protein-RNA Docking Using ICM - Journal of Chemical Theory and

Jul 17, 2018 - In this work, we considered three recently developed data sets of protein-RNA complexes to evaluate and improve the performance of the ...
1 downloads 0 Views 2MB Size
Subscriber access provided by UNIV OF DURHAM

Structure Prediction

Protein-RNA docking using ICM Yelena A. Arnautova, Ruben Abagyan, and Maxim Totrov J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.8b00293 • Publication Date (Web): 17 Jul 2018 Downloaded from http://pubs.acs.org on July 21, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

1

Protein-RNA docking using ICM Yelena A. Arnautova1, Ruben Abagyan2 and Maxim Totrov1* 1

Molsoft L.L.C., 11199 Sorrento Valley Road, S209, San Diego, CA 92121

2

Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego,

La Jolla, CA, 92093 ABSTRACT Protein-RNA interactions play an important role in many biological processes. Computational methods such as docking have been developed to complement existing biophysical and structural biology techniques. Computational prediction of protein-RNA complex structures includes two steps: generating candidate structures from the individual protein and RNA parts and scoring the generated poses to pick out the correct one. In this work, we considered three recently developed datasets of protein-RNA complexes to evaluate and improve the performance of FFT-based rigid-body docking algorithm implemented in the ICM package. An electrostatic term describing interactions between negatively-charged phosphate groups and positively charged protein residues was added to the energy function used during the docking step to take into account the greater role that electrostatic interactions play in protein-RNA complexes. Next, the docking results were used to optimize a scoring function including van der Waals, electrostatic and solvation terms. This optimization yielded a much smaller weight for the solvation term indicating that solvation energy may be less important for the scoring of protein-RNA structures. Re-scoring of the generated poses with the new scoring function led to much higher success rates while pose clustering by contact fingerprints produced further improvements, achieving a success rate of 0.66 for the 100 top structures. INTRODUCTION Protein-nucleic acid interactions play an important role in many biological processes, including gene regulation, transcription, splicing and translation. Recently discovered genetic regulation and editing mechanisms such as siRNA, miRNA and CRISPR all involve specific protein-RNA complexes. Understanding factors governing these interactions is crucial for practical applications in medicine, bioengineering and biotechnology. Study of protein-RNA interactions is greatly facilitated by structural information. The number of protein-RNA complexes deposited in the Protein Data Bank1 (PDB) is much smaller compared to protein or RNA structures because obtaining high resolution structures of protein-RNA complexes is a

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 28

2 more difficult task than experimental study of their separate components. This makes development of computational methods to complement current structural biology and biophysical efforts a very important goal. A few computational methods to predict 3D structure of a protein-RNA complex from protein and RNA components have been developed.2 These methods rely on rigid-body docking of RNA molecules to a protein receptor and subsequent scoring of the generated poses using a specially designed scoring function. Results of docking simulations reported for several benchmark datasets of protein-RNA complexes3, as well as addition of protein-RNA complexes to the list of targets for the Critical Assessment of Prediction of Interactions (CAPRI) exercise4, showed that although existing methods are powerful tools, accurate prediction of these systems remains a challenging task.5 Algorithms for the prediction of macromolecular complexes typically consist of two steps: initial sampling and re-scoring. Vast numbers of poses need to be evaluated during initial sampling, therefore fast simplified energy functions must be used during this stage. A more accurate and well-balanced (i.e., including all major contributions with appropriate weights) scoring function is introduced at the re-scoring stage. Some of the software tools developed originally for protein-protein docking, such as FTdock,6 GRAMM,7 PIPER,8 HADDOCK,9 RosettaDock,2b HDOCK10, PatchDock,11 ATTRACT,12 pyDOCK,13 DOT2,14 ClusPro,15 PEPSIDock16 were adapted for protein-RNA docking. However, it was observed4 that these programs often fail to generate native-like poses during the sampling stage because of incomplete sampling of the conformational space and deficiencies of their internal scoring functions2a, which were not designed for protein-RNA docking. Rigid-body docking schemes typically apply some form of ‘soft’ docking to accommodate flexibility of the partner molecules. Plasticity of RNA has a different character compared to that of a protein and may need different ‘softness’ parameters. Protein-RNA systems also differ from the protein-protein complexes in that protein-RNA interfaces often contain a large number of charged groups (positively charged protein residues and negatively charged phosphate groups), and therefore electrostatic interactions between them play larger role in pre-orientation of components17 and the formation of the final complex. To account for these effects, some of the existing docking methods were modified2a, 18 by adding an electrostatic energy term to the energy function used during sampling. Thus, Iwakiri et al.2a used ZDOCK, a popular docking software, combined with several physics-based all-atom force fields to study the effects of electrostatic interaction and shape complementary between protein and RNA on docking sampling. Results obtained for a large benchmark dataset revealed a significant improvement in prediction accuracy by adding an electrostatic term, compared with existing docking programs. The second part of a protein-RNA docking protocol is the re-scoring of generated poses to improve ranking of the near-native ones. Scoring functions may include physics-based, knowledge-based (statistical),19 and geometric complementarity terms as well as their combinations.2a, 18 Only a few potentials specifically designed to rank protein–RNA docking poses have been developed. For example, a combined function, composed of the shape complementarity score, electrostatics and van der Waals energy terms showed good discrimination of the near-native docking poses in bound docking18 (i.e., docking using protein and RNA conformations taken from the corresponding protein-RNA complex). It was found that geometry-based terms and electrostatics were the most important terms for the scoring of

ACS Paragon Plus Environment

Page 3 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

3 protein-RNA poses as compared to binding propensities and desolvation. When applied to the more realistic unbound docking, the scoring success depended strongly on the quantity and quality of the native-like solutions generated. At present, constructing accurate scoring functions is still not a well solved problem, and the performance of the scoring functions usually depends on the benchmarks used. It was also suggested18 that the higher flexibility of nucleic acids and the difficulty in recognizing their interaction surfaces due to less specific interactions makes protein-RNA docking more challenging. Despite the importance of protein-RNA interactions to many applications, the number of reported studies on the development of protein-RNA docking and scoring methods is still relatively small. This prompted us to apply our FFT-based docking method implemented in the ICM program to protein-RNA docking. ICM20 is an integrated molecular modeling and bioinformatics program, which includes a broad range of algorithms. It has been demonstrated to perform well when applied to such molecular modeling problems as protein loop modeling,21 ligand docking,22 virtual ligand screening20b, 23 (VLS) and homology modeling.24 The goal of this work is three-fold: (1) to evaluate and improve performance of our docking method for the protein-RNA complexes by adding an electrostatic term to the energy function employed during the initial FFT-based search; (2) to optimize a physics-based scoring function used for re-ranking and selection of the final predictions, and (3) to study the effect of clustering on the performance of the method. For comparison, we have also evaluated our docking and scoring methods on the protein-protein docking benchmark 4.0.25 Results of these simulations allowed us to further optimize a scoring function used for selecting correct protein-protein poses, and to identify energy terms that are relevant for protein-protein and protein-RNA complexes. MATERIALS AND METHODS Data sets Three benchmark data sets of RNA-protein complexes were used to evaluate performance of the ICM-based docking protocol and to compare it with other existing methods. All three sets (Table 1, Table S1 of Supporting Information) were derived from PDB1 and include structures determined by X-ray as well as NMR (the first model in NMR ensemble was used as a benchmark). The first data set (set1) was developed by Huang et al.3b and consists of 72 complexes, comprising 52 RNA−protein complexes with unbound structures of both protein and RNA (unbound−unbound targets), and 20 complexes with unbound structures of protein and bound structures of RNA (unbound−bound targets). It should be mentioned that the unboundunbound subset includes only one truly unbound complex, viz. the one with both protein and RNA conformations without bound partners in the corresponding experimental structures. Another 14 unbound−unbound and unbound-bound targets can be categorized as pseudounbound since their RNA parts are bound to proteins that have less than 30% sequence identity with respect to those in the reference complex structures. The second data set (set2) published by Perez-Cano et al.3a contains 71 experimental complexes including 9 unbound−unbound targets and 62 unbound−bound targets. All cases in set2 have available unbound protein structure, and include five cases with available unbound RNA structure, four cases with a pseudo‐unbound RNA structure, and 62 cases with the bound RNA form. In the case of RNA molecules, pseudo‐unbound RNA structures, i.e., those bound to a protein that had less than 35% sequence identity with respect to that in the reference complex structure, were included. The third data set by Barik et al.3c (set3) consists of 45 complexes comprising 9 truly unbound−unbound and 36

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 28

4 unbound−bound targets. The benchmark sets cover all major functional categories and contain complexes of different degree of difficulty for docking (protein and RNA flexibility). Table 1. Success rate for N lowest scoring poses generated for protein-RNA complexes. Total number of poses generated for each complex was 10000. ωel = 0 Ncompla) set1 set2 set3

72 (52) 71(9) 45(9)

Nprotb) 382/314 449/339 409/398

N

Irmsd, Åc) 2.08/1.47 2.68/1.66 2.37/1.92

10

100

3600

10000

0.48 0.29 0.22

0.65 0.44 0.33

0.85 0.74 0.64

0.89 0.80 0.78

meanRank

rmsdbest, Å

16 48 83

4.5 6.0 6.1

a)

number of complexes (number of unbound-unbound complexes); b) average/median number of residues in protein part of the complexes; c) average/median interface RMSD The protein-protein docking benchmark version 4.025 (Table S2, Supporting Information) was selected to study the role of electrostatics in protein-protein interactions and to compare the performance of our docking protocol in the protein-RNA vs. protein-protein docking. The dataset contains 228 non-redundant protein–protein complexes for which the complex structure and the constituent unbound structures are available. The nuclear magnetic resonance (NMR) structures for complexes were not included, however, structures of the unbound forms of the proteins determined by NMR were allowed. The benchmark consists of three groups of complexes, ‘rigid’, ‘medium’, and ‘difficult’, classified according to the interface Root Mean Square Deviation (RMSD) between the bound and the unbound forms of the binding partners. Complexes with larger interface RMSD's can be expected to be more difficult to predict. There are 154 ‘rigid’, 41 ‘medium’, and 33 ‘difficult’ complexes in the benchmark version 4.0. Multi-stage docking protocol We used the FFT-based docking method implemented in the ICM program version 3.8626 for the global rigid body sampling, with and without electrostatics. 1. FFT-based translation search and initial scoring function. Fast Fourier Transform is a wellestablished approach in the field of protein-protein docking, allowing dramatic acceleration of sampling for the translational degrees of freedom6-7. In some implementations27 rotational search is also FFT-accelerated to a varying extent. To take advantage of the FFT acceleration of the translational search, the scoring function must be represented in the form of a convolution of two functions Φ(r) and Ψ(r): E(rt)=∫∫∫Φ(r)Ψ(r-rt)dr

(1)

where Φ(r) represents and depends only on the atomic coordinates of the first macromolecule, and Ψ(r) of the second one. Provided such a representation, one can apply the theorem that convolution in the direct space corresponds to the multiplication in the reciprocal (Fouriertransform) space and calculate E(rt) via transformation (by FFT) of Φ(r) and Ψ(r) calculated on a grid for the two interacting macromolecules, multiplication of the two transforms, and a backtransformation (again by FFT). In principle, any pairwise energy function of the form qiqjF(rij) could be represented as a convolution using the corresponding potential qiF(r-ri) as Φ(r) and a delta-function qjδ(r-rj) as Ψ(r). Practically, it is preferable to use well-behaved, smoother functions that have quickly converging Fourier expansions so that relatively coarse-grain grids can be used to save computational time and memory. On the other hand, given that rigid-body

ACS Paragon Plus Environment

Page 5 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

5 docking is inherently inexact due to the lack of induced fit of the partner macromolecules, it is unnecessary and even undesirable to adhere exactly to the functional forms used in molecular mechanics force-field terms. Thus, simplified interaction potentials that capture coarse-grain behavior of the system can be used instead. Indeed, many of the FFT-based docking algorithms use shape complementarity potentials which reflect purely geometric fit of the two macromolecular surfaces. We chose to represent in the FFT search the aspects of the macromolecular interactions that generally persist even at the coarse-grain low resolution: ‘soft’ van der Waals interaction reflecting shape complementarity (even though some of it can be lost due to the lack of induced fit) and lipophilicity. These interactions are represented via two terms. Firstly, steric exclusion is taken into account using the following functional form of Φster(r) and Ψster(r):

Φster(r) = Σ wi ⋅ min(max(Evw(r,ri),0),1) Ψster(r) = Σ wj ⋅ exp(-(r-rj)2/λ2)

(2) (3)

where summations are over all atoms of the first (index i) or the second (index j) molecule (here and thereafter ‘molecule’ may also mean multi-chain assembly when appropriate). Evw(r,ri) is the standard force-field (ECEPP/328) interaction energy between atom i located at ri and a probe carbon atom located at r. Thus Φster is the van der Waals potential truncated to include only repulsive portion up to 1 kcal/mol. Weights wi are used to modulate contributions of side-chain atoms versus backbone. Weights of 0.33 were used for potentially highly mobile atoms of side chains (those past gamma positions) while weight of 1 was used for other side-chain and all backbone atoms. Gaussian function Ψster had width parameter λ=1 Å. To represent lipophilic interactions and van der Waals attractions we use Φlipo(r) and Ψlipo(r): Φlipo(r) = Σ wi ⋅ min(max(Evw(r,ri),-5),0) (4) 2 2 Ψlipo(r) = Σ wj ⋅ exp(-(r-rj) /λ ) (5) where summations are performed only over carbon and sulfur atoms of one (i) or the other (j) molecule. Thus Φlipo is exploiting the attractive part of the van der Waals potential calculated only for lipophilic atoms. 2. Systematic rotational search. Rotational degrees of freedom of the system of two interacting macromolecules was explored using systematic search. To minimize the number of grid potential calculations and forward-transforms, we applied rotations to both interacting partners: one of the molecules (or sub-assemblies) was rotated to explore globally the entire three-dimensional Euler angle (φ/ψ/θ) space, while the second molecule (or sub-assembly) was rotated by fine steps only within a small segment of the rotational space close to the initial position. The first type of rotational sampling used a polyhedron to set two angles (φ/ψ) and evenly distributed values in the 0-360° range to set the third angle (θ). The second type of sampling used angle values on a simple three-dimensional grid, i.e. iδ, jδ, kδ; where indices i,j,k go through the –n, n range and δ is the sampling step. While this approach results in a somewhat uneven sampling density, its advantage is that relatively small number of potential grids have to be calculated and forwardtransformed, e.g., for the sampling level used in the benchmarks 256 sets of grids were generated for one molecule and 125 for the other (vide infra for details), to search 256*125 = 32000 rotations. 3. Search levels. Our current implementation of the algorithm allows three levels of search thoroughness – coarse, medium and fine, although the benchmarking results reported herein were

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 28

6 all obtained with the ‘fine’ level. The coarse precision level uses an icosahedron (i.e. 12 φ/ψ pairs) and 72° θ angle step for global rotations on one side as well as a 3x3x3 fine grid with δ = 30° step on the other side. The medium precision level uses a dodecahedron (20 φ/ψ pairs), 60° θ step and a 5x5x5 fine grid with δ = 15°. The fine precision level uses a dodecahedronicosahedron hybrid polyhedron (32 φ/ψ pairs), 45° θ step and a 5x5x5 fine grid with δ = 9°. 4. Identification and accumulation of score minima. The Fourier inverse-transformation results in a 3D grid map of scores for all translations r(k,l,m), where k,l,m are grid node indexes. We next identify the local minima of the discrete function energy score E(k,l,m), using a simple algorithm – the value at each node is compared to the six neighbors along coordinate axes and 12 neighbors along diagonals in coordinate planes; a node is considered to be in a local minimum if the value is lower than all 18 neighbor values. For all identified minima, a simple electrostatic term is added, the minima are sorted and a subset (100 lowest, or all identified, whichever is less) is retained in each translation search and accumulated for further evaluation. 5. Addition of a simple electrostatic term. Electrostatics, in particular longer-range interactions of charged groups, is another term that plays significant role in molecular recognition at sub-atomic resolution. We chose to add a simple form of an electrostatic term to the FFT scoring function after initial identification of the minima via FFT convolution. The justification for this approach is that the long-range electrostatics is a relatively smooth, slowly varying term. Thus, it mostly changes relative depths of minima rather than creating or eliminating them. While electrostatics may have more drastic variations at a shorter range, considering again the coarse-grain nature of rigid body docking, we assume that these short-range effects cannot be modeled accurately until resolution close to the atomic-level is achieved (perhaps via refinement at later stages) and should be smoothened out. Also, at this stage we only consider electrostatic interactions of the charged (protonated basic and deprotonated acidic) groups. Thus, the electrostatic score term was calculated as: ∑∑ 

  = ,

∙ ,  

(6)

where the summations are over the atoms representing centers of the charged groups, e.g. Cγ of aspartate, Cδ of glutamate, Cζ of arginine, Nζ of lysine and P of the phosphate. Truncation at short distances (3 Å) reduces noise from the close contacts due to inexact fit. Simple distancedependent dielectric constant ε = ε0Rij electrostatic solvation screening approximation is used. The term was added to the score with a weight ωel which was optimized on a subset of the benchmark cases. Constant ε0 was set to 1. and, therefore, the weight of electrostatic term reflects the effective dielectric constant and Coulomb constant. It is possible to include the electrostatic term into the primary FFT scoring function directly. However, introduction of a third term would increase memory requirements and FFT calculations time by ~50%. In our tests, calculating the term only for the local minima adds ~14% to the computing time and the increase in memory requirements is negligible. Solution re-evaluation/re-ranking using a solvation scoring function We have previously found that a scoring function with an optimized Solvent Accessible Surface Area (SASA)-based solvation term can significantly improve ranking of the near-native docking solutions generated by Monte-Carlo sampling.29 Therefore, in the present work we also re-evaluated a fixed number (10000) of top solutions generated in the initial stages of our new protocol with a score that consists of the van der Waals, electrostatics, and solvation terms:

ACS Paragon Plus Environment

Page 7 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

7  = , +  ∙  +  ∙ 

(7)

where , is the combined van der Waals and hydrogen bonding terms calculated according to the ECEPP/328 force-field, Eel is the electrostatic term, and Scoresf (surface) is the solvation term. Pairwise van der Waals interactions were truncated at 0.25 kcal/mol to ‘soften’ the potential. The pairwise electrostatics term, Eel, is based on the Coulombic electrostatic energy with a distance-dependent dielectric constant (ε = 4r) and SChEM30 charge scaling to represent electrostatic screening. Partial atomic charges from the ECEPP/3 force field were used for the protein part of a complex while van der Waals parameters and charges for RNA atoms were taken from the AMBER force field.31 The solvation term, Scoresf, was computed with the atomic solvation parameters taken from32 and includes contributions of the polar, aliphatic and aromatic buried surfaces:  = !" # ∙ !" # +  $! ∙  $! +  #" ∙  #" (8) The original weights (see the "Initial" column of Table 2) of the scoring function (Eq. 7 and 8) were previously optimized using protein-protein complexes.29 Table 2. Weights of the scoring function from Eq. 3 optimized using different target functions Scoring function Target function wvw wel wsf (aliphatic) wsf (aromatic) wsf (polar)

Initialc) 1.00 2.00 1.27 6.26 2.30

Model 1a) Model 2b) sqrt3 sqrt3 All data Mean/RMSDd) All data Mean/RMSd) 1.00 1.00 1.00 ± 0.00 1.00±0.00 1.48 1.72 1.26 ± 0.20 1.55±0.71 0.51 0.00 0.41 ± 0.16 0.19±0.30 0.51 1.70 0.41 ± 0.16 0.95±0.87 0.51 0.41 0.41 ± 0.16 0.50±0.56

a)

two-parameter scoring function four-parameter scoring function c) original weights b)

Performance evaluation The ICM FFT docking protocol was used to generate 10,000 poses for a target RNA−protein or protein-protein complex. Ligand-RMSD was proposed in CAPRI blind proteinprotein docking evaluation experiments33 and is commonly used in the field to measure the deviation of the predicted pose from a corresponding native structure. The ligand-RMSD of a pose with respect to the experimental complex structure was calculated as follows. For the protein-RNA complexes protein chain(s) were considered as a receptor and RNA chain(s) as a ligand, while for the protein-protein complexes, one part of the complex (specified in the proteinprotein benchmark version 4.0) was a receptor and another one a ligand. First, ligand residues within 10 Å distance of the receptor-ligand interface in the native complex were identified. Next, for each complex, receptor sequences of the docked poses and the native structures were aligned to find the equivalent residues. For these receptor residues, heavy atoms of the receptor in the docking pose were superimposed onto those of the native structure. Then, the ligand-RMSD between the pose and the native structure was calculated using all heavy atoms of the ligand residues in the vicinity of the native interface identified earlier. Poses with ligand-RMSDs less than 10 Å were defined as "native-like” structures. All 10,000 generated poses were ranked according to their docking score. The performance of the

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

8 method was evaluated using a success rate defined as the percentage of complexes for which the docking software predicts at least one native-like pose within a given number of top-scoring poses. Success rates were calculated for all 10,000 poses retained in FFT stage, unless stated otherwise. To compare performance of the different energy functions used during docking for an entire dataset, we calculated geometric mean rank of native-like poses: %&'(&') = *+ ,

∑1 23 -.# -/ 0 4

5,

(9)

where ranki is a rank of the top-scoring native-like pose for a given complex and N is the total number of complexes in a dataset. We also calculated the average number of native-like poses generated per-complex. Success of rigid body docking strongly depends on how similar the conformations of components of a complex are to their structures in the unbound state. To estimate the degree of flexibility of either receptor or ligand parts of a complex, the RMSD between a given part (receptor or ligand) of the complex and a corresponding molecule(s) in unbound state was computed. Thus, to compare receptor conformations in bound and unbound states, residues from the bound structure that are within 10 Å radius of the interface were selected. These residues were superimposed onto corresponding residues in the unbound structure and the RMSDrec for all heavy atoms belonging to these residues was computed. The same procedure was performed for the ligand part of the complex (RMSDlig). The overall extent of induced fit was evaluated as the quadratic average RMSD of the two parts: (678.:. = ;

B C B ?@A DE

(10)

F

Optimization of the scoring function Optimization of the scoring function (Eq. 7 and 8) was carried out by minimizing a target function F. Two forms of the target function were considered:

F= F=

N train



min Rankinative

(11)

min Rankinative

(12)

i =1 N train



3

i =1

where minRanki is the rank of the native-like pose with the lowest score for a given complex. Ntrain is the number of complexes in a training set. The nonlinear form of the target functions results in a reduced relative contribution of the high ranks, therefore allowing optimization to focus on improving the ranks of lower-ranking poses. This emphasis on low-ranking poses is stronger when the cubic root functional form is used, and less pronounced with the square root, therefore, we chose to use the cubic root target function for parameter optimization. The geometric mean is another possiblility, but we found that its optimization results in a very strong focus on the ‘easy’ portion of the training set and no ranking improvement for more ‘difficult’ complexes. Optimization consisted of 12 runs carried out for different training sets that included docking poses generated for 70% of complexes from a given dataset selected randomly. The remaining 30% of complexes were used as the test data. Several optimization runs with the

ACS Paragon Plus Environment

Page 9 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

9 different training/test sets allowed us to establish whether the scoring function contained the optimal number of parameters to accurately describe binding interaction and whether the optimization would lead to overtraining of the scoring function. For each of the training sets, ten optimization runs were carried out starting from random weights (Eq. 7, 8) generated on the [0,3] grid. Each of these runs included 10 minimizations, starting from the parameters corresponding to the lowest value of the target function reached during the previous minimization. The Simplex method was used for minimizing F. The step size, maximum number of iterations and tolerance parameters were set to 0.5, 1000, and 0.001, respectively. Pose clustering To reduce the redundancy of the sets of solutions, we investigated clustering in the space of simple binary Contact Fingerprints (CFPs). CFPs were calculated as bitsets, where each bit determines whether a pair of residues (residue/base pair in case of protein/nucleic acid complex) are within 6 Å of each other, as calculated using their Cα (or P) atom positions. The fingerprints had the initial length n*m (n and m are sizes of macromolecules in residues and bases respectively) and were compressed (or extended) to fixed 2048-bit wide sets using a modular hashing function. Dissimilarity of any two poses was calculated as the Tanimoto distance34 between their CFPs. Use of CFPs and Tanimoto distances instead of a direct geometric measure such as RMSD allows very rapid calculation of a complete distance matrix for large sets of poses. It also naturally focuses pose comparison on similarities or differences in the interface rather than potentially large swings of remote non-interacting portions of the two macromolecules that may have little effect at the interface. The hierarchical tree clustering algorithm with un-weighted pair group average (UPGMA) was applied to join two branches. A cutoff value of 0.5, determined previously from protein-protein docking simulations, was used for clustering docking poses for protein-protein complexes. The optimal clustering distance cutoff for the protein-RNA complexes was determined by testing several values (0.3, 0.4 and 0.5). RESULTS AND DISCUSSION Role of electrostatic interactions in protein-RNA docking Several studies have shown2a, 35 that electrostatic interactions play an important role in protein-RNA docking. While our scoring function already included electrostatic interactions, the initial FFT-based search only considered steric and lipophilic interactions, which could lead to the loss of poses stabilized largely by electrostatics. To take electrostatic interactions into account in our docking protocol, an electrostatic term was added to the energy function used to rank and accumulate the minima in the FFT-based search. Docking simulations were carried out for the complexes from set1 using the modified function with different weights, ωel, of the electrostatic term. Results of these calculations are summarized in Table 3. It is clear that the addition of electrostatics noticeably improves docking predictions. Thus, the overall (among 10,000 poses retained on FFT stage) success rate increased from 0.72 for ωel = 0 to 0.79 for ωel = 1. The best success rate (Fig. 1 for the top 1,000 poses and Fig. S1 for 10,000 poses), 0.89, was obtained for ωel = 5. Further increase of ωel, to 10 or 40, decreased the success rate. Furthermore, meanRank and the average number of near-native structures obtained with ωel = 5 are better than those obtained without the electrostatic term (Table 3). It should be noted that the docking

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 28

10 protocol is efficient at generating native-like poses as indicated by the high average number of native-like poses (~90-100) obtained for all values of ωel. Figure 2 shows an overlay of the experimental and the top scoring docking poses generated using ωel = 0 and ωel = 5 for a zinc finger-RNA complex (PDB ID 1un6). No native-like poses were found with ωel = 0, while a native-like structure with RMSD=3.85 Å was located as the top-scoring pose for ωel = 5. Based on the highest success rate, ωel = 5 was adopted for all other protein-RNA docking simulations described in this work. Table 3. Results of protein-RNA docking obtained for complexes from set1 using different weight of the electrostatic term Nnativea) SuccessRate meanRank rmsdbest, Å ωel 0 1 5 10 15 20 30 40 a)

0.75 0.79 0.89 0.83 0.83 0.81 0.82 0.79

19 17 17 12 14 13 9 9

4.5 4.6 4.5 4.5 4.8 4.7 4.7 4.5

39 58 93 98 101 99 93 90

average number of native-like poses in 10000 poses generated.

ACS Paragon Plus Environment

Page 11 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

11 Figure 1. Success rate of the FFT docking protocol for set1 (72 complexes) for the top 1,000 poses. Weight of the electrostatic term, ωel, was set to 0 (red curve) and 5 (blue curve).

Figure 2. An overlay of the experimental structure of a zinc-finger-RNA complex (PDB ID1un6) and top scoring docking poses generated with ωel = 0. (orange) and 5. (magenta), respectively, using the unbound structure of the protein (PDB ID 2hgh, RMSDprotein=3.85 Å) and bound structure of RNA. Since the electrostatic term added to the docking energy function describes interactions between negatively charged phosphate groups of an RNA residues and positively charged protein residues, it may have stronger effect on the complexes with RNA molecules that form predominantly "backbone" (i.e., through phosphate groups) contacts with their protein partners. To test this hypothesis, set1, set2 and set3 were combined and two contact areas were computed for each protein-RNA complex (bound state) of the resulting set. One area, SAphosph, was for contacts between atoms of the phosphate groups and the protein heavy atoms and another one, SAbase, was for contacts between the heavy atoms of the RNA bases and the protein. All the complexes were then divided into two groups, "phosph" and "base", based on what type of contact area was larger for a given complex (complexes with SAphosph- SAbase 6 Å. At the same time, we found a small number of complexes with low (