CoDockPP: A Multistage Approach for Global and Site-Specific Protein

Jul 5, 2019 - Protein–protein docking technology is an effective approach to study the molecular mechanism of essential biological processes mediate...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/jcim

Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

CoDockPP: A Multistage Approach for Global and Site-Specific Protein−Protein Docking Ren Kong,†,⊥ Feng Wang,‡,⊥ Jian Zhang,§ Fengfei Wang,† and Shan Chang*,† †

Downloaded via UNIV OF SOUTHERN INDIANA on July 18, 2019 at 01:26:13 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China ‡ School of Information Science & Engineering, Changzhou University, Changzhou 213164, China § Department of Pathophysiology, Key Laboratory of Cell Differentiation and Apoptosis of National Ministry of Education, Shanghai Jiao Tong University, School of Medicine, Shanghai 200025, China S Supporting Information *

ABSTRACT: Protein−protein docking technology is an effective approach to study the molecular mechanism of essential biological processes mediated by complex protein−protein interactions. The fast Fourier transform (FFT) correlation approach makes a good balance between the exhaustive global sampling and the computational efficiency for protein−protein docking. However, it is difficult to integrate the precise knowledge-based scoring function and site constraint information into the FFTbased approach. New docking strategies with the capability of combining both global sampling and precise scoring are strongly needed. We propose a multistage protein−protein docking strategy called CoDockPP. This program takes full advantage of the sampling efficiency of the FFT-based method to choose the valid ligand protein poses with good surface complementarity. The retained poses are transformed to the real Cartesian space for the implementation of site constraints and atomic scoring. Site constraints and a rapid table lookup scoring are applied to gradually reduce the candidate poses to a tractable number. To enhance the accuracy of docking prediction, the best fast-scoring states are expanded the local sampling points and then these neighbor poses are further evaluated by the precise knowledge-based scoring function. By testing on protein−protein docking benchmark 5.0, CoDockPP remarkably improves the success rate and hit count in both ab initio docking and site-specific docking, especially in difficult cases. The server is free and open to all users with no login requirement at http://codockpp. schanglab.org.cn.



INTRODUCTION Protein−protein interactions and recognition play important roles in many biological processes, such as protein expression regulation, signal transduction, cell-cycle control, and immune response.1−3 However, it is still difficult to determine a protein−protein complex structure through experimental methods of structural biology.4 Molecular docking technology is an effective approach for predicting the complex structures of biological macromolecules.5−7 Motivated by the Critical Assessment of PRedicted Interactions (CAPRI) experiments, many groups have established state-of-the-art methods for protein−protein docking.8−10 Not like the protein−small molecule recognition which happens at some specific binding pockets,11,12 the protein− protein interactions usually occur on protein surfaces and © XXXX American Chemical Society

much larger interaction area is involved during the process.13−15 Therefore, the global sampling is necessary for protein−protein docking. The fast Fourier transform (FFT) correlation approach evaluates the complementarity of two proteins extremely fast and samples the entire conformational space globally and systematically.16 These FFT-based docking programs and web servers included GRAMM,17 DOT,18 HEX, 19 FTDOCK, 16 ZDOCK, 20 PIPER, 21 F2Dock, 22 HDOCK,23 MDockPP,24 SDOCK,25 and FRODOCK.26 Early FFT-based programs like DOT,18 FTDOCK,16 and HEX19 only incorporated shape complementarity and electroReceived: May 31, 2019 Published: July 5, 2019 A

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

yielded a total of 975 bound structures. The PDB codes of the complex structures in the training set are listed in Table S1. FTDock program was applied to sample the training samples.16 We performed 100 independent bound−bound docking for each complex structure in training data set. Each docking process randomly generated 1000 putative ligand protein poses. A total of 100 × 1000 = 100 000 conformations were thus generated for each protein−protein complex. We randomly sieved 5000 samples from these 100 000 conformations, which included at least 10 near-native conformations. Then, a distance-dependent knowledge-based scoring function was trained based on the statistical mechanics-based iterative method.36,37 In the previous method, the experimentally observed pair distribution functions is defined as

static complementarity for fast evaluation of binding modes. These programs were also used by other new FFT-based methods for initial sampling, such as pyDock27 and PEPSIDock.28 Next, the desolvation effects have also been included in FFT-based programs like ZDOCK,20 FRODOCK,26 and SDOCK.25 Furthermore, the pairwise knowledge-based contact potentials were successfully incorporated in FFTbased programs like PIPER/CLUSPRO.21 In addition, the fast manifold Fourier transform29 and GPU implementations30 were applied for FFT-based programs to enhance FFT sampling speed significantly. These FFT-based methods have achieved good performance in the CAPRI experiments. In spite of the significant progress, FFT-based methods also need to be developed due to their limitations.31,32 First, since any extra atomic pairwise scoring function, such as the pairwise contact potential has to be defined as a correlation function term, it is difficult to include a precise distance-dependent scoring function within the FFT approach.33 Accordingly, external scoring functions are usually used after the FFT sampling, but decoupling the scoring and sampling may lead to a loss of accuracy. Second, it is not easy for the FFT-based docking programs to consider the site constraint information in the sampling process, which is important for the accuracy and efficiency of conformation sampling.34 Third, the scoring functions of the FFT-based docking programs are usually deduced from the native structures and very sensitive to small changes in the atomic coordinates.31 These scoring functions cause the docking programs to work extremely well for docking bound structures but fail in the realistic unbound docking. Here, we propose a multistage protein−protein docking strategy called CoDockPP to overcome the above limitations. A distance-dependent knowledge-based scoring function was trained based on the observed atomic pair distribution function with both native structures and near-native structures to make it more robust for conformational changes. A hierarchical docking strategy was designed to improve the accuracy and efficiency. An FFT-based method was used to systematically evaluate shape complementarity, and the conformation sampling was biased toward regions with good surface complementarity. The retained conformations were first filtered by site constraints if the information is provided and, then, further evaluated by the newly developed knowledgebased scoring function. The performance of CoDockPP was tested on benchmark 5.0 and compared with widely used docking programs such as ZDOCK and RosettaDock.

gijobs(r ) = ρijobs (r )/ρijobs ,bulk

(1)

obs where ρobs ij (r) and ρij,bulk are the densities of the ij interatom pairs occurring in a spherical shell of radius from r − Δr/2 to r + Δr/2 and in a reference sphere of radius Rmax, respectively. We set the bin size Δr to 0.25 Å based on the grid spacing of Autodock38,39 and the radius of the reference sphere Rmax to 15 Å. obs The atomic pair densities ρobs ij (r) and ρij,bulk are calculated as

ρijobs (r ) = ρijobs ,bulk

1 M

nijm(r )

M



1 = M

and

4πr 2Δr

m=1 M

∑ m=1

Nijm V (R max )

(2)

where M is the number of protein−protein complexes in the training data set, nmij (r) is the number of atom pair ij in the spherical shell for the mth native complex structure, V(Rmax) is the volume of the reference sphere and equals 4πR3max/3, and Nmij is the total number of atom pair ij in the reference sphere R for the mth native complex structure and equals ∑r =max0 nijm(r ). Previous study has found that the exact native structure is hardly sampled in the process of docking.31,40 The native structure is one of the representative mode in the near-native conformational ensembles,41−43 and the goal of docking program is to discriminate the near-native conformations rather than the only native structure from the incorrect ones. Therefore, different from the previous method, the near-native conformations with L_RMSD < 2.5 Å were also treated as the flexible “native” conformational ensembles. Then, the new obs atomic pair densities ρobs ij (r) and ρij,bulk are calculated according to these near-native conformational ensembles (i.e., the experimental native structure and the near-native modes) as



MATERIALS AND METHODS Knowledge-Based Scoring Function. We scanned the Protein Data Bank (PDB) to obtain the protein−protein complex for our training set. We queried all the X-ray crystal structures with the resolution better than 2.5 Å to identify those PDB entries that contain only dimeric protein structures but without RNA or DNA chains. The number of the residues in the protein ranges from 20 to 2000. To avoid redundancy, if two complexes have more than 30% sequence identity, only the one with better resolution was kept. In addition, we removed those entries having more than 100 missing atoms or more than 5 severe atomic clashes (a clash exists if the distance between a pair of atoms is less than 2.0 Å) during the preparation of data sets. To avoid introducing biases into docking tests, we also used sequence identity of 30% to remove the overlapped complexes between the training set and the test set of protein−protein docking benchmark 5.0.35 The query

ρijobs (r )

1 = MLn

ρijobs = ,bulk

M

∑∑ m=1 l=1

1 MLn

M

nijml(r )

Ln

4πr 2Δr

Ln

∑∑ m=1 l=1

and

Nijml V (R max )

(3)

where Ln is the number of native and near-native modes, nml ij (r) and Nml ij are the number of atom pair ij in the spherical shell and the reference sphere for the lth near-native mode of the mth protein−protein complex. Following the previous method,36 we can calculate the initial potentials {u(0) ij (r)} B

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling based on the improved gobs ij (r). An iterative process is used to optimize the pairwise potentials as uij(1)(r ) = uij(0)(r ) + Δuij(0)(r ) = uij(0)(r ) + λkBT[ln gij(0)(r ) − ln gijobs(r )]

(4)

Using the rigorously derived effective pairwise potentials {uij(r)}, the knowledge-based scoring function for protein− protein docking is as follows score =

∑ uij(r) ij

(5)

Site Constraints. The biological information, especially the site information is very helpful to improve the sampling efficiency of docking program.44−47 Site constraints are a set of atom-pair constraints that evaluate whether a residue interacts with some other parts (receptor protein or ligand protein). More specifically, if we set a site constraint on a particular residue of the receptor protein, the site constraint consists of the distance constraints on the C-alpha from the defined residue to the C-alpha of all other residues in the ligand protein.34 For convenience of discussion, we choose the two closest residues on the interface as the two site constraint residues. One constraint residue is located on the receptor protein interface and another one is on the ligand protein interface. The interface residue is defined as the pair of C-alpha atoms, one from each protein monomer, within 10 Å of each other in complex.48,49 While the two site constraints are ambiguous, the conformation is required with at least one site on the interface of receptor protein or ligand protein. These constraints are denoted as ambiguous constraints. While users definitely require two sites to be satisfied, the conformation is retained with both of the two sites on the interface. These constraints are denoted as multiple constraints. The ambiguous constraints and multiple constraints are both tested in this work. To ensure the comparative evaluation is fair, the conformations of RosettaDock were also filtered by using the same site constraint criterion as CoDockPP program. Docking Protocol. We propose a multistage docking protocol called CoDockPP, which integrates the shape complementarity, knowledge-based scoring function and site constraints. Figure 1 shows the flowchart of the docking protocol. 3D fast Fourier transform (FFT) is used to evaluate the shape complementarity as the first step. FFTW350 is implemented in the docking program to accelerate the computation of FFT. An angle interval of 15° is used for rotational sampling, and a spacing of 1.2 Å is adopted for FFTbased translational search. These rotations and translations will produce approximately 1010 ligand protein poses. Previous study has found that the FFT-based methods systematically sample more than 109 conformations but retain only 103−104 structures for rescoring.31 These previous strategies would decouple the sampling and scoring and most likely eliminate the possible near-native conformations from the retained small set. In order to overcome this limitation, our docking strategy retains many more conformations for rescoring. In the docking protocol, this initial FFT-based shape complementarity only eliminates the regions with core clashes or outside the overlap of skin. Therefore, the sampling is biased toward the limited region where the skin−skin overlaps occur and then retains about 108 ligand protein poses for the second step.

Figure 1. Flowchart of the multistage protein−protein docking protocol. The docking protocol starts with a global search over the entire rotational/translational space using the FFT-based method. In the medium and final stages, the retained conformations with good shape complementarity are filtered by the site constraints and evaluated by the knowledge-based scoring function.

The second step, if site constraints are provided, the conformations will be filtered based on the constraints. Because the retained ligand protein poses are in Cartesian space, these conformations can be filtered by the distance constraints on the C-alpha. Next, the knowledge-based scoring function is applied to score and sort the conformations by using a rapid table lookup method.32 To make a local optimization for ligand protein poses, the best fast-scoring poses (∼105) are further subdivided the grid and expanded the sampling points with 27 cubical grid points around the translation centroid.22 Approximately 106 conformations are further evaluated by the precise trilinear interpolation of knowledge-based scoring function.51 Finally, the top 10 000 ranked binding modes are clustered with an L_RMSD cutoff of 3.0 Å. In the site-specific docking test, the diversities of docking modes generated by docking program are usually less than those of the global docking, so the L_RMSD cutoff is set to 2.0 Å for clustering. The top 1000 clustered binding modes are collected to compare with the other docking programs. The other compared docking programs are also used the same clustering method and cutoff. Docking Performance Evaluation. The protein−protein benchmark 5.035 was used in this work to test the docking program extensively. Given the realistic feature of unbound structures, the docking performance evaluation in this work is based on the results of unbound docking. The performances of different docking programs are evaluated by success rate and hit count, which are used commonly in evaluation of protein− protein docking programs.52,53 The hit is defined as a binding mode with acceptable accuracy according to the CAPRI criteria.36,52 Given the number of predictions NP, for a specific complex system, if at least one hit can be found within NP, it is defined as a success docking. The success rate is the percentage of success docked complexes in the data set in the condition of a specific NP. Hit count is the average number of hits per complex within NP. Similar to the previous study,25 NPs of 100, 100.5, 101, 101.5, 102, 102.5, and 103 are chosen for the comparison, which are the numbers of 1, 3, 10, 31, 100, 316, and 1000, respectively. The success rates are denoted as SR1, SR3, SR10, SR31, SR100, SR316, and SR1000, respectively. When C

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

training. The second function plotted in the figure is the average L_RMSD between the predicted binding modes and the native modes. The average L_RMSD is reduced to around 2 Å after the iterative training. The resulting scoring function was included in CoDockPP program and tested on the unbound protein−protein complexes of benchmark 5.0, as shown in the following. Global Docking Test. In global docking test, CoDockPP program is compared with ZDOCK3.0.254 on the unbound cases of protein−protein benchmark 5.0. The evaluation results are shown in Figure 3 and Table S2. For 230 protein−protein complexes in benchmark 5.0, the success rates at NP = 1, 10, 100, and 1000 are 9.1%, 29.1%, 51.3%, and 80.4% for ZDOCK, and 13.9%, 32.2%, 57.8%, and 80.0% for CoDockPP, respectively (Figure 3A). The success rate of ZDOCK tested in this study is consistent with the previous tests.6,35 Notably, the average increase of success rates AISR for CoDockPP is about 14.4% compared with ZDOCK. As shown in Figure 3B, the hit counts at NP = 1, 10, 100, and 1000 are about 0.09, 0.69, 3.1, and 11.5 for ZDOCK and 0.14, 0.93, 4.2, and 17.1 for CoDockPP, respectively. The average increase of hit counts AIHC for CoDockPP is about 37.5% compared with ZDOCK. Therefore, CoDockPP is more robust for the unbound test and could find more hit structures than ZDOCK. To investigate the effects of conformational changes, we also calculated the success rates and hit counts of ZDOCK and CoDockPP on three categories of rigid-body, medium difficulty and difficult cases in the benchmark (Figure S1 and Table S2). The rigid-body category has 151 cases, which take over more than 65% of the benchmark. Then, the curves of rigid-body and the whole benchmark are similar. For the medium difficulty category, the success rates of CoDockPP are higher than ZDOCK at NP = 10, 100, but slightly lower at NP = 1000. Notably, the accuracies of CoDockPP are remarkable higher than ZDOCK in the difficult cases. For 34 difficult cases, the success rates at NP = 1, 10, 100, and 1000 are about 0.0%, 11.8%, 35.3%, and 58.8% for ZDOCK, and 5.9%, 20.6%, 38.2%, and 64.7% for CoDockPP, respectively. The average increase of success rates AISR for CoDockPP is about 42.9% compared with ZDOCK in difficult category. For the difficult cases, it is tough to predict the correct binding poses based on the unbound structures. However, for the seven difficult categories of 1EER, 1ZLI, 2IDO, 2O3B, 3F1P, 3L89, and 4GAM, CoDockPP obtains the acceptable-quality predictions in the top 10. Especially for the cases of 1EER, 2IDO, 3F1P, and 3L89, CoDockPP achieves medium-quality (L_RMSD < 5

CoDockPP compares with ZDOCK, we further define the average increase of success rates for CoDockPP as AISR =

1 7

∑ i

SR i(CoDockPP) − SR i(ZDOCK) SR i(ZDOCK)

(6)

where i traverses 7 NPs. Similarly, the hit counts are denoted as HC1, HC3, HC10, HC31, HC100, HC316, and HC1000, respectively. The average increase of hit counts for CoDockPP is defined as AIHC =

1 7

∑ i

HCi(CoDockPP) − HCi(ZDOCK) HCi(ZDOCK)

(7)

Similarly, AISR and AIHC can also be used to compare the average increase of success rates and hit counts for any two docking programs.



RESULTS Convergence of Scoring Function Training. Based on the 5000 binding modes for each of the 975 protein−protein complexes in the training set (Table S1), we have trained the knowledge-based scoring function for protein−protein docking by using the statistical mechanics-based iterative method.36 To show the effectiveness of the iterative method on deriving the pair potentials, Figure 2 plots the success rate as a function of

Figure 2. Success rate and the average L_RMSD of the predicted binding modes as functions of the iterative step. The dashed line stands for the success rate of 100%.

the iterative step. It is shown that the success rate gradually approaches 99% as the iteration goes on, indicating that most of the near-native binding modes of the protein−protein complexes in the training set are found after the iterative

Figure 3. Success rate and hit count comparisons for global unbound docking test of 230 protein−protein complexes in benchmark 5.0. (A) Docking success rates of ZDOCK and CoDockPP. (B) Hit counts of ZDOCK and CoDockPP. D

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

increased in comparison to ZDOCK for all of the three categories. In the default setting of CoDockPP, we retained the conformations with good surface complementarity (about 108), and we are able to keep more than 10 000 ligand protein binding modes in each rotation. For the convenience of comparison, we denoted this setting as CoDockPPDefault. In order to disclose the influence of keeping different ligand protein binding modes in initial FFT-based search, we also performed the global docking tests of CoDockPP while keeping 10, 100, and 1000 binding modes in each rotation. These three tests were denoted as CoDockPP10, CoDockPP100, and CoDockPP1000, respectively. Meanwhile, we also compared these tests with different L_RMSD criteria (3.0, 5.0, and 10.0 Å). As show in Figure 5A, for 3.0 and 5.0 Å, the success case counts of CoDockPPDefault are obviously higher than those of the other three tests, but their success case counts are similar at 10.0 Å criteria. This shows that CoDockPPDefault can retain many more high-quality binding modes in the initial FFTbased search. However, does this strategy also benefit the final success rate and hit count? We further compared the success rate and hit count of these docking tests. As shown in Figure 5B, the success rates are similar to the success case counts in Figure 5A. For 3.0 and 5.0 Å, the success rates of CoDockPPDefault are also higher than those of the other three tests. Moreover, for all L_RMSD criteria (3.0, 5.0, and 10.0 Å), the CoDockPPDefault obtains more hit counts than those of the other three test (see Figure 5C). It shows that the default strategy of CoDockPP, which keeps more ligand protein binding modes, improves the final success rate and hit count. In Figure 5B and C, we also showed the performance of ZDOCK with the different L_RMSD criteria. Its success rate and hit count are between the performances of CoDockPP10 and CoDockPP100. This implies that other FFT-based programs, such as ZDOCK, may be improved if these programs retain more binding modes in the initial FFT-

Å) predictions in the top 10. As shown in Figure 4, the L_RMSDs are 3.61, 4.45, 2.80, and 4.34 Å, respectively. The

Figure 4. Medium-accuracy predictions of four difficult unbound cases (1EER, 2IDO, 3F1P, and 3L89) achieved by CoDockPP in the top 10. The receptor protein and ligand protein of crystal structure are colored yellow and cyan. The predicted ligand protein structure is colored pink. Protein structures were depicted using Chimera.55 The L_RMSDs of 1EER (A), 2IDO (B), 3F1P (C), and 3L89 (D) are 3.61, 4.45, 2.80, and 4.34 Å, respectively.

knowledge-based scoring function of CoDockPP includes much more information on near-native structures in the observed pair distribution function, enabling it to be more robust for the difficult cases to obtain much higher success rates than ZDOCK. Similar to the test on the whole benchmark, the hit counts of CoDockPP are remarkably

Figure 5. Histograms of the success case count (A), the final success rate (B), and hit count (C) for the default setting of CoDockPP (Default) and other docking tests of keeping different binding modes (10, 100, 1000) in each rotation. The histograms from left to right are the predictions with the L_RMSD criteria of 3.0, 5.0, and 10.0 Å, respectively. In part A, colors indicate the success case counts of 3.0 (red), 5.0 (orange), and 10.0 Å (yellow), respectively. In parts B and C, the performance of ZDOCK is also added for comparison, and colors indicate the results at NP = 10 (purple), 100 (green), and 1000 (yellow), respectively. E

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 6. Success rate and hit count comparisons for site-specific unbound docking test of 230 protein−protein complexes in benchmark 5.0. (A) Docking success rates of RosettaDock and CoDockPP in ambiguous and multiple constraint tests. (B) Hit counts of RosettaDock and CoDockPP in ambiguous and multiple constraint tests. AC denotes the ambiguous constraint docking test, and MC denotes the multiple constraint docking test.

docking tests. Remarkably, the success rate at NP = 10 is close to 60.0% for CoDockPP in the multiple constraints test, which shows that the CoDockPP program can predict the accurate binding modes for most of cases within the top 10 when users provide some information for binding sites. Similar to comparison of the global docking, the three categories of rigid-body, medium difficulty, and difficult cases are also investigated in the site-specific docking test (Figure S2 and Table S3). The curves of three categories have the similar trend to those of the whole benchmark. Notably, for the categories of rigid-body and medium difficulty, the success rates of CoDockPP both have reached 100% at NP = 1000 in the multiple constraints test. It shows that the CoDockPP program can completely obtain the accurate binding modes of low flexible systems within top 1000 when the site residue information is provided adequately. Computational Efficiency. The computational efficiency is very important for developing a protein−protein docking program. Some state-of-the-art docking programs like SwarmDock57 and HADDOCK58 are also capable of performing global docking, but the computational cost is expensive. We tested the running time of CoDockPP for each unbound case on a single core of AMD Opteron 6386 CPU with a clock speed of 2.8 GHz on a Linux x86_64 cluster. As shown in Figure 7, the medians of running time of CoDockPP program are about 81 min for the global docking test, 24 min for the

based searching. Nevertheless, keeping more structures will also increase the running time and create challenges for scoring functions. Site-Specific Docking Test. ZDOCK does not use the site constraints on the docking search, although it can block nonsite residues in the interface.54 RosettaDock56 uses the site constraints on the global docking process, so we compared the site-restrained global docking tests between CoDockPP and RosettaDock in Rosetta3.7. The site constraints include ambiguous constraints and multiple constraints, which can be set as AmbiguousConstraint and MultiConstraint in RosettaDock, respectively. For site-specific docking tests, we compared the CoDockPP program with RosettaDock on the unbound cases of protein− protein benchmark 5.0. The evaluation results are shown in Figure 6 and Table S3. For 230 unbound cases in the ambiguous constraints test, the success rates at NP = 1, 10, 100, and 1000 are about 2.2%, 5.2%, 24.3%, and 67.4% for RosettaDock, and 14.8%, 34.8%, 61.3%, and 86.1% for CoDockPP, respectively (Figure 6A). The success rates of CoDockPP with the ambiguous constraints are slightly higher than the global docking of CoDockPP, but they are much higher than those of RosettaDock with the ambiguous constraints. Similarly, the success rates of CoDockPP with the multiple constraints are also much higher than those of RosettaDock. For 230 unbound cases in the multiple constraints test, the success rates at NP = 1, 10, 100, and 1000 are about 10.4%, 30.0%, 70.0%, and 96.5% for RosettaDock, and 27.8%, 58.7%, 87.4%, and 97.4% for CoDockPP, respectively. Accordingly, the average increase of success rates AISR for CoDockPP is 323.8% compared with RosettaDock in ambiguous constraints test, and 65.1% in multiple constraints test. In the ambiguous constraints test, the hit counts at NP = 1, 10, 100, and 1000 are about 0.02, 0.07, 0.37, and 2.05 for RosettaDock, and 0.15, 1.12, 5.68, and 27.4 for CoDockPP, respectively (Figure 6B). In the multiple constraints test, the hit counts at NP = 1, 10, 100, and 1000 are about 0.10, 0.63, 4.27, and 27.45 for RosettaDock, and 0.28, 2.24, 13.80, and 78.09 for CoDockPP, respectively. With two site multiple constraints, the success rates and hit counts of RosettaDock are only similar to those of CoDockPP with ambiguous constraints. RosettaDock is widely used for refinement and ranking in the CAPRI experiment,56 but it is not as good as the FFT-based methods in the global searching. Therefore, the sampling method of RosettaDock may reduce its success rates and hit counts in the site-restrained global

Figure 7. Boxplot statistics analysis for the running times of three different docking tests of CoDockPP on the 230 unbound cases of benchmark 5.0. GD denotes the global docking test, AC denotes the ambiguous constraint docking test, and MC denotes the multiple constraints docking test. The medians are shown by the thick lines, boxes show the range from the first (Q1) to third (Q3) quartile, and whiskers extend to the most extreme data point within 1.5 times the interquartile range of Q1 and Q3. F

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

constrains can even decrease the running times. Only portions of candidate ligand protein poses need to be evaluated by the precise knowledge-base scoring function, so the computational cost of CoDockPP is actually reduced. In the recent CASP13-CAPRI Assembly prediction challenge, our CoDock group ranked no. 8 in the CASP +CAPRI groups and no. 4 in the CAPRI groups (see http:// predictioncenter.org/casp13/zscores_multimer.cgi). Especially for six hard targets, the CoDock group ranked no. 1 in the CASP+CAPRI groups, which also confirms the robust docking performances of CoDockPP. Considering its robust predictive performance, the CoDockPP program is a good alternative for ab initio docking as well as site-specific docking.

two site ambiguous constraints docking test, and 16 min for the two site multiple constraints docking test on the 230 unbound cases of benchmark 5.0. Although this computational performance is slower than ZDOCK54 (11 min), it is faster than SwarmDock57 (36 h) and some unimproved FFT-based programs6 (∼100 min).



DISCUSSION From the above results and comparisons, the CoDockPP program shows better performance in both the global docking test and the site-specific docking test. In the global docking test, CoDockPP is compared with ZDOCK3.0.2,54 because of its robust performance in previous assessments. In a recent independent comprehensive assessment of current exhaustive docking programs,6 ZDOCK3.0.2,54 SDOCK,25 and PIPER/ ClusPro21 yielded the relatively higher success rates with 30.7%, 22.7%, and 21.0% for the top 10 predictions at benchmark 4.0. In the new cases of benchmark 5.0,35 the performance of ZDOCK3.0.2 was also better than those of HADDOCK58 and pyDock.27 In our global docking test, the average increase of success rates AISR for CoDockPP is about 14.4% compared with ZDOCK. Especially in the difficult category, the success rate of CoDockPP for top 10 predictions is 20.6%, which is remarkably higher than the 11.8% of ZDOCK. This implies that the scoring function and docking protocol of CoDockPP are more robust for the conformational changes. Different from correlation-type scoring of other FFTbased docking programs, CoDockPP uses the precise knowledge-base scoring function to evaluate a large but tractable number of the candidate poses in the Cartesian space directly. Therefore, CoDockPP obtains much more near-native structures in comparison to ZDOCK, and the average increase of hit counts AIHC for CoDockPP is about 37.5% compared with ZDOCK. In addition, the scoring training sets of previous docking programs include the same complexes from benchmark 4.0 or 5.0, which may cause the potential risk of overfitting.6 The training set of CoDockPP is collected by our criteria and removes the overlapped complexes of benchmark 5.0 by sequence identity of 30%, so the possible bias of CoDockPP might be smaller than other docking programs. With a priori information about the binding site, some nonFFT docking programs like HADDOCK58 and RosettaDock56 will uses site constraint predictions to drive the docking. In the site-specific docking test, CoDockPP is compared with Rosettadock, because it can uses the site constraints in the global Monte Carlo docking process.44 However, the site constraint in FFT-based docking programs requires a new correlation function term, and an additional Fourier transform will increase the running time of docking program.34 Thus, most of FFT-based docking programs do not implement the site constraints through the correlation function term. Notably, ClusPro add site constraints after its FFT sampling and improve the docking performance.34 Similarly, because the candidate ligand protein poses after FFT-based shape complementarity are in Cartesian space, CoDockPP can also filter these conformations directly by the site constrains. In two site-specific docking tests, the average increases of success rates AISR for CoDockPP are 323.8% and 65.1% compared with RosettaDock, respectively. This implies that adding site constraints in the FFT-based methods can remarkably improve the docking performance, which seems to be better than Rosettadock’s site-specific docking by the global Monte Carlo sampling. Meanwhile, as shown in Figure 7, providing site



CONCLUSION In this work, we have presented a multistage protein−protein docking protocol called CoDockPP, which provides a multistage framework for both ab initio protein−protein docking and site-specific docking. This docking protocol applies the efficient FFT-based method to systematically evaluate shape complementarity and focuses on the ligand protein poses with good surface complementarity. These conformations are retained in the Cartesian space, and the pose number is gradually reduced to a tractable range. The hierarchical strategy used in CoDockPP facilitates the implementation of a precise knowledge-base scoring function and information about site constraints to improve docking accuracy. CoDockPP shows higher success rates and much more hit counts than does ZDOCK in global docking and predicts much more accurate binding modes than does RosettaDock in site-specific docking. The first version of CoDockPP has improved docking accuracy and efficiency, which make it a good choice for protein−protein complex prediction. Future developments will include the further optimization of the scoring function and the integration with structure refinement. The CoDockPP server is scheduled to be updated annually and open to all users at http://codockpp.schanglab.org.cn.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.9b00445.



Figure S1. Success rate and hit count comparisons for a global unbound docking test of three categories in benchmark 5.0. Figure S2. Success rate and hit count comparisons for a site-specific unbound docking test of three categories in benchmark 5.0. Table S1. Training set of 975 protein−protein complexes used for deriving the knowledge-based scoring function. Table S2. Success rate and hit count comparisons for the global unbound docking test. Table S3. Success rate and hit count comparisons for the site-specific unbound docking test (PDF)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Ren Kong: 0000-0001-9010-1750 Feng Wang: 0000-0002-0275-8267 Jian Zhang: 0000-0002-6558-791X G

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

(10) Lensink, M. F.; Velankar, S.; Baek, M.; Heo, L.; Seok, C.; Wodak, S. J. The Challenge of Modeling Protein Assemblies: The CASP12-CAPRI Experiment. Proteins: Struct., Funct., Genet. 2018, 86, 257−273. (11) Morris, G. M.; Huey, R.; Lindstrom, W.; Sanner, M. F.; Belew, R. K.; Goodsell, D. S.; Olson, A. J. AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility. J. Comput. Chem. 2009, 30, 2785−2791. (12) Wang, Z.; Sun, H.; Yao, X.; Li, D.; Xu, L.; Li, Y.; Tian, S.; Hou, T. Comprehensive Evaluation of Ten Docking Programs on a Diverse Set of Protein-Ligand Complexes: The Prediction Accuracy of Sampling Power and Scoring Power. Phys. Chem. Chem. Phys. 2016, 18, 12964−12975. (13) Park, H.; Lee, H.; Seok, C. High-Resolution Protein-Protein Docking by Global Optimization: Recent Advances and Future Challenges. Curr. Opin. Struct. Biol. 2015, 35, 24−31. (14) Vakser, I. A. Protein-Protein Interfaces are Special. Structure 2004, 12, 910−912. (15) Lazar, T.; Guharoy, M.; Schad, E.; Tompa, P. Unique Physicochemical Patterns of Residues in Protein-Protein Interfaces. J. Chem. Inf. Model. 2018, 58, 2164−2173. (16) Katchalski-katzir, E.; Shariv, I.; Eisenstein, M.; Friesem, A. A.; Aflalo, C.; Vakser, I. A. Molecular Surface Recognition: Determination of Geometric Fit between Proteins and Their Ligands by Correlation Techniques. Proc. Natl. Acad. Sci. U. S. A. 1992, 89, 2195−2199. (17) Tovchigrechko, A.; Vakser, I. A. GRAMM-X Public Web Server for Protein-Protein Docking. Nucleic Acids Res. 2006, 34, W310− W314. (18) Mandell, J. G.; Roberts, V. A.; Pique, M. E.; Kotlovyi, V.; Mitchell, J. C.; Nelson, E.; Tsigelny, I.; Ten Eyck, L. F. Protein Docking Using Continuum Electrostatics and Geometric Fit. Protein Eng., Des. Sel. 2001, 14, 105−113. (19) Ritchie, D. W. Evaluation of Protein Docking Predictions Using Hex 3.1 in CAPRI Rounds 1 and 2. Proteins: Struct., Funct., Genet. 2003, 52, 98−106. (20) Chen, R.; Li, L.; Weng, Z. ZDOCK: An Initial-Stage ProteinDocking Algorithm. Proteins: Struct., Funct., Genet. 2003, 52, 80−87. (21) Kozakov, D.; Brenke, R.; Comeau, S. R.; Vajda, S. PIPER: An FFT-Based Protein Docking Program with Pairwise Potentials. Proteins: Struct., Funct., Genet. 2006, 65, 392−406. (22) Bajaj, C.; Chowdhury, R.; Siddavanahalli, V. F2Dock: Fast Fourier Protein-Protein Docking. IEEE/ACM Trans. Comput. Biol. Bioinf. 2011, 8, 45−58. (23) Yan, Y.; Zhang, D.; Zhou, P.; Li, B.; Huang, S.-Y. HDOCK: A Web Server for Protein-Protein and Protein-DNA/RNA Docking Based on a Hybrid Strategy. Nucleic Acids Res. 2017, 45, W365− W373. (24) Huang, S.-Y.; Zou, X. MDockPP: A Hierarchical Approach for Protein-Protein Docking and Its Application to CAPRI Rounds 15− 19. Proteins: Struct., Funct., Genet. 2010, 78, 3096−3103. (25) Zhang, C.; Lai, L. SDOCK: A Global Protein-Protein Docking Program Using Stepwise Force-Field Potentials. J. Comput. Chem. 2011, 32, 2598−2612. (26) Garzon, J. I.; Lopéz-Blanco, J. R.; Pons, C.; Kovacs, J.; Abagyan, R.; Fernandez-Recio, J.; Chacon, P. FRODOCK: A New Approach for Fast Rotational Protein-Protein Docking. Bioinformatics 2009, 25, 2544−2551. (27) Cheng, T.; Blundell, T.; Fernandez-Recio, J. PyDock: Electrostatics and Desolvation for Effective Scoring of Rigid-Body Protein-Protein Docking. Proteins: Struct., Funct., Genet. 2007, 68, 503−515. (28) Neveu, E.; Ritchie, D. W.; Popov, P.; Grudinin, S. PEPSI-Dock: A Detailed Data-Driven Protein-Protein Interaction Potential Accelerated by Polar Fourier Correlation. Bioinformatics 2016, 32, i693−i701. (29) Padhorny, D.; Kazennov, A.; Zerbe, B. S.; Porter, K. A.; Xia, B.; Mottarella, S. E.; Kholodov, Y.; Ritchie, D. W.; Vajda, S.; Kozakov, D. Protein-Protein Docking by Fast Generalized Fourier Transforms on

Fengfei Wang: 0000-0003-3423-211X Shan Chang: 0000-0001-7169-9398 Author Contributions ⊥

R.K. and F.W. contributed equally to this work. S.C. conceived the idea and supervised the study. R.K., F.W., and S.C. implemented the docking protocol, built the CoDockPP server, and performed the tests of the docking program. J.Z. prepared the training set and test set. F.W. and S.C. designed of scoring function and analyzed the results. R.K., F.W., and S.C. wrote the manuscript. All authors read and approved the final manuscript.

Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by National Natural Science Foundation of China (Grant No. 81603152), Industry− Academia Cooperation Innovation Fund Project of Jiangsu Province (Grant Nos. BY2016030-06 and BY2016030-11), Six Talent Peaks Project in Jiangsu Province (Grant No. 2016XYDXXJS-020), and Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (Grant No. U1501501). We thank Shengyou Huang and Xiaoqin Zou (University of Missouri) for providing parts of the initial scripts for training the scoring function.



ABBREVIATIONS FFT, fast Fourier transform; CAPRI, Critical Assessment of PRediction of Interactions; PDB, Protein Data Bank; L_RMSD, ligand protein root-mean-square deviations



REFERENCES

(1) Huttlin, E. L.; Bruckner, R. J.; Paulo, J. A.; Cannon, J. R.; Ting, L.; Baltier, K.; Colby, G.; Gebreab, F.; Gygi, M. P.; Parzen, H.; Szpyt, J.; Tam, S.; Zarraga, G.; Pontano-Vaites, L.; Swarup, S.; White, A. E.; Schweppe, D. K.; Rad, R.; Erickson, B. K.; Obar, R. A.; Guruharsha, K. G.; Li, K.; Artavanis-Tsakonas, S.; Gygi, S. P.; Harper, J. W. Architecture of the Human Interactome Defines Protein Communities and Disease Networks. Nature 2017, 545, 505−509. (2) Li, X.-H.; Chavali, P. L.; Babu, M. M. Capturing Dynamic Protein Interactions. Science 2018, 359, 1105−1106. (3) Jiang, H.; Deng, R.; Yang, X.; Shang, J.; Lu, S.; Zhao, Y.; Song, K.; Liu, X.; Zhang, Q.; Chen, Y.; Chinn, Y. E.; Wu, G.; Li, J.; Chen, G.; Yu, J.; Zhang, J. Peptidomimetic Inhibitors of APC−Asef Interaction Block Colorectal Cancer Migration. Nat. Chem. Biol. 2017, 13, 994−1001. (4) Russell, R. B.; Alber, F.; Aloy, P.; Davis, F. P.; Korkin, D.; Pichaud, M.; Topf, M.; Sali, A. A Structural Perspective on ProteinProtein Interactions. Curr. Opin. Struct. Biol. 2004, 14, 313−324. (5) Smith, G. R.; Sternberg, M. J. E. Prediction of Protein-Protein Interactions by Docking Methods. Curr. Opin. Struct. Biol. 2002, 12, 28−35. (6) Huang, S.-Y. Exploring the Potential of Global Protein-Protein Docking: An Overview and Critical Assessment of Current Programs for Automatic Ab Initio Docking. Drug Discovery Today 2015, 20, 969−977. (7) Halperin, I.; Ma, B.; Wolfson, H.; Nussinov, R. Principles of Docking: An Overview of Search Algorithms and a Guide to Scoring Functions. Proteins: Struct., Funct., Genet. 2002, 47, 409−443. (8) Janin, J.; Henrick, K.; Moult, J.; Ten Eyck, L.; Sternberg, M. J. E.; Vajda, S.; Vasker, I.; Wodak, S. J. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins: Struct., Funct., Genet. 2003, 52, 2−9. (9) Lensink, M. F.; Wodak, S. J. Docking, Scoring, and Affinity Prediction in CAPRI. Proteins: Struct., Funct., Genet. 2013, 81, 2082− 2095. H

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling 5D Rotational Manifolds. Proc. Natl. Acad. Sci. U. S. A. 2016, 113, E4286−E4293. (30) Ohue, M.; Shimoda, T.; Suzuki, S.; Matsuzaki, Y.; Ishida, T.; Akiyama, Y. MEGADOCK 4.0: An Ultra-High-Performance ProteinProtein Docking Software for Heterogeneous Supercomputers. Bioinformatics 2014, 30, 3281−3283. (31) Vajda, S.; Hall, D. R.; Kozakov, D. Sampling and Scoring: A Marriage Made in Heaven. Proteins: Struct., Funct., Genet. 2013, 81, 1874−1884. (32) Hogues, H.; Gaudreault, F.; Corbeil, C. R.; Deprez, C.; Sulea, T.; Purisima, E. O. ProPOSE: Direct Exhaustive Protein-Protein Docking with Side Chain Flexibility. J. Chem. Theory Comput. 2018, 14, 4938−4947. (33) Jiménez-García, B.; Roel-Touris, J.; Romero-Durana, M.; Vidal, M.; Jiménez-González, D.; Fernández-Recio, J. LightDock: A New Multi-Scale Approach to Protein-Protein Docking. Bioinformatics 2018, 34, 49−55. (34) Xia, B.; Vajda, S.; Kozakov, D. Accounting for Pairwise Distance Restraints in FFT-Based Protein-Protein Docking. Bioinformatics 2016, 32, 3342−3344. (35) Vreven, T.; Moal, I. H.; Vangone, A.; Pierce, B. G.; Kastritis, P. L.; Torchala, M.; Chaleil, R.; Jiménez-García, B.; Bates, P. A.; Fernandez-Recio, J.; Bonvin, A. M. J. J.; Weng, Z. Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J. Mol. Biol. 2015, 427, 3031−3041. (36) Huang, S.-Y.; Zou, X. An Iterative Knowledge-Based Scoring Function for Protein-Protein Recognition. Proteins: Struct., Funct., Genet. 2008, 72, 557−579. (37) Huang, S.-Y.; Zou, X. An Iterative Knowledge-Based Scoring Function to Predict Protein-Ligand Interactions: II. Validation of the Scoring Function. J. Comput. Chem. 2006, 27, 1876−1882. (38) Kulys, J.; Ziemys, A. A Role of Proton Transfer in PeroxidaseCatalyzed Process Elucidated by Substrates Docking Calculations. BMC Struct. Biol. 2001, 1, 3. (39) Soares, T. A.; Goodsell, D. S.; Briggs, J. M.; Ferreira, R.; Olson, A. J. Docking of 4-Oxalocrotonate Tautomerase Substrates: Implications for the Catalytic Mechanism. Biopolymers 1999, 50, 319−328. (40) Pallara, C.; Rueda, M.; Abagyan, R.; Fernández-Recio, J. Conformational Heterogeneity of Unbound Proteins Enhances Recognition in Protein-Protein Encounters. J. Chem. Theory Comput. 2016, 12, 3236−3249. (41) Pons, C.; Fenwick, R. B.; Esteban-Martín, S.; Salvatella, X.; Fernandez-Recio, J. Validated Conformational Ensembles are Key for the Successful Prediction of Protein Complexes. J. Chem. Theory Comput. 2013, 9, 1830−1837. (42) Grünberg, R.; Leckner, J.; Nilges, M. Complementarity of Structure Ensembles in Protein-Protein Binding. Structure 2004, 12, 2125−2136. (43) Popov, P.; Grudinin, S. Knowledge of Native Protein-Protein Interfaces is Sufficient to Construct Predictive Models for the Selection of Binding Candidates. J. Chem. Inf. Model. 2015, 55, 2242− 2255. (44) Chaudhury, S.; Sircar, A.; Sivasubramanian, A.; Berrondo, M.; Gray, J. J. Incorporating Biochemical Information and Backbone Flexibility in RosettaDock for CAPRI Rounds 6−12. Proteins: Struct., Funct., Genet. 2007, 69, 793−800. (45) Li, L.; Huang, Y. Z.; Xiao, Y. How to Use Not-Always-Reliable Binding Site Information in Protein-Protein Docking Prediction. PLoS One 2013, 8, No. e75936. (46) Gong, X.; Wang, P.; Yang, F.; Chang, S.; Liu, B.; He, H.; Cao, L.; Xu, X.; Li, C.; Chen, W.; Wang, C. Protein-Protein Docking with Binding Site Patch Prediction and Network-Based Terms Enhanced Combinatorial Scoring. Proteins: Struct., Funct., Genet. 2010, 78, 3150−3155. (47) Jiménez-García, B.; Pons, C.; Fernández-Recio, J. pyDockWEB: A Web Server for Rigid-Body Protein-Protein Docking Using

Electrostatics and Desolvation Scoring. Bioinformatics 2013, 29, 1698−1699. (48) Pan, A. C.; Jacobson, D.; Yatsenko, K.; Sritharan, D.; Weinreich, T. M.; Shaw, D. E. Atomic-Level Characterization of Protein-Protein Association. Proc. Natl. Acad. Sci. U. S. A. 2019, 116, 4244−4249. (49) Hwang, H.; Petrey, D.; Honig, B. A Hybrid Method for Protein-Protein Interface Prediction. Protein Sci. 2016, 25, 159−165. (50) Frigo, M.; Johnson, S. G. The Design and Implementation of FFTW3. Proc. IEEE 2005, 93, 216−231. (51) Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R. K.; Olson, A. J. Automated Docking Using a Lamarckian Genetic Algorithm and an Empirical Binding Free Energy Function. J. Comput. Chem. 1998, 19, 1639−1662. (52) Janin, J. Assessing Predictions of Protein-Protein Interaction: The CAPRI Experiment. Protein Sci. 2005, 14, 278−283. (53) Janin, J. Protein-Protein Docking Tested in Blind Predictions: The CAPRI Experiment. Mol. BioSyst. 2010, 6, 2351−2362. (54) Pierce, B. G.; Hourai, Y.; Weng, Z. Accelerating Protein Docking in ZDOCK Using an Advanced 3D Convolution Library. PLoS One 2011, 6, No. e24657. (55) Pettersen, E. F.; Goddard, T. D.; Huang, C. C.; Couch, G. S.; Greenblatt, D. M.; Meng, E. C.; Ferrin, T. E. UCSF Chimera–A Visualization System for Exploratory Research and Analysis. J. Comput. Chem. 2004, 25, 1605−1612. (56) Lyskov, S.; Gray, J. J. The RosettaDock Server for Local Protein-Protein Docking. Nucleic Acids Res. 2008, 36, W233−W238. (57) Torchala, M.; Moal, I. H.; Chaleil, R. A. G.; Fernandez-Recio, J.; Bates, P. A. SwarmDock: A Server for Flexible Protein-Protein Docking. Bioinformatics 2013, 29, 807−809. (58) Dominguez, C.; Boelens, R.; Bonvin, A. M. J. J. HADDOCK: A Protein-Protein Docking Approach Based on Biochemical or Biophysical Information. J. Am. Chem. Soc. 2003, 125, 1731−1737.

I

DOI: 10.1021/acs.jcim.9b00445 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX