Protein Loop Structure Prediction Using Conformational Space

Apr 11, 2017 - ... https://cdn.mathjax.org/mathjax/contrib/a11y/accessibility-menu.js .... with the conformational space annealing (CSA) global optimi...
0 downloads 0 Views 5MB Size
Article pubs.acs.org/jcim

Protein Loop Structure Prediction Using Conformational Space Annealing Seungryong Heo,†,‡ Juyong Lee,§ Keehyoung Joo,†,∥ Hang-Cheol Shin,‡ and Jooyoung Lee*,†,⊥ †

Center for in Silico Protein Science, ∥Center for Advanced Computation, and ⊥School of Computational Sciences, Korea Institute for Advanced Study, Seoul 02455, Korea ‡ School of Systems Biomedical Science, Soongsil University, Seoul 06978, Korea § Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland 20892, United States S Supporting Information *

ABSTRACT: We have developed a protein loop structure prediction method by combining a new energy function, which we call EPLM (energy for protein loop modeling), with the conformational space annealing (CSA) global optimization algorithm. The energy function includes stereochemistry, dynamic fragment assembly, distance-scaled finite ideal gas reference (DFIRE), and generalized orientation- and distancedependent terms. For the conformational search of loop structures, we used the CSA algorithm, which has been quite successful in dealing with various hard global optimization problems. We assessed the performance of EPLM with two widely used loop-decoy sets, Jacobson and RAPPER, and compared the results against the DFIRE potential. The accuracy of model selection from a pool of loop decoys as well as de novo loop modeling starting from randomly generated structures was examined separately. For the selection of a nativelike structure from a decoy set, EPLM was more accurate than DFIRE in the case of the Jacobson set and had similar accuracy in the case of the RAPPER set. In terms of sampling more nativelike loop structures, EPLM outperformed EDFIRE for both decoy sets. This new approach equipped with EPLM and CSA can serve as the state-of-the-art de novo loop modeling method. over the years; five such methods are quickly described here. (1) Yelena et al. developed an internal coordinate mechanics force field (ICMFF) based on the ECEPP/05 force field11 and combined it with a solvent-accessible surface-area solvation model. A conformational search was then performed with the biased probability Monte Carlo (BPMC) method.12,13 (2) Kevin et al. introduced a fast protein loop conformational sampling approach based on the continuum configurational bias Monte Carlo (CCBMC) method14 using distance and torsion angle restraints to guide the sequential atom replacement of the loop structure.15 (3) Velin et al. devised the LOOPER method, which performs conformational sampling in the backbone torsion-angle space and uses CHARMM 16 for energy minimization and scoring.17 (4) Rosetta is one of the most widely used protein loop modeling programs.18−20 Rosetta extracts homologous fragments from a protein structure database and evaluates initial fragments using a scoring function. These fragments are then randomly assembled by a Monte Carlo-simulated annealing search method.21 (5) The protein local optimization program (PLOP) uses a rotamer

1. INTRODUCTION The protein loop is the most flexible substructure in a protein. This loop flexibility often plays an important role in the biological function of a protein1−4 as it affects the protein folding process,5 the protein−protein interaction,6,7 and the protein activity.8,9 Proteins cannot fold into a compact structure without loop regions, and an exposed loop on a protein surface can play an important role as a binding or activation site. As a result, homologous proteins often have different functions derived from different loop structures.10 The structural diversity of loop structures occurs as a consequence of insertion, deletion, or substitution of amino acid sequences during evolution. Although protein loop structure prediction is important for studying the biological functions of proteins, it remains a challenging task due to the sequence variability of the loop and the lack of homologous loop templates in the database. Existing loop modeling approaches can be categorized into two groups: ab initio (de novo) and knowledge-based methods. An ab initio method performs a conformational search guided by a given energy function to obtain the most thermodynamically stable structure. For protein loop modeling in particular, various ab initio methods have been suggested © XXXX American Chemical Society

Received: December 8, 2016 Published: April 11, 2017 A

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling library for a conformational search in dihedral angle space.22 Rosetta and PLOP use fragment and rotamer libraries to reduce the size of the conformational space and to shorten the computational time. In general, ab initio loop prediction methods are heavily influenced by both the efficiency of the sampling method and the accuracy of the scoring function. If a near-native conformation cannot be found in the sampling step, ranking the loop candidates would be meaningless. Similarly, if the scoring function fails to identify near-native conformations among given/sampled loop candidates, we cannot obtain good prediction results even when near-native structures are present in the decoys or during sampling. A knowledge-based method is an approach that identifies a nativelike loop structure from the protein data bank (PDB) database based on the sequence homology. In this approach, the quality of a loop model depends on the success of the homology detection using its associated database.23,24 As the size of the protein structure database increases, the knowledgebased approach becomes more powerful. Three such databasebased methods are quickly described here: (1) FREAD25 uses sequence similarity measured by environment-specific substitution scores and the distance between two anchoring residues of the loop. (2) In the loop in proteins (LIP) database developed by Michalsky et al., loop candidates are selected based on the local geometry among four anchor atoms and ranked by a scoring function.26 (3) LoopWeaver,27 introduced by Daniel et al., selects fragments from a database of known structures and uses the weighted multidimensional scaling (WMDS) method28 to maintain the shape of the loop obtained from the database while avoiding steric clashes with the rest of the protein structure. In this study, we devised a new energy function for ab initio protein loop modeling, EPLM, and compared its performance with the DFIRE potential, which is probably the most widely used all-atom statistical potential for protein model quality assessment. We assessed our method using two loop decoy sets: Jacobson and RAPPER.29,30 The assessment of the new energy function was performed in two ways. First, we evaluated the efficiency of the energy function to select the most nativelike structure from given decoys. Second, we investigated the utility of the energy function to search for nativelike structures starting from randomly generated loop structures. For the conformational sampling of the loop structure, we used the conformational space annealing (CSA) method, which has been quite successfully applied to various hard combinatorial optimization problems.32−36

E DFAnei = wDFAnei(polar)E DFAnei(polar) + wDFAnei(nonpolar) E DFAnei(nonpolar) + wDFAnei(aromatic)E DFAnei(aromatic)

E DFIRE,spline = wDFIRE,spline(polar)E DFIRE,spline(polar) + wDFIRE,spline(nonpolar)EDFIRE,spline(nonpolar) E hbond = whbond(backbone − backbone)E hbond(backbone − backbone) + whbond(backbone − side chain)E hbond(backbone − side chain) + whbond(side chain − side chain)E hbond(side chain − side chain)

Evdw is the van der Waals energy term adopted from the CHARMM22 force field.37 Estereochem is the basic local stereochemistry term derived from the MODELER energy function using only the sequence information on the target loop.38 A fragment library was constructed to generate three dynamic fragment assembly (DFA) terms in eq 1. EDFAdist and EDFAang are the distance and pseudodihedral angle restraint terms of its corresponding 9-residue fragment. EDFAnei is the restraint term for the preferred packing environment by the number of neighboring Cα atoms of the center Cα atom in the fragment. When generating the fragment library, we excluded homologous templates, which have more than 30% sequence identity with the query protein sequence. Details on EDFA can be found elsewhere.40,41 EDFIRE,spline is the all-atom statistical potential energy smoothed by cubic spline.42 Original DFIRE is discrete and was used to perform the single-point energy calculation of a given conformation. We interpolated the potential by using the cubic spline method to make it differentiable. Ehbond is the hydrogen bond energy term,43 and EGOAP is the generalized orientation and distance-dependent allatom statistical potential (GOAP).44 The weight values of energy components in eq 1 are identical to those used for protein structure prediction of CASP10 targets of the PMS server,31 where additional templatedependent terms were used for template-based modeling (TBM) targets (wDFAdist = 1.60, wDFAang = 1.60, wDFAnei(polar) = 1.0, w DFAnei(nonpolar) = 3.20, w DFAnei(aromatic) = 1.60, w D F I R E , s p l i n e ( p o l a r ) = 2.5, w D F I R E , s p l i n e ( n o n p o l a r ) = 2.5, whbond(backbone−backbone) = 17.0, whbone(backbone−side chain) = 4.0, whbond(side chain−side chain) = 1.0, wGOAP = 0.001). We note that three separate wDFAnei values were used depending on the type of amino acid in the center of the fragment. Also, three whbond values were used as indicated depending on the hydrogen bond forming atoms. These weight values were obtained by using 27 small single-domain TBM targets of CASP9. We note that, because of its mathematical complexity, EGOAP was not used during the energy minimization step and only its single-point energy evaluation was added in eq 1 at the end of the energy minimization. Because the protein loop modeling can be considered as a small ab inito modeling, we first considered eq 1 as the energy function to be used in loop modeling where the rest of the protein structure is treated as a rigid body object. In addition to eq 1, we also considered five additional energy functions where each of the last five energy terms in eq 1 was removed in turn. This was designed to estimate the contribution of these five energy terms separately. From this test, we found that EDFAnei and Ehbond contributed negatively to the energy function for desired loop modeling/selection. Finally, on the basis of these experiments, we constructed a modified energy function for

2. MATERIALS AND METHODS 2.1. Energy Function. For protein loop modeling, several energy functions were tested in this study. First, we considered the energy function we used for ab initio targets of CASP10, ECASP10ab. CASP (critical assessment of protein structure prediction) is a community-wide experiment for protein structure prediction taking place every two years since 1994.39 The energy function ECASP10ab is shown below. ECASP10ab = Evdw + Estereochem + wGOAPEGOAP + wDFAdistE DFAdist + wDFAangE DFAang + E DFAnei + E DFIRE,spline + E hbond

(1) B

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling protein loop modeling, EPLM, where the two negatively contributing terms (EDFAnei and Ehbond) were excluded.

conformations were obtained. See refs 32−36 for details regarding CSA. After the CSA sampling, the lowest energy conformations of ECASP10ab and EDFIRE were selected, and their RMSD values were then compared. The correlation coefficient (CC) between the total energy and RMSD is an important measure for accessing the performance of the energy function. However, it should be noted that CC strongly depends on the existence of outliers. For example, the CC of the loop target 1aaj in the Jacobson set was 0.833 but had decreased to 0.592 after high-energy outliers were removed. On the other hand, the CC of the loop target 1fkf improved from 0.231 to 0.843 after filtering out outliers (see Figure S1). Thus, to obtain reliable results, we filtered out high-energy outliers among the 100 final bank conformations whose energies were higher than the average of the maximum and minimum energies of the bank. 2.4. Sampling Loop Structures by Conformational Space Annealing. To obtain the initial structures of CSA sampling for loop modeling, we randomly positioned all atoms in the loop region within the sphere that was located and formed between the backbone carboxyl atom of the N-terminal anchor residue and the backbone amide atom of the C-terminal anchor residue. We generated a set of 50 random conformations that were subsequently minimized by local energy minimization. We called this set the “first bank”. For local energy minimization, a limited memory BFGS quasinewton nonlinear optimization routine (LBFGS)48−50 was used. During the CSA procedure, only the atoms in the loop region were allowed to move; the rest of the protein remained fixed. From the first bank, we selected 30 conformations as seed conformations for the generation of trial conformations. For each seed conformation, we generated 30 trial conformations, t, via crossover operation in the Cartesian coordinate followed by energy minimization. All energy-minimized trial conformations were compared with the 50 bank structures in terms of energy and similarity. For each t, the closest conformation to t in the bank, c, was identified by measuring the pairwise distance between t and the rest of the bank conformations. In this study, we used the sum of the absolute values of torsion angle differences between the two loops as the distance measure. If the distance was lower than the cutoff distance, Dcut, t was considered to be similar to c. In this case, if the energy of t was lower than that of c, t replaced c. Otherwise, the trial conformation was discarded. If t was not similar to c, but its energy was lower than that of the highest energy conformation in the bank, b, t replaced b. The value of Dcut was initially set to half the average pair distance in the first bank. The value of Dcut was slowly reduced until it reached Dave/5. Finally, we obtained 50 distinct low-energy conformations that we called the “final bank”. We repeated this procedure to obtain a total of 100 low energy “final bank” conformations. 2.5. Clustering. After the CSA sampling, the k-means clustering procedure was applied to reduce the complexity of the final bank conformations. In the CASP experiment, up to five structures were allowed for model submission for each target. Hence, we tried to estimate the quality of loop modeling by selecting up to five loop structures. For this, we grouped 100 final bank conformations into five clusters and then selected the lowest energy conformation from each cluster. The unsuperimposed all heavy-atom RMSD values between the native loop structure and final bank conformations were used as the distance measure for clustering.

E PLM = Evdw + Estereochem + wGOAPEGOAP + EDFIRE,spline + wDFAdistE DFAdist + wDFAangE DFAang

(2)

In addition to ECASP10ab and EPLM, we also considered the EDFIRE energy function for benchmarking, which consists of the differentiable DFIRE term, the basic stereochemistry term (Estereochem), and the van der Waals term (Evdw) as E DFIRE = Evdw + Estereochem + EDFIRE,spline

(3)

The structure of a protein used in this study is divided into two regions: flexible (loop) and fixed (nonloop). The potential energies of given conformations are calculated for the interactions between flexible and flexible regions and flexible and fixed regions. However, in the case of DFAdist and DFAang, we considered only the interactions between flexible (loop) regions. This is not to create any conflicts between the local structural propensity of selected fragments extending the fixed (nonloop) part of the protein and the actual fixed (nonloop) structure of the protein. 2.2. Data Sets. Jacobson Decoy Set. The Jacobson decoy set, which consists of 756 loop targets (loop sizes ranging from 4 to 12 amino acid residues), was used to benchmark the effectiveness of energy functions (ECASP10ab, EPLM, EDFIRE, and others) for loop modeling. One half of the Jacobson set (set A) was used as the test set to estimate the contribution of each energy term in eq 1, and the other half (set B) was used as the benchmark set. For each loop target, 50−1000 loop decoys were provided with their structural variation covering a wide spectrum ranging from nativelike to random structures. For all targets, MODEL0 refers to the “native” loop structure, MODEL1 to the single energy-minimized native loop structure using OPLS/SGB, and MODEL 2 to the native structure with optimized loop side-chains prior to the minimization.45 Because these three structures were either identical or quite similar to the native structure, they were removed from the decoys in this study. RAPPER Decoy Set. The RAPPER set46 was generated by sampling backbone φ/ψ angles from fine-grained residuespecific φ/ψ propensity tables, and the ω dihedral angle was kept fixed at 180°. Special attention was paid to avoid clashed atoms in the loop region.47 The RAPPER set consisted of 385 loop targets with loop sizes ranging from 2 to 12 residues. Each loop target contained 1000 decoys. Because the decoys of the RAPPER set did not contain native loops, the whole set was used as a benchmark set. 2.3. Assessment Scheme. ECASP10ab (or EPLM) was compared with DFIRE (or EDFIRE) in two ways. First, the ECASP10ab and DFIRE scores of all decoys were calculated. Then, we selected the lowest energy conformation by each potential and compared their RMSD (root mean squared deviation) values. To calculate the RMSD value to the native structure, we used all backbone heavy atoms. Second, to identify how well a potential function was correlated with the quality of the model structure, we performed an extensive conformational sampling of the loop region by using CSA starting from randomly generated loop structures. During CSA, 50 first bank conformations (random structures followed by single energy minimization) evolve to 50 final bank conformations, and with an additional 50 conformations added, a total of 100 final C

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 1. (a) RMSDs (Å) of the lowest-energy decoys selected by DFIRE, ECASP10ab, dDFIRE, and GOAP are shown for Jacobson set A as a function of the loop size. The average and lowest RMSDs for all decoys are also shown. RMSD was calculated using all backbone heavy atoms without performing superimposition between a decoy and its native structure. All RMSDs are averaged over the total number of loop targets of a given loop size. (b) RMSDs of the lowest energy first bank conformations from CSA calculations are shown. The average and the lowest RMSDs of first bank conformations are also shown. For the CSA calculation, two energy functions, ECASP10ab and EDFIRE, were used. (c) Results for final bank conformations from CSA calculations are shown. (d) RMSDs of the lowest energy final bank conformations are shown for seven functions listed in the legend. On the basis of the results of the five energy functions where one energy term was removed from ECASP10ab one at a time, EPLM was constructed (see the text). (e) The first benchmark result of EPLM on Jacobson set B by selecting the lowest energy decoys is shown along with the results of the four other energy functions listed in the legend. (f) RMSDs of the lowest energy final bank conformations from CSA calculations are shown for EDFIRE, ECASP10ab, and EPLM.

2.6. RMSD Calculations. We used RMSD to evaluate the accuracy of the loop structure prediction. To calculate RMSD

of a loop structure, we used all backbone heavy atoms (N, Cα, C, O). Because we fixed the nonloop part of the protein D

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

and GOAP. In this result, EPLM shows the best result out of the five energy functions. Figure 1(f) represents the average RMSD value of the lowest energy final bank conformation in the CSA sampling. EPLM results were improved by 10−20% in a similar fashion. Detailed average RMSD values of the whole Jacobson decoy set are listed in Table S1. EPLM worked better than EDFIRE and ECASP10ab for all loop sizes for searching more nativelike conformations. The scatter plot of RMSD values of loop models predicted by EPLM and EDFIRE is shown in Figure S2(a). Here, we compare EPLM against EDFIRE in the two worst loop modeling cases 1amp (10-residue loop) and 1rcf_2 (11-residue loop). The RMSD values of the lowest EDFIRE conformations of 1amp and 1rcf_2 were 0.94 and 0.54 Å, respectively, whereas those of the lowest EPLM conformations were 3.93 and 4.68 Å, respectively. In the case of 1amp, its 10-residue loop consists of five hydrophilic residues (T181, N182, Y183, S186, Q188), three hydrophobic residues (G185, A187, V190), and two charged residues (K184, D189); the EPLM model was more exposed than the crystal structure. In the case of 1rcf_2, its 11residue loop consists of four hydrophilic residues (T122, Y125, N128, S129), three hydrophobic residues (G124, G127, A132), and four charged residues (D123, D126, D129, K131); the EPLM model was not properly bent toward the inside of the protein, which led to the poor prediction result (see Figure S3). Figure 2 shows an example of a successful prediction using EPLM: 1kuh (12-residue loop size). Although the crystal

structure during the CSA sampling of the loop, the loop region RMSD was calculated without performing superimposition between loop structures.

3. RESULTS Jacobson Set A. We gathered the results for the first half of the Jacobson decoy set (Jacobson set A) by initially finding the average RMSD values of the lowest energy decoys by ECASP10ab: DFIRE, dDFIRE, and GOAP (see Figure 1(a)). Here, dDFIRE51 is a dipolar DFIRE that treats polar atoms separately from nonpolar atoms. The average RMSD values of the lowest ECASP10ab conformations are, on average, lower than those of the lowest statistical potentials. This result shows that ECASP10ab works slightly better than three statistical potentials on average in terms of selecting more nativelike loop conformations from Jacobson set A. Next, we performed extensive conformational sampling of the loops starting from random loop structures using CSA with ECASP10ab and EDFIRE separately. The average RMSD values of the first bank conformations of CSA, which were generated in a random fashion and then locally energy minimized, are shown in Figure 1(b). Although all starting loop structures were generated randomly, their average RMSD values were all below 9 Å due to the chain connectivity to the nonloop part of the protein. When we compared the decoy loop structures from the Jacobson set with the first bank structures, we observed that the average RMSD values of Jacobson decoys were significantly less than those of the first bank conformations, illustrating that Jacobson decoys are biased toward their native loop structures. This is a huge contrast to the case of the RAPPER set where the decoys were of similar quality to the first bank conformations (see below). The average RMSD values of the final bank conformations are shown in Figure 1(c). For the first bank conformations, the lowest energy conformations selected by ECASP10ab were of higher RMSD values than those selected by EDFIRE. However, for the final bank conformations, the results show that model selection by ECASP10ab was better than that by EDFIRE for loops of 9 residues and longer, whereas the opposite was true for loops of 8 residues and shorter. This suggests that ECASP10ab is more efficient than EDFIRE for longer loops. To evaluate the contribution of five of the energy terms (EDFIRE,spline, EDFAdis, EDFAang, EDFAnei, and Ehbond) in eq 1 separately, we considered modified energy functions where each of the five terms was removed from ECASP10ab in the CSA sampling procedure. Figure 1(d) shows the results. The average RMSD value of the lowest energy conformations became worse when each of EDFAdis, EDFAang, and EDFIRE,spline energy terms were taken out of ECASP10ab one at a time. However, when the EDFAnei or Ehbond energy term was removed one at a time, the average RMSD value improved. These results suggested that EDFAdis, EDFAang, and EDFIRE,spline work better for loop modeling whereas EDFAnei and Ehbond work against it. On the basis of this observation, we decided to exclude EDFAnei and Ehbond from ECASP10ab to define EPLM, where PLM stands for protein loop modeling. Jacobson Set B. We used the second half of the Jacobson decoy set (Jacobson set B) as the first benchmark set for EPLM. Figure 1(e) shows the result, where the average RMSD value was improved by 5−30% in screening decoys for all loop sizes by replacing ECASP10ab with EPLM. In addition, EPLM appears to select more nativelike decoy structures than DFIRE, dDFIRE,

Figure 2. Loop modeling result of 1kuh (12-residue loop size) from Jacobson decoy set B. The loop structures generated using EPLM and EDFIRE are shown in yellow and cyan, respectively, and the native loop structure is shown in magenta. The gray part represents the non-loopfixed substructure of the protein.

structure of 1kuh is exposed, EDFIRE predicted a buried loop structure. In many cases, we observed that EDFIRE prefers more densely packed conformations than EPLM. Thus, we investigated the solvent accessibility (SA) of the loop structures with the lowest EPLM, of the loop structures with the lowest EDFIRE, and of native loops using the DSSP program.52 Figure 3 shows three-way comparisons of the SA of the lowest energy structures among EPLM-generated loops, EDFIRE-generated loops, and native loops. As obviously noted in the figures, SA between native and EPLM-generated loops correlates reasonably well with correlation coefficients of 0.943. On the other hand, SA between native and EDFIRE-generated loops were systematically reduced with correlation coefficients of 0.873. This result E

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 3. Three-way comparison of solvent accessibility (SA) of all loop targets for Jacobson decoy set B: (a) lowest EPLM vs lowest EDFIRE, (b) native vs lowest EDFIRE, and (c) native vs lowest EPLM. Linear fits are shown by green lines with their slopes indicated. We observed a close agreement of SA between loops selected by EPLM and native loops.

Figure 4. RMSDs (Å) of the lowest energy decoys selected by DFIRE, ECASP10ab, EPLM, dDFIRE, and GOAP are shown for the RAPPER set as a function of the loop size. The average and lowest RMSDs of all decoys are also shown. RMSD was calculated using all heavy atoms without performing superimposition between a decoy and its native structure. All RMSDs are averaged over the total number of loop targets of a given loop size. (b) RMSDs of the lowest energy first bank conformations from CSA calculations are shown. The average and lowest RMSDs of first bank conformations are also shown. For the CSA calculation, three energy functions EPLM, ECASP10ab, and EDFIRE, were used. We observe that the average RMSD values of decoys are similar to those of average first bank conformations suggesting that RAPPER decoys were generated in a random fashion. This should be contrasted to the case of Jacobson set (see Figure 1(a and b)). (c) RMSDs of the lowest energy final bank conformations from CSA calculations are shown for EDFIRE, ECASP10ab, and EPLM. The average and lowest RMSDs are also shown.

F

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling shows that EDFIRE favors more densely packed structures than EPLM, which led to poorer loop modeling by EDFIRE on average. RAPPER Decoy Set. The whole RAPPER decoy set was used as the second benchmark set for testing the ECASP10ab, EDFIRE, and EPLM energy functions. For decoys, we also included the results of DFIRE, dDFIRE, and GOAP. We observed that, in selecting more nativelike structures from decoys, EPLM performed marginally better than the other statistical potentials tested. In the case of 12-residue-long loops, the result of ECASP10ab was worse than that of DFIRE. The worse performance of ECASP10ab for 12-residue-long loops was mainly due to the failures of two targets (1tca and 451c), which showed significantly higher RMSD values than those of the DFIRE results (for 1tca, the DFIRE/ECASP10ab result was 4.459/11.487 Å; for 45lc, the DFIRE/ECASP10ab result was 4.041/13.365 Å). Unlike the results of the Jacobson set, even when the EPLM energy function was used, the results of the lowest energy decoy selection test were similar to those of ECASP10ab (Figure 4(a)). As in the Jacobson decoy set, we performed energy calculations and the lowest energy comparison with dDFIRE and GOAP energy functions. The result was not much different with the other three energy function results (ECASP10ab, EDFIRE, EPLM). The reason for this is because the RAPPER decoy set has more distantly related structures to the native structure than does the Jacobson decoy set. When the ab initio loop modeling was performed, the overall average RMSD values of the first bank conformations selected by ECASP10ab were higher than those by EDFIRE (Figure 4(b)). The average RMSD values of the final bank conformations selected by ECASP10ab were consistently lower than those by EDFIRE for loops of sizes 9−12 residues, whereas the results were mixed for smaller loops (Figure 4(c)). These results show that the loop conformation sampled using ECASP10ab can lead to more accurate results for longer loops than by using EDFIRE. When E PLM was used to select the lowest energy conformation among the first bank conformations, the average RMSD value was lowered by 40−57% for all loop sizes. However, when ECASP10ab was used among the final bank conformations, the average RMSD value was lowered by 1− 32% for all loop sizes (Figure 4(b and c)). From Figure 4(c), we observed that the improvement of EPLM over that of ECASP10ab was minimal for 10-residue loops 1nfp and 1onc being the only targets responsible. For 1nfp, the lowest energy conformation by EPLM had an RMSD value of 11.84 Å, and that by ECASP10ab was 8.14 Å. (EDFIRE also failed with an RMSD value of 10.43 Å.) For 1onc, the lowest energy conformation by EPLM had an RMSD value of 5.51 Å, and that by ECASP10ab was 1.83 Å. (EDFIRE also failed with an RMSD value of 6.06 Å.) Table S2 shows the average RMSD value of each loop size in detail. EPLM worked better than EDFIRE for all loop sizes when searching more nativelike conformations. The scatter plot of RMSD values of loop models predicted by EPLM and EDFIRE is shown in Figure S2(b). The worst loop modeling cases with EPLM, as compared to EDFIRE, for 12-residue loops were 154l and 1srp (see Figure S4). The RMSD values of the lowest EDFIRE conformations of 154l and 1srp were 2.90 and 1.02 Å, respectively, and those of the lowest EPLM conformations were 4.16 and 3.00 Å, respectively. In the case of 154l, the fourth and fifth residues (red-colored S156 and Y157) of the loop are tilted to the inside of the protein so that the EPLM model structure was predicted differently with the crystal structure. In the case of 1srp, the fourth, fifth, and sixth residues (red-

colored G311, G312, and L313) of the loop are twisted conversely, which led to the poor prediction result. Figure 5 shows an example of accurately predicted 12residue-long loop structures by EPLM (451c, H16∼Y27). Similar

Figure 5. Loop modeling result of 451c loop target (12-residue loop size) from the RAPPER set. Loop structures predicted by EPLM and EDFIRE are yellow and cyan, respectively, and the native structure is magenta. The gray color represents the fixed substructure of the protein.

to the case of 1kuh (12-residue loops) in the Jacobson decoy set discussed above, EDFIRE predicted a more buried loop structure as the lowest energy structure, far worse than the structure predicted by EPLM (RMSD by EPLM was 0.84 Å; RMSD by EDFIRE was 8.03 Å). Figure 6 shows three-way comparisons of SA of the lowest energy structures among EPLMgenerated, DFIRE-generated, and native loops. SA between native and EPLM-generated loops correlates reasonably well with correlation coefficients of 0.932. On the other hand, SA between native and EDFIRE-generated loops were systematically reduced with correlation coefficients of 0.839. This result again shows that EDFIRE favors more densely packed structures than does EPLM, which led to poorer loop modeling by EDFIRE on average. Sampling and Accuracy of Energy Issues. Separating the sampling and accuracy issues of the energy function used for sampling is a very important problem but hard to properly execute. When dealing with a given set of decoys generated by others (such as Jacobson set B or the RAPPER set), we have no control of either the robustness of the sampling or the identity of the energy function used. For this reason, we examined CSAgenerated lowest energy EPLM loops along with nativeminimized EPLM loops plus minimized decoys. For each loop target of Jacobson set B, we compared the EPLM value of the lowest CSA-generated structure and those of the nativeminimized structure and minimized decoys. As shown in Figure S5(A), for all cases without an exception, the lowest CSAgenerated structures were of lower EPLM values with an average energy gap of ∼103.8 kcal/mol (range of 1.4−505.0 kcal/mol). For the RAPPER set, the result was similar, as shown in Figure S5(B), with an average energy gap of ∼94.1 kcal/mol ranging (range of 2.7−350.5). This clearly demonstrates that the CSA sampling is quite robust for generating more optimal loops in terms of EPLM than native or decoy loops. Therefore, although EPLM is shown to be more accurate for loop modeling than G

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 6. Three-way comparison of solvent accessibility (SA) of all loop targets for the RAPPER decoy set: (a) lowest EPLM vs lowest EDFIRE, (b) native vs lowest EDFIRE, (c) native vs lowest EPLM. Linear fits are shown by green lines, and their slopes are indicated. We observed close agreement of SA between loops selected by EPLM and native loops.

EDFIRE or ECASP10ab, in this study, EPLM is also not ideal. On the other hand, when we compared the EPLM values of either the native loops or native-minimized loops against the lowest corresponding counterparts from the decoys, approximately 30 and 74% of the native loops and approximately 10 and 27% of the native-minimized loops were of lower EPLM values than the lowest EPLM loops of Jacobson set B and the RAPPER set, respectively. We believe this was due to the inefficient coverage of the provided decoys. Correlation Coefficient (CC) between Energy and RMSD. We calculated the average CC between energies and RMSD values by using the final bank conformations from the CSA sampling after filtering out high-energy outliers. For Jacobson loop targets, the average CC of EPLM (0.672) was higher than that of EDFIRE (0.535) (Figure 7(a) and Table S3). The results of RAPPER loop targets also showed similar results (Figure 7(b) and Table S3). The average CC of EPLM (0.724) was higher than that of EDFIRE (0.553). These results show that nativelike loop structures can be sampled with EPLM more than with EDFIRE. Success Rate. We evaluated the quality of the loop modeling methods by calculating the success rates. Here, the success rate is defined as the fraction of loop targets for which the lowest energy modeled structures are of an RMSD below a certain cutoff value. The results are shown in Figure 8 using five RMSD cutoff values from 1.0 to 5.0 Å, where the results of EPLM and EDFIRE are shown by blue filled symbols and open red symbols, respectively. For Jacobson set B with a cutoff value of 1.0 Å, the results were mixed. However, in all the other cases, including the RAPPER set, EPLM outperformed EDFIRE in terms of the success rate. Clustering of Bank Conformations. Even though EPLM showed good performance for loop structure prediction, the lowest energy conformation did not always correspond to the lowest RMSD conformation for a given decoy set. To alleviate this problem, we performed a clustering53 of the final bank conformations by using the k-means clustering algorithm based on their pairwise RMSD values. The final bank conformations were divided into five clusters; we then chose the lowest energy conformation from each cluster. The results of the best of these five lowest energy conformations are shown in Figure 9. The RMSD results were improved between 7% (for 4- and 12residue loops) and 29% (for 11-residue loops) in the Jacobson set and between 3% (for 3-residue loops) and 25% (for 10- and

Figure 7. Comparison of the average correlation coefficient for (a) Jacobson set B and (b) the RAPPER set. High energy outliers were filtered out before the calculation.

12-residue loops) in the RAPPER set. The average RMSD value was improved by approximatey 15% by considering the five loop models. H

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 9. Clustering results of the final bank conformations for (a) Jacobson set B and (b) the RAPPER set. The best out of five refers to the best loop structure out of the five lowest energy loops selected from five clusters.

Figure 8. Success rate for different RMSD cutoffs from 1.0 to 5.0 Å for (a) Jacobson decoy set B and (b) the RAPPER decoy set.

4. DISCUSSION In this paper, we have tested the performance of a newly introduced energy function, EPLM, for protein loop modeling. The assessment results on the Jacobson and RAPPER decoy sets show significant improvements in both identifying a more nativelike loop conformation among decoys and modeling the loop structure starting from random structures. For the ab initio modeling of a loop, the average RMSD values of the lowest EPLM conformations were lower than those of EDFIRE for all loop sizes. Furthermore, there were no decoy models that had lower total energy values than the lowest energy bank conformations, suggesting that the CSA sampling and the minimization procedures were performed properly. For identifying more nativelike decoy loop structures, EPLM clearly outperformed EDFIRE on the Jacobson target set, whereas it was less decisive on the RAPPER target set. This difference in the performances according to decoy set can be explained by the varying characteristics of the two decoy sets. The average RMSD values of the RAPPER decoys are higher than those of the Jacobson decoys (see Figures 1(a) and 4(a)). That is, the Jacobson decoy set was biased toward more nativelike structures than was the RAPPER decoy set. In addition, we observed that most decoys of the RAPPER set had unfavorable van der Waals energy values, meaning that the RAPPER decoys

were of atomic steric clashes, whereas this was not the case for the Jacobson set. Consequently, the correlation coefficient and the selection accuracy by EPLM were relatively poor for the RAPPER set. This limitation was somewhat resolved by carrying out the CSA calculations where more nativelike structures had been identified.

5. CONCLUSIONS Protein loop structures play important roles in various biological phenomena; consequently, accurate loop structure prediction is required in many studies. Here, we investigated a newly developed energy function, EPLM, for accurate protein loop structure determination. By carrying out benchmarking tests on two popular loop decoy sets, we have shown that EPLM can identify and generate more nativelike loop structures. Although many challenges still remain in protein loop modeling, the current study can serve as a useful tool and benchmarking result for further development in the study of protein loop modeling. I

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling



(12) Abagyan, R.; Totrov, M. Biased Probability Monte Carlo Conformational Searches and Electrostatic Calculations for Peptides and Proteins. J. Mol. Biol. 1994, 235, 983−1002. (13) Arnautova, Y. A.; Abagyan, R. A.; Totrov, M. Development of a New Physics-Based Internal Coordinate Mechanics Force Field and Its Application to Protein Loop Modeling. Proteins: Struct., Funct., Genet. 2011, 79, 477−98. (14) Frenkel, D.; Mooij, G. C. A. M.; Smit, B. Novel Scheme to Study Structural and Thermal Properties of Continuously Deformable Molecules. J. Phys.: Condens. Matter 1992, 4, 3053−3076. (15) Zhu, K.; Day, T. Ab initio Structure Prediction of the Antibody Hypervariable H3 Loop. Proteins: Struct., Funct., Genet. 2013, 81, 1081−1089. (16) Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J. Comput. Chem. 1983, 4, 187−217. (17) Spassov, V. Z.; Flook, P. K.; Yan, L. LOOPER: A Molecular Mechanics-Based Algorithm for Protein Loop Prediction. Protein Eng., Des. Sel. 2008, 21, 91−100. (18) Jamroz, M.; Kolinski, A. Modeling of Loops in Proteins: A Multimethod Approach. BMC Struct. Biol. 2010, 10, 5. (19) Gray, J. J.; Moughon, S.; Wang, C.; Schueler-Furman, O.; Kuhlman, B.; Rohl, C. A.; Baker, D. Protein − Protein Docking with Simulations Optimization of Rigid-Body Displacement and Side-Chain Conformations. J. Mol. Biol. 2003, 331, 281−299. (20) Schueler-Furman, O.; Wang, C.; Bradley, P.; Misura, K.; Baker, D. Progress in Modeling of Protein Structures and Interactions. Science 2005, 310, 638−642. (21) Rohl, C. A.; Strauss, C. E.; Chivian, D.; Baker, D. Modeling Structurally Variable Regions in Homologous Proteins with Rosetta. Proteins: Struct., Funct., Genet. 2004, 55, 656−677. (22) Zhu, K.; Pincus, D.; Zhao, S.; Friesner, R. A. Long Loop Prediction using the Protein Local Optimization Program. Proteins: Struct., Funct., Genet. 2006, 65, 438−452. (23) Deane, C. M.; Blundell, T. L. CODA: A Combined Algorithm for Predicting the Structurally Variable Regions of Protein Models. Protein Sci. 2001, 10, 599−612. (24) Peng, H. P.; Yang, A. S. Modeling Protein Loops with Knowledge-Based Prediction of Sequence-Structure Alignment. Bioinformatics 2007, 23, 2836−2842. (25) Choi, Y.; Deane, C. M. FREAD Revisited: Accurate Loop Structure Prediction using a Database Search Algorithm. Proteins: Struct., Funct., Genet. 2010, 78, 1431−40. (26) Michalsky, E.; Goede, A.; Preissner, R. Loops In Proteins (LIP)a Comprehensive Loop Database for Homology Modeling. Protein Eng., Des. Sel. 2003, 16, 979−985. (27) Holtby, D.; Li, S. C.; Li, M. LoopWeaver-Loop Modeling by the Weighted Scaling of Verified Proteins. J. Comput. Biol. 2013, 20, 212− 223. (28) De, L. J. Applications of Convex Analysis to Multidimensional Scaling; North Holland Publishing Company, 1977; pp 133−146. (29) Jacobson, M. P.; Pincus, D. L.; Day, T. J. F.; Rapp, C. S.; Honig, B.; Shaw, D. E.; Friesner, R. A. A Hierarchical Approach to All-Atom Protein Loop Prediction. Proteins: Struct., Funct., Genet. 2004, 55, 351− 367. (30) Fiser, A.; Do, R. K. G.; Sali, A. Modeling of Loops in Protein Structures. Protein Sci. 2000, 9, 1753−1773. (31) Joo, K.; Lee, J.; Sim, S.; Lee, S. Y.; Lee, K.; Heo, S.; Lee, I. H.; Lee, S. J.; Lee, J. Protein Structure Modeling for CASP 10 by Multiple Layers of Global Optimization. Proteins: Struct., Funct., Genet. 2014, 82, 188−195. (32) Lee, J.; Scheraga, H. A.; Rackovsky, S. New Optimization Method for Conformational Energy Calculations on Polypeptides: Conformational Space Annealing. J. Comput. Chem. 1997, 18, 1222− 1232. (33) Lee, J.; Gross, S. P.; Lee, J. Modularity Optimization by Conformational Space Annealing. Phys. Rev. E 2012, 85, 056702−1−5.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00742. Average RMSDs and correlation coefficients, pairwise ttests, scatter plots, and loop modeling results (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Jooyoung Lee: 0000-0002-4432-6163 Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2008-0061987). Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Dr. Yoonjoo Choi for valuable discussions and Andrew Brooks for his critical reading of and suggestions for this manuscript. We thank the Korea Institute for Advanced Study for providing computing resources (KIAS Center for Advanced Computation).



REFERENCES

(1) Xiang, B. Q.; Jia, Z.; Xiao, F. X.; Zhou, K.; Liu, P.; Wei, Q. The Role of Loop 7 in Mediation Calcineurin Regulation. Protein Eng., Des. Sel. 2003, 16, 795−798. (2) Shi, L.; Javitch, J. A. The Second Extracellular Loop of the Dopamine D2 Receptor Lines the Binding-Site Crevice. Proc. Natl. Acad. Sci. U. S. A. 2004, 101, 440−445. (3) Maity, H.; Rumbley, J. N.; Englander, S. W. Functional Role of a Protein Foldon - An Omega-Loop Foldon Controls the Alkaline Transition in Ferricytochrome C. Proteins: Struct., Funct., Genet. 2006, 63, 349−355. (4) Valdez, C. E.; Sparta, M.; Alexandrova, A. N. The Role of the Flexible L43-S54 Protein Loop in the CcrA Metallo-â-Lactamase in Binding Structurally Dissimilar â-Lactam Antibiotics. J. Chem. Theory Comput. 2013, 9, 730−737. (5) Viguera, A.; Serrano, L. Different Folding Transition States May Result in the Same Native Structure. Nat. Struct. Biol. 1997, 4, 939− 946. (6) Gautam, J. K.; Ashish; Comeau, L. D.; Krueger, J. K.; Smith, M. F., Jr. Structural and Functional Evidence for the Eole of the TLR2 DD Loop in TLR1/TLR2 Heterodimerization and Signaling. J. Biol. Chem. 2006, 281, 30132−30142. (7) Bastard, K.; Prevost, C.; Zacharias, M. Accounting for Loop Flexibility During Protein-Protein Docking. Proteins: Struct., Funct., Genet. 2006, 62, 956−969. (8) Miura, T.; Nishinaka, T.; Terada, T. Importance of the SubstrateBinding Loop Region of Human Monomeric Carbonyl Reductases in Catalysis and Coenzyme Binding. Life Sci. 2009, 85, 303−308. (9) Wang, Y.; Berlow, R. B.; Loria, J. P. Role of Loop-Loop Interactions in Coordinating Motions and Enzymatic Function in Triosephosphate Isomerase. Biochemistry 2009, 48, 4548−4556. (10) Liu, J.; Tan, H.; Rost, B. Loopy Proteins Appear Conserved in Evolution. J. Mol. Biol. 2002, 322, 53−64. (11) Arnautova, Y. A.; Jagielska, A.; Scheraga, H. A. A New Force Field (ECEPP-05) for Peptides, Proteins, and Organic Molecules. J. Phys. Chem. B 2006, 110, 5025−5044. J

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling (34) Okur; Miller, B. T.; Joo, K.; Lee, J.; Brooks, B. R. Generating Reservoir Conformations for Replica Exchange through the Use of the Conformational Space Annealing Method. J. Chem. Theory Comput. 2013, 9, 1115−1124. (35) Shin, W.; Heo, L.; Lee, J.; Ko, J.; Seok, C.; Lee, J. LigDockCSA: Protein-Ligand Docking Using Conformational Space Annealing. J. Comput. Chem. 2011, 32, 3226−3232. (36) Joo, K.; Lee, J.; Kim, I.; Lee, S.; Lee, J. Multiple Sequence Alignment by Conformational Space Annealing. Biophys. J. 2008, 95, 4813−4819. (37) MacKerell, A. D. J.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiorkiewicz-Kuczera, J.; Yin, D.; Karplus, M. All-Atom Empirical Potential for Molecular Modeling and Dynamics Studies of Proteins. J. Phys. Chem. B 1998, 102, 3586−3616. (38) Sali, A.; Blundell, T. L. Comparative Protein Modelling by Satisfaction of Spatial Restraints. J. Mol. Biol. 1993, 234, 779−815. (39) Moult, J.; Pedersen, J. T.; Judson, R.; Fidelis, K. A Large-Scale Experiment to Assess Protein Structure Prediction Methods. Proteins: Struct., Funct., Genet. 1995, 23, ii−iv. (40) Sasaki, T.; Sasai, M. A Coarse-Grained Langevin Molecular Dynamics pAproach to Protein Structure Reproduction. Chem. Phys. Lett. 2005, 402, 102−106. (41) Lee, J.; Lee, J.; Sasaki, T. N.; Sasai, M.; Seok, C.; Lee, J. De novo Protein Structure Prediction by Dynamic Fragment Assembly and Conformational Space Annealing. Proteins: Struct., Funct., Genet. 2011, 79, 2403−2417. (42) Zhang, C.; Liu, S.; Zhou, Y. Accurate and Efficient Loop Selections by the DFIRE-based All-Atom Statistical Potential. Protein Sci. 2004, 13, 391−399. (43) Kortemme, T.; Morozov, A. V.; Baker, D. An OrientationDependent Hydrogen Bonding Potential Improves Prediction of Specificity and Structure for Proteins and Protein-Protein Complexes. J. Mol. Biol. 2003, 326, 1239−1259. (44) Zhou, H.; Skolnick, J. GOAP: A Generalized Orientation Dependent, All-Atom Statistical Potential for Protein Structure Prediction. Biophys. J. 2011, 101, 2043−2052. (45) Jacobson, M. P.; Pincus, D. L.; Rapp, C. S.; Day, T. J. F.; Honig, B.; Shaw, D. E.; Friesner, R. A. A Hierarchical Approach to All-Atom Loop Prediction. Proteins: Struct., Funct., Genet. 2004, 55, 351−367. (46) DePristo, M. A.; de Bakker, P. I. W.; Lovell, S. C.; Blundell, T. L. Ab initio Construction of Polypeptide Fragments: Efficient Generation of Accurate, Representative Ensembles. Proteins: Struct., Funct., Genet. 2003, 51, 41−55. (47) de Bakker, P. I. W.; DePristo, M. A.; Burke, D. F.; Blundell, T. L. Ab initio Construction of Polypeptide Fragments: Accuracy of Loop Decoy Discrimination by an All-Atom Statistical Potential and the AMBER Force Field with the Generalized Born Solvation Model. Proteins: Struct., Funct., Genet. 2003, 51, 21−40. (48) Nocedal, J. Updating Quasi-Newton Matrices with Limited Storage. Math. Comput. 1980, 35, 773−782. (49) Liu, D. C.; Nocedal, J. On the Limited Memory BFGS Method for Large Scale Optimization. Math. Program 1989, 45, 503−528. (50) Nocedal, J.; Wright, S. J. Numerical Optimization. SpringerVerlag: New York. 1999, Section 9.1. (51) Yang, Y.; Zhou, Y. Specific Interactions for Ab initio Folding of Protein Terminal Regions with Secondary Structures. Proteins: Struct., Funct., Genet. 2008, 72, 793−803. (52) Kabsch, W.; Sander, C. Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 1983, 22, 2577−2637. (53) MacQueen, J. B. Some Methods for Classification and Analysis of Multivariate Observations. Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1967, 1, 281−297.

K

DOI: 10.1021/acs.jcim.6b00742 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX