SURPASS Low-Resolution Coarse-Grained Protein Modeling

Oct 9, 2017 - On the other hand, when starting from template-derived and even quite inaccurate initial models, but with generally correct topology, CG...
0 downloads 12 Views 8MB Size
This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

Article pubs.acs.org/JCTC

Cite This: J. Chem. Theory Comput. 2017, 13, 5766-5779

SURPASS Low-Resolution Coarse-Grained Protein Modeling Aleksandra E. Dawid,† Dominik Gront,† and Andrzej Kolinski*,† †

Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland S Supporting Information *

ABSTRACT: Coarse-grained modeling of biomolecules has a very important role in molecular biology. In this work we present a novel SURPASS (Single United Residue per Pre-Averaged Secondary Structure fragment) model of proteins that can be an interesting alternative for existing coarse-grained models. The design of the model is unique and strongly supported by the statistical analysis of structural regularities characteristic for protein systems. Coarse-graining of protein chain structures assumes a single center of interactions per residue and accounts for preaveraged effects of four adjacent residue fragments. Knowledge-based statistical potentials encode complex interaction patterns of these fragments. Using the Replica Exchange Monte Carlo sampling scheme and a generic version of the SURPASS force field we performed test simulations of a representative set of single-domain globular proteins. The method samples a significant part of conformational space and reproduces protein structures, including native-like, with surprisingly good accuracy. Future extension of the SURPASS model on large biomacromolecular systems is briefly discussed.

1. INTRODUCTION Molecular modeling is very important in the investigation of protein structure, dynamics, and interactions.1 Classical molecular dynamics (MD) simulations are commonly used in structural biology, rational drug design, and other areas of life science, although their range of applicability is limited by the time scale and complexity of biomacromolecular processes. The “holey grail” of theoretical molecular biology is to predict threedimensional native structures of proteins and the mechanism of their assembly based on their amino acid sequences.2,3 Using an MD dedicated supercomputer it is now possible to simulate the entire folding pathway of small and rapidly folding proteins.4 The time scales of assembly for larger proteins, association with other biomacromolecules, protein aggregation, and many other biological phenomena are usually several orders of magnitude longer. This is the reason for the increasing role of coarsegrained (CG) modeling approaches.5 Some of the most successful CG models are targeted onto structure prediction, and when combined with appropriate bioinformatics tools, such as Rosetta6 or I-Tasser,7 they could be very efficient. Other models are more universal and enable not only structure prediction but also studies of protein dynamics.8,9 Typical examples of such medium-resolution CG models are UNRES10 or CABS.11 In this class of CG models a single amino acid residue is usually replaced by 2, 3, or 4 united atoms, representing a larger number of atoms forming the real structure. Such a level of coarse-graining provides computational speed-up of about 3 orders of magnitude in comparison with all-atom simulations, still allowing realistic reconstruction of the atomistic details at selected points of simulation trajectories.12 This way it is possible to build quite effective multiscale modeling strategies. Medium-resolution CG models significantly expand the time scale and system size of molecular modeling; however, as it was observed in successive CASP (critical assessment of protein © 2017 American Chemical Society

structure prediction methods) experiments, they struggle with de novo (without homologous templates) modeling of larger structures.13,14 With increasing size and topological complexity of protein structures the necessary time scale of protein folding simulations simply grows very rapidly. On the other hand, when starting from template-derived and even quite inaccurate initial models, but with generally correct topology, CG methods usually generate good resolution of final models.5 Modeling protein complexes and their interactions with other biomacromolecules is even more demanding.5 Therefore, to expand the range of de novo modeling of protein structure and dynamics, an efficient tool is needed for very fast simulations of low-resolution structures.15 This is one of the reasons why models such as SURPASS (single united residue per pre-averaged secondary structure fragment) proposed here can be very useful as part of multiscale molecular modeling schemes. In such a scheme SURPASS simulations can provide a collection of protein-like low-resolution starting structures, and these could be for instance used as an input to replica exchange simulations with an intermediate resolution CG model (for example, CABS). Intermediate resolution structures can be finally subjected to all-atom reconstruction16 and MD refinement/scoring simulations.17 Efficient reconstruction and refinement of coarse-grained structures can significantly benefit from the good knowledge of structural regularities characteristic for proteins.5,18,19 Several multiscale strategies based on these two levels of resolution have been recently explored.5,20 SURPASS is a low-resolution model. Models of similar resolution,21 although computationally more expensive, have proven to be quite productive in the studies of protein structure and near-native dynamics.22−25 The low-resolution of the SURPASS model makes it rather impractical as a basic tool for Received: June 20, 2017 Published: October 9, 2017 5766

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation

Figure 1. Schematic illustration of the projection from all-atom to coarse-grained SURPASS representation for two protein fragments: helical (left) and β-strand (right). Panel A illustrates an idealized “ribbon diagram” (see ref 26 for an overview), while panel B represents the all-atom structure of example fragments. Panel C shows the corresponding α-carbon chains. Panel D illustrates the idea of the four-residue averaging of α-carbon positions, leading to the SURPASS pseudoresidue representation. The resulting SURPASS chains (green for helical fragments and blue for the beta strand) superimposed onto the original α-carbon traces are shown in the last row (panel E). Note the different radii of the pseudoresidues representing helical and beta fragments reflecting different thicknesses of such fragments in real (atomistic) structures.

structure. The number of pseudoresidues representing protein structure corresponds to protein size and is equal to N − 3, where N is the length of an amino acid sequence. The concept of SURPASS representation is explained in Figure 1. The main idea behind the model is based on unique generalization of the local geometry of a polypeptide chain. Namely, positions of pseudoresidues are defined by averaging the coordinates of the four consecutive alpha carbons along the chain. These fourresidue fragments are replaced by a single center of interactions. Due to the single residue frame shift each α-carbon in the modeled chain (except for the chain ends) contributes to the definition of four consecutive SURPASS pseudoresidues. The choice of four-residue averaging is crucial for the geometry of the model. In contrast to other short fragments of different lengths, only the four-residue averaging leads to an almost linear shape of the SURPASS fragments representing helices or beta strands (see Figure 1). This feature of the model results in very simple and effective sampling schemes. Depending on the assumed definitions of interactions between the SURPASS united residues, different sampling strategies can be used, including Molecular Dynamics protocols, Monte Carlo methods, or various combined sampling schemes. The SURPASS representation assumes three types of pseudoatoms depending on secondary structure assignment (or prediction) of the averaged fragments of the modeled chain: a. pseudoatom H (for helical fragments) localized in the centers of mass of four consecutive α-carbons representing helical (HHHH) or almost helical (HHHC, CHHH, EHHH, HHHE) fragments in the atomistic models

structure prediction. Nevertheless, the model can provide large sets of low-resolution protein-like structures that can be used as starting sets for more accurate methods. Due to its computational efficiency SURPASS can be used for modeling long-time dynamics and large-scale structural transitions in protein systems that are significantly bigger than those tractable by the coarse-grained modeling tools of higher resolution.5 The rest of the paper is organized as follows. In the Methods section we present the coarse-grained SURPASS representations of protein chains, and then we discuss the knowledgebased force field and finally the sampling scheme used in simulations. The description of the force field is divided into two basic sections. The first one describes the generic part of the force field that can be applied to a variety of protein systems. It consists of potentials enhancing regular secondary structures, other short-range (along the sequence) interactions, model of hydrogen bonds, and potentials that control the local packing of pseudoresidues. The full set of parameters of this minimalistic force field is available on our homepage: http:// biocomp.chem.uw.edu.pl/tools/surpass. The second section of the force field description presents a simple patch used in simulations of single-domain globular proteins. The Results and Discussion section describes Monte Carlo simulations of a small but representative set of globular proteins of various sizes. The main conclusions and expected future applications of the SURPASS models are summarized in Conclusions.

2. METHODS 2.1. Representation of Protein Structure. SURPASS is a low-resolution deeply coarse-grained model of protein 5767

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation

that appear to be appropriate for a broad class of protein systems (single-domain, multidomain, globular, and membrane proteins etc.). In the next section a simple modification, or a patch, of the interaction model targeted onto single-domain globular proteins is presented. We believe that the generic force field described here can be useful in many applications and very easily implemented by other researchers in SURPASS-like or related coarse-grained models. Deriving reliable and universal statistical potentials requires the use of a suitably selected subset of protein structures from the Protein Data Bank (PDB). In this work we used the PISCES subset.30 This database contains 4600 polypeptide chains with lengths of 20 to 1193 amino acid residues, which are representative and nonredundant subsets of all kinds of known protein structures. The resolution of PISCES structures is not worse than 1.6 Å, and the similarity of sequences is not greater than 60%. A detailed specification is provided in the section SI1 of Supporting Information. Model of Secondary Structure. Distances and angles between atoms close along the sequence in polypeptide chains are highly restricted due to various short-range interactions.31 The deficiencies of atomic details in a strongly simplified and preaveraged SURPASS chain may result in incorrect local geometry of the structure. To avoid this, it is necessary to transfer the structural regularities of the atomistic models onto a corresponding restriction imposed onto united atoms. Planar Angle Restrictions. Planar angles between three consecutive united atoms in the SURPASS model correspond to the spatial relationship between six all-atom residues along the sequence. Due to the averaging of structure fragments the angle between the ith, (i + 1)th, and (i + 2)th united atom in helices and β-strands is close to 180°, which means that both types of regular structures are represented as almost collinear straight sticks (see Figure 3). Secondary structure elements require specific distance restrictions along the model chain (see potential R13 in the section Short-Range Interactions).

b. pseudoatom S (β-strand fragments) representing the centers of mass of EEEE, EEEC, or CEEE fragments c. pseudoatom C (coil-like) representing the center mass of four consecutive alpha carbons for all the remaining secondary structure combinations (H, E, and C). This definition of secondary structure in the SURPASS model leads to slight shortening (∼2.0 Å) of both helices and β-strands compared to the original structures and slight compressing of loop fragments. A continuous fragment of SURPASS helical pseudoatoms has a form of a straight, rigid, and tightly packed stick with partially overlapping beads representing united residues. United residues for β-strands represent centers of more separated α-carbons. Therefore, the distance between two consecutive type S or C pseudoresidues is almost twice as long as the distance in helical fragments. The simple “ball” protein visualization in this representation resembles the well-known “ribbon diagrams” (see Figure 2).

Figure 2. Visualization of protein structure (PDB code: 2gb1, chain A). Upper panel: all-atom representation by “ribbon diagram”; lower panel: SURPASS “ball-and-stick” model. Green, α-helix; cyan, loops; blue, β-sheet.

The coarse-grained representation of protein structure by the SURPASS model implies several directions for designing its force field. We opt for the knowledge-based approach and specific definitions of the statistical potentials describing interactions between united residues. Solvent is treated in an implicit fashion, and its effects (water with other small molecules or a membrane environment for transmembrane proteins) are included directly in the statistical potentials that describe interactions between the united residues. This is a strategy proven to be quite productive in other, higher resolution coarse-grained protein models.5,11,13,27 2.2. Generic Terms of the SURPASS Force Field. Designing and derivation of knowledge-based force fields of coarse-grained models is always a key point for model performance.5,28,29 In this paper we describe and test only the generic terms that are directly sequence-independent, while a more general model of interactions will be presented in a separate report. In the minimalistic force field presented here, the sequence effects are encoded only nondirectly via secondary structure assignment. First we describe the basic generic terms

Figure 3. α-helical and β-strand type fragments. In the SURPASS model angular regularities observed in Cα traces are translated into simple distance preferences encoded in the R13 potential.

Short-Range Interactions. It is assumed that the local geometry of a secondary structure fragment is to a large extent encoded in the distance distribution of SURPASS pseudoatoms. This applies to pseudoresidues at a distance of n = 1, 2, 3, 4 positions along the chain. All generic terms R12, R13, R14, and R15 are prepared in six variants (HH, EE, CC, HE, HC, EC) depending on the secondary structure assignments for pairs of residues located at key positions (see Chart 1 in section 5768

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation

Figure 4. Plots of statistical potentials of short-range interactions for various secondary structure assignments (listed in the legend on the right) for pairs of residues: A, R12; B, R13; C, R14; D, R15.

smaller than expected are penalized with energy proportional to the product of helix size and deviation from the expected value. For more details see the Table 1 in section SI3 of Supporting Information (see Parameter Nos. 5.7−5.9). Model of Hydrogen Bonds. In the SURPASS model hydrogen bonds between residues close to each other along the chain, such as those in the helices, are treated in an implicit manner by angular preferences for secondary structure.34 Hydrogen bonds between residues that are distant in the sequence, especially in extended structure fragments, are modeled more directly. As a result of averaging the secondary structure fragments, regardless of the mutual directions of the β-strands, the pattern of connections is very regular, and interacting pseudoresidues are regularly packed (see Figure 6). The organization of space packing of the extended secondary structure elements in SURPASS representation is extremely important for enforcing protein-like topology. The imposition of restrictions has to not only ensure regularity of the pattern but also avoid the formation of nonphysical pseudobonds between orthogonal beta sheets. Therefore, the formation of model hydrogen bonds depends on the fulfillment of a few simple geometrical conditions: a. The length of the model hydrogen bond is in a range of 3.8 Å to 6.0 Å, and the most probable length is 4.65 Å (statistically the most frequent distance; see Figure 7A). b. The maximum number of connections for each pseudoresidue in the β-strand is 2, to maintain consistency and to avoid redundancy. If there are more potential candidates for hydrogen bond formation, the best two are chosen according to the following angular criterion: • a hydrogen bond should be perpendicular to the main chain of both interacting β-strands, and the

SI2 of Supporting Information). The secondary structure assignment can be taken from the structural database or predicted using appropriate bioinformatics tools. The distribution of the distance (R12) between ith and (i + 1)th beads shows that the average density of residues along helices (1.6 Å) is higher than in loops (2.2 Å) and more than double that of the β-strand (3.3 Å). Potentials R13 and R14 control not only the distance between united residues but also the planar and dihedral angles between corresponding united residues. All short-range interactions have been implemented in the force field as a potential of the mean field (PMF)32 using a onedimensional kernel density estimator (KDE)33 as a method for defining the density of empirical distribution. Energy function depends not only on the KDE value at the observed distance d but also on the probability of the secondary structure type of the first and the second interacting pseudoresidues (see Figure 4) (in simulations described in the present report values of 0 or 1 were used). To avoid nonphysical stretching of model chains an energy penalty increasing with square of the distance is added. The mathematical equations and the energy plot are shown in Table 1, Term 5, while the description of the individual parameters can be found in Table 1 in section SI3 of Supporting Information (see Parameter Nos. 5.1−5.6). In contrast to beta strands, the local distance potential does not control properly the stiffness of helices, and, therefore, helices require a somewhat different approach. Helices are the most compact secondary structure elements. This feature is even more evident in the SURPASS representation, where the helix looks like a stiff stick formed by densely packed and overlapping beads. Results of the statistical analysis of a set of representative protein structures show specific correlations between the size and the length of model helices (see Figure 5). These data helped us to derive a simple potential enhancing the helix stiffness of SURPASS chains. Lengths of helical fragments 5769

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation Table 1. Summary of Energy Terms in the SURPASS Basic Force Field

Figure 6. Hydrogen bond definition in the SURPASS β-sheet. (A) Main concept of the hydrogen bond scheme: a hydrogen bond can be formed only between the ith and jth pseudoresidues located in two different β-strands. (B) Pattern of hydrogen bonds in the β-sheet from protein 2gb1, chain A structure in SURPASS representation.

Figure 5. Correlation between the number of SURPASS pseudoresidues in a helix and its length measured in Å (X and Y axes, respectively).

permitted angle range is from 70° to 115° (see Figure 7B); • the maximum allowable twist of the β sheet, measured as the planar angle between the main chains of two adjacent β-strands, is not greater than 55° (or cos(α) greater than 0.574; see Figure 7C); • for a pseudoresidue that forms two hydrogen bonds (with two different β-strands), the planar angle between these bonds must be greater than

125°, and 180° is the best orientation (see Figure 7D). The statistics of hydrogen bond length for known protein structures in SURPASS representation are given by normal distribution. The strength of these bonds is only distancedependent, and the energy minima correspond to the optimal bond length, d = 4.65 Å. The choice of constant C in eq 4 in Table 1 leads to an energy minimum of −1.0 and function 5770

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation

Figure 7. Statistical analysis of the geometry of the model hydrogen bond. (A) Length of hydrogen pseudobonds extracted from the RDF of distance between ith and jth pseudoresidues in two beta strands. (B) Angle between two β-strands connected by a hydrogen bond: “single” if the ith residue participates in a single hydrogen bond, “double” if it has connections with two different β-strands. (C) Twist of the β-sheet measured as a planar angle between the main chains of two adjacent β-strands. (D) Angle between two hydrogen bonds of three connecting β-strands.

flattening at zero. In summary, the model hydrogen bonds are formed only if the bond length is greater than dMIN = 3.8 Å and less than dMAX = 6.0 Å and if all the angular conditions are satisfied. The exact description of these parameters can be found in the Table 1, section SI3, of Supporting Information (see Parameter Nos. 4.1−4.5). The application of the above definition of hydrogen bonds into the SURPASS representation of native structures reproduces quite accurately the majority of corresponding bonds observed in real atomistic structures (see Figure 8A).

proportional to the number of hydrogen bonds in real fullatom β-sheets. Coefficients of the linear relationship between the length of a protein chain and the number of hydrogen bonds in the β-sheets in both cases are similar. This confirms the validity of the chosen definition of pseudohydrogen bonds. 2.3. Local Packing Control. Generic Local Repulsive Interactions. Coarse-grained models, particularly with lowresolution and highly simplified representation, have a tendency to form excessively compressed structures. This is mostly due to the lack of atomic details, especially the lack of any explicit trace of side chains. To prevent the excessive and nonphysical collapse of a structure during the simulation, generic local repulsions are needed. Using the statistics derived from deeply buried elements of protein structures from the database, we estimated the number of neighboring SURPASS pseudoatoms in the central part of the model globules. Counts did not include ±3 subsequential neighbors along the chain. It has been estimated that, within a radius of 6 Å around each deeply buried type E residue, there are no more than 6 other residues, 2 for the type H pseudoresidue and 4 for a residue in the loop (see Figure 9). The cutoff of 6 Å appears to be an optimal choice since, in contrast to significantly larger values of cut-offs, it properly differentiates the packing patterns of various secondary structure elements. The energy function describing local repulsion interactions is an energy penalty and increases with the square of excess of neighbors in the sphere according to Formula 2 in Table 1 (see also Table 1, Parameter Nos. 2.1−2.2 in section SI3, of Supporting Information). Hard-Core Repulsions. To avoid steric clashes between pseudoresidues distant in sequence but close in space, the excluded volume cutoff is needed. This component of longrange interactions has a form of hard-core repulsion between

Figure 8. Statistics of main chain hydrogen bonds in proteins. (A) Relationship between the length of extended structures and the number of hydrogen bonds observed in all-atom structures and in their SURPASS representations (red and blue, respectively). (B) Fraction of hydrogen bond reconstruction by the SURPASS model.

Consistent reduction of the number of observed native hydrogen bonds in the SURPASS model by almost 30% (in comparison to all-atom counts) is directly related to the strict averaging of pseudobonds in this model (see Figure 8B). However, our statistics prove that the number of hydrogen bonds in the SURPASS extended structures is directly

Figure 9. (A) Definitions of burial conditions for single-domain globular proteins: blue, exposed pseudoresidues; green, first coordination zone in the residues contacting exposed residues; and red, buried center of a globule, where the number of contacts within the 6 Å radius of each (red) united residue is controlled by repulsive interactions. Observed number of neighbors for fully buried united residues depending on their secondary structure type: B, variants for helical; C, extended; D, coil. 5771

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation

a hydrogen bond (3.8 Å). During simulations, all moves causing hard-core overlaps have been prohibited. Such an approach ensures efficient sampling by avoiding nonphysical crumpled geometry. Soft-Core Attraction. The number of noncovalent interactions for pairs of pseudoresidues is relatively unrestricted, and, therefore, estimation of the nonbinding terms typically scales with the square of the system size. For the SURPASS model it is not a difficult problem, because the number of all possible centers of interaction is very small. However, not all long-range interactions are equally important, and we decided to reward only specific distances between pseudoresidues, depending on their secondary structure type. Furthermore, contacts are not counted for pseudoresidues from the same element of a secondary structure (e.g., helix or β-strand), including β-strands from the same sheets or barrels. In addition, all contacts involving loop-assigned united residues are ignored. All lower cut-offs have been defined as the fifth percentile of the radial distribution function (RDF) of distance for pseudoresidue pairs, and the upper cut-offs correspond to the maximum of this function (see Figure 10). Such defined pairwise contacts are rewarded with low fixed soft attraction energy (Table 1, Parameters Nos. 3.2−3.3 in section SI3, of Supporting Information). 2.4. Generic Force Field for Single-Domain Globular Proteins. In the present work we describe the generic version of the SURPASS force field. A more complete force field for a broad spectrum of protein systems will need an important extension to include sequence-specific binary interactions between the model pseudoresidues. In this report we focus on a simple case of single-domain globular proteins. To test the efficiency of the model its generic force field has been supported by a simple patch that enables modeling singledomain protein-like systems. To force the SURPASS chain to fold into globular topology, we used a simple centrosymmetric potential. Its purpose is to maintain a sufficiently high degree of packing of pseudoresidues in the protein core. Statistical analysis revealed that for a typical globule the optimal quantity of pseudoresidues densely packed in a sphere around the structure’s center of mass is 50%. The choice of a 50% sphere is a good compromise between the requirement of the close packing of central fragments in globular proteins and acceptance of overall various shapes (except for highly elongated structures) of single domains. The radius of this sphere is proportional to the square root of system size (see Figure 11). Centrosymmetric potential is a linear energy penalty for all deviations (excess or deficit) from the estimated number of pseudoresidues within a sphere of radius R50 (see Table 1, Term 1). When the number of residues in the central

pseudoresidues (see Table 1, Term 3), with the cutoff distances given in Table 1, Parameter No. 3.1 in section SI3, of Supporting Information. All parameters have been defined as the first percentile of the radial distribution function (RDF) of distance for pseudoresidue pairs depending on their type of secondary structure (see Figure 10). The only exception is

Figure 10. Radial distribution function for pairs of united residues of various secondary structure types. The first percentile of RDF is the cutoff distance for hard-core repulsions and the range between the fifth percentile, and the maximum of distribution is the distance for soft attractive interactions. Both parts of the potential are sequenceindependent.

variant EE (both pseudoresidues in a β-sheet) for which the cutoff distance is equal to the shortest distance required to form

Figure 11. Generic centrosymmetric attractive force for single-domain globular proteins. (A) Dark blue residues, within a sphere of radius R50, illustrate dense packing in central fragments of globular protein. (B) Statistics for pseudoradii of spheres containing 50% of pseudoresidues representing native-like structures. (C) Radii of spheres containing various fractions of residues as a function of protein chain length. 5772

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation Table 2. Summary of Contributions of Terms in SURPASS Total Energy short-range centro-symmetric

local repulsive

soft-core attraction

hydrogen bond

R12

R13

R14

R15

helix stiffness

1

1

1

10

1

1

1

5

5

Figure 12. SURPASS simulation of a 56 residue α+β protein immunoglobulin binding domain (PDB code: 2gb1). Panel A shows a schematic illustration of the path of protein folding observed during the simulations for a selected replica. Panels B and E show plots of total energy versus time, for a single and all 12 replicas (in different colors), respectively. Examples of RMSD (from native) and replica temperature for selected replica are shown in panels C and D. Plot F shows summarized results of simulations for a single replica. The best native-like structure observed in simulations, superimposed onto SURPASS representation of the native structure, is shown in the last panel.

sphere drops below 50%, a penalty is imposed, slowly increasing with the square root of the number of lacking residues. On the contrary, when packing density in the sphere exceeds 60% of all residues, a rapidly increasing penalty is implemented for each additional excessive residue. The details of formula 1 from Table 1 are given in Table 1 of Supporting Information (Parameters Nos. 1.1−1.4 in Section SI3). This simple patch imposed onto the generic force field of SURPASS proves to be sufficient to prevent the total collapse of the structure or long-time sampling of low-density unfolded states. Total energy for single-domain globular proteins (see Table 2) using the generic SURPASS model is defined by a combination of the described statistical potentials (summarized in Table 1). Specific weights of particular components were optimized in a long series of simulations. The final scaling appears to be close to optimal. The important interactions for stabilizing local geometry (including short-range R15, helix stiffness, or hydrogen bond in extended structure fragments)

required significantly higher scaling (by a factor of 5−10), while the other contributions were of a constant magnitude of 1.0. 2.5. Sampling Scheme. In principle, the SURPASS representation and model of interactions can be used in various simulation protocols, including molecular dynamics (after minor modifications of force field equations into continuous forms), Monte Carlo, and other modeling techniques. In the present work, the simulation process was controlled by the replica exchange Monte Carlo dynamics scheme (REMC),35 which provides very efficient searching of the conformational space and relatively fast localization of global minima. The term Monte Carlo dynamics means that we used only very local random modifications of chain conformation. Short-distance random moves of single SURPASS pseudoresidues were used in the simulations only. Obviously, the algorithm can be significantly accelerated when somewhat larger fragments (2−5 units) are used in properly designed local moves. Monte Carlo simulations that use only 5773

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation

Figure 13. SURPASS simulation of a 103 residue α+β hypothetical YfjZ domain (PDB code: 2ea9). Panel A shows a schematic illustration of the path of protein folding observed during the simulations for a selected replica. Panels B and E show plots of total energy versus time, for a single and all 12 replicas (in different colors), respectively. Examples of RMSD (from native) and replica temperature for selected replica are shown in panels C and D. Plot F shows summarized results of simulations for a single replica. The best native-like structure observed in simulations, superimposed onto SURPASS representation of the native structure, is shown in the last panel.

by the number of Monte Carlo steps and required from a couple of hours of real simulation time for a single replica of a smaller protein to tens of hours for the largest chains. This computational cost, already very low, can still be reduced by an order of magnitude in the future with an optimized version of the program. At the moment, the simulation algorithm has a structure that allows easy and straightforward changes of the model of dynamics, details of interaction patterns, specific formulas of the potential, etc. It makes model testing very easy, albeit at an additional cost not necessary for the final algorithm that will be available in the near feature as an easy to use web server. In addition, as noted before, some modifications of the model of Monte Carlo dynamics by “smart” local moves can speed up the sampling process significantly. The results of simulations are summarized in Tables 3−5. The first two columns in these tables show the PDB code and protein size defined by the number of amino acid residues. The third column lists the range of the model dimensionless (reduced) temperature of 12 replicas. The fourth column shows the efficiency of simulations and provides the average number of transitions between short-range and fully random conformations of the modeled chains detected in a single replica. These numbers indicate very efficient sampling of the entire conformational space for small and middle-size proteins. For

local random moves provide realistic pictures of the long-time dynamics of modeled systems. To ensure high efficiency of the exchange algorithm, the starting temperatures for replicas were fine-tuned. Exchanges only between the nearest neighbors were allowed, and each replica visited all the possible temperatures (see panels D in Figures 12−14).

3. RESULTS AND DISCUSSION The generic SURPASS model presented above has been tested on a representative set of single-domain globular proteins. The set contained various types of globular proteins, including 7 helical proteins, 9 mostly β-sheet, and 8 mixed alpha/beta proteins. In the test simulations presented here, the secondary structure assignments, required by the model, were taken directly from the PDB database. These proteins were of various length, containing from 56 to 334 amino acid residues. Replica exchange Monte Carlo simulations were performed with 12 replicas for each tested protein. Replicas were distributed in a properly selected temperature range with increments increasing with growing temperature and with the expected folding transition temperature (roughly estimated from preliminary REMC runs) near the middle replicas. The starting structures of all replicas had fully expanded conformation of model chains. All simulations were performed within the same time counted 5774

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation

Figure 14. SURPASS simulation of a mostly alpha 105 residue N-terminal domain of the delta subunit of the F1F0-ATP synthase domain (PDB code: 1abv). Panel A shows a schematic illustration of the path of protein folding observed during the simulations for a selected replica. Panels B and E show plots of total energy versus time, for a single and all 12 replicas (in different colors), respectively. Examples of RMSD (from native) and replica temperature for selected replica are shown in panels C and D. Plot F shows summarized results of simulations for a single replica. The best native-like structure observed in simulations, superimposed onto SURPASS representation of the native structure, is shown in the last panel.

Table 3. Summary of Results for α Proteins RMSD

TMscore

PDB ID

size

temp.

transitions

original

annealing

1i2t, A 1eo0, A 3qf2, A 1abv, A 1aep, A 1ls4, A 1r0d, A

61 77 90 105 153 164 194

1.4−2.6 1.4−2.6 1.4−2.6 1.5−2.6 1.6−2.8 1.7−3.0 1.7−3.0

∼20 ∼15 ∼10 ∼7 ∼3 ∼4 ∼5

0.84 2.81 3.72 2.44 3.68 4.21 3.29

0.78 2.57 3.20 2.32 3.30 3.62 2.87

the largest systems, the sampling is not fully complete, although single native-like folds and lower accuracy properly folded fragments are also observed. The two RMSD columns need an additional comment. RMSD (root-mean-square deviation, in angstroms) means the distance between the SURPASS representation of the native (PDB) structures and the best models observed in the trajectories from a single simulation. In all cases, with the exception of the largest β-type protein, the best structures adopted correct native-like topology. In the specific case of 3ned, a 231 residue complex β-type fold, fragments of the best structures were distorted by inaccurate topology. In other cases, surprisingly accurate native-like

best model 0.71 0.55 0.58 0.58 0.68 0.54 0.75

(100%) (99%) (85%) (93%) (89%) (75%) (93%)

% of models

architecture

0.93% 0.09% 0.12% 0.27% 27.40% 38.84% 25.40%

orthogonal bundle up-and-down orthogonal bundle 5-helix bundle 5-helix bundle 5-helix bundle 5-helix bundle

structures were observed. Additional refinement (annealing) of these structures by a single replica low temperature SURPASS simulations slightly improved accuracy of the best structures. Columns 7−8 provide additional data that characterize the modeling accuracy using TMscore for selected structure fragments.36 These data prove that even in more difficult cases the simulations generate structure fragments similar to the native-like reference structure. This is summarized in column 8 (% of models) by the fraction of all models from the entire trajectory of a protein for which reasonably accurate models (RMSD smaller than 4 Å) cover at least 60-residue continuous fragments. 5775

DOI: 10.1021/acs.jctc.7b00642 J. Chem. Theory Comput. 2017, 13, 5766−5779

Article

Journal of Chemical Theory and Computation Table 4. Summary of Results for β Proteins RMSD

TMscore

PDB ID

size

temp.

transitions

original

annealing

1mhn, A 1jic, A 1yez, A 1nte, A 1ten, A 2i5f, A 1bww, A 3kff, A 3ned, A

59 62 68 82 89 99 109 152 231

1.3−2.4 1.3−2.4 1.3−2.4 1.3−2.4 1.4−2.6 1.4−2.6 1.4−2.6 1.6−2.8 1.7−3.0

∼12 ∼12 ∼13 ∼7 ∼6 ∼5 ∼3 ∼2 -

2.73 3.49 3.66 5.15 4.81 3.64 6.41 8.35 12.85

2.29 3.29 3.16 4.46 4.22 3.04 4.99 7.19 -

best model 0.55 0.46 0.44 0.49 0.54 0.52 0.43 0.41 0.31

(98%) (93%) (91%) (72%) (95%) (85%) (65%) (52%) (32%)

% of models

architecture

0.49% 0.02% 0.01% 0.01% 1.15% 0.43% 0.01% 0.04% 0.05%

roll β barrel β barrel roll sandwich roll sandwich β barrel β barrel

Table 5. Summary of Results for α+β Proteins RMSD

TMscore

PDB ID

size

temp.

transitions

original

annealing

2gb1, A 3zke, A 2ea9, A 2j5a, A 3vsy, A 2k54, A 3ck2, A 3zfc, A

56 85 103 106 120 123 174 334

1.3−2.6 1.4−2.6 1.4−2.6 1.4−2.6 1.4−2.6 1.4−2.6 1.7−3.0 1.8−3.2

∼14 ∼10 ∼6 ∼6 ∼3 ∼4 ∼3 ∼1

1.65 2.46 3.39 5.33 7.49 6.95 8.60 15.60

1.55 2.13 2.88 4.67 7.30 6.60 8.05 -

best model 0.72 0.64 0.60 0.49 0.36 0.44 0.36 0.31

(100%) (100%) (95%) (80%) (49%) (65%) (42%) (31%)

% of models 3.88% 1.40% 6.10% 0.06%