Subscriber access provided by LAURENTIAN UNIV
Article
SURPASS low-resolution coarse-grained protein modeling Aleksandra Elzbieta Dawid, Dominik Gront, and Andrzej Kolinski J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.7b00642 • Publication Date (Web): 09 Oct 2017 Downloaded from http://pubs.acs.org on October 11, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
SURPASS low-resolution coarse-grained protein modeling Aleksandra E. Dawid †, Dominik Gront †, Andrzej Kolinski †* † Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland KEYWORDS: protein modeling, coarse-grained models, reduced models, empirical force field, knowledge-based potential, de novo protein folding
ABSTRACT: Coarse-grained modeling of biomolecules has a very important role in molecular biology. In this work we present a novel SURPASS (Single United Residue per Pre-Averaged Secondary Structure fragment) model of proteins that can be an interesting alternative for existing coarse-grained models. The design of the model is unique and strongly supported by the statistical analysis of structural regularities characteristic for protein systems. Coarse-graining of protein chain structures assumes a single center of interactions per residue and accounts for preaveraged effects of four adjacent residue fragments. Knowledge-based statistical potentials encode complex interaction patterns of these fragments. Using the Replica Exchange Monte Carlo sampling scheme and a generic version of the SURPASS force field we performed test simulations of a representative set of single-domain globular proteins. The method samples
ACS Paragon Plus Environment
1
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 43
significant part of conformational space and reproduces protein structures, including native-like, with surprisingly good accuracy. Future extension of the SURPASS model on large biomacromolecular systems is briefly discussed.
1. INTRODUCTION
Molecular modeling is very important in the investigation of protein structure, dynamics and interactions.1 Classical molecular dynamics (MD) simulations are commonly used in structural biology, rational drug design and other areas of life science, although their range of applicability is limited by the time scale and complexity of biomacromolecular processes. The “holy grail” of theoretical molecular biology is to predict three-dimensional native structures of proteins and the mechanism of its assembly based on its amino acid sequence.2,3 Using an MD dedicated supercomputer it is now possible to simulate the entire folding pathway of small and rapidly folding proteins.4 The time scales of assembly for larger proteins, association with other biomacromolecules, protein aggregation, and many other biological phenomena are usually several orders of magnitude longer. This is the reason for the increasing role of coarse-grained (CG) modeling approaches.5 Some of the most successful CG models are targeted onto structure prediction, and when combined with appropriate bioinformatics tools, such as Rosetta6 or ITasser7, they could be very efficient. Other models are more universal, and enable not only structure prediction but also studies of protein dynamics.8,9 Typical examples of such mediumresolution CG models are UNRES10 or CABS11. In this class of CG models a single amino acid residue is usually replaced by 2, 3 or 4 united atoms, representing a larger number of atoms forming the real structure. Such a level of coarse-graining provides computational speed-up of about 3 orders of magnitude in comparison with all-atom simulations, still allowing realistic
ACS Paragon Plus Environment
2
Page 3 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
reconstruction of the atomistic details at selected points of simulation trajectories.12 This way it is possible to build quite effective multiscale modeling strategies. Medium-resolution CG models significantly expand the time scale and system size of molecular modeling; however, as it was observed in successive CASP (Critical Assessment of protein Structure Prediction methods) experiments, they struggle with de novo (without homologous templates) modeling of larger structures.13,14 With increasing size and topological complexity of protein structures the necessary time scale of protein folding simulations simply grows very rapidly. On the other hand, when starting from template-derived and even quite inaccurate initial models, but with generally correct topology, CG methods usually generate good resolution of final models.5 Modeling protein complexes and their interactions with other biomacromolecules is even more demanding.5 Therefore, to expand the range of de novo modeling of protein structure and dynamics, an efficient tool is needed for very fast simulations of low-resolution structures.15 This is one of the reasons why models such as SURPASS (Single United Residue per Pre-Averaged Secondary Structure fragment) proposed here can be very useful as part of multiscale molecular modeling schemes. In such a scheme SURPASS simulations can provide a collection of protein-like low-resolution starting structures, and these could be for instance used as an input to replica exchange simulations with an intermediate resolution CG model (for example CABS). Intermediate resolution structures can be finally subjected to all-atom reconstruction16 and MD refinement/scoring simulations17. Efficient reconstruction and refinement of coarse-grained structures can significantly benefit from the good knowledge of structural regularities characteristic for proteins.5,18,19 Several multiscale strategies based on these two levels of resolution have been recently explored.5,20
ACS Paragon Plus Environment
3
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 43
SURPASS is a low-resolution model. Models of similar resolution21, although computationally more expensive, have proven to be quite productive in the studies of protein structure and nearnative dynamics.22-25 The low-resolution of the SURPASS model makes it rather impractical as a basic tool for structure prediction. Nevertheless, the model can provide large sets of lowresolution protein-like structures that can be used as starting sets for more accurate methods. Due to its computational efficiency SURPASS can be used for modeling long-time dynamics and large-scale structural transitions in protein systems that are significantly bigger than those tractable by the coarse-grained modeling tools of higher resolution.5 The rest of the paper is organized as follows. In the Methods section we present the coarsegrained SURPASS representations of protein chains, then we discuss the knowledge-based force field and finally the sampling scheme used in simulations. The description of the force field is divided into two basic sections. The first one describes the generic part of the force field that can be applied to a variety of protein systems. It consists of potentials enhancing regular secondary structures, other short-range (along the sequence) interactions, model of hydrogen bonds and potentials that control the local packing of pseudoresidues. The full set of parameters of this minimalistic
force
field
is
available
on
our
homepage:
http://biocomp.chem.uw.edu.pl/tools/surpass. The second section of the force field description presents a simple patch used in simulations of single-domain globular proteins. The Results and Discussion sections describe Monte Carlo simulations of a small but representative set of globular proteins of various sizes. Main conclusions and expected future applications of the SURPASS models are summarized in Conclusions. 2. METHODS
ACS Paragon Plus Environment
4
Page 5 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
2.1.
Representation of protein structure. SURPASS is a low-resolution deeply coarse-
grained model of protein structure. The number of pseudoresidues representing protein structure corresponds to protein size and is equal to N-3, where N is the length of an amino acid sequence. The concept of SURPASS representation is explained in Figure 1. The main idea behind the model is based on unique generalization of the local geometry of a polypeptide chain. Namely, positions of pseudoresidues are defined by averaging the coordinates of the four consecutive alpha carbons along the chain. These four-residue fragments are replaced by a single center of interactions. Due to the single residue frame shift each alpha carbon in the modeled chain (except for the chain ends) contributes to the definition of four consecutive SURPASS pseudoresidues. The choice of four-residue averaging is crucial for the geometry of the model. In contrast to other short fragments of different lengths, only the four-residue averaging leads to an almost linear shape of the SURPASS fragments representing helices or beta strands (see Figure 1). This feature of the model results in very simple and effective sampling schemes. Depending on the assumed definitions of interactions between the SURPASS united residues, different sampling strategies can be used, including Molecular Dynamics protocols, Monte Carlo methods or various combined sampling schemes.
ACS Paragon Plus Environment
5
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 43
Figure 1. Schematic illustration of the projection from all-atom to coarse-grained SURPASS representation for two protein fragments: helical (left) and β-strand (right). Panel A illustrates an idealized “ribbon diagram” (see ref 26 for an overview), while panel B represents the all-atom structure of example fragments. Panel C shows the corresponding α-carbon chains. Panel D illustrates the idea of the four-residue averaging of alpha carbon positions, leading to the SURPASS pseudoresidue representation. The resulting SURPASS chains (green for helical fragments and blue for the beta strand) superimposed onto the original alpha-carbon traces are
ACS Paragon Plus Environment
6
Page 7 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
shown in the last row (panel E). Note the different radii of the pseudoresidues representing helical and beta fragments reflecting different thicknesses of such fragments in real (atomistic) structures. The SURPASS representation assumes three types of pseudoatoms depending on secondary structure assignment (or prediction) of the averaged fragments of the modeled chain: a. pseudoatom H (for helical fragments) localized in the centers of mass of four consecutive α-carbons representing helical (HHHH) or almost helical (HHHC, CHHH, EHHH, HHHE) fragments in the atomistic models; b. pseudoatom S (β-strand fragments) representing the centers of mass of EEEE, EEEC or
CEEE fragments; c. pseudoatom C (coil-like) representing the center mass of four consecutive alpha carbons for all the remaining secondary structure combinations (H, E and C). This definition of secondary structure in the SURPASS model leads to slight shortening (~ 2.0Å) of both helices and β-strands compared to the original structures, and slight compressing of loop fragments. A continuous fragment of SURPASS helical pseudoatoms has a form of a straight, rigid and tightly packed stick with partially overlapping beads representing united residues. United residues for β-strands represent centers of more separated α-carbons. Therefore, the distance between two consecutive type S or C pseudoresidues is almost twice as long as the distance in helical fragments. The simple “ball” protein visualization in this representation resembles the well-known “ribbon diagrams” (see Figure 2). The coarse-grained representation of protein structure by the SURPASS model implies several directions for designing its force field. We opt for the knowledge-based approach and specific definitions of the statistical potentials describing interactions between united residues. Solvent is treated in an implicit
ACS Paragon Plus Environment
7
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 43
fashion and its effects (water with other small molecules or a membrane environment for transmembrane proteins) are included directly in the statistical potentials that describe interactions between the united residues. This is a strategy proven to be quite productive in other, higher resolution coarse-grained protein models.5,11,13,27
Figure 2. Visualization of protein structure (PDB code: 2gb1, chain A). Upper panel: all-atom representation by “ribbon diagram”; lower panel: SURPASS “ball-and-stick” model. Green – αhelix; cyan – loops; blue – β-sheet. 2.2.
Generic terms of the SURPASS force field. Designing and derivation of knowledge-
based force fields of coarse-grained models is always a key point for model performance.5,28,29 In this paper we describe and test only the generic terms that are directly sequence-independent, while a more general model of interactions will be presented in a separate report. In the minimalistic force field presented here, the sequence effects are encoded only non-directly via
ACS Paragon Plus Environment
8
Page 9 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
secondary structure assignment. First we describe the basic generic terms that appear to be appropriate for a broad class of protein systems (single-domain, multi-domain, globular and membrane proteins, etc.). In the next section a simple modification, or a patch, of the interaction model targeted onto single-domain globular proteins is presented. We believe that the generic force field described here can be useful in many applications and very easily implemented by other researchers in SURPASS-like or related coarse-grained models. Deriving reliable and universal statistical potentials requires the use of a suitably selected subset of protein structures from the Protein Data Bank (PDB). In this work we used the PISCES subset.30 This database contains 4600 polypeptide chains with lengths of 20 to 1193 amino acid residues, which are representative and non-redundant subsets of all kinds of known protein structures. The resolution of PISCES structures is not worse than 1.6Å and the similarity of sequences is not greater than 60%. A detailed specification is provided in the section SI1 of Supporting Information. Model of secondary structure. Distances and angles between atoms close along the sequence in polypeptide chains are highly restricted due to various short-range interactions.31 The deficiencies of atomic details in a strongly simplified and pre-averaged SURPASS chain may result in incorrect local geometry of the structure. To avoid this, it is necessary to transfer the structural regularities of the atomistic models onto a corresponding restriction imposed onto united atoms. Planar angle restrictions. Planar angles between three consecutive united atoms in the SURPASS model correspond to the spatial relationship between six all-atom residues along the sequence. Due to the averaging of structure fragments the angle between the i-th and i+2th united atom in helices and β-strands is close to 180°, which means that both types of regular
ACS Paragon Plus Environment
9
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 43
structures are represented as almost collinear straight sticks (see Figure 3). Secondary structure elements require specific distance restrictions along the model chain (see potential R13 in the section Short-range interactions).
Figure 3. α-helical and β-strand type fragments. In the SURPASS model angular regularities observed in Cα traces are translated into simple distance preferences encoded in the R13 potential. Short-range interactions. It is assumed that the local geometry of a secondary structure fragment is to a large extent encoded in the distance distribution of SURPASS pseudoatoms. This applies to pseudoresidues at a distance of n = 1, 2, 3, 4 positions along the chain. All generic terms: R12, R13, R14 and R15 are prepared in six variants (HH, EE, CC, HE, HC, EC) depending on the secondary structure assignments for pairs of residues located at key positions (see Chart 1 in section SI2 of Supporting Information). The secondary structure assignment can be taken from the structural database or predicted using appropriate bioinformatics tools. The
ACS Paragon Plus Environment
10
Page 11 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
distribution of the distance (R12) between i-th and i+1th beads shows that the average density of residues along helices (1.6Å) is higher than in loops (2.2Å) and more than double that of the βstrand (3.3Å). Potentials R13 and R14 control not only the distance between united residues but also the planar and dihedral angles between corresponding united residues. All short-range interactions have been implemented in the force field as a potential of the mean field (PMF)32 using a one-dimensional kernel density estimator (KDE)33 as a method for defining the density of empirical distribution. Energy function depends not only on the KDE value at the observed distance d, but also on the probability of the secondary structure type of the first and the second interacting pseudoresidues (see Figure 4) (in simulations described in the present report values of 0 or 1 were used). To avoid nonphysical stretching of model chains an energy penalty increasing with square of the distance is added. The mathematical equations and the energy plot are shown in Table 1, Term 5, while the description of the individual parameters can be found in Table 1 in section SI3 of Supporting Information (see Parameter No. 5.1-5.6).
ACS Paragon Plus Environment
11
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 43
Figure 4. Plots of statistical potentials of short-range interactions for various secondary structure assignments (listed in the legend on the right) for pairs of residues: A – R12, B – R13, C – R14, D – R15. In contrast to beta strands, the local distance potential does not control properly the stiffness of helices and, therefore, helices require a somewhat different approach. Helices are the most compact secondary structure elements. This feature is even more evident in the SURPASS representation, where the helix looks like a stiff stick formed by densely packed and overlapping
ACS Paragon Plus Environment
12
Page 13 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
beads. Results of the statistical analysis of a set of representative protein structures show specific correlations between the size and length of model helices (see Figure 5). These data helped us to derive a simple potential enhancing the helix stiffness of SURPASS chains. Lengths of helical fragments smaller than expected are penalized with energy proportional to the product of helix size and deviation from the expected value. For more details see the Table 1 in section SI3 of Supporting Information (see Parameter No. 5.7-5.9).
Figure 5. Correlation between the number of SURPASS pseudoresidues in a helix and its length measured in Å (X and Y axis, respectively). Model of hydrogen bonds. In the SURPASS model hydrogen bonds between residues close to each other along the chain, such as those in the helices, are treated in an implicit manner by angular preferences for secondary structure.34 Hydrogen bonds between residues that are distant in the sequence, especially in extended structure fragments, are modeled more directly. As a result of averaging the secondary structure fragments, regardless of the mutual directions of the β-strands, the pattern of connections is very regular, and interacting pseudoresidues are regularly packed (see Figure 6).
ACS Paragon Plus Environment
13
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 43
Figure 6. Hydrogen bond definition in the SURPASS β-sheet: A – main concept of the hydrogen bond scheme: a hydrogen bond can be formed only between i-th and j-th pseudoresidues located in two different β-strands; B – pattern of hydrogen bonds in the β-sheet from protein 2gb1, chain A structure in SURPASS representation. The organization of space packing of the extended secondary structure elements in SURPASS representation is extremely important for enforcing protein-like topology. The imposition of restrictions not only has to ensure regularity of the pattern, but also avoid the formation of nonphysical pseudobonds between orthogonal beta sheets. Therefore, the formation of model hydrogen bonds depends on the fulfillment of a few simple geometrical conditions: a. The length of the model hydrogen bond is in a range of 3.8Å to 6.0Å, and the most probable length is 4.65Å (statistically the most frequent distance, see Figure 7A). b. The maximum number of connections for each pseudoresidue in the β-strand is 2, to maintain consistency and to avoid redundancy. If there are more potential candidates for hydrogen bond formation, the best two are chosen according to the following angular criterion:
a hydrogen bond should be perpendicular to the main chain of both interacting βstrands and the permitted angle range is from 70˚ to 115˚ (see Figure 7B);
ACS Paragon Plus Environment
14
Page 15 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
the maximum allowable twist of the beta sheet, measured as the planar angle between the main chains of two adjacent β-strands, is not greater than 55˚ (or cos(α) greater than 0.574, see Figure 7C);
for a pseudoresidue that forms two hydrogen bonds (with two different β-strands), the planar angle between these bonds must be greater than 125˚, and 180˚ is the best orientation (see Figure 7D).
Figure 7. Statistical analysis of the geometry of the model hydrogen bond: A – length of hydrogen pseudobonds extracted from the RDF of distance between i-th and j-th pseudoresidues in two beta strands; B – angle between two β-strands connected by a hydrogen bond: “single” if i-th residue participates in a single hydrogen bond, “double” if it has connections with two different β-strands; C – twist of the β-sheet measured as a planar angle between the main chains of two adjacent β-strands; D – angle between two hydrogen bonds of three connecting β-strands. The statistics of hydrogen bond length for known protein structures in SURPASS representation are given by normal distribution. The strength of these bonds is only distancedependent and the energy minima correspond to the optimal bond length, d = 4.65Å. The choice of constant C in Equation 4 in Table 1 leads to an energy minimum of -1.0 and function flattening at zero. In summary, the model hydrogen bonds are formed only if the bond length is greater than dMIN = 3.8Å and less than dMAX = 6.0Å, and if all the angular conditions are satisfied.
ACS Paragon Plus Environment
15
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 43
The exact description of these parameters can be found in the Table 1, section SI3 of Supporting Information (see Parameter No. 4.1-4.5). The application of the above definition of hydrogen bonds into the SURPASS representation of native structures reproduces quite accurately the majority of corresponding bonds observed in real atomistic structures (see Figure 8A). Consistent reduction of the number of observed native hydrogen bonds in the SURPASS model by almost 30% (in comparison to all-atom counts) is directly related to the strict averaging of pseudobonds in this model (see Figure 8B). However, our statistics prove that the number of hydrogen bonds in the SURPASS extended structures is directly proportional to the number of hydrogen bonds in real full-atom β-sheets. Coefficients of the linear relationship between the length of a protein chain and the number of hydrogen bonds in the β-sheets in both cases are similar. This confirms the validity of the chosen definition of pseudohydrogen bonds.
Figure 8. Statistics of main chain hydrogen bonds in proteins: A – relationship between the length of extended structures and the number of hydrogen bonds observed in all-atom structures and in their SURPASS representations (red and blue, respectively); B – fraction of hydrogen bond reconstruction by the SURPASS model. 2.3.
Local packing control
ACS Paragon Plus Environment
16
Page 17 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
Generic local repulsive interactions. Coarse-grained models, particularly with low-resolution and highly simplified representation, have a tendency to form excessively compressed structures. This is mostly due to the lack of atomic details, especially the lack of any explicit trace of side chains. To prevent the excessive and non-physical collapse of a structure during the simulation, generic local repulsions are needed. Using the statistics derived from deeply buried elements of protein structures from the database, we estimated the number of neighboring SURPASS pseudoatoms in the central part of the model globules. Counts did not include +/-3 sub-sequential neighbors along the chain. It has been estimated that within a radius of 6Å around each deeply buried type E residue, there are no more than 6 other residues, 2 for the type H pseudoresidue and 4 for a residue in the loop (see Figure 9). The cut-off of 6Å appears to be an optimal choice since, in contrast to significantly larger values of cut-offs, it properly differentiates the packing patterns of various secondary structure elements. The energy function describing local repulsion interactions is an energy penalty and increases with the square of excess of neighbors in the sphere according to Formula 2 in Table 1 (see also Table 1, Parameter No. 2.1-2.2 in section SI3 of Supporting Information).
Figure 9. A – definitions of burial conditions for single-domain globular proteins: blue – exposed pseudoresidues, green – first coordination zone in the residues contacting exposed residues, red – buried center of a globule, where the number of contacts within the 6Å radius of
ACS Paragon Plus Environment
17
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 43
each (red) united residue is controlled by repulsive interactions. Observed number of neighbors for fully buried united residues depending on their secondary structure type: B – variants for helical; C – extended; D – coil. Hard-core repulsions. To avoid steric clashes between pseudoresidues distant in sequence but close in space, the excluded volume cut-off is needed. This component of long-range interactions has a form of hard-core repulsion between pseudoresidues (see Table 1, Term 3), with the cutoff distances given in Table 1, Parameter No. 3.1 in section SI3 of Supporting Information. All parameters have been defined as the first percentile of the radial distribution function (RDF) of distance for pseudoresidue pairs depending on their type of secondary structure (see Figure 10). The only exception is variant EE (both pseudoresidues in a β-sheet) for which the cut-off distance is equal to the shortest distance required to form a hydrogen bond (3.8Å). During simulations, all moves causing hard-core overlaps have been prohibited. Such an approach ensures efficient sampling by avoiding non-physical crumpled geometry. Soft-core attraction. The number of non-covalent interactions for pairs of pseudoresidues is relatively unrestricted and, therefore, estimation of the non-binding terms typically scales with the square of system size. For the SURPASS model it is not a difficult problem, because the number of all possible centers of interaction is very small. However, not all long-range interactions are equally important, and we decided to reward only specific distances between pseudoresidues, depending on their secondary structure type. Furthermore, contacts are not counted for pseudoresidues from the same element of a secondary structure (e.g. helix or βstrand), including β-strands from the same sheets or barrels. In addition, all contacts involving loop-assigned united residues are ignored. All lower cut-offs have been defined as the fifth percentile of the radial distribution function (RDF) of distance for pseudoresidue pairs and the
ACS Paragon Plus Environment
18
Page 19 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
upper cut-offs correspond to the maximum of this function (see Figure 10). Such defined pairwise contacts are rewarded with low fixed soft attraction energy (Table 1, Parameters No. 3.2-3.3 in section SI3 of Supporting Information).
Figure 10. Radial distribution function for pairs of united residues of various secondary structure types. The first percentile of RDF is the cut-off distance for hard-core repulsions and the range between the fifth percentile and maximum of distribution is the distance for soft attractive interactions. Both parts of potential are sequence-independent.
ACS Paragon Plus Environment
19
Local Repulsive
ACS Paragon Plus Environment Hydrogen bond interactions between pseudoresidues from two different β-strands
Short-range (along the chain) interactions between pseudoresidues Secondary structure dependent
4 Hydrogen Bond
5 Interactions
Short-range
Contact energy for pseudoresidues
Penalty energy for too dense packing of pseudoresidues in the 6Å vicinity of pseudoresidues in the buried region
3 Attraction
Soft-Core
2 Interactions
1
Penalty for deviations from expected number of residues Centrosymmetric in the R50 central sphere of single-domain globular proteins (only)
Description
Formula
Energy Function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
No. Term
Journal of Chemical Theory and Computation Page 20 of 43
Table 1. Summary of energy terms in the SURPASS basic force field
20
Page 21 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
2.4.
Generic force field for single-domain globular proteins. In the present work we
describe the generic version of the SURPASS force field. A more complete force field for a broad spectrum of protein systems will need an important extension to include sequence-specific binary interactions between the model pseudoresidues. In this report we focus on a simple case of single-domain globular proteins. To test the efficiency of the model its generic force field has been supported by a simple patch that enables modeling single-domain protein-like systems. To force the SURPASS chain to fold into globular topology, we used a simple centrosymmetric potential. Its purpose is to maintain a sufficiently high degree of packing of pseudoresidues in the protein core. Statistical analysis revealed that for a typical globule the optimal quantity of pseudoresidues densely packed in a sphere around the structure’s center of mass is 50%. The choice of a 50% sphere is a good compromise between the requirement of the close packing of central fragments in globular proteins and acceptance of overall various shapes (except for highly elongated structures) of single domains. The radius of this sphere is proportional to the square root of system size (see Figure 11). Centrosymmetric potential is a linear energy penalty for all deviations (excess or deficit) from the estimated number of pseudoresidues within a sphere of radius R50 (see Table 1, Term 1). When the number of residues in the central sphere drops below 50%, a penalty is imposed, slowly increasing with the square root of the number of lacking residues. On the contrary, when packing density in the sphere exceeds 60% of all residues, a rapidly increasing penalty is implemented for each additional excessive residue. The details of formula 1 from Table 1 are given in Table 1 of Supporting Information (Parameters No. 1.1-1.4 in section SI3). This simple patch imposed onto the generic force field of SURPASS proves to be sufficient to prevent the total collapse of the structure or long-time sampling of lowdensity unfolded states.
ACS Paragon Plus Environment
21
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 43
Figure 11. Generic centrosymmetric attractive force for single-domain globular proteins: A – dark blue residues, within a sphere of radius R50, illustrate dense packing in central fragments of globular protein; B – statistics for pseudoradii of spheres containing 50% of pseudoresidues representing native-like structures; C – radii of spheres containing various fractions of residues as a function of protein chain length. Total energy for single-domain globular proteins (see Table 2) using the generic SURPASS model is defined by a combination of the described statistical potentials (summarized in Table 1). Specific weights of particular components were optimized in a long series of simulations. The final scaling appears to be close to optimal. The important interactions for stabilizing local geometry (including short-range R15, helix stiffness or hydrogen bond in extended structure fragments) required significantly higher scaling (by a factor of 5-10), while the other contributions were of a constant magnitude of 1.0. Table 2. Summary of contributions of terms in SURPASS total energy CENTROSYMMETRIC
1 2.5.
LOCAL REPULSIVE
1
SHORT-RANGE
SOFT-CORE ATTRACTION
HYDROGEN BOND
R12
R13
R14
R15
HELIX STIFFNESS
1
10
1
1
1
5
5
Sampling scheme. In principle, the SURPASS representation and model of interactions
can be used in various simulation protocols, including Molecular Dynamics (after minor
ACS Paragon Plus Environment
22
Page 23 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
modifications of force field equations into continuous forms), Monte Carlo and other modeling techniques. In the present work, the simulation process was controlled by the Replica Exchange Monte Carlo dynamics scheme (REMC)35, which provides very efficient searching of the conformational space and relatively fast localization of global minima. The term Monte Carlo dynamics means that we used only very local random modifications of chain conformation. Short-distance random moves of single SURPASS pseudoresidues were used in the simulations only. Obviously, the algorithm can be significantly accelerated when somewhat larger fragments (2-5 units) are used in properly designed local moves. Monte Carlo simulations that use only local random moves provide realistic pictures of the long-time dynamics of modeled systems. To ensure high efficiency of the exchange algorithm, the starting temperatures for replicas were fine-tuned. Exchanges only between the nearest neighbors were allowed and each replica visited all the possible temperatures (see panels D in Figures 12-14).
3. RESULTS AND DISCUSSION
The generic SURPASS model presented above has been tested on a representative set of single-domain globular proteins. The set contained various types of globular proteins, including 7 helical proteins, 9 mostly β-sheet, and 8 mixed alpha/beta proteins. In the test simulations presented here, the secondary structure assignments, required by the model, were taken directly from the PDB database. These proteins were of various length, containing from 56 to 334 amino acid residues. Replica exchange Monte Carlo simulations were performed with 12 replicas for each tested protein. Replicas were distributed in a properly selected temperature range with increments increasing with growing temperature and with the expected folding transition temperature (roughly estimated from preliminary REMC runs) near the middle replicas. The
ACS Paragon Plus Environment
23
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 43
starting structures of all replicas had fully expanded conformation of model chains. All simulations were performed within the same time counted by the number of Monte Carlo steps and required from a couple of hours of real simulation time for a single replica of a smaller protein to tens of hours for the largest chains. This computational cost, already very low, can still be reduced by an order of magnitude in the future with an optimized version of the program. At the moment, the simulation algorithm has a structure that allows easy and straightforward changes of the model of dynamics, details of interaction patterns, specific formulas of the potential, etc. It makes model testing very easy, albeit at an additional cost not necessary for the final algorithm that will be available in the near feature as an easy to use web server. In addition, as noted before, some modifications of the model of Monte Carlo dynamics by “smart” local moves can speed up the sampling process significantly. The results of simulations are summarized in Tables 3-5. The first two columns in these Tables show the PDB code and protein size defined by the number of amino acid residues. The third column lists the range of the model dimensionless (reduced) temperature of 12 replicas. The fourth column shows the efficiency of simulations and provides the average number of transitions between short-range and fully random conformations of the modeled chains detected in a single replica. These numbers indicate very efficient sampling of the entire conformational space for small and middle-size proteins. For the largest systems, the sampling is not fully complete, although single native-like folds and lower accuracy properly folded fragments are also observed. The two RMSD columns need an additional comment. RMSD (Root-MeanSquare-Deviation, in Angstroms) means the distance between the SURPASS representation of the native (PDB) structures and the best models observed in the trajectories from a single simulation. In all cases, with the exception of the largest β-type protein, the best structures
ACS Paragon Plus Environment
24
Page 25 of 43
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Theory and Computation
adopted correct native-like topology. In the specific case of 3ned, a 231 residue complex β-type fold, fragments of the best structures were distorted by inaccurate topology. In other cases, surprisingly accurate native-like structures were observed. Additional refinement (annealing) of these structures by single replica low temperature SURPASS simulations slightly improved accuracy of the best structures. Columns 7-8 provide additional data that characterize the modeling accuracy using TMscore for selected structure fragments.36 These data prove that even in more difficult cases the simulations generate structure fragments similar to the native-like reference structure. This is summarized in column 8 (% of models) by the fraction of all models from the entire trajectory of a protein for which reasonably accurate models (RMSD smaller than 4 Å) cover at least 60-residue continuous fragments. Table 3. Summary of results for α proteins RMSD
TMscore
SIZE
TEMP.
TRANSITIONS
original
annealing
best model
% of models
1i2t, A
61
1.4-2.6
~20
0.84
0.78
0.71(100%)
0.93%
orthogonal bundle
1eo0, A
77
1.4-2.6
~ 15
2.81
2.57
0.55 (99%)
0.09%
up-and-down
3qf2, A
90
1.4-2.6
~10
3.72
3.20
0.58 (85%)
0.12%
orthogonal bundle
1abv, A
105
1.5-2.6
~7
2.44
2.32
0.58 (93%)
0.27%
5-helix bundle
1aep, A
153
1.6-2.8
~3
3.68
3.30
0.68 (89%)
27.40%
5-helix bundle
1ls4, A
164
1.7-3.0
~4
4.21
3.62
0.54 (75%)
38.84%
5-helix bundle
1r0d, A
194
1.7-3.0
~5
3.29
2.87
0.75 (93%)
25.40%
5-helix bundle
PDB ID
ARCHITECTURE
Table 4. Summary of results for β proteins RMSD
TMscore
SIZE
TEMP.
TRANSITIONS
original
annealing
best model
% of models
1mhn, A
59
1.3-2.4
~12
2.73
2.29
0.55 (98%)
0.49%
roll
1jic, A
62
1.3-2.4
~12
3.49
3.29
0.46 (93%)
0,02%
beta barrel
1yez, A
68
1.3-2.4
~13
3.66
3.16
0.44 (91%)
0.01%
beta barrel
1nte, A
82
1.3-2.4
~7
5.15
4.46
0.49 (72%)
0.01%
roll
1ten, A
89
1.4-2.6
~6
4.81
4.22
0.54 (95%)
1.15%
sandwich
2i5f, A
99
1.4-2.6
~5
3.64
3.04
0.52 (85%)
0.43%
roll
PDB ID
ACS Paragon Plus Environment
ARCHITECTURE
25
Journal of Chemical Theory and Computation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 26 of 43
1bww, A 109
1.4-2.6
~3
6.41
4.99
0.43 (65%)
0.01%
sandwich
3kff, A 152
1.6-2.8
~2
8.35
7.19
0.41 (52%)
0.04%
beta barrel
3ned, A 231
1.7-3.0
---
12.85
------
0.31 (32%)
0.05%
beta barrel
Table 5. Summary of results for α+β proteins RMSD
TMscore
SIZE
TEMP.
TRANSITIONS
original
annealing
best model
% of models
2gb1, A
56
1.3-2.6
~14
1.65
1.55
0.72 (100%)
3.88%
roll
3zke, A
85
1.4-2.6
~10
2.46
2.13
0.64 (100%)
1.40%
2-layer sandwich
2ea9, A
103
1.4-2.6
~6
3.39
2.88
0.60 (95%)
6.10%
2-layer sandwich
2j5a, A
106
1.4-2.6
~6
5.33
4.67
0.49 (80%)
0.06%
2-layer sandwich
3vsy, A
120
1.4-2.6
~3
7.49
7.30
0.36 (49%)