Coarse-Grained Modeling of the Interplay between Secondary

Feb 27, 2018 - In its generic form the statistical knowledge-based force field of the model has been dedicated for single-domain globular proteins. Se...
1 downloads 4 Views 4MB Size
Article Cite This: J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

pubs.acs.org/JCTC

Coarse-Grained Modeling of the Interplay between Secondary Structure Propensities and Protein Fold Assembly Aleksandra E. Dawid,† Dominik Gront,† and Andrzej Kolinski*,† †

Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland S Supporting Information *

ABSTRACT: We recently developed a new coarse-grained model of protein structure and dynamics [Dawid et al. J. Chem. Theory Comput. 2017, 13(11), 5766−5779]. The model assumed a single bead representation of amino acid residues, where positions of such united residues were defined by centers of mass of four amino acid fragments. Replica exchange Monte Carlo sampling of the model chains provided good pictures of modeled structures and their dynamics. In its generic form the statistical knowledge-based force field of the model has been dedicated for single-domain globular proteins. Sequence-specific interactions are defined by three-letter secondary structure data. In the present work we demonstrate that different assignments and/or predictions of secondary structures are usually sufficient for enforcing cooperative formation of native-like folds of SURPASS chains for the majority of single-domain globular proteins. Simulations of native-like structure assembly for a representative set of globular proteins have shown that the accuracy of secondary structure data is usually not crucial for model performance, although some specific errors can strongly distort the obtained three-dimensional structures.

1. INTRODUCTION Numerous genomic projects provide a plethora of protein sequences. For many reasons the number of experimentally solved three-dimensional protein structures is much smaller. The knowledge of protein structures, but also their dynamic properties and interactions with other biomolecules, is essential for describing protein function, rational drug design, elucidation of complex evolutionary relationships, and other key problems of molecular biology, biophysics, and bioinformatics. De novo (starting just from sequence information) theoretical prediction of a three-dimensional protein structure is a very challenging task whose solution is still far from satisfactory.1−3 For a subset of relatively small proteins only it is now possible to predict computationally their structure using various molecular dynamics (MD) modeling tools.4−6 Well-established coarsegrained models enable simulations of somewhat larger proteins, although, in spite of quite spectacular successes, the resolution and fidelity of predicted structures is usually lower.1−3,7,8 Fortunately, about 80% of protein families deposited in the Pfam database have well-defined representatives that are already structurally charcterized.9,10 Since during the evolution protein sequences diverged much faster than their three-dimensional structures, the knowledge of such structural representatives of homologous proteins opens up enormous possibilities for structure prediction by means of various strategies of comparative modeling. For close homologues that show a high level of sequence similarity with its structural template, comparative modeling is quite easy, and excellent bioinformatics tools are available. On the contrary, comparative modeling for distant homology targets can be very difficult © XXXX American Chemical Society

and requires an elaborate combination of bioinformatics and molecular modeling.1,3,8,11 Determination (experimental and/ or computational) of three-dimensional structures of single protein molecules is, however, only a small part of the structural biology of proteins. We also need some knowledge of protein folding mechanisms, dynamics of protein molecules, and their interactions.3 Moreover, proteins may form many complexes with other proteins, nucleic acids, and other biomolecules. Structures and dynamic properties of such complexes are significantly more challenging for experimental studies and now become an important direction in advancing computational modeling methodology.3,8,12−14 Due to the time scale and system size of biomacromolecules and their complexes, developing multiscale integrative modeling strategies may play a crucial role. This is one of the main reasons for designing efficient low-resolution modeling methods that enable simulations of large systems opened onto reasonable integration with more accurate modeling tools. Recently we proposed a new low-resolution model of protein structure and dynamics.7 SURPASS assumes deep coarsegraining of protein representation and is based on the knowledge-based model of interactions and Monte Carlo dynamics sampling schemes. It provides new opportunities for quite efficient modeling of long time processes and simulations of relatively large structures.15−17 SURPASS appears to successfully address some limitations of the existing deeply coarse-grained models.15,16,18 Received: December 12, 2017 Published: February 27, 2018 A

DOI: 10.1021/acs.jctc.7b01242 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

2. METHODS 2.1. SURPASS Model. The coarse-grained modeling method used in this work has been described recently in great detail.7 Here we outline its major features, including protein representation, model of interactions, and sampling strategy. SURPASS assumes united residue representation of preaveraged secondary structure fragments. Beads, representing single united residues, are positioned in the centers of mass of the main chain fragments representing four consecutive amino acid units (see Figure 1B). This way regular fragments (helices

In the past, several medium-resolution coarse-grained protein models were developed that enabled quite efficient simulation of protein structure and dynamics.3,13,14,19−23 Typical medium resolution models, such as ROSETTA,24 CABS,25 UNRESS,26 or MARTINI,27 usually use a few (2−4) united residues per amino acid and are targeted onto studies of single proteins. Additionally, MARTINI27 is a powerful tool for studies of biomembranes and biomembrane proteins. Many higher resolution models (ROSETTA,24 OPEP,28 AAWSEM,23 and others3) use an all-atom representation and coarse-grained force field. While de novo structure prediction is possible for small proteins, modeling of large conformational transitions in not too small protein systems remains still beyond the reach of all-atom or medium-resolution coarse-grained molecular dynamics.3 Nevertheless, coarse-grained, medium-resolution simulations have proved to be very efficient in difficult cases of comparative modeling, where simulations of large molecules can be restricted to limited regions of conformational space, defined by fragmentary templates extracted from structures of homologous/analogous proteins.29 On the opposite side, various simple exact models of protein chains have been quite successfully used in the theoretical studies of general properties of protein-like systems.30,31 The apparent gap between the medium-resolution protein models and the simple exact protein-like models3 was the main reason for designing the intermediate resolution SURPASS tool.7 Due to its reasonable accuracy and high sampling efficiency, the SURPASS model appears to be an interesting alternative to existing medium-resolution and very crude protein-like models. First tests of the model have been targeted onto single-domain globular proteins. Secondary structure assignment was the only sequence-dependent input data. Using secondary structure annotations taken from the headers of PDB files we performed replica exchange MC simulations for a small set of globular proteins.7 It has been shown that the model efficiently samples all the important regions of conformational space and reproduces native-like structures with surprisingly good accuracy for such a level of coarse-graining. Therefore, the SURPASS simulations appear to be a very good tool for the fast delivery of representative sets of initial structures for higher resolution simulations and thereby to be an initial stage for a variety of multiscale integrative modeling schemes. In the present work we examine the effects of secondary structure data (SS) on protein dynamics and fold assembly in the SURPASS simulations. Various assignments (PDB32 versus DSSP33) and predictions of secondary structure (PSIPRED34), including random assignments, are tested.35 The results of these tests are crucial for the future extension of the knowledgebased schemes of interaction used in the initial SURPASS simulations onto larger multidomain protein systems. The secondary structure data, now in the classical three-letter code (H: helix, E: strand, C: coil), can be extended in the future onto other types of secondary structure. SS data are crucial for the sequence specific part of the knowledge-based force field of the generic version of the SURPASS model.7 A more complete sequence-specific force field will also contain contact energies and longer distance restraints between united residues, derived from structures of homologous proteins or locally analogous fragments of known protein structures. This will open up several possibilities for efficient integrative modeling of large systems, 3,12 including de novo and distant homology comparative modeling of proteins and protein complexes.

Figure 1. Main concepts of the SURPASS model. (A) All-atom reference structure of the 2hsh protein (chain A) with the superimposed protein surface in gray. (B) Coarse-graining steps for a helical fragment: the picture in the middle shows an α-carbon trace with four darker α-carbons defining the center of a SURPASS united residue. United residues are partially overlapping, as shown in the bottom of panel (B). (C) The 2hsh protein structure in the SURPASS representation. Helices are colored green, β-sheets are colored blue, and loops are colored cyan. The black circle indicates a sphere containing half of the united residues of a globular protein. (D) The βsheet fragment of protein 2hsh with model hydrogen bonds (see the original SURPASS paper7).

or β-sheets) of modeled proteins adopt almost straight line shapes. The united residues are spherical with their excluded volume depending on the expected local geometry of the modeled protein. Helical united residues are the thickest (neighbors along the chain may partially overlap) and beads forming β-strands are the thinnest, while the coil beads have an average size. The virtual bonds are flexible, and their length oscillates around the expected average values characteristic for specific secondary structure fragments. Statistical knowledge-based potentials control the local geometry and short-range interaction of united residues. These potentials encode various structural regularities that characterize local packing of amino acids in the native-like structures of globular proteins, including the distance between the beads along the chain, preferred contact distances (between helices, β-strands, etc.), contact angles, etc. The angular restrictions enforced by the main chain hydrogen bonds are also modeled by the specific statistical, distance and angle B

DOI: 10.1021/acs.jctc.7b01242 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation Table 1. Results for α Proteinsa PDB identity [%]

transitions [int] and near-native [%]

PDB ID

size

DSSP

PSIPRED

COIL

1oks 1i2t 1eo0 1x3o 3qf2 1cy5 1abv 1tqg 1g7d 1bkr 1nze 1a7d 1nzn 1k40 1orj 1cpq 1wn0 1edu 1aep 1ls4 1r0d

53 61 77 80 90 92 105 105 106 108 112 118 122 126 126 129 131 149 153 164 194

86.8 86.9 89.6 87.5 87.8 84.8 96.2 93.3 90.6 100.0 91.0 99.2 90.2 92.9 91.3 97.7 90.0 90.6 92.8 93.9 94.3

86.8 86.9 79.2 82.5 81.1 81.5 94.3 87.6 86.8 82.4 92.0 88.9 80.3 84.9 88.9 93.8 90.0 87.2 77.8 82.9 77.3

11.3 6.6 20.8 22.5 17.8 7.6 24.8 9.5 25.5 38.0 12.5 28.8 9.0 6.3 14.3 29.5 11.5 18.8 24.8 17.0 7.7

PDB 879 92 107 35 0 24 43 107 74 17 74 120 14 80 105 107 120 0 60 54 12

3.95% 0.02% 1.44% 0.08% 0.00% 0.01% 0.15% 14.46% 2.08% 0.01% 16.13% 0.63% 3.99% 54.87% 30.71% 2.66% 3.88% 0.00% 31.56% 15.82%