Article pubs.acs.org/JPCB
Can Simple Interaction Models Explain Sequence-Dependent Effects in Peptide Homodimerization? David J. Smith and M. Scott Shell* Department of Chemical Engineering, University of California, Santa Barbara, Santa Barbara, California 93106, United States S Supporting Information *
ABSTRACT: The development of rapid methods to explain and predict peptide interactions, aggregation, and self-assembly has become important to understanding amyloid disease pathology, the shelf stability of peptide therapeutics, and the design of novel peptide materials. Although experimental aggregation databases have been used to develop correlative and statistical models, molecular simulations offer atomic-level details that potentially provide greater physical insight and allow one to single out the most explanatory simple models. Here, we outline one such approach using a case study that develops homodimerization models for serineglycine peptides with various hydrophobic leucine mutations. Using detailed all-atom simulations, we calculate reference dimerization free energy profiles and binding constants for a small peptide library. We then use statistical methods to systematically assess whether simple interaction models, which do not require expensive simulations and free energy calculation, can capture them. Surprisingly, some combinations of a few simple scaling laws well recapitulate the detailed, all-atom results with high accuracy. Specifically, we find that a recently proposed phenomenological hydrophobic force law and coarse measures of entropic effects in binding offer particularly high explanatory power, underscoring the physical relevance to association that these driving forces can play.
1. INTRODUCTION Peptides are exciting candidates for a wide range of applications in materials engineering.1−7 They offer routes to highly controlled and functional nanoscaffolds for tissue engineering and regenerative medicine,8,9 glues for wound healing and industrial adhesives,10−12 biosensors,13−16 drug penetration enhancers,17,18 antibacterials,19 and more. They are attractive for modulating interactions at surfaces, which has led to roles in protective coatings,12 composite materials,20,21 and nonfouling22 applications. The advantage of peptides includes their sequence tunability and the ability to implement diverse chemistries through simple decoration and manufacturing processes,23−25 leading to a wide range of nanostructures3,4,9,23,24,26 and responsive capabilities.3,5,27 Peptides are also attractive for biocompatibility, sustainability, and environmental factors and for the opportunity to parse and exploit known peptide and protein data sets with affinity binding and other biochemical toolkits.25 An understanding of the sequence determinants in peptide assembly also has implications for protein folding and misfolding and the control of protein aggregation and solubility (stability) in biotechnology and biotherapeutics.28−30 Peptides and their higher-order assemblies are implicated via aggregation in at least 20 major diseases.31,32 Still, however, the interaction of peptides with each other and surface substrates is not completely understood in the fundamental theoretical sense.12,33,34 Although many efforts have sought to predict sequencespecific aggregation and self-assembly propensity using © 2017 American Chemical Society
bioinformatics approaches, models based not on statistics but on physiochemical interactions potentially offer broader predictive power and transferability to novel sequences. Here, we pursue such an approach that uses detailed all-atom free energy calculations of peptide dimerization to motivate reductionist binding models based on simple interaction laws and scaling principles. We show that statistical methods can be used to infer which structural and molecular-thermodynamic metrics are crucial to the detailed free energies. Bioinformatics approaches have produced a popular number of statistical routes to understanding how peptides and proteins hierarchically interact to produce higher-order assemblies. The most common response variable in these models is the aggregation propensity (AP), an often binary outcome associated with the formation (or lack thereof) of amyloid fibrils that entails a universal cross-β structure with peptide monomers assembling into β-strands perpendicular to the fibril axis.35,36 Such bioinformatics models seek structure−property correlations for peptide and protein aggregation as a function of physicochemical, sequence composition, thermodynamic, and even germline-dependent predictors. Of particular interest is the determination of “hot spots”, the minimal functional units believed to be aggregation-prone or responsible for the formation of stable nanoscale assemblies.37−39 Classical and Received: April 4, 2017 Revised: May 23, 2017 Published: May 24, 2017 5928
DOI: 10.1021/acs.jpcb.7b03186 J. Phys. Chem. B 2017, 121, 5928−5943
Article
The Journal of Physical Chemistry B
specific models,61 they also frequently have difficulty sorting proteins with the same sequence composition but a different arrangement.55 Moreover, there is a growing list of systems that challenge the classic correlative picture in which charged residues reduce AP, whereas hydrophobic residues increase it.47,62,63 Counterexamples include (1) numerous five- to eightresidue peptide fragments that require an X(Y)nZ primary structure to form aggregates, where X and Z are charged residues and Y is hydrophobic (e.g., KTVIIE);64−66 (2) uncapped (zwitterionic) diphenylalanine fragments that form higher yields of nanotubes than their capped (solely hydrophobic) counterparts;67 and (3) short hexameric peptides that have a higher AP when the net charge is ±1 rather than net neutral or multivalent.66,68 Physics-based approaches have offered additional fundamental insights into peptide aggregation and assembly through detailed interaction models. These approaches often draw largescale inferences about high-order peptide and protein assemblies from small-scale studies of lower-order assemblies such as monomers, dimers, and trimers. This is partially legitimized by experiments that correlate aggregation rates with simple single-peptide physicochemical properties such as secondary structure preference, hydrophobicity, and net charge.47,69,70 Various brute-force molecular dynamics (MD), enhanced sampling temperature replica exchange molecular dynamics (T-REMD), 71 bias-exchange metadynamics (BEMD),72 and parallel well-tempered metadynamics73 studies have offered insights into the structural and thermodynamic behaviors of many specific peptide systems.74,75 These studies have emphasized the role of electrostatics,76,77 hydrogen bonding networks,76 aromatic residues and π-stacking,76 and a stable hydrophobic core76 in the thermodynamics of small peptide oligomers.78−80 Because they produce the entire ensembles of states, physics-based approaches are able to capture important thermodynamic metrics such as backbone entropies. Previous work by Jeon and Shell has shed light on experimental findings66 as to why monovalent hexapeptides of the KTVIIE template form more ordered aggregates than their zwitterionic (net neutral) and divalent counterpartsspecifically, entropy was found to stabilize monovalent trimers because of the degeneracy of possible strand alignments.74 Oligomer polymorphism studies have also suggested the importance of conformational entropy in assembly.81,82 Still, other simulations have shown, in the study of higher-order aggregates, that the octamer state of KLVFFAE is stabilized relative to lower-order aggregates.83 A particularly interesting direction is the synergistic use of bioinformatic and physics-based approaches. Bioinformatics metrics can be computed from biophysical simulation results that contain details of the collective ensemble of states. For example, TANGO uses molecular structures and force fields in a statistical mechanical fashion to predict AP in proteins.70 Other studies have used tools that predict aggregation-prone regions based on the spatial AP, a measure of dynamic exposure to hydrophobic patches that is computed via molecular simulations.84 Because of this effort, metrics such as conformational entropy have more recently been built into bioinformatics approaches.85 Lin and Shell used enhanced sampling MD simulations of peptide monomers and logistic regression modeling with various molecular thermodynamic parameters of both amyloidogenic and nonamyloidogenic sequences, coupled with experimental data, to achieve 70 to 80% accuracy in aggregation predictions. They found that aggregation was
recent model examples include PASTA and BETASCAN, which both predict protein AP via the energy of β-pairing calculations;40−42 SALSA, which locates protein regions with a high propensity for β-strand structure;43 AGGRESCAN, which predicts scores for aggregation hot spots using AP values for each of the 20 natural amino acids from in vivo experimental data;44 Zyggregator;45 and RosettaProfile,46 among many others. Recently, AbAmyloid has been touted as the second of its kind in antibody AP. It has achieved predictive accuracies of 83.10 and 83.33% within and across germlines, respectively,41 through the use of a decision tree method with amino acid composition, germline-dependent predictors, and physicochemical properties such as the dipeptide composition, composition of polar and nonpolar residues, amphiphilicity, hydrophobicity, reverse turns, helical structure, and the number of ubiquitylation sites, among others. Such bioinformatics approaches have provided insights into the determinants of peptide interactions, aggregation, and assembly. Chiti et al. noted a simple linear correlation among aggregation rates and hydrophobicity, secondary structure propensity, and net charge among peptide variants with single-point mutations.47 In a follow-up, additional physicochemical and thermodynamic (e.g., pH, ionic strength, and concentration) parameters were included, where hydrophobicity and β-strand hydrophobic/philic sequence patterning were still found to be the strongest predictors.48 Crucially, these models as well as experimental work have produced insight into hot spots. In the first study of hexapeptide units from islet amyloid polypeptide (IAPP), Tenidis et al. showed that the NFGAIL fragment is amyloidogenic and particularly that phenylalanine or a similar aromatic residue is essential through the provision of both hydrophobicity and directional and ordered Π−Π stacking.49 Thompson et al. assessed fibril determinants through 3D mapping of each hexapeptide unit in amyloidogenic yeast prion Sup35 onto an ensemble of templates from the crystal structure of NNQQNY.46 Gazit and co-workers have further shown that peptides as short as five and four amino acids (e.g., DFNKF and DFNK) can form nanoscale fibrils with aromatic moieties.50 Their work suggests that aromatic residue groups contain the highest amyloidogenic potential through the mutation of aromatics for nonaromatic aliphatics as well as aromatics of lower aliphaticity.51−53 Dobson and co-workers also noted the importance of aromatics.54 This is perhaps most evident in peptides as small as diphenylalanine (FF), which are known to form structured and ultrastrong nanotubes,24,55 and F and Y are among the smallest hydrogelators.8 Studies have shown that dipeptide and tripeptide motifs contain the necessary information to form ordered nanoscale assemblies,56 and FF is one of the most studied motifs in tripeptide assemblies.56−58 Still, bioinformatics approaches can be limited in their ability to inform fundamental principles. Inherently, these models are convoluted with multiscale molecular physicochemical and macroscopic thermodynamic parameters that are obtained through a mixture of methods (i.e., experiments, simulations, and phenomenologies), leading to interdependent, correlated metrics.28,59 The reasoning for this top-down tactic is generally holistic in that the overall aggregation or assembly behavior cannot be determined from the sum of its parts. Furthermore, many bioinformatics models are trained on amyloidogenic peptide and protein sequence families and therefore are often unable to sort between amyloidogenic and nonamyloidogenic proteins of similar sequences.60 With the exception of position5929
DOI: 10.1021/acs.jpcb.7b03186 J. Phys. Chem. B 2017, 121, 5928−5943
Article
The Journal of Physical Chemistry B correlated with the hydrophobic solvent-accessible surface area (SASA) and the number of charged amino acid residues, consistent with the more recent experimental findings.66,85 In the present study, we pursue a hybrid approach that uses statistical methods to identify simple physical scaling laws that capture sequence-specific effects on peptide dimerization. Rather than use tabulated experimental information as training data, we instead compute reference interaction free energies using detailed, extensive all-atom calculations. In turn, this allows the calculation and analysis of a large number of molecular structural and thermodynamic metrics that are selfconsistent with the computed dimerization free energies. (In other words, both binding free energies and metrics are measured for the same peptide model and force field.) To make reasonable the scope of this initial study, we specifically focus on a model hydrophilic peptide backbone onto which we systematically introduce hydrophobic residues. Hydrophobic interactions (HIs) are arguably the most important driving force in biological folding and self-assembly; therefore, there is great interest in the development of simple models that can capture them. Hydrophobic peptides are implicated in a number of interesting phenomena, perhaps most significantly in amyloidogenesis, where they are believed to be the most important predictor of AP.25 There has been great progress in understanding the origin of HIs, but it still remains a challenge to determine their strength in fluctuating, conformationally plastic molecules such as peptides. HIs show distinct signatures for small molecules and extended macroscopic surfaces, and recent work has attempted to explain the hydrophobic driving forces via local water density fluctuations around solutes,86 tetrahedrality,87 hydrogen bonding or coordination,87−91 and other interfacial water order parameters.92 Yet, a crucial question is how HIs contribute in strength and range to effective interaction energies in the complex setting of peptide assemblies. Here, we address several basic questions concerned with moving beyond sequence composition models and targeting key thermodynamic and structural signatures of peptide homodimerization. First, can the contributions of individual peptide residues be decomposed into pair interactions, or are there cooperative or anticooperative effects that cannot be neglected? Second, do direct binding potential energies or subsets thereof (e.g., van der Waals and hydrogen bonding contributions) capture the homodimerization behaviors? Third, are entropic and free-energetic contributions important and essential? Finally, are the properties of the peptides in the unbound (nondimer) state relevant to the overall binding thermodynamics? To answer these, we formulate a simulation study to capture detailed atomistic interactions and calculate physically relevant metrics for evaluation in simple statistical models.
Simultaneously, the models themselves offer insights into how to predict peptide−peptide HIs with higher fidelity. We start with an uncapped (zwitterionic) short serine-glycine scaffold (SG)4, with a total of eight residues per peptide. The scaffold, because of its flexibility and solubility,93 offers a useful starting point as it is frequently used in experimental studies as a protease-resistant linker motif in recombinant proteins.94−98 We then introduce serine-to-leucine mutations to systematically determine the structural and thermodynamic effect of the number and location of hydrophobic residues. By leaving unchanged all glycine residues, we also largely preserve the conformational flexibility of the backbone and thus do not assess its effect on thermodynamic properties at present. We choose representative sequences that involve single (LGSGSGSG, SGLGSGSG), double (LGLGSGSG, LGSGLGSG, LGSGSGLG, SGLGLGSG), triple (LGLGLGSG, LGLGSGLG), and quadruple (LGLGLGLG) leucine mutations. We hypothesize that the binding free energy should correlate with the ability of the peptide dimer to form a hydrophobic cluster as well as the ability to retain a backbone hydrogen-bonding network. 2.1. Model and Simulation Details. To characterize the extent of hydrophobic binding through all-atom simulations, we use the AMBER ff96 all-atom force field with an implicit solvent model99−101 whose combination has been shown to successfully fold a variety of small peptides with both α- and βsecondary structures.102−104 The implicit solvent model is that of Onufriev, Bashford, and Case,100 using a generalized Born (GB) approximation to the Poisson−Boltzmann equation and an additional hydrophobic-solvent-accessible surface-area term (GBSA). The use of an implicit solvent model is primarily due to the computational expense of the UREMD approach (described below), which otherwise would require prohibitively expensive systems with large numbers of water molecules to accommodate the entire span of the reaction coordinate, and a much larger number of simulation replicas. Certainly, the implicit solvent does not capture detailed hydration structure (including, for example, solvent shells and solvent-separated ion pairs) nor some subtleties of HIs associated with these properties. However, here we are interested in the impact of hydrophobicity on the dimerization affinity as the sequence is mutated, which the implicit model is likely to capture. This study also serves to illustrate a statistical approach to extracting simple models from molecular simulations, and in that sense, it is essential that the models allow reliable and high-precision calculations of underlying free energies that are self-consistent with simulation-calculated order parameters. (In other words, both free energies and parameters are calculated for models using the same force field.) We leave the investigation of a more detailed solvation model to a future study. Umbrella replica exchange molecular dynamics (UREMD) is an enhanced sampling strategy that combines replica exchange with a two-dimensional extended ensemble in temperature and a reaction coordinate of interest. This approach explicitly samples configuration space along the reaction coordinate and uses high-temperature sampling at select reaction coordinate values to prevent trapping in local free energy minima. UREMD boasts high statistical precision and has been particularly advantageous in elucidating the origins of cooperative and anticooperative behavior in peptide interactions and assembly.105,106 For all simulations, we use the UREMD technique in order to calculate high-precision free energies of dimerization.
2. METHODS Our main objective is to determine the role that simple molecular interaction models can play in the context of peptide binding. Specifically, this work investigates the effect of singlepoint nonaromatic hydrophobic mutations on the dimerization of two identical peptides. Starting from extensive all-atom simulations, we seek to characterize the molecular interactions responsible for the thermodynamic driving forces and subsequently formulate simple theoretical models that describe the precise effect of hydrophobic mutations on binding. 5930
DOI: 10.1021/acs.jpcb.7b03186 J. Phys. Chem. B 2017, 121, 5928−5943
Article
The Journal of Physical Chemistry B ΔG bind = −kBT ln(C 0KA )
Here, we use for the reaction coordinate the separation distance between the geometric centroids of two associating peptides. For this study, replicas consist of specific combinations of temperature (ranging from 270 to 500 K) and peptide−peptide centroid separation (from 3.0 to 55.0 Å). Temperatures are maintained with the Andersen thermostat with massive collisions after each swap round,107 and reaction coordinate sampling is constrained using soft bottom harmonic restraints for umbrella potentials. Temperature cascades are constructed at the first and last umbrella separations, and other umbrella separations exist only at the lowest temperature for a total of 47 temperature−umbrella replicas for each peptide run. Each UREMD run lasts 20 ns per replica, and only the last 10 ns is used for analysis. For the short peptides studied here, we found that this amount of sampling was sufficient to converge the calculated potential of mean force curves (PMFs). Exchanges between replicas are attempted five times per 1 ps cycle based on a Metropolis-like criterion. Distances are spaced more closely at lower values of the reaction coordinate to enable swap acceptance ratios of at least 20%. All data is analyzed at 270 K using the weighted-histogram analysis method (WHAM).108 This temperature was shown to well stabilize the structures of a variety of small peptides and proteins for the force field.102−104 We use an in-house routine that consists of a wrapper code for the AMBER9 molecular dynamics package.109 2.2. Free Energy Calculations and Binding Thermodynamics. From the simulations, we construct PMFs or free energy profiles along the separation distance d between the Nterminus of the chain one α-carbon and the C-terminus of the chain two α-carbon. This reaction coordinate allows for a smooth signature of binding that appears to track association into the final dimeric structure in a continuous manner. To normalize the peptide−peptide PMFs, we shift the free energies to zero at a noninteracting reference state at around 35 Å. The resulting PMFs are used to calculate binding constants and binding free energies for each dimer system. Theoretical and practical details of these calculations can be found in refs 106 and 110−113. The binding constant is defined as KA =
[AB] [A][B]
where ΔGbind is the binding free energy, kB is the Boltzmann constant, T is temperature, and C0 is a standard state volume scale that determines the reference point of the free energy.110 For a PMF with a significant minimum, the binding free energy generally scales closely with the well depth, particularly if the curvature around it is high. In this study where our main interest is to compare different sequences, we examine a relative binding free energy with respect to a reference sequence, thereby eliminating the need to define a standard state ⎛ KA,2 ⎞ ⎟⎟ ΔΔG bind,1 → 2 = −kBT ln⎜⎜ ⎝ KA,1 ⎠
1 2
∫d
dup low
ΔX ≡ Xbound − X unbound
(5)
where X is the observable of interest and Xbound and Xunbound are the values of this observable in the bound and unbound states, respectively. In most cases, the values are calculated as averages over the last 10 ns of simulation. The bound state is an average over all simulation data where the peptide−peptide separation is within 1 Å of the free energy global minimum (defined for all peptides at 5.2 Å), and the unbound state is an average over all data where the separation is within 1 Å of the free energy tail (the unbound state, set at 31 Å). A second Δ convention is used to describe the shift in the binding metric relative to the (SG)4 scaffold. Binding metrics that take the form of eq 5 include the following energetic metrics: (3) ΔΔEbind, the energy of binding relative to the (SG)4 scaffold. (4) ΔΔEHBOND, an approximate hydrogen bonding energy of binding, where EHBOND = ∑i