Generating Intrinsically Disordered Protein Conformational Ensembles

Publication Date (Web): September 11, 2018 ... Our analysis of these peptides shows that the combination of the database and the Markov algorithm yiel...
0 downloads 0 Views 1024KB Size
Subscriber access provided by University of South Dakota

B: Biophysics; Physical Chemistry of Biological Systems and Biomolecules

Generating Intrinsically Disordered Protein Conformational Ensembles from a Database of Ramachandran-Space Pair Residue Probabilities Using a Markov Chain Robert I. Cukier J. Phys. Chem. B, Just Accepted Manuscript • DOI: 10.1021/acs.jpcb.8b05797 • Publication Date (Web): 11 Sep 2018 Downloaded from http://pubs.acs.org on September 12, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Generating Intrinsically Disordered Protein Conformational Ensembles from a Database of Ramachandran-space Pair Residue Probabilities using a Markov Chain

Robert I. Cukier* Department of Chemistry Michigan State University, East Lansing, MI 48824-1322

*Corresponding author: Professor R. I. Cukier Department of Chemistry Michigan State University E. Lansing, MI 48224-1322 [email protected] Phone: 517-353-1170 Fax: 517-353-1793

1 ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT Intrinsically disordered proteins (IDPs), involved in regulatory pathways and cell signaling, sample a range of conformations. Constructing structural ensembles of IDPs is a difficult task for both experiment and simulation. In this work, we produce potential IDP ensembles using an existing database of pair residue phi and psi angle probabilities chosen from turn, coil and bend parts of sequences from the Protein Data Bank (PDB). For all residue pair types, a kmeansbased discretization is used to create a set of rotamers and their probabilities in this pair Ramachandran space. For a given sequence, a Markov-based probabilistic algorithm is used to create Ramachandran space Database-Markov ensembles that are converted to Cartesian coordinates of the backbone atoms. From these Cartesian coordinates and phi and psi dihedral angles of a sequence, various observables: the radius of gyration and shape parameters, the distance probability distribution that is related to the SAXS intensity, atom-atom contact percentages, local structural information, NMR three-bond J couplings, CA chemical shifts and residual dipolar couplings are evaluated. A benchmark set of ensembles for sixteen residue, regular sequences is constructed and used to validate the method and to explore the implications of the database for some of the above mentioned observables. Then, we examine a set of nonapeptides of the form EGAAXAASS where X denotes residues of different character. These peptides were studied by NMR, and subsequent MD simulations were carried out using various force fields to find which one best agrees with the NMR data. Our analysis of these peptides shows that the combination of the database and the Markov algorithm yields ensembles that agree very well with the NMR and MD results for the above listed observables. Thus, this Database-Markov method is a promising method to generate IDP conformational ensembles.

2 ACS Paragon Plus Environment

Page 2 of 52

Page 3 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

I. INTRODUCTION Intrinsically disordered proteins (IDPs) function by a disorder-to-order transition induced by interaction with a ligand or another protein. They are prominent actors in regulatory pathways and in cell signaling.1-15 An entropy penalty is incurred in this disorder-to-order transition that must be made up by enthalpic stabilization. Assessing the magnitude of this entropy penalty is key in view of the marginal stability of proteins.16-18 Obtaining this entropy relies on the construction of conformational ensembles; a difficult task when going beyond the assumption of “random coil” behavior. Early on it was recognized that disordered proteins are not random coils.19-23 Rather, they may exhibit substantial residual order, leading to a reduction of this entropy penalty. Numerous experiments and simulations have been carried out to explore the nature of IDP ensembles.24 The experimental work typically uses small-angle X-ray scattering (SAX)25-27, NMR28-31, and Förster Resonance Energy Transfer (FRET)32, often in combination with simulation, to construct IDP ensembles and obtain their observables. There are varieties of molecular dynamics (MD) and more coarse-grained simulation methods that produce ensembles. For MD, there are two main problems in obtaining reliable conformational ensembles: force-field (FF) quality and slow sampling.33-34 FFs have mainly been parameterized with the extensive data available on folded proteins. With the requisite small MD time-step, special methods are needed to overcome potential surface barriers. Both FF construction and sampling are especially difficult tasks for IDP simulations.35-40 In this work, we will construct IDP conformational ensembles based on a database of protein backbone dihedral conformations obtained by Shapovalov and Dunbrack and co3 ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

workers.41

42

The database41 provides, for each pair of residue types XY ∈ ( amino acids ) , the

probabilities of sampling the phi and psi dihedral angles of the residue pairs. To make our approach a practical program, in view of the exponentially large state space of a given sequence, the Shapovalov-Dunbrack database that provides pair probabilities in continuous phi and psi space is discretized. As will be shown, the database pair probabilities do exhibit good rotameric properties, and we use a kmeans approach to cluster each pair probability into discrete regions of the Ramachandran (phi, psi) plane. For a given sequence, ensembles will be obtained by using these residue-specific pair probabilities to construct sequence realizations based on a Markov chain. They will be referred to as Database-Markov ensembles. With these dihedral ensembles a conversion from Ramachandran space to Cartesian Coordinates can be made and various observables of the ensemble obtained. Residue-specific pair probabilities have been used before to obtain properties of IDP-like ensembles.23 In that work, the pair probabilities were discretized and converted to an energy function that was used to generate conformational ensembles with a Monte Carlo algorithm. Radius of gyration (RG) and residual dipolar coupling (RDC) predictions were compared with experimental data for a variety of proteins. To compare with experimental and simulation metrics, we shall focus on structure via: the probability distributions p(RG) of the radius of gyration (RG), along with other measures of ensemble shape, the structure factor p(R) (whose Fourier transform gives the SAXS spectrum)2627

, the end to end distance (measured by e.g. FRET32, 43-45), and correlations among some of

these quantities. Furthermore, where appropriate, hydrogen bond and other local structural

4 ACS Paragon Plus Environment

Page 4 of 52

Page 5 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

information will be evaluated, as well as NMR three-bond 3JHNα residue scalar couplings46-47 , RDCs and CA residue chemical shifts.48 Two classes of sequences are explored in this first application of the method. First, sixteen residue sequences consisting of mixtures of the form (XY)8 and (X)8(Y)8, with some residue types having high likelihood to be found in IDPs.49-50 Metrics for accurate sampling of the realizations will be established. And, the relation between the various Ramachandran plane residue propensities and their implications for some of the above metrics will be explored. Second, a series of nonapeptides of the form EGAAXAASS with X= [G, W, I, D, V, P] chosen from different residue classes will be considered. These peptides have been investigated by NMR spectroscopy51 and by MD simulation.52 The latter study was carried out to evaluate and benchmark re-parameterized MD force fields both for the residues and the water models for improving simulations of IDPs. The plan of the remainder of this paper is as follows. Section II presents the methodology of converting the Shapovalov-Dunbrack database to the desired discretized pair residue probabilities. The Markov chain ensemble construction for a sequence, based on these probabilities, is outlined and the conversion between Ramachandran phi/psi space for each residue and the corresponding Cartesian coordinates representing the sequence backbone is given. The various geometric and other observables that follow from the Ramachandran and Cartesian coordinates are defined here. Section III presents our results on the sixteen-residuelong peptides and the nonapeptides. A comparison of our nonapeptide results with those obtained from Flexible-Meccano53, a method that also uses a database approach is carried out.

5 ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 52

Section IV summarizes the method and results and discusses its strengths, deficiencies and prospects of this Markov method.

2. METHODS 2.1 The Shapovalov-Dunbrack database.41 The Shapovalov-Dunbrack database provides pair residue probabilities for all XY (X,Y ∈ A, … Y; Z) with A, … ,Y the 20 standard amino acids and Z corresponding to cis proline. Four sets of databases are given. The one used here is designated as NDRD_TCB and incorporates data from turn, coil and bridge regions of sequences deposited in the PDB. There are 44,112 residues in this database. As noted in their analysis of the four databases, the NDRD_TCB set does the best job in a loop prediction benchmark set. For each of the 21 * 21 = 441 residue pairs a nonparametric statistical methodology was used to obtain the pair probabilities on phi/psi scales of 10 degree bins. For Markov purposes, what would be most useful are the conditional probabilities

p ( x2 , R2 | x1 , R1 ) ≡ p ( x2 , R2 ; x1 , R1 ) / p ( x1 , R1 ) ,

(2.1)

with Ri a residue of type i and xi the ith bin for either phi or psi. But, only the conditional probabilities p ( x2 , R2 | R1 ) , with the normalizations

∑ p(x ,R

x2∈{ X 2}

2

2

| R1 ) = 1

(2.2)

are available in the database. In the absence of specific knowledge54, use of a Lagrange multiplier on x2 to maximize p ( x2 , R2 | x1 , R1 ) subject to its norm constraint of eq. (2.2) shows that the Markov conditional probabilities are 6 ACS Paragon Plus Environment

Page 7 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

p ( x2 , R2 | x1 , R1 ) = p ( x2 , R2 | R1 ) .

(2.3)

2.2 Discretization of the Database using Kmeans. To discretize these conditional probabilities, we use the binned data to create, for each XY residue pair, a “trajectory” where the number of instances of each bin is proportional to its database probability. A simple peak finder routine was written to identify the number of peaks and their locations in each XY trajectory, using a cutoff factor of 10 relative to the maximal peak value, except for X or Y=Glycine, where a factor 20 was used. These peak and location data, along with the corresponding trajectories, were used in a special-purpose kmeans routine to cluster the pair conditional data, and obtain the discretized conditional probabilities that will be the Markov chain ingredients. The kmeans program uses a previously developed “compositional clustering” method55 in which the phi and psi dihedral angles of each pair are jointly clustered. The clustering is done in “Fisher” coordinates56 that represent each angle in terms of its sine and cosine, in order to provide a proper metric in the periodic dihedral space. Knowledge of the number and approximate, initial locations of the centroids, from the peak finder program, provided excellent kmeans initial guesses. Some experimentation with the number of clusters (from using different numbers of peaks in the peak finder program) and evaluation of a quality-of-fit measure in the kmeans program shows that our clustering is accurate in the following sense. For each conditional probability pair we evaluated the root mean square (rms) deviation between each cluster’s centroid and all the trajectory points contributing to that cluster. Making sure that this rms deviation is small for the number and centroid locations in Fisher space of each pair provides a good set of centroids and associated conditionals to represent the database. The four-dimensional 7 ACS Paragon Plus Environment

The Journal of Physical Chemistry

Ramachandran Fisher space rms deviations are in the range 0.07-0.25. In this space the largest deviation would be 2 2 , thus verifying that the phi and psi angles that form this space are tightly clustered around their respective centroids. The database of discretized conditional probabilities for each phi/psi centroid is given in Table S1 of the SI. Examination of this table shows that the number of clusters (rotamers) varies from 1 for, e.g., XY=PV to 11 for, e.g., XY=NF, with the largest number of XY conditionals with 5, 6 and 7 rotamers.

PMF

180

Type (phi,psi) 6.0 5.0

120

α1 (–70, –30)

4.0 3.0

60

α2 (–100,5)

2.0

psi

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1.0

0

PPII (–70,140)

0.0

-60

β (–130,145)

-120 -180 -180

αL (80,0) -120

-60

0

60

120

G (80,-170)

180

phi Figure 1. PMF (ϕ ,ψ ) ≡ − log  condProb (ϕ ,ψ )  obtained from all 441 conditional probabilities XY by kmeans discretization. As the database conditionals are p ( x2 , R2 | R1 ) , the dihedral angles plotted are for x2 ≡ (ϕ ,ψ ) . Ramachandran plane high population (low PMF) regions are defined in the right side table.

8 ACS Paragon Plus Environment

Page 8 of 52

Page 9 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Figure 1 presents the potential of mean force (PMF) of a summary over all XY pair kmeans data of the Ramachandran space occupation of the various residue pairs. Our type designations come in part from extensive work on classifying PDB-based protein structures in Ramachandran space.57 That database is dominated by folded proteins with definite secondary structure. Nevertheless, constraints arising from steric requirements still restrain the Ramachandran populations from the database to regions that are prominent in the PDB. What changes, of course, is the weighting in Ramachandran space. Examination of the probabilities underlying the PMF plot shows that PPII is most prominent, followed by α1, β~α2, and αL in descending population order. A comparison with previous work38 (see their Figure 3) that consists of non-Pro, non-Gly residues that are not in helix or sheet secondary structure in the TOP500 of the PDB, shows a similar PMF pattern. Also displayed In that figure are Ramachandran space PMFs for various Amber and CHARMM FFs, modifications to them to pertain to IDPs, as well as the ABSINTH FF58 that was parameterized for IDPs. In the Discussion Section, a comparison of this PMF plot with those arising from IDP-oriented MD simulations will be given. 2.3 Markov Chain Construction. With the conditional probabilities determined, a Markov chain59 corresponding to a given sequence R1R2 ,..., RN can be generated. The chain probabilities, in row stochastic order, p ( x1 , R1 , x2 , R2 ,..., xN RN ) are

p ( x1 , R1 , x2 , R2 ,..., xN RN ) = p ( xN , RN | RN −1 ) p ( xN −1 , RN −1 | RN −2 ) ... p ( x2 , R2 | R1 ) p ( x1 , R1 ) (2.4)

9 ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Note that our use of a Markov process is not to iterate it to a fixed point. Rather, as Eq. (2.4) shows, the Markov chain produces realizations of the sequence from the pair conditional probabilities. For convenience, we set the first residue to have unit probability – it is not relevant structurally. To generate realizations (an ensemble of conformers) of this chain we use a random number generator to sample each conditional probability of the chain according to the numbers and values of the conditionals constructed in Section 2.2 and given in Table S1 of the SI, and construct the chain according to eq. (2.4). Note that for a chain of N residues and P rotamers per residue pair there are (N-1)P possible states. For example, with N=16 and P=6 (using an average 6 rotamers/residue pair) there are ~5X1011 possible states. Sampling realizations of this state space with adequate statistics requires many trials. Fortunately, for the realistic cases, where there are a limited number of high population states, those states can be sampled sufficiently for not too long chains if an efficient method of generating samples is available. We previously devised an efficient method in the context of dichotomic Markov chains.60 For the current purpose a generalized program was written to incorporate any number of rotamers per residue pair. To take advantage of the feature that the chain probabilities fall off as the chain grows, only sub-chains whose probability is greater than some fraction of the maximum possible chain probability are accepted. Typically, a factor of ten is used for accepting chains. Once a set of chain realizations (states) with accepted probabilities are generated, an ensemble of states is generated where the number of times a particular state appears is proportional to its probability. It is useful to create multiple sets of realizations of the chain and use these sets to verify that a sufficient number of samples are generated to produce an 10 ACS Paragon Plus Environment

Page 10 of 52

Page 11 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

accurate ensemble. These accuracy tests are described in the Results section. The output is an ensemble in dihedral space that can be used to generate backbone Cartesian coordinates (PDBs) as we now describe. 2.4 Conversion of Dihedral Ensembles to Cartesian Coordinates. With the ensemble of phi and psi angles defining the dihedral space representation of the sequence, properties such as NMR 3JHNHα couplings that rely on the phi angles of the sequence can be obtained. To obtain other observables such as a radius of gyration, and for structural visualization, the dihedrals need to be converted to Cartesian coordinates. A program that does so was written. It assumes a fixed, idealized planar geometry for the peptide plane, along with standard peptide bond lengths and angles. Other than that, the method is general, and can incorporate (small) random variations of these parameters if desired. The first three atoms are in a fixed configuration and the Cartesian coordinates of each residue are then obtained from the list of dihedral values. Because the Markov chain generation is pairwise there is no guarantee that there will be no chain overlaps. Thus, when such chains are generated if one or more atom(i)-atom(i+n) for n=3, 4, … distances are greater than 2.5 Å such conformers are excluded from the ensemble. The fractions of chain exclusions will be given in the relevant sections of the Results. 2.5 Chain Observables. Measures of chain dimension and shape can be obtained by diagonalizing the gyration tensor (inertia tensor), for each member of the ensemble. The three gyration tensor eigenvalues provide conventional shape parameters61-63 for each ensemble member that can be averaged and histogramed. We evaluate the following observables. Geometric-based measures:

11 ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

RG , the radius of gyration, and its probability distribution p ( RG ) indicating the overall sequence dimension. ∆ , the asphericity, and its probability distribution p ( ∆ ) measuring the anisotropy of the

shape (range 0 to 1, from sphere to rod). S, a measure of the principle axis direction of deviation from spherical geometry (range – 1/4 to 2, from oblate to prolate shape). Other geometric-based measures:

p ( R ) , the distance distribution function between all pairs of atom sites, whose Fourier transform, the scattering intensity, is proportional to the SAXS intensity. 26-27, 64 EtoE, the end-to-end average and its probability distribution p ( EtoE ) that provides another indication of how compact-to-extended the sequence is, that can be related to various fluorescence-based, e.g. FRET44 32, 43, 45 measurements. Secondary structure analysis based on the specifics of the database: A two-dimensional Ramachandran plane histogram of the sequence occupation probabilities over the ensemble is generated. The grid spacing used is 100 to match the database. The same peak finder routine as in Section 2.2 is used and these probabilities along with sufficient neighbor cell probabilities of the peaks are added to obtain the total probability around each peak. That provides peak locations and their probabilities for each sequence. “Hydrogen bond” (HB) analysis carried out by: 12 ACS Paragon Plus Environment

Page 12 of 52

Page 13 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Evaluating the proportions in the ensembles of distances between N(res i)-O(res i+4) less than 3.5 Å for 1-4 HBs, and distances between N(res i)-O(res i+3) less than 3.5 Å for 1-3 HBs. The quotes on hydrogen bond indicate that the N-H-O angles are not restricted so as to indicate a true hydrogen bond. But the 1-3 HB definition turns out to be a good indication of PPII structure.65-66 We find that, for this work, there are essentially no 1-4 HBs, but many 1-3 distances that do correspond in some cases to PPII helix geometry. NMR analysis based on: 3

JHNHα three bond couplings that report on the phi angles’ sampling in the ensembles,

using the revised Bax47 values, rather than the standard Karplus46 values, though they hardly are different from the originally-based values. Averages and standard deviations over the ensembles’ data are evaluated. CA chemical shifts evaluated using the shiftx248 program that combines information based on sequence and structure (the backbone atom Cartesian coordinates generated in Section 2.4). The program is used in “NMR” mode that reports residue CA chemical shifts averaged over all the ensemble samples. Literature data is often reported as secondary chemical shifts, the difference between the given sequence’s experimental and “random coil” chemical shift values. There are a number of sets of random coil values67-70 that have been evaluated based on a number of methodologies. In this work, the CA random coil chemical shifts provided by Kjaergaard et al.67 are used. They have been obtained at neutral pH and are corrected mainly for protonation states relative to the Schwarzinger et al. values.69 This correction is significant for the Asp (residue X in PEP_D) and the Glu residue in all these peptides. 13 ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 52

Residual Dipolar Couplings. To be able to obtain RDCs with our generated N, CA, C, O backbone atom positions requires addition of two hydrogens: the N-H amide hydrogen and the CA-H hydrogen. The Pymol program71 was used to add them. The RDCs are evaluated from the expression72

3 2 pq pq  pq pq  2 D pq = Dmax  Aa ( 3cos θ − 1) + Ar 2 sin θ cos 2ϕ  pq Dmax = −γ pγ q hµ0 4π rpq3

(2.5)

where the axial Aa and rhomboid Ar components in the frame of the principle axes of the Saupe alignment tensor components, S11, S22 and S33, are

Aa = S11

Ar =

(2.6)

2 ( S33 − S 22 ) 3

and the angles are certain projections of a given pq bond vector spanning atoms p and q on pq these axes. The constant Dmax depends on the atom p and q gyromagnetic ratios γ p and γ q

and bond distances rpq . Obtaining by computation the Saupe tensor can be accomplished in various ways.72-76 Best suited to our scheme is the Almond and Axelsen73 method approximately relating the Saupe tensor to the protein’s Gyration tensor. The Gyration tensor is 1 T= Na

Na

∑(x k =1

k

− x ) ⊗ ( xk − x

)

(2.7)

with the x k − x the Cartesian coordinate deviations of atom k of the N a atoms from their centroids x , for a given configuration. It is a real, symmetric, positive definite matrix 14 ACS Paragon Plus Environment

Page 15 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

whose eigenvalues λi ( i = 1, 2,3) , with the definition

ρi = λi and the ordering,

ρ1 ≥ ρ 2 ≥ ρ3 , are related to the diagonalized Saupe tensor according to 1 ( ρ 2 + ρ3 ) 2 1 S 22 = ρ 2 − ( ρ3 + ρ1 ) 2 1 S33 = ρ3 − ( ρ1 + ρ 2 ) 2 S11 = ρ1 −

(2.8)

Note that for ρ1 = ρ 2 = ρ3 (spherical symmetry) Aa = Ar = 0 , and for ρ1 > ρ 2 = ρ3 (axial symmetry) Ar = 0 . Thus, this RDC scheme consists of, for each conformation: obtain and diagonalize the Gyration tensor and, from its eigenvalues, evaluate Aa and Ar . Then, for each residue, construct e.g. the atoms p and q amide N-H bond vector and project it onto the S11 and S33 directions of the Saupe principle axes to obtain, respectively, angles θ pq and

ϕ pq . Finally, average these RDCs over the conformer ensemble. In this way, the above equations provide a set of relative RDCs values. A primitive error analysis was carried out by, for each θ pq , assuming a ± 10 degree error in

θ pq for a given N-H bond vector’s projection onto the Saupe S11 direction. A similar analysis for the ϕ pq lead to somewhat smaller error, so the error bounds we present are conservative but still provide a good estimate of what to expect.

3. RESULTS 3.1 Geometric Observables for the Four Main Ramachandran Populations.

We first

examine the geometric implications of sequences that correspond to the four principle regions

15 ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of occupation of the Ramachandran plane, given in Figure 1. Conformations for four sequences of length sixteen were (non-probabilistically) constructed using the angles in Table 1, their phi/psi angles converted to Cartesian coordinates, and the various considered geometric measures evaluated, as outlined in the Methods. Table 1 provides the radius of gyration (RG), shape parameter Δ that reflects the spherical (Δ =0) to rod like (Δ=1) character, and S, that tracks the principle axis’ direction of deviation from spherical geometry, from oblate (S=–1/4) to prolate (S=2), and the end-to-end distance. Table 1. Geometric characteristics of sixteen residue sequences with all angles set to the four regions of the Ramachandran plane given in the first column as defined in Figure 1. Δ (a) S (a) EtoE (Å) (a) Type RG (Å) (a) α1 (–70, –30) 7.80474 0.818272 1.48019 25.6454 α2 (–100,5) 9.16953 0.848232 1.56169 29.1586 PPII (–70,140) 13.5933 0.96747 1.90306 42.4963 β (–130,145) 15.8964 0.983315 1.95013 51.836 (a)

RG, Δ, S and EtoE are defined in Section 2.5. There is a definite trend from α1 to α2 to PPII to β of increasing all four quantities, RG, Δ, S

and EtoE, indicating a transition from more compact to more extended conformers. It should be noted that while the radius of gyration and the end-to-end distance both essentially double, the Δ parameter is always towards the rod-like (Δ=1) value. Thus, the conformations of all these sequences are quite extended. This statement finds confirmation in the corresponding plot of p(R), the distance distribution function, whose Fourier transform is proportional to the SAXS scattering function as often expressed as a Kratky plot.26-27 Figure 2 displays the p(R) plots for the above sequences. The data show the characteristic behavior of distortion from a bellshaped27, 64 profile appropriate to a spherical distribution with an increasingly extended tail for the more rod-like distributions. 16 ACS Paragon Plus Environment

Page 16 of 52

Page 17 of 52

α1 α2

0.16

PPII

β

0.12 p(R)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

0.08

0.04

0.00 0

10

20 30 R (Å)

40

50

Figure 2. The distance distribution functions p(R) for 16 residue sequences based on the four regions of the Ramachandran plane, as defined in Table 1. They show an increasing tail as the distribution becomes more rod-like. 3.2 Some characteristic 16 Residue Sequences. Before a comparison with experimental data, it is worth examining properties of some sequences that are quite regular in character, for a variety of residue types that emphasize different regions of the Ramachandran plane, as indicated in Figure 1. These sequences will exhibit a diversity in number of ensemble members and Ramachandran occupation and geometric properties, arising from the different number of rotamers for each doublet, their probabilities and Ramachandran plane sampling. To that end we construct 16 residue ensembles of the forms (X)8(Y)8 and (XY)8 with X ≠ Y . These sequences were first used to validate the methodology; in particular, to verify that the stochastically-generated ensembles accurately reflect the desired sequence probabilities. Table 2 lists the number of possible states (the last column) for each sequence. The range is very large; from 78,125 for PIPI to 3,938,980,639,167 for FDFD, reflecting the number of conditionals for each pair of residues. By only accepting states of high probability the accuracy of the stochastic generation can in principle be maintained. To monitor the quality of successful

17 ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ensemble construction our first criterion is that the sequence maximum probability (the product of the largest conditional probabilities for each residue pair in the sequence) is realized for all the sequences tried, and that the ensemble will contain 9 samples of maximum probability down to (the many) states with 1 sample. This cut off in accepted probabilities was chosen to reflect roughly one order of magnitude of conformation population, to exclude very low probability conformers that would hardly be measured experimentally. Use of a twoorders-of-magnitude difference in accepted sequence probabilities does not change the measurable properties significantly. An ensemble with this difference between the largest and smallest probabilities would most likely be beyond what could be captured experimentally. Our second criterion was to see if multiple, independent runs of a given sequence would produce reliable results for each run. A test on the sequence (EK)8 led to typical results. Ten independent runs, each run with 100 million trials produced, for the radius of gyration, RG=12.3812±0.0094 Å. Thus, the fluctuations over the independent runs are negligible. And, this RG agrees with the RG for (EK)8 with one run with 10 million trials, as shown in Table 2. A summary of some properties of the tested sequences is given in Table 2. The examples were picked in part to create ensembles that explore the consequences of the Ramachandran plane conditional probabilities (listed in Table 1) on sequence properties. Examination of the conditional probabilities listed in Table S1 of the SI show that some favor (higher probability) regions of the Ramachandran plane over other regions. Of course, the consequences to a particular sequence rely on all the conditionals that contribute to that sequence in some complex manner.

18 ACS Paragon Plus Environment

Page 18 of 52

Page 19 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Table 2. Geometric and Ensemble Information on some sequences of the forms (X)8(Y)8 and (XY)8 with X ≠ Y . (P=Pro, I=Ile, E=Glu, K=Lys, H=His, F=Phe, T=Thr, W=Trp, D=Asp) Seq (a) PIPI EKEK EEKK RREE RERE HTHT FDFD HHTT WDWD (a)

RG(Å)(b) 13.40 12.39 11.72 11.39 10.73 9.15 9.01 8.94 8.70

Δ(b) 0.86 0.76 0.71 0.66 0.60 0.50 0.54 0.57 0.55

S(b) 1.59 1.30 1.17 1.05 0.89 0.68 0.75 0.82 0.78

α(c) 0.04 0.10 0.13 0.19 0.30 0.39 0.57 0.66 0.66

β(c) 0.20 0.11 0.08 0.16 0.14 0.31 0.15 0.13 0.28

PPII(c) 0.77 0.79 0.78 0.65 0.56 0.21 0.28 0.17 0.06

Ens size(d) 1865 9189 10111 72916 183816 766051 297172 54692 462856

States (e) 7.8e4 2.2e10 1.2e10 1.6e10 2.2e10 6.8e11 3.9e12 9.1e10 2.7e8

The sequences (X)8(Y)8 and (XY)8 with X ≠ Y are 16 residues long and, for notational

convenience, abbreviated as indicated. (b)

RG, Δ , and S are defined in Section 2.5.

(c)

The definitions of Ramachandran occupation are α (psi–90⁰, psi