A Hybrid Monte Carlo Scheme for Multibackbone Protein Design

To achieve Boltzmann sampling of sequences and conformations, we use a hybrid Monte Carlo (MC) scheme: a trial hop between backbone models is followed...
2 downloads 0 Views 5MB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Article

A hybrid Monte Carlo scheme for multibackbone protein design Karen Druart, Julien Bigot, Edouard Audit, and Thomas Simonson J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.6b00421 • Publication Date (Web): 24 Oct 2016 Downloaded from http://pubs.acs.org on October 26, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

A hybrid Monte Carlo scheme for multibackbone protein design Karen Druart,†,‡ Julien Bigot,‡ Edouard Audit,∗,‡ and Thomas Simonson∗,† †Laboratoire de Biochimie (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France ‡Maison de la Simulation, CEA, CNRS, Univ. Paris-Sud, UVSQ, Universit´e Paris-Saclay, 91191 Gif-sur-Yvette, France E-mail: [email protected]; [email protected]

Abstract Multistate protein design explores sidechain mutations with the backbone allowed to sample a small, predetermined library of conformations. To achieve Boltzmann sampling of sequences and conformations, we use a hybrid Monte Carlo (MC) scheme: a trial hop between backbone models is followed by a short MC segment where sidechain rotamers adjust to the new backbone, before applying a Metropolis-like acceptance test. The theoretical form and a practical approximation for the acceptance test are derived. We then compute backbone conformational free energies for two SH2 and SH3 proteins using different routes and protocols, and verify that for simple test problems, the free energy behaves like a state function, a hallmark of Boltzmann sampling.

Key words: molecular modelling, Proteus package, free energy simulation, tyrosyl-tRNA synthetase 1 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Introduction

Computational protein design is an important emerging technique, both for engineering and understanding protein structure and function. 1–7 Many applications seek to obtain protein variants that have a specific preference for a particular conformation and possibly a particular ligand. For such applications, multistate design methods have been proposed. 8–13 They typically use a small, predetermined library of backbone conformations. This is in contrast to methods where the backbone is fully-flexible and can deform continuously during the simulation. 14–19 One multistate design strategy is to explore sequence space separately for each backbone conformation in turn, then combine the results in some way; for example, by retaining sequences that favor the desired backbone conformation and discarding sequences that favor others. 9 Another recent strategy is to explore sequence space for all the backbone variants at once, by combining residue-residue distances from all the variants into a single, effective energy function. 8 Dead End Elimination methods are another route and have been extended to multi-state design. 20–23 Here, we present a multistate method that samples sequences and backbone variants simultaneously, through a hybrid Monte Carlo (MC) scheme that differs from a recent proposal 19 and generalizes slightly an earlier one. 24 The MC move set includes hops from one backbone conformation to another, within a small, predetermined library. To allow such a hop, before evaluating it, we perform a short MC segment where sidechain rotamers and possibly types can change, thus enforcing a limited structural relaxation before we decide to accept or reject the move. This strategy can work as a heuristic; indeed, a related heuristic method was proposed recently and applied successfully to ligand binding site design; 19 see below. Here, however, we go further and seek a move acceptance scheme that will lead to a Boltzmann sampling of sequences and conformations. Specifically, sequences, backbone conformations, and sidechain conformations should all be sampled with the correct relative Boltzmann weights. For this, we use the theoretical expression for the hybrid move probability, which involves a sum over pathways in configuration space that connect the original 2 ACS Paragon Plus Environment

Page 2 of 47

Page 3 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

state and the partly-relaxed target state. 24 We propose an approximation to compute this expression that takes into account a subset of the connecting pathways, and is more rigorous than the one proposed previously, 24 which considered only a single pathway. We combine the new approximation with an assumption of detailed balance (see Theoretical framework) between specific sets of forward and backward pathways, leading to a practical computation scheme, which we have implemented in a protein design framework. To test the nature of the statistical sampling, we consider four realistic test systems, based on an SH2 and an SH3 protein, where part of the system can explore 6–10 backbone conformations, one sidechain position can explore multiple residue types, and all the protein sidechains can explore rotamers from a discrete library. For these systems, we perform several kinds of free energy calculation. This problem size (several thousand available rotamers) is moderate by the standards of computational design, 25 but much larger than the problems used to test the earlier hybrid scheme. 24 To ensure adequate sampling, we use long Monte Carlo simulations and, for some of the tests, a replica exchange approach that runs multiple simulations at different temperatures in parallel. For most of the tests, we also use importance sampling techniques that are common in the field of biomolecular free energy simulations. 26,27 A first, “naive” free energy method obtains free energy differences directly from the populations of two states of interest (e.g., two backbone conformations). The second method uses importance sampling, as it gradually “titrates in” a particular backbone conformation by varying a bias potential. A third method is analogous to the metadynamics importance sampling method, 28 and gradually equalizes the populations of all backbone conformations by iteratively introducing an adaptive bias potential. We verify that the results are independent of the particular computational route, and the computed free energy function behaves like a state function, a hallmark of Boltzmann sampling and true free energies. Both the theoretical framework for hybrid MC and the free energy methods are described in the next section. As an application to a slightly larger problem, we consider the tyrosyl-tRNA synthetase

3 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

enzyme. We redesign a signature sequence motif, which is part of a flexible, active site loop, and is known as the “KMSKS” motif, since its sequence is KMSKS in some organisms. Four of the five signature positions are allowed to mutate; there are ten backbone variants, and all proteins positions can explore a discrete rotamer library, for a total of several thousand available rotamers. This problem size is typical of many computational design applications (even though much larger applications are also common). The design identifies sequence variants that favor, respectively, a closed or an open loop conformation. This could lead in principle to protein variants that have altered activity and/or reaction rates. Our test systems include zero, one, or four positions that can mutate, plus ∼100 positions that can change rotamers and 10-30 that can change their backbone conformation. All the mutations, rotamer changes, and backbone changes occur in the same simulation, and lead to a sampling that appears consistent with a Boltzmann distribution. This is very encouraging, since even when there are no mutating positions, the large number of possible rotamer changes makes the test systems challenging. More generally, we expect our method should work well for problems of about the same complexity as the TyrRS application: a dozen backbone conformations, a few thousand available rotamers, and 5–10 mutating sites. This problem size is typical of many CPD applications, especially in the area of protein-ligand binding. For larger applications, more testing is needed. The present hybrid MC approach could also be applicable to other problems, such as ligand specificity design, as well as problems beyond protein design, where Boltzmann sampling is important and where there is a natural separation between slower and faster degrees of freedom, 29,30 analogous to the backbone/sidechain degrees of freedom here.

4 ACS Paragon Plus Environment

Page 4 of 47

Page 5 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

2 2.1

Theoretical framework MC exploration of sidechain rotamers and types

We consider a polypeptide of L amino acids. We assume that each amino acid i can adopt a few different types t, t0 , ...., leading to different possible sequences S. There are two classes of structures: folded and unfolded. For the folded protein, all the sequences share a small library of possible conformations for the polypeptide backbone; for each one, the sidechains can each explore a few discrete conformations r, r0 , ... called rotamers (around 10 per sidechain type t). The energy of any particular conformation depends on the sequence, the backbone conformation, and the particular set of rotamers. The structure of the unfolded form is not specified, but its energy Euf is known and has the simple additive form:

Euf (S) =

L X

Euf (ti ).

(1)

i=1

We perform a Monte Carlo exploration, whose goal is to generate a Markov chain of states, 31–33 such that the states are populated according to a Boltzmann distribution. The simulation system explicitly includes one copy of the folded protein, whose sequence and conformation can fluctuate. One possible MC move is to change a rotamer ri at one particular position in the folded protein; the energy change is ∆Eon = E(...ti , ri0 ...)−E(...ti , ri ...), where the subscripts o, n refer to the old and new rotamer states. Another possible elementary move is a mutation: we modify the sidechain type ti → t0i at a chosen position i in the folded protein, assigning a particular rotamer ri0 to the new sidechain. At the same time, we perform the reverse mutation in the unfolded protein, t0i → ti . The corresponding energy change has the form:

∆Eon = ∆Ef − ∆Euf = (Ef (...t0i , ri0 ...) − Ef (...ti , ri ...)) − (Euf (t0i ) − Euf (ti ))

(2)

∆Eon measures the stability change due to the mutation (for the given set of rotamers). 5 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 47

This MC procedure mimics a situation where for each folded sequence sampled during a simulation, there is also an unfolded copy present. With this interpretation, mutation moves can be seen as unfolding one copy of the protein and refolding another, and thus taking place in conformational space, so that ordinary statistical mechanics apply. For now, we assume only rotamer/type changes are possible. Let α(o → n) be the probability to select a move between two states o and n; let acc(o → n) be the probability to accept the move. The overall probability of the move is

π(o → n) = α(o → n) acc(o → n)

(3)

The classic Metropolis scheme 31–33 chooses the acceptance probability acc as

acc(o → n) = min (1, exp(−β∆Eon ))

(4)

where β = 1/kT is the inverse temperature. This leads to acc(o → n) = exp(−β∆Eon ) acc(n → o)

(5)

For rotamer and type changes, we may assume 25 that in the limit of a long simulation, the following condition of detailed balance 32,33 is obeyed:

N (o)π(o → n) = N (n)π(n → o),

(6)

where N (o) is the equilibrium population of state o (respectively, n). Eq. (6) expresses the condition that the number of o → n and n → o transitions observed in a long simulation are the same. Detailed balance is verified in the limit of infinite simulation length, subject

6 ACS Paragon Plus Environment

Page 7 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

to rather weak conditions concerning the move set, detailed elsewhere. 25,33,34 We then have α(o → n) N (n) π(o → n) α(o → n)acc(o → n) = = = e−β∆Eon N (o) π(n → o) α(n → o)acc(n → o) α(n → o)

(7)

If the probabilities to select moves are symmetric, α(o → n) = α(n → o), we are left with the Boltzmann distribution where the states are populated (at equilibrium) according to their Boltzmann factors e−βE . If the probabilities to select moves are not symmetric, a slight modification of the acceptance probability is needed.

2.2

Hybrid backbone moves

We now consider a third type of move, which involves a change in the backbone conformation. An individual move is schematized in Fig. 1 and Table 1. We assume there are several possible backbone geometries, which differ in the conformation of all or part of the protein, such as a flexible loop (see test systems in Fig. 2, where the flexible backbone part in each system is colored black). We use a “hybrid” move scheme that generalizes slightly one proposed earlier. 24 The initial state o will now also be denoted b, 0, where b refers to the backbone conformation and 0 to the initial set of rotamers. A backbone move has two stages: (1) switch to a new backbone conformation b0 , with sidechain types/rotamers unchanged; (2) perform N steps of Monte Carlo in sidechain type/rotamer space. Stage (2) allows the sidechains to adjust to the new backbone geometry. Sidechain mutations may or may not be allowed during this stage, depending on the application. In all cases, we simply speak of “rotamer relaxation”. In practice, only sidechains within or near the flexible backbone region will be allowed to relax. At this point, the system has reached a trial state n, also denoted b0 , N . We then apply a modified Metropolis test, accepting or rejecting the entire two-stage move, based on energy changes that have occurred during stages 1 and 2 (as detailed below). The rotameric MC segment (stage 2) follows a particular b0 , 0 → b0 , N pathway that we

7 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 47

refer to as the “generating path” Pg (it generates the trial state n; Fig. 1). The probability to follow this path is

α(Pg ) = π(0 → 1)π(1 → 2) · · · π(N − 1 → N ),

(8)

where each term π(i → i + 1) corresponds to one step in the rotameric MC segment. For an accepted step, π is given by Eq. (3). For a rejected step, we adopt the convention that i + 1 and i have the same structure, i + 1 ≡ i, and π(i → i + 1) is the move rejection probability: def

π(i → i + 1 ≡ i) = α(i → j) (1 − acc(i → j)) = π(i → i|j)

(9)

The notation on the left does not indicate which exact trial state j was rejected, even though it affects the precise value of π(i → i + 1); a more explicit notation is introduced on the right. For simplicity, in Eq. (8) and below, we mostly use the simpler notation π(i → i + 1) for rejected moves, with the understanding that when i + 1 ≡ i, π involves a specific target state j and a non-acceptance probability 1 − acc(i → j). The rejected move probabilities will actually cancel out of our final Metropolis test, as shown below. To compute the probability α(b, 0 → b0 , N ) to select the trial state b0 , N , we recognize that while the actual MC segment followed the generating pathway Pg , many other b0 , 0 → b0 , N pathways exist that could also have been taken. The trial move probability involves a summation over all these paths. For the individual steps along a pathway P, we continue to denote the probabilities π(i → i + 1), so that the dependency on P is not made explicit. The probability of choosing the b0 , N trial state then has the form:

α(b, 0 → b0 , N ) = α(b → b0 )α(b0 , 0 → b0 , N )

8 ACS Paragon Plus Environment

(10)

Page 9 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

where α(b → b0 ) is the probability to choose the particular b → b0 backbone hop and

α(b0 , 0 → b0 , N ) =

X

π(0 → 1)π(1 → 2) · · · π(N − 1 → N )

(11)

P

Here, the sum is over all the rotamer relaxation paths P that connect state b0 , 0 to state b0 , N . The number of such paths is enormous, roughly on the order of (10M )N , with M the number of residues allowed to relax along the paths (typically, ∼10 residues within or near the mobile backbone region). Effectively, the move selection probability (Eqs. 10-11) has the form of a path integral, 35 where each path connects the high energy b0 , 0 state to the (partly) relaxed state b0 , N and is weighted by its probability. For each b0 , 0 → b0 , N pathway P there is an inverse b, N → b, 0 pathway Q = P, where the b0 → b change occurs first, then the rotameric pathway P is followed backwards, N → 0. The collection or “bundle” {P} of forward paths thus has a corresponding bundle of inverted paths, {Q}. Nilmeier & Jacobson 24 proposed to assume that detailed balance is verified for these sets of pathways. This means that in the stationary state, the flux along the forward bundle {P} matches that along the inverted bundle {Q}. This can be expressed in the form: N (b0 , N ) acc(b, 0 → b0 , N ) α(b, 0 → b0 , N ) = × N (b, 0) acc(b0 , N → b, 0) α(b0 , N → b, 0)

(12)

We would like to choose acc so that this ratio is equal to e−β∆Eon (where ∆Eon = En − Eo ), giving Boltzmann statistics. We assume for simplicity the backbone hop probabilities are symmetrical, α(b → b0 ) = α(b0 → b). The acceptance probabilities should then obey P π(0 → 1) · · · π(N − 1 → N ) α(b0 , 0 → b0 , N ) acc(b0 , N → b, 0) −β∆Eon e = = PP 0 acc(b, 0 → b , N ) α(b, N → b, 0) Q π(N → N − 1) · · · π(1 → 0)

(13)

Notice that the rotamer changes i ↔ i + 1 in the numerator and denominator on the right occur in two different contexts: b0 (numerator) and b (denominator). We see that acceptance probabilities that lead to Boltzmann statistics can be written explicitly in a closed form,

9 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

albeit one involving a sum over many relaxation paths. In practice, the collection of pathways above cannot be used directly and an adjustment is needed. Indeed, in extensive test calculations on our SH2 protein, we found that for a given generating MC pathway Pg , about 90% of the rotameric moves are rejected, and for most of these (over 90%), the corresponding moves i + 1 → j on the inverse pathway have a negative energy change and would thus be accepted. As a result, the corresponding step rejection probability π(i + 1 → i ≡ i + 1) = π(i + 1 → i|j) and the overall probability α(P) of the inverse path are both zero, so that the ratio in Eq. (13) is not defined. This reflects a preference for rotamers closer to the initial b, 0 state that evidently persists throughout the generating path (and its inverse). To avoid this difficulty, we will systematically replace the original generating path with a simplified or “culled” version, which only contains the accepted moves of the original pathway. From now on, we will refer to the original pathway as the “primitive” generating pathway and, unless otherwise mentioned, we will use “generating pathway” to designate the simplified version, with the rejected moves culled out. To keep the notations simple, we will continue to write the path probabilities in the numerator and denominator of Eq. (13) as products of N terms, e.g., π(0 → 1) · · · π(N − 1 → N ), with the convention that for a rejected move on the primitive path, π(i → i + 1 ≡ i|j) = π(i + 1 → i ≡ i + 1), so that these contributions cancel out harmlessly from the ratio in (13). Several comments can be made. First, the pathways that appear on the right of (13), connecting b0 , 0 to b0 , N and b, N to b, 0 tend to restore the rotamers to a more favorable state after a backbone hop. Approximations for the variation of the energy along such a pathway could perhaps be obtained from linear response theory. 36 Second, if all the relaxation steps were downhill in both the direct and inverted pathways, the fraction on the right of (13) would be unity; making this approximation in all cases could be used as a simple heuristic. We would then simply run N rotamer relaxation steps and accept or reject the final configuration based on the overall energy change, ∆Eon . Third, a special case is when there is only one

10 ACS Paragon Plus Environment

Page 10 of 47

Page 11 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

rotamer relaxation step, N = 1; each sum on the right then reduces to a single pathway, for which the probabilities π(0 → 1), π(1 → 0) are known. Fourth, another important special case is when N is very large, so that the rotamer relaxation phases behave as long, equilibrium simulations, which obey Boltzmann statistics (they have converged to their invariant, stationary probability distribution). In that case, the total probability for the b0 , 0 → b0 , N transition no longer depends on the starting point, and is just the probability P (N | b0 ) of the final state. Eq. (13) becomes 

acc(b0 , N → b, 0) acc(b, 0 → b0 , N )



= eβ∆Eon

N →∞

P (N | b0 ) = eβ∆Eon e−β∆Eon = 1 P (0 | b)

(14)

The upper and lower probabilities on the right of (14) are proportional to e−βEo and e−βEn , respectively, with the same normalization constant, so the righthand side is equal to unity. We can then choose the forward and backward acceptance probabilities equal to unity: all mutation moves are accepted. With this large-N scheme, we effectively propose each move with a probability proportional to the Boltzmann weight of the final state: we get Boltzmann statistics thanks to the complex (and very expensive) move proposal scheme, not the acceptance scheme, contrary to the usual Metropolis idea.

2.3

“Single Path” and “Permuted Paths” approximations

A “Single Path” approximation (SPA) was proposed by Nilmeier & Jacobson, 24 where the sum over paths in Eq. (11) is approximated by the contribution of the single, “generating” path. We would then have: acc(b0 , N → b, 0) SP A β∆Eon π(0 → 1)π(1 → 2) · · · π(N − 1 → N ) = e × acc(b, 0 → b0 , N ) π(N → N − 1) · · · π(2 → 1)π(1 → 0)

(15)

The righthand side of (15) can easily be computed in practice. 24 When we perform the generating MC segment, we simply record the product of π values that occur along the P

Q

accepted steps and multiply them to obtain α(b0 , 0 → b0 , N ) and α(b, N → b, 0) (where Q is 11 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the inverted version of P, taking place in the context of backbone b). Another way to view the SPA approximation is to assume there is detailed balance P

between each particular forward path b, 0 → b0 , 0 → b0 , N and its inverted, backward path Q

b0 , N → b, N → b, 0. Indeed, starting from this “super-detailed” balance assumption, we easily obtain (15). Notice that above, we assumed flux balance between bundles of forward and backward pathways, not individual pathways. Here, we introduce a new, multi-path approximation, as follows. For a particular, trial, backbone move, we perform as above an initial, generating, rotameric MC segment that defines the target trial state n. At the end of this segment, there are m residues that have actually changed rotamers, compared to the starting configuration o. As alternative paths from o to n, we then consider “permutations” of the (simplified) generating path, where the same m rotamer changes occur, but in different orders, as schematized in Fig. 1. There are m! such paths, and we either sample all of them (for small m) or a set, large number P of them, chosen randomly. Notice that the m “changed” positions do not include positions that change to a new rotamer during the relaxation segment but change back later in the segment, so that their final rotamer is the same as in o. For each permuted path P, the probabilities π(i → i + 1) are different from those in the generating path, because the rotamer changes take place in a slightly different rotameric context; however, they are easily computed. We assume that the sum over these permuted paths is a good estimator of the entire path integral that appears in the numerator of Eq. (13). Finally, for the reverse, n → o transformation, we adopt the corresponding inverted paths P, where the same rotamer changes are made, in the reverse order and the opposite direction, in the context of b, instead of b0 . We assume that the sum over the inverted generating path and its permuted variants is a good estimator of the path integral that appears in the denominator of Eq. (13). We refer to this new scheme as the “Permuted Path” approximation, or PPA.

12 ACS Paragon Plus Environment

Page 12 of 47

Page 13 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

2.4

Multi-backbone free energy estimation

To test the nature of the statistical ensemble sampled with SPA or PPA, we assume the simulation obeys Boltzmann statistics, we compute various free energy differences using several methods, and we verify that the results are independent of the method used: a hallmark of Boltzmann statistics and true free energies. In particular, we compare three methods to estimate the relative free energies Gi of individual backbone conformations i within a small, finite set of conformations that form our library of backbone geometries. The first method is a “naive” one: we run multi-backbone MC, then compute the conformational free energy difference Gi − Gj between two backbones bi , bj from the ratio of their fractional populations, fi , fj : Gi − Gj = −kT ln

fi fj

(16)

The multi-backbone MC can be performed with the protein sequence fixed or variable. If it is variable, the free energies being compared will correspond to one particular sequence S among those that are sampled; this can be denoted Gi = G(bi |S), Gj = G(bj |S). The second free energy method consists in “titrating in” one particular backbone conformation, say bi , by adding a bias energy δ to the system whenever it occupies this conformation. Starting from a large positive δ and decreasing it gradually, we go from a situation where conformation bi is never populated to one where only bi is populated. If the simulations obey Boltzmann statistics, we can apply the usual relations between populations and free energies. In particular, let fi (δ) be the fractional population of backbone conformation bi for a particular δ value and let δi∗ be the δ value where fi = 0.5 (the titration midpoint). We then have fn = where α =

β ln 10

1 1+

∗ e−β(δ−δi )

=

1 1+

∗ 10−α(δ−δi )

(17)

is referred to as the Hill coefficient (following the acid/base titration litera-

ture; 37 α = 0.734 (kcal/mol)−1 at room temperature). The free energy differences between

13 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 47

backbone conformations bi , bj (without any bias term) are then ∗

Gi − Gj = kT ln

1 + e−βδi ∗ 1 + e−βδj

(18)

The third free energy method is analogous to the metadynamics approach for MD simulations. 28 Over the course of a long MC simulation, we gradually increment the individual backbone conformational energies by a bias potential, until all the backbone conformations are equally populated. The biases added to each conformation then report on the relative backbone free energies for the unbaised system; specifically, the free energy of each backbone conformation bi is just Gi = −

X

δi (t),

(19)

t

where the sum is over simulation time and δi (t) is the bias energy added to each backbone’s energy at each time. In practice, after an amount δi has been added at a particular time, we run the simulation for an interval T before adding another bias energy. At the end of each interval T , bias is only added to one backbone: the one that was most populated during the last interval. The magnitude of the bias increment δi (t) decreases with time; implementation details are given in the next section.

3 3.1

Computational methods Test systems and backbone libraries

We consider four test systems, constructed from two small proteins: an SH2 domain (PDB code 1A81) and an SH3 domain (PDB code 1CKA). For the SH2 domain, the flexible backbone region includes the loop 196–203, with either the wildtype sequence ARDNNSY (wtSH2) or a simplified sequence with fewer rotamers, AASTSAA (siSH2). For the SH3 domain, the flexible region is either the 140–157 loop (loSH3) or a decapeptide ligand (peSH3). For each system a 500 ps molecular dynamics simulation was run at 300 K using the Amber 14 ACS Paragon Plus Environment

Page 15 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

ff99SB force field and a Generalized Born implicit solvent model, with a solute dielectric constant of 4. 38 The protein was held fixed except for the flexible loop or peptide. Six loop conformations were drawn from each trajectory such that the rms deviations between their conformations were between about 1 and 3 ˚ A; the complete set of deviations between models is given in Table 2. Notice that only a small part of the protein backbone structure differs between models: either a flexible loop or a peptide ligand. The test proteins are shown in Fig. 2. We also considered the tyrosyl-tRNA synthetase enzyme from Escherichia coli (PDB code 1VBM). The flexible region is a 31-residue loop in the active site, the so-called KMSKS loop, 39 encompassing residues 225–258. The signature motif in E. coli is

235 KFGKT239 .

The loop was modelled in either a closed conformation (homologous to the 1H3E Thermus thermophilus crystal structure) or a partly-open conformation, close to the 1VBM E. coli crystal structure. However, the loop is one residue shorter than the E. coli one and three shorter than the Thermus thermophilus one. This loop version excludes a few residues that are very divergent between the two enzymes, and was chosen to allow a more reliable modelling of the two loop conformations. The final loop includes the E. coli residues 225–258 other than Arg230. We included a conserved, buried water molecule in the model, as detailed elsewhere. 40 Using an MD protocol, as above, we obtained 10 backbone conformations, 5 partly open and with no ligand, 5 closed and with the tyrosine substrate as the ligand. When present, the ligand has its own set of rotamers, as described elsewhere. 41 Rms deviations between the loop conformations are given in Table 2. In the MC simulations below, for each test system, the different backbone conformations were assumed to have equal energies. To implement this, we ran a 10 million step Monte Carlo simulation separately for each backbone geometry and averaged the energy over the last 100,000 steps. The average energy for each backbone geometry was then subtracted out of the energy function for future simulations.

15 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3.2

Page 16 of 47

Effective energy function and energy matrix for MC

The first stage in our protein design procedure 42–44 is to calculate the interaction energies between all sidechain pairs, between sidechains and the backbone, and between the protein and any ligands, considering all allowed backbone geometries, sidechain types and the discrete rotamer library of Tuffery et al. (enhanced by allowing several orientations for SH and OH groups). 45,46 To alleviate the rotamer approximation, each sidechain pair is initially positioned using two library rotamers, then 15 steps of conjugate-gradient energy minimization are performed, where only the pair of interest can move and the energy is limited to the interaction within the pair, plus the pair’s interaction with the backbone and solvent. 43,44 The minimization is done under restraints that maintain the two sidechain close to the library rotamers, while removing the largest steric overlaps. The final interaction energy is stored in an “energy matrix” for later use. 42 With this approach, a given rotamer at a given position I is not quite unique, but is represented by a series of proxies, all very similar, all belonging to the same local energy well, and with different proxies standing in for different matrix elements IJ, IK, .... Once established, the energy matrix is kept fixed throughout the Monte Carlo exploration. 44,46 The energy calculations are done using a modified version of the Xplor program, 44,47 using the following effective energy function:

E = Ebonds + Eangles + Edihe + Eimpr + Evdw + ECoul + Esolv

(20)

The first four terms in (20) are associated with covalent interactions in the protein, including bonds, angles, dihedrals, and improper dihedrals; the next two respresent van der Waals and Coulombic interactions between non-bonded atoms. These six terms were were taken from the Amber ff99SB empirical energy function, 48 slightly modified for protein design, with the original backbone charges replaced by a unified set, obtained by averaging over all amino acid types and adjusted slightly to make the backbone portion of each amino acid neutral. 49 The

16 ACS Paragon Plus Environment

Page 17 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

last term on the right, Esolv , represents the contribution of solvent. We used a “Generalized Born + Surface Area”, or GBSA implicit solvent model: 50

Esolv = EGB + Esurf

1 = 2



1 1 − W P

X

 −1/2 X 2 2 qi qj rij + ai aj e−rij /4ai aj + σi Ai

ij

(21)

i

Here, W = 80, P = 4 are the solvent and protein dielectric constants; rij is the distance between atoms i, j and ai is the “solvation radius” of atom i. 38,50 Ai is the exposed solvent accessible surface area of atom i; σi is a parameter that reflects each atom’s preference to be exposed or hidden from solvent. The solute atoms were divided into 4 groups with the following σi values (cal/mol/˚ A2 ): unpolar (-5), aromatic (-12), polar (-8) and ionic (-9). Hydrogen atoms were assigned a surface coefficient of zero. Surface areas were computed by the Lee and Richards algorithm, 51 implemented in the XPLOR program, 47 using a 1.5 ˚ A probe radius. In the GB energy term, the atomic solvation radius ai approximates the distance from i to the protein surface and is a function of the coordinates of all the protein atoms. The particular ai form corresponds to a GB variant we call GB/HCT, after its original authors, 50 with model parameters optimized for use with the Amber force field. 38 Since ai depends on the coordinates of all the solute atoms, 50 an additional approximation is needed to make the GB energy term pairwise additive and define the energy matrix. We use a “Native Environment Approximation”, or NEA, where the solvation radii ai of a particular group (backbone, sidechain or ligand) are computed ahead of time, with the rest of the system having its native sequence and conformation. 37,44 The surface energy contribution Esurf is not pairwise additive either, because in a protein structure, surface area buried by one sidechain may also be buried by another. To make this energy pairwise, Street et al proposed a simple procedure. 52 The buried surface of a sidechain is computed by summing over the neighboring sidechain and backbone groups. For each neighboring group, the contact area with the sidechain of interest is computed,

17 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

independently of other surrounding groups. The contact areas are then summed. To avoid overcounting of buried surface area, a scaling factor is applied to the contact areas involving buried sidechains. Previous work showed that a scaling factor of 0.65 works well. 38,46 For systems where the sequence was allowed to fluctuate (TyrRS, siSH2), the energy function also includes an unfolded state contribution (Eq. 2); the Euf (ti ) values employed are given in Supplementary Material. The TyrRS simulation also included a bias energy, that restrained the signature sequence to be close to the native value KFGKT. The bias energy had the form c(S − Srand )2 , where c = 1 kcal/mol, S is the Blosum40 similarity score relative to the native sequence, and Srand is the score relative to a random sequence (with equal probabilities for all amino acid types).

3.3

MC and Replica Exchange MC setup

The MC moves include backbone moves, individual sidechain mutations, and individual rotamer changes. At each MC step, one position attempted a rotamer change; one pair of positions also attempted a pair of rotamer changes. When mutations were allowed, the probability to attempt a mutation at one random position was 0.1 at each step (one mutation every ten steps on average). The probability to choose as a trial move a backbone hop was 0.1 at each step. The hybrid backbone moves use N relaxation steps; values of N between 10 and 100 were used, depending on the test. Residues allowed to move during the relaxation segment were those in the flexible region plus those whose (unsigned) interaction energy with any flexible residue was greater than 3 kcal/mol (for at least one rotamer combination). During the rotamer relaxation segments, no pairwise rotamer changes were allowed, only single position changes. Simulations were done at room temperature. Some runs used a Replica Exchange MC method, 25 with 8 replicas and thermal energies of kT = 3, 2, 1.333, 0.888, 0.592, 0.395, 0.263, and 0.175 kcal/mol. A conformation swap between a random pair of replicas (at neighboring temperatures) was attempted every 2,500,000 steps (one random pair per trial), or every 10,000 steps for TyrRS. Overall REMC trajectory lengths 18 ACS Paragon Plus Environment

Page 18 of 47

Page 19 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

were 250,000,000 steps per replica, or 50,000,000 steps for TyrRS. The plain MC simulations lasted 10,000,000 steps. In the titration calculations, one MC simulation was done for each value δ of the bias energy; the individual δ values were 1 kcal/mol apart. In the metadynamics calculations, the individual simulation segments lasted 1,000,000 steps. A bias energy δ(t) was added to the most populated backbone at the end of each segment. At the beginning of the procedure, the bias added was always 1 kcal/mol. Once all the backbones were populated, 0.7

the added bias gradually decreased, δ = exp(−0.1n ), where n is the MC segment number. Convergence was achieved within n ≤ 50 segments. For the Permuted Path calculations with different path numbers P , let m be the number of residues that underwent a rotamer change in a given generating path. If m! ≤ P , we used all the permuted paths. If m! was greater than P , we used P randomly chosen, distinct path permutations. Values of P were usually P =100; some tests used P =500 or P =1000.

4 4.1

Results Rotamer relaxation pathways

We consider first the energy relaxation ∆Er that occurs along each rotamer MC segment following a trial backbone hop. We consider the test system wtSH2, simulated with N =20 or 50 relaxation steps and P = 100 path permutations. For a large collection of accepted backbone moves, we report in Fig. 3 the energy variation along the first, generating path b0 , 0 → b0 , N . The average variation h∆Er (n)i of the energy with the step number n is shown, as well as a few individual relaxation segments. A 2D histogram is also shown for the pair (∆Er , m), where m is the number of residues that change rotamers during the relaxation pathway. The individual relaxation curves are diverse, noisy, and non-monotonic. The average behavior is more simple, with a smooth, monotonic decrease. With N =20 relaxation steps, the mean relaxation energy along the direct (respectively, inverted) pathway is about 4 kcal/mol (respectively, 14 kcal/mol); with N =50, it is about 11 kcal/mol (respectively, 6 19 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 47

kcal/mol). With N =20, there are about 5 rotamer changes on average; with N =50, there are about 10. We consider next the move acceptance probabilities in the SPA and PPA schemes. SPA and PPA differ in the estimation of the two sums over paths (or path integrals) in Eq. (13), whose ratio we denote f (P ); to compare the SPA and PPA acceptance probabilities, we focus on another quantity r(P ), defined as a ratio of ratios: P π(0 → 1) · · · π(N − 1 → N ) f (P ) = PP Q π(N → N − 1) · · · π(1 → 0) f (1) r(P ) = f (P )

(22)

Fig. 4 shows a histogram of r(P ) for P = 100 and N = 50. The distribution has a single large peak at r=1, with weak tails on either side. Thus, the SPA and PPA methods lead to similar acceptance probabilities in most cases. However, with P =1, there are rejected moves that would have been accepted with P =100 (see the gray band on the left edge of the righthand panel in Fig. 4).

4.2

Backbone free energies

We consider four test systems (Fig. 2), constructed from two small proteins: an SH2 domain (PDB code 1A81) and an SH3 domain (PDB code 1CKA). For each of them, there are six backbone geometries that differ in the position of either a surface loop or a peptide ligand (peSH3). The SH2 surface loop has either the wildtype sequence ARDNNSY (wtSH2) or a simplified sequence AASTSAA (siSH2), with fewer rotameric states. We consider both the Single and Permuted Path approximations (SPA, PPA), with relaxation segments of length N between 10, 20, 50, or 100, and permuted path numbers P = 100, 500, or 1000. The differences between simulation protocols will affect the nature of the statistical sampling, which may or may not approach our goal of Boltzmann sampling. To test whether Boltzmann sampling is achieved, we compute several estimators of the backbone conformational free 20 ACS Paragon Plus Environment

Page 21 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

energies. While they are all expected to be accurate under Boltzmann sampling, they differ substantially in the routes they follow between backbone states and the manner in which they sum up the contributions from the individual conformations or microstates of the system. If the microstates are sampled according to a Boltzmann distribution, the different free energy estimators will behave as state functions, with different computational routes giving essentially the same free energy values. Specifically, we use the naive free energy method (Eq. 16) with several protocols and, for some of the systems, the titration and/or metadynamics methods. For the SH3 domain (loSH3 and peSH3 tests), when using the naive free energy method, Replica Exchange MC with eight replicas (see Methods) was needed to obtain adequate sampling. With a single replica, transiitions between backbone were too rare to provide accurate averages. For wtSH2, siSH2, and peSH3, simulation times (per 10 million MC steps) vary in a roughly linear way when N increases from 10 to 100, from 0.3 to 2 hours when P =1 and from 0.5 to about 24 hours when P = 100 or 500. For loSH3, with its large flexible loop, simulation times are longer: 10 hours with P =1 and N =50. Our REMC implementation (loSH3 and peSH3) uses an OpenMP parallelization on multi-core machines, so that the effective speed is essentially the same as for plain MC (but eight cores are used instead of one). Results are shown in Fig. 5. Notice that for clarity, most of the free energy curves use the same backbone conformation as a reference; however, each curve is in fact only defined up to an additive constant. Selected free energy values are also reported in Table 3; these values are adjusted by subtracting out the average of the six backbone free energies; this adjustment minimizes the rms deviations between pairs of protocols. For loSH3, only the SPA was used; agreement between methods is good, with the N =10 protocol leading to somewhat larger values. For the other systems, the SPA and PPA methods were compared, and gave somewhat different results. For siSH2, the SPA/PPA rms differences are about 0.08 kcal/mol (depending on N ), compared to a mean |δGi | value of 0.38 kcal/mol (Table 3). For wtSH2, with N = 10 or 20, the rms difference between

21 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

SPA and PPA is about 0.10 kcal/mol, a 20% relative difference. For N = 50 or 100, the rms differences are a bit larger, about 0.20 kcal/mol. For peSH3, with N =20, we have an SPA/PPA rms difference of 0.21 kcal/mol, compared to a mean |δGi | value of 0.80 kcal/mol, a 28% relative difference. With N =50, we obtain the largest SPA/PPA rms difference by far, 0.48 kcal/mol (Table 3). The PPA results themselves are in good mutual agreement, despite using a range of N and P values and (for wtSH2) three different free energy methods. The largest PPA/PPA rms difference is 0.22 kcal/mol for peSH3 (N =20 vs. N =50). The protocols for wtSH2 are represented in the lower panel of Fig. 5 as pseudo-particles, separated by a spatial distance proportional to their rms free energy differences. The PPA protocols form a compact group, while the SPA protocols are more spread out. The largest deviation in the figure, between the upper left protocol (naive method, N =100, SPA), and the lowermost protocol (naive method, N =10, P =1000, marked with an asterisk) is 0.28 kcal/mol. The SPA protocols with the naive or titration/metadynamics protocols segregate towards the upper left and right, respectively; a lesser segregation is seen for the PPA protocols. Illustrative titration curves are shown in Fig. 6 for wtSH2 (N =20, P =100) and peSH3 (N =50, P =1). In Table 4, we report the full set of Hill’s coefficients α (Eq. 17) obtained with the different protocols, since this quantity has a specific value in the case where Boltzmann sampling is obeyed: α = 1/(kT ln 10) = 0.734 (kcal/mol)−1 at the simulation temperature. Good agreement with the theoretical value is obtained, especially with P = 100. The good agreement between our three free energy methods might seem a trivial result, telling us our simulations are long enough and not much else. In fact, it indicates that our sums over microstates are state functions, as expected for free energies. This is a highly nontrivial result and a stringent test of the hybrid MC method, since in statistical mechanics, arbitrary functions of the microstate energies are very unlikely to be state functions.

22 ACS Paragon Plus Environment

Page 22 of 47

Page 23 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

4.3

Simulations with backbone and sequence changes

We now consider a more complex situation, where both the protein backbone and its sequence can vary. Specifically, for siSH2, we allow one amino acid in the flexible backbone region to adopt either the Thr or Phe types. We use two different approaches to compute the relative free energies of the various backbone geometries with either type. The simplest approach uses a single simulation, where backbone and sequence vary freely; the relative free energies are deduced directly from the populations (Eq. 16). The second uses two separate simulations to compare, respectively, a pair of backbone geometries and a pair of sidechain types (“dual simulation” approach). The pairwise free energy comparisons are represented in Fig. 7 in the form of a thermodynamic cycle. 26 There are six backbone geometries, denoted A–F; free energy comparisons are shown for three illustrative pairs (BD, AE, DF). We compare six MC protocols that use P = 1 (SPA) or 100 (PPA), with N = 10, 20, or 50. In Fig. 7, the upper panel shows the thermodynamic cycle, with free energies for the BD backbone pair using N =50, P =1 or 100 paths, and “dual” or “single” simulation approaches. In the single simulation, either all six backbone geometries are allowed (“single6” protocol) or just the B and D geometries are allowed (single2 protocol). These four approaches give results for each leg of the cycle within ±0.05 kcal/mol of each other. Going from the upper left to the lower right state, and following either the gray or the black routes around the cycle, gives two different free energy estimates, shown in the middle of the cycle. For all four simulation approaches, the two routes give results within ±0.05 kcal/mol of each other (compare the upper and lower lines); these differences define a cycle closure error. The agreement between simulation approaches suggests the hybrid MC method is at least moderately robust with respect to its details. The small closure errors suggests that the free energy function behaves like a state function, the hallmark of a true free energy. This in turn suggests that microstates are sampled according to a Boltzmann distribution. The lower panel of Fig. 7 provides a more detailed picture, with results for the BD, AE, and DF backbone pairs, with the same four simulation approaches and N = 10, 20, or 50. 23 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The corresponding cycle closure errors are also shown (gray dots). The free energies δGij between backbones are estimated to be about 0.2–0.3 kcal/mol for AE and DF, and about 0.5–0.8 kcal/mol for BD. The differences between protocols are moderate. Results with N = 20 and 50 and all four simulation approaches are almost all within 0.2 kcal/mol of each other, and usually 0.1 kcal/mol. The closure errors are at most 0.1 kcal/mol, and typically only 10 or 20% of the corresponding δGij .

4.4

Simulating an enzyme active site

An important application of multistate protein design is to identify protein variants that provide specificity for one particular ligand or conformation. As an illustration, we apply our hybrid MC scheme to the tyrosyl-tRNA synthetase enzyme (TyrRS). Its active site includes a flexible loop, known as the “KMSKS loop”, since it contains a signature motif whose sequence is KMSKS in some organisms. 39 This loop has an open conformation when there is no bound substrate and a closed conformation when the tyrosine and ATP substrates are bound (Fig. 8). Here, by allowing the sequence and structure to fluctuate freely, we will identify protein variants that specifically favor one conformation over the other. We use a library of 10 loop conformations, 5 of which are open and 5 closed. Rms differences between the conformations are given in Table 2. In the closed conformations, the amino acid substrate Tyr is bound; in the open conformations, there is no ligand (Tyr has moved into bulk solution). The flexible region includes 31 amino acids, with the signature motif near the center. In the Escherichia coli enzyme studied here, the signature sequence is 235 KFGKT239 . We allow the four residues other than Gly to mutate freely, subject to a restraining umbrella potential that biases their sequence to stay close to the experimental one (see Methods). During the rotamer relaxation segments, mutations of the active positions are also allowed. The results are summarized in Fig. 8. The few most populated sequences sampled during the simulation are shown, with the corresponding backbone state indicated. Overall, the open/closed conformations are sampled 14%/86% of the time, respectively. These popula24 ACS Paragon Plus Environment

Page 24 of 47

Page 25 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

tions correspond to a small open/closed free energy difference, about 1 kcal/mol. This is consistent with the experimental observation that both conformations are observed in room temperature crystal structures 39 (although the precise experimental free energies are not known). Sequences sampled are also represented as two logos, and compared to experimental sequences from the Pfam database. 53 Only the positions of the signature sequence motif are shown. The upper (respectively, lower) logo recapitulates sequences sampled when the loop is open (closed). For the four variable positions, the simulated sequences have a mean identity with the Pfam sequences of, respectively, 21% (open) and 43% (closed). We recall that the simulations included a weak bias towards the experimental Pfam alignment; if the bias is halved, the mean identities drop to 7% (open state) and 22% (closed state). Here, however, our focus is the dependency of the sequence on the conformational state, which is not affected by the bias (which does not depend on the conformation). The second Lys in the signature motif, Lys238, is known to interact with the phosphate group of the tyrosyladenylate reaction intermediate 39 and thought to stabilize the transition state of the enzyme reaction. Here, even though the ligand is Tyr and not tyrosyl-adenylate, this Lys is mostly preserved in the closed state sequences, or mutated to the homologous Arg, also positively charged. In the open state, it is changed to Asp, with a very small Arg population; Asp is also present in the closed state but with a low population. The native Phe236 is mostly preserved in the closed state, and replaced by Met in the open state; both types are common in the experimental sequences. Overall, and partly thanks to the biasing potential, the open and closed sequences are moderately homologous to the experimental sequences and to each other. Several sidechain types distinctly favor one conformation over the other: Arg/Lys at position 235 and Asp at position 238 favor the open state; Ala/Ser at position 235, Phe at position 236, and Arg/Lys at position 238 all favor the closed state. Thus, the simulations reveal several possible mutations that could specifically stabilize one conformation or the other, a typical goal in multistate protein design. Further testing could be done, in principle, using molecular dynamics simulations with explicit solvent.

25 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

5

Concluding discussion

The simplest method for multistate protein design is to run separate calculations for each backbone model. The results can then be combined in various ways; for example, sequences sampled with a given backbone model could be weighted according to the energy of that particular model. If the backbone models have the same energy, for example, all the sequences would be weighted equally. However, with separate simulations for each backbone, the relative entropies of the different backbone models are not available, and so one cannot assign rigorous statistical weights to the backbone models or sequences. Methods like Dead End Elimination that identify the Global Minimum Energy Conformation (GMEC) have been extended to multistate design, 20–23 but only reveal energy differences, not free energy differences and statistical weights. In contrast, the method proposed here allows us to obtain the free energy differences between backbone models, either for a single sequence or a collection of sequences. This in turn allows us to predict sequence populations in a precise and formally rigorous way. In our test systems, while the backbone energies were all equal by construction (see Methods), the backbone free energies varied over a range of 1–2 kcal/mol, corresponding to a range of relative populations of about 1–50. While free energies are not needed for all design problems, our method should be complementary to earlier multistate approaches. To sample all the backbone models effectively, we used a hybrid MC scheme, proposed earlier, where 10–100 MC steps were done to relax rotamers in and near the mobile backbone region before accepting or rejecting a backbone hop. 24 Indeed, with a naive method that applies the Metropolis test immediately after a trial backbone hop, the move acceptance rate was essentially zero, because of steric overlaps between sidechains on the new backbone. Other MC schemes are of course possible that might be effective; for example, one could randomize all the rotamers in the mobile region at the same time one makes a backbone hop. Presumably, this would occasionally lead to an accepted move, whenever the new, random rotamers happened to “fit” in the new backbone arrangement. However, we expect 26 ACS Paragon Plus Environment

Page 26 of 47

Page 27 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

the more sophisticated, hybrid scheme to be much more efficient. Indeed, we were able to extensively sample a diverse collection of backbone models within a few hours of simulation time for all the test systems, occasionally using Replica Exchange Monte Carlo when the simple MC sampling was not sufficient. This is consistent with the large relaxation energies observed during the rotamer relaxation segments: eg, 11 kcal/mol on average for wtSH2 using N =50, about 20 times the thermal energy kT . The proposed hybrid MC method can be used as a heuristic, since it allowed a range of backbone conformations to be visited effectively in the tests above. For example, a metadynamics-like importance sampling led to equal sampling of all the backbone models in one test. Indeed, a related heuristic method was proposed recently by Kortemme and coworkers. 19 With that method, a small backbone displacement (of the backrub type) is followed by a heuristic optimization of sidechain rotamers, before applying a Metropolis test to accept or reject the whole sequence. The method was applied successfully to ligand binding site design, but the nature of the statistical ensemble was not tested. Here, we have pursued a more ambitious goal, and examined whether the conformations and sequences visited with our hybrid MC were populated according to a Boltzmann distribution. The main assumption that is being tested, in fact, is the assumption of detailed balance P

between bundles of simplified forward pathways, of the form b, 0 → b0 , 0 → b0 , N and the P

corresponding bundles of inverted paths, of the form b0 , N → b, N → b, 0. As a practical test, our main criterion is that the computed free energies should be independent of the pathway taken between the states being compared. Indeed, while the true free energy is a state function, it seems very unlikely that arbitrary sums over microstates would have this property. A secondary criterion is their robustness with respect to protocol details (N , P values). With the titration protocols, we can also compare the computed Hill’s coefficient to the theoretical value, a weaker but still useful test. As summarized above, these criteria are rather well obeyed, despite noise and moderate differences between protocols. Overall, it appears that the present hybrid MC method gives a reasonable approximation of Boltzmann

27 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

sampling and free energy differences. In addition to providing effective sampling of backbones and a good approximation to various free energy differences, we expect the hybrid MC idea to have other uses, such as the design of ligand specificity. Consider a protein with two competing ligands, like tyrosine in the TyrRS example above, competing with a Tyr analog such as phenylalanine. Suppose each one has its own set of ligand backbone conformations that put it either inside the binding pocket or in bulk solution (i.e., the shift into bulk solution is viewed as a backbone hop). We could explicitly penalize sequences that favor bound Tyr over Phe, by applying a negative weight to selected terms in the energy function; namely the Tyr:protein interaction energy terms. With plain MC, this negative weighting would lead the system towards sequences and rotamers where Tyr sits in the binding pocket but where there is steric overlap between ligand and protein (the negative weight favors large positive energies). With hybrid MC, when Tyr hops into the pocket, surrounding rotamers could be relaxed with a positive interaction energy, so that steric overlaps are removed before the sequence and conformation are evaluated. This would only require a slight modification of the above method. The method can also be seen as a special case of a general hybrid MC framework, 54 which could give ideas for still other applications. To generate a good library of backbone conformations, we applied a simple, plausible, MD method. The resulting rms deviations between models, about 1–3 ˚ A, are comparable to the typical rms fluctuations of flexible loops in proteins. Other methods to generate backbone models could also be used, such as the kinematic closure method tested recently. 10 Here, because of the way the models were generated, we made the hypothesis that they should have similar energies. If their relative free energies are known or specifically computed with some other method (such as molecular dynamics free energy simulations), the corresponding free energies can easily be taken into account in a given application. Similarly, if we know or expect the backbone conformations to have very different energies and/or free energies, we can still perform MC sampling with a modified energy function, as above, in order to

28 ACS Paragon Plus Environment

Page 28 of 47

Page 29 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

explore all of them efficiently. This is simply another variety of importance sampling.

Acknowledgements Extensive discussions with Georgios Archontis (U. Cyprus), Gersende Fort and Carl Graham (Centre de Math´ematiques Appliqu´ees, Ecole Polytechnique) are gratefully acknowledged.

References (1) Dantas, G.; Kuhlman, B.; Callender, D.; Wong, M.; Baker, D. A Large Test of Computational Protein Design: Folding and Stability of Nine Completely Redesigned Globular Proteins. J. Mol. Biol. 2003, 332, 449–460. (2) Butterfoss, G. L.; Kuhlman, B. Computer-based design of novel protein structures. Ann. Rev. Biophys. Biomolec. Struct. 2006, 35, 49–65. (3) Lippow, S. M.; Tidor, B. Progress in computational Protein Design. Curr. Opin. Biotech. 2007, 18, 305–311. (4) Saven, J. G. Computational protein design: engineering molecular diversity, nonnatural enzymes, nonbiological cofactor complexes, and membrane proteins. Curr. Opin. Chem. Biol. 2011, 15, 452–457. (5) Feldmeier, K.; Hoecker, B. Computational protein design of ligand binding and catalysis. Curr. Opin. Chem. Biol. 2013, 17, 929–933. (6) Tinberg, C. E.; Khare, S. D.; Dou, J.; Doyle, L.; Nelson, J. W.; Schena, A.; Jankowski, W.; Kalodimos, C. G.; Johnsson, K.; Stoddard, B. L.; Baker, D. Computational design of ligandbinding proteins with high affinity and selectivity. Nature 2013, 501, 212–218. (7) Nisonoff, P. G. H. M.; Donald, B. R. Algorithms for Protein Design. Curr. Opin. Struct. Biol. 2016, 39, 16–26.

29 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(8) Fung, H. K.; Floudas, C. A.; Taylor, M. S.; Zhang, L.; Morikis, D. Towards full-sequence de novo protein design with flexible templates for human beta-defensin-2. Biophys. J. 2008, 94, 584–599. (9) Havranek, J. J. Specificity in Computational Protein Design. J. Biol. Chem. 2010, 285, 31095– 31099. (10) Babor, M.; Mandell, D. J.; Kortemme, T. Assessment of flexible backbone protein design methods for sequence library prediction in the therapeutic antibody Herceptin-HER2 interface. Prot. Sci. 2011, 20, 1082–1089. (11) Leaver-Kay, A.; Jacak, R.; Stranges, B.; Kuhlman, B. A generic program for multistate protein design. PLoS One 2011, 6, e20937. (12) Davey, J. A.; Chica, R. A. Multistate Approaches in Computational Protein Design. Prot. Sci. 2012, 21, 1241–1252. (13) Sharabi, O.; Erijman, A.; Shifman, J. M. Methods in Enzymology; Elsevier, Amsterdam, 2013; Vol. 523; pp 41–59. (14) Saunders, C. T.; Baker, D. Recapitulation of protein family divergence using flexible backbone protein design. J. Mol. Biol. 2005, 346, 631–644. (15) Smith, C. A.; Kortemme, T. Backrub-like backbone simulation recapitulates natural protein conformational variability and improves mutant sidechain prediction. J. Mol. Biol. 2008, 380, 742–756. (16) Mandell, D. J.; Kortemme, T. Backbone flexibility in computational protein design. Curr. Opin. Biotech. 2009, 20, 420–428. (17) Huang, P. S.; Ban, Y. E.; Richter, F.; Andre, I.; Vernon, R.; Schief, W. R.; Baker, D. RosettaRemodel: a generalized framework for flexible backbone protein design. PLoS One 2011, 6, e24109.

30 ACS Paragon Plus Environment

Page 30 of 47

Page 31 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(18) Ollikainen, N.; Smith, C. A.; Fraser, J. S.; Kortemme, T. Flexible backbone sampling methods to model and design protein alternative conformations. Methods Enzym. 2013, 523, 61–85. (19) Ollikainen, N.; de Jong, R. M.; Kortemme, T. Coupling Protein Side-Chain and Backbone Flexibility Improves the Re-design of Protein-Ligand Specificity. PLoS Comp. Bio. 2015, 1, e1004335. (20) Yanover, C.; Fromer, M.; Shifman, J. Dead-End Elimination for Multistate Protein Design. J. Comput. Chem. 2007, 28, 2122–2129. (21) Georgiev, I.; Donald, B. R. Dead-End Elimination with Backbone Flexibility. Bioinf. 2007, 23, i185–194. (22) Hallen, M. A.; Reedy, D. A.; Donald, B. R. Dead-end elimination with perturbations (DEEPer): a provable protein design algorithm with continuous sidechain and backbone flexibility. Proteins 2012, 81, 18–39. (23) Hallen, M. A.; Donald, B. R. COMETS (Constrained Optimization of Multistate Energies by Tree Search): A Provable and Efficient Protein Design Algorithm to Optimize Binding Affinity and Specificity with Respect to Sequence. J. Comp. Biol. 2016, 23, 311–321. (24) Nilmeier, J.; Jacobson, M. P. Monte Carlo sampling with hierarchical moves sets: POSH Monte Carlo. J. Chem. Theory Comput. 2009, 5, 1968–1984. (25) Mignon, D.; Simonson, T. Comparing three stochastic search algorithms for computational protein design: Monte Carlo, Replica Exchange Monte Carlo, and a multistart, steepestdescent heuristic. J. Comput. Chem. 2016, 37, 1781–1793. (26) Chipot, C.; Pohorille, A. Free energy calculations: theory and applications in chemistry and biology; Springer Verlag, N.Y., 2007. (27) Lelievre, T.; andMathias Rousset, G. S. Free Energy Computations: A Mathematical Perspective; World Scientific, Singapore, 2010.

31 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(28) Laio, A.; Gervasio, F. Metadynamics: a method to simulate rare events and reconstruct the free energy in biophysics, chemistry and material science. Rep. Prog. Phys. 2008, 71, 126601. (29) Chen, B.; Siepmann, J. I. Monte Carlo algorithms for simulating systems with adiabatic separation of electronic and nuclear degrees of freedom. Theoret. Chem. Acc. 1999, 103, 87– 104. (30) Chen, B.; Potoff, J. J.; Siepmann, J. I. Adiabatic nuclear and electronic sampling Monte Carlo simulations in the Gibbs ensemble: application to polarizable force fields for water. J. Phys. Chem. A 2000, 104, 2378–2390. (31) Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087–1092. (32) Frenkel, D.; Smit, B. Understanding molecular simulation, Chapter 3 ; Academic Press, New York, 1996. (33) Grimmett, G. R.; Stirzaker, D. R. Probability and random processes; Oxford University Press, 2001. (34) Norris, J. R. Markov chains; Cambridge University Press, 1988. (35) Itzykson, C.; Drouffe, J. M. Th´eorie statistique du champ; CNRS Editions, 1989. (36) Landau, L.; Lifschitz, E. Statistical Mechanics; Pergamon Press, New York, 1980. (37) Polydorides, S.; Simonson, T. Monte Carlo simulations of proteins at constant pH with generalized Born solvent, flexible sidechains, and an effective dielectric boundary. J. Comput. Chem. 2013, 34, 2742–2756. (38) Lopes, A.; Aleksandrov, A.; Bathelt, C.; Archontis, G.; Simonson, T. Computational sidechain placement and protein mutagenesis with implicit solvent models. Proteins 2007, 67, 853–867. (39) Kobayashi, T.; Takimura, T.; Sekine, R.; Vincent, K.; Kamata, K.; Sakamoto, K.; Nishimura, S.; Yokoyama, S. Structural Snapshots of the KMSKS Loop Rearrangement for

32 ACS Paragon Plus Environment

Page 32 of 47

Page 33 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Amino Acid Activation by Bacterial Tyrosyl-tRNA Synthetase. J. Mol. Biol. 2005, 346, 105– 117. (40) Simonson, T.; Ye-Lehmann, S.; Palmai, Z.; Amara, N.; Bigan, E.; Wydau, S.; Druart, K.; Moch, C.; Plateau, P. The stereospecificity of tyrosyl-tRNA synthetase is robust towards active site mutations. Proteins 2016, 84, 240–253. (41) Druart, K.; Palmai, Z.; Omarjee, E.; Simonson, T. Protein:ligand binding free energies: a stringent test for computational protein design. J. Comput. Chem. 2016, 37, 404–415. (42) Dahiyat, B. I.; Mayo, S. L. De novo protein design: fully automated sequence selection. Science 1997, 278, 82–87. (43) Schmidt am Busch, M.; Lopes, A.; Mignon, D.; Simonson, T. Computational protein design: software implementation, parameter optimization, and performance of a simple model. J. Comput. Chem. 2008, 29, 1092–1102. (44) Simonson, T. Protein:ligand recognition:

simple models for electrostatic effects. Curr.

Pharma. Design 2013, 19, 4241–4256. (45) Tuffery, P.; Etchebest, C.; Hazout, S.; Lavery, R. A New Approach to the Rapid Determination of Protein Side Chain Conformations. J. Biomol. Struct. Dyn. 1991, 8, 1267. (46) Gaillard, T.; Simonson, T. Pairwise Decomposition of an MMGBSA Energy Function for Computational Protein Design. J. Comput. Chem. 2014, 35, 1371–1387. (47) Br¨ unger, A. T. X-plor version 3.1, A System for X-ray crystallography and NMR; Yale University Press, New Haven, 1992. (48) Cornell, W.; Cieplak, P.; Bayly, C.; Gould, I.; Merz, K.; Ferguson, D.; Spellmeyer, D.; Fox, T.; Caldwell, J.; Kollman, P. A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. J. Am. Chem. Soc. 1995, 117, 5179–5197.

33 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(49) Aleksandrov, A.; Polydorides, S.; Archontis, G.; Simonson, T. Predicting the Acid/Base Behavior of Proteins: A Constant-pH Monte Carlo Approach with Generalized Born Solvent. J. Phys. Chem. B 2010, 114, 10634–10648. (50) Hawkins, G. D.; Cramer, C.; Truhlar, D. Pairwise descreening of solute charges from a dielectric medium. Chem. Phys. Lett. 1995, 246, 122–129. (51) Lee, B.; Richards, F. The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 1971, 55, 379–400. (52) Street, A. G.; Mayo, S. Pairwise calculation of protein solvent-accessible surface areas. Folding and Design 1998, 3, 253–258. (53) Finn, R. D.; Tate, J.; Mistry, J.; Coggill, P. C.; Sammut, S. J.; Hotz, H. R.; Ceric, G.; Forslund, K.; Eddy, S. R.; Sonnhammer, E. L.; Bateman, A. The Pfam protein families database. Nucl. Acids Res. 2008, 36, D281–D288. (54) Nilmeier, J.; Crooks, G. E.; Minh, D. D. L.; Chodera, J. D. Nonequilibrium candidate Monte Carlo is an efficient tool for equilibrium simulation. Proc. Natl. Acad. Sci. USA 2011, 108, E1009–1018. (55) DeLano, W. L. The PyMOL molecular graphics system. DeLano Scientific, San Carlos, CA, USA, 2002.

34 ACS Paragon Plus Environment

Page 34 of 47

Page 35 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Table 1: A step-by-step flowchart describing a hybrid Monte Carlo move 1) Select a trial backbone hop b → b0 2) Run a short MC segment to relax sidechain rotamers and possibly types (N steps); collect energy statistics for accepted moves. The state at the end of this “generating path” is b0 , N . 3) Create additional, “permuted” paths by scrambling the order of the rotamer changes seen in the first, generating path. Collect energy statistics for these paths (no new calculations are needed; energy changes are read out of the energy matrix). 4) Based on the energy statistics collected, construct the approximate acceptance probability for the overall hybrid move (Eqs. 10, 11, 13). 5) Apply the acceptance probability within a Metropolis test to accept or reject the overall hybrid move.

Table 2: Rms deviations between backbone models (˚ A) peSH3 ↓ A B C D E F

A 1.4 1.3 1.1 1.6 1.7

SH2 C D 0.8 1.4 1.4 2.0 0.7 1.1 0.9 0.9 1.5 1.5 1.0 1.5 1.7 1.6 closed C D E 3.5 4.0 4.1 2.6 3.0 3.1 1.3 1.9 1.6 B 1.1

E 1.5 2.3 1.3 0.8

F 1.5 1.8 1.4 1.9 2.0

B 1.3

C 2.1 1.3

loSH3 D 2.9 2.2 1.3

E 2.5 1.9 1.3 1.2

F 2.7 1.9 1.4 1.6 1.3

1.4

open TyrRS B F G H I J A 2.0 4.6 5.4 7.1 8.1 7.7 B 5.1 5.8 7.4 8.3 7.9 C 5.4 6.0 7.9 9.1 8.4 D 5.5 6.0 7.9 9.0 8.3 E 6.2 6.8 8.5 9.3 8.6 F 1.9 4.5 6.3 5.6 G 3.3 5.7 5.1 H 3.7 3.5 I 1.9 Rms deviations for backbone atoms. The upper-left matrix contains the results for both peSH3 (lower triangle) and SH2 (upper triangle). The bottom part of the Table corresponds to TyrRS.

35 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 47

Table 3: Relative backbone free energies with selected protocols system wtSH2 wtSH2 wtSH2 wtSH2 wtSH2 wtSH2 wtSH2 wtSH2 siSH2 siSH2 siSH2 siSH2 siSH2 siSH2 peSH3 peSH3 peSH3 peSH3 peSH3

Method Naive Naive Titration Titration Naive Naive Titration MetaDyn Naive Naive Titration Titration Titration Titration Naive(mw) Naive(mw) MetaDyn Titration Titration

P 1 1 1 1 100 100 100 1 1 1 1 1 100 100 1 100 1 1 1

N 50 100 50 100 50 100 50 50 20 50 20 50 20 50 50 50 50 50 100

-0.36 -0.29 -0.37 -0.28 -0.54 -0.49 -0.51 -0.37 -0.33 -0.36 -0.33 -0.36 -0.32 -0.31 -0.84 -0.15 -0.82 -0.82 -0.71

free energies -0.39 0.08 -0.37 0.10 -0.35 -0.09 -0.39 -0.15 -0.43 0.26 -0.42 0.28 -0.43 0.07 -0.45 0.05 -0.25 -0.10 -0.17 -0.01 -0.24 -0.10 -0.20 -0.03 -0.34 -0.18 -0.29 -0.13 -0.65 -0.37 -0.54 -0.47 -0.53 -0.25 -0.55 -0.42 -0.56 -0.39

(kcal/mol) -0.04 0.43 -0.12 0.49 0.19 0.32 0.15 0.51 0.22 0.22 0.19 0.21 0.35 0.14 0.44 0.14 -0.05 0.28 -0.07 0.19 -0.05 0.30 -0.08 0.27 -0.05 0.39 -0.08 0.35 0.45 -0.01 0.27 0.31 0.61 -0.03 0.44 0.23 0.47 0.06

0.30 0.17 0.32 0.17 0.28 0.26 0.36 0.21 0.44 0.41 0.44 0.40 0.50 0.44 1.42 0.57 1.01 1.10 1.12

Table 4: Hill coefficients from backbone titrations system wtSH2 wtSH2 wtSH2 wtSH2 wtSH2 wtSH2 siSH2 siSH2 siSH2 siSH2 siSH2 siSH2 peSH3 peSH3

A 0.67 0.64 0.61 0.70 0.66 0.56 0.72 0.73 0.74 0.72 0.66 0.63 0.65 0.60

B 0.67 0.69 0.66 0.72 0.67 0.60 0.73 0.73 0.72 0.73 0.65 0.63 0.72 0.65

C 0.71 0.69 0.69 0.73 0.66 0.56 0.72 0.74 0.74 0.74 0.67 0.62 0.76 0.79

D 0.70 0.67 0.64 0.72 0.69 0.61 0.71 0.70 0.70 0.73 0.65 0.61 0.65 0.76

E 0.74 0.67 0.67 0.71 0.66 0.59 0.70 0.66 0.66 0.69 0.62 0.59 0.79 0.73

F 0.69 0.70 0.69 0.69 0.66 0.59 0.71 0.71 0.70 0.70 0.64 0.61 0.81 0.68

mean 0.70 0.68 0.66 0.71 0.67 0.58 0.71 0.71 0.71 0.72 0.65 0.61 0.73 0.70

error 0.03 0.05 0.07 0.02 0.06 0.15 0.02 0.02 0.02 0.01 0.08 0.12 0.00 0.03

P 1 1 1 100 100 100 1 1 1 100 100 100 1 1

N 20 50 100 20 50 100 20 50 100 20 50 100 50 100

Hill coefficients α for the titration of each backbone model A–F, for each system and protocol. For each system, the value closest to the theoretical one, 0.734 (kcal/mol)−1 , is in bold face. The error is the mean difference from the theoretical value.

36 ACS Paragon Plus Environment

Page 37 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure captions 1. Hybrid scheme for backbone moves. The initial backbone is schematized in blue, with a few sidechains shown (thin lines); after the initial hop to a new trial conformation (green), the sidechain rotamers are relaxed through N MC steps, schematized as short arrows; step 2 is a rejected step; for the others, the arrow colors refer to different sidechains that undergo rotamer changes at each step. Once this primitive generating path is done, rejected steps are removed and permuted paths are considered, where the same rotamer changes are made in different orders, as shown. See also the nonmathematical flowchart in Table 1. 2. Test systems. Each protein is shown in a ribbon view; the flexible backbone region is colored black; the library of backbone conformations is shown in each case. Note that for peSH3, the flexible region corresponds to a decapeptide ligand. For wtSH2 and siSH2, the protein structure and backbone conformations are the same. 3. Upper panel: Histograms of relaxation energies and number of rotamer changes for wtSH2, simulated with P =100 paths of length N =20 or 50. Bottom panels: Relaxation energies along the direct and inverted rotameric MC relaxation paths for wtSH2, simulated with P =100 paths of length N =20 (middle panels) or 50 (bottom). Results correspond in each case to energy variation along the generating paths for 1500 accepted backbone moves. The average variation is shown (black line) along with illustrative individual moves (gray lines). 4. Histogram of acceptance ratios r(P ) vs. relaxation energy (kcal/mol) for wtSH2 with P =100 and N =50. Results correspond to 1500 accepted backbone moves. The righthand plot excludes most of the large r=1 peak, so that the weak population of the other r values can be seen. 5. Upper panel: Backbone free energies for the four test systems using different methods 37 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(with an arbitrary offset for each system, for clarity). Free energies from the naive method are shown as solid lines; titration results are dashed; metadynamics results use fine dashes. Gray results use the Single Path Approximation (P =1), black results use the Permuted Paths Approximation, with P = 100, 500, or 1000. Different lengths N are used for the rotamer relaxation segment; only a few of the N values are indicated. Lower panel: wtSH2 protocols are represented by pseudo-particles, whose distances are equal to the rms deviation between the corresponding free energy sets. SPA results are blue; PPA are red; N values are indicated for some of the protocols. Results obtained with titration or metadynamics are labelled t or m. Stereo view, created with Pymol. 55 6. Backbone titration curves for wtSH2 and peSH3. For each backbone geometry, the population fraction is shown as a function of the bias contribution δ that is added to that backbone’s energy. 7. Upper panel: Both the backbone geometry and one sidechain’s type are varied, forming a thermodynamic cycle. Free energy changes computed for wtSH2 using four different methods are shown for each leg of the cycle (values separated by slashes, in kcal/mol). The total free energy values are also shown (in the middle) going from the upper left to the lower right corner, along either the gray legs (upper values) or the black legs (lower values). The good agreement between the gray and black routes supports the idea that the simulations sample microstates with Boltzmann statistics so that the resulting free energy is a state function, as it should be. Lower panel: A more complete view of the same simulations: results are shown for three pairs of backbone geometries, using 12 free energy protocols. The upper, black or circled dots are the free energy differences between the upper left and lower right of the cycle (averaged over the two routes); different symbols are used for the three backbone pairs (AE, BD, and DF). The lower, gray dots are the unsigned differences between

38 ACS Paragon Plus Environment

Page 38 of 47

Page 39 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

the two routes around the cycle, connecting the upper left and lower right states. The nature of each free energy protocol is indicated at the top: P =1 (SPA) or 100 (PPA); “single” means that a single simulation is used, where both the backbone geometry and the sidechain type fluctuate; either 2 geometries (single2) or all 6 backbone geometries (single6) are allowed. “Dual” means that separate simulations are used to compare the sidechain types (with either backbone), then to compare the backbones (with either sidechain type). The N values were 10, 20, or 50, as indicated for two of the protocols (“single2, single6, P =1”). 8. (A) The TyrRS system (cross-eyed stereo). Only one monomer of the dimeric TyrRS is shown; the other one is far from the active site region simulated here. The open and closed conformations of the flexible KMSKS loop are shown (in beige and red, respectively), as well as the Tyr ligand (spheres). (B) A few of the individual sequences sampled the most frequently during the 50 million MC steps. Only the five positions of the signature KMSKS motif are shown. (C) The entire set of sequences sampled during the TyrRS simulations are shown be in the form of two logos, and compared to experimental sequences from the Pfam database. 53 Sequences sampled with the loop open/closed make up the corresponding logos. The height of each letter at each logo position represents the frequency of the corresponding amino acid type at that position during the 50 million MC steps, within the corresponding subset of open/closed backbone states. The relative abundancies of the open/closed states are not shown but are indicated in the main text.

39 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 47

trial move b,0

Backbone  hop

Rotamer relaxation 1     2            N

b',0

b',N

Primitive generating path

Simplified generating path Some permuted paths

Figure 1: Hybrid move scheme.

peSH3

loSH3

SH2

Figure 2: Test systems.

40 ACS Paragon Plus Environment

Generating paths, N=50 Relaxation energy Er

Generating paths, N=20

number of rotamer changes

number of rotamer changes

Direct paths, N=20

Inverted paths, N=20 Relaxation energy Er(n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

MC step number n

MC step number n

Direct paths, N=50

Inverted paths, N=50 Relaxation energy Er(n)

Page 41 of 47

MC step number n

MC step number n

Figure 3: Relaxation energies for rotameric MC segments.

41 ACS Paragon Plus Environment

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 47

Relaxation energy Er

Journal of Chemical Theory and Computation

r(P=100)

r(P=100)

Figure 4: Histogram of acceptance probability ratio r(100) vs. relaxation energy (kcal/mol).

42 ACS Paragon Plus Environment

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

free energy (kcal/mol)

Page 43 of 47

siSH2

peSH3

50

100

20 20

10

50

wtSH2 100 10

50

loSH3 P=1 P100

backbone conformation 100t

100t

100

100 50t

50t

50

50 20t

20

20t 50m

50t 20t

10

20

50m 50t 20t

10

100t

100t

10

10 20

20

10* Figure 5: Backbone free energies.

43 ACS Paragon Plus Environment

10*

Journal of Chemical Theory and Computation

fraction

wtSH2 P=100 paths N=20 steps

δ (kcal/mol) peSH3 P=1 path N=50 steps

fraction

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6: Backbone titration curves.

44 ACS Paragon Plus Environment

Page 44 of 47

Page 45 of 47

Backbone change

Mutation 0.03/-0.01/-0.06/0.03 0.70 0.67 0.71 0.77

0.54 0.49 0.56 0.59

0.57/0.48/0.50/0.62 0.49/0.47/0.49/0.62

-0.21/-0.20/-0.22/-0.15

free energy (kcal/mol)

dual P=1

BD

single2 P=1

10 20

10 50

AE

(all 6 backbones) single6 P=1 P=100

20 50

DF

cycle closure error

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

computational protocol Figure 7: Thermodynamic cycle for backbone and sidechain changes.

45 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

(A)

(B)

(C)

expt

open

open

235                236               237              238                239

closed

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 46 of 47

235               236                237             238                239

closed Figure 8: Structure, sequences and sequence logos for tyrosyl-tRNA synthetase.

46 ACS Paragon Plus Environment

Page 47 of 47

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

A hybrid Monte Carlo method for multibackbone protein design Karen Druart, Julien Bigot, Edouard Audit & Thomas Simonson Table of contents graphic

b,0

Backbone hop

b',0

Rotamer relaxation 1 2 N

b',N

SH3 SH3

SH2

47 ACS Paragon Plus Environment