Complete coupled binding-folding pathway of the intrinsically

10 Jul 2018 - Complete coupled binding-folding pathway of the intrinsically disordered transcription factor protein Brinker revealed by molecular dyna...
0 downloads 0 Views 2MB Size
Subscriber access provided by TUFTS UNIV

Article

Complete coupled binding-folding pathway of the intrinsically disordered transcription factor protein Brinker revealed by molecular dynamics simulations and Markov state modeling Andrew P. Collins, and Peter Anderson Biochemistry, Just Accepted Manuscript • DOI: 10.1021/acs.biochem.8b00441 • Publication Date (Web): 10 Jul 2018 Downloaded from http://pubs.acs.org on July 10, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

Introduction

Complete coupled bindingfolding pathway of the intrinsically disordered transcription factor protein Brinker revealed by molecular dynamics simulations and Markov state modeling Andrew P. Collins and Peter C. Anderson* Department of Physical Sciences, University of Washington, Bothell, Washington, United States ABSTRACT: Intrinsically disordered proteins (IDPs) are a large class of proteins that lack stable structures in solution, existing instead as dynamic conformational ensembles. To perform their biological functions, many IDPs bind to other proteins or nucleic acids. Although IDPs are unstructured in solution, when they interact with binding partners they fold into defined three-dimensional structures via coupled bindingfolding processes. Since it frequently underlies IDP function, the mechanisms of this coupled binding-folding process are of great interest. However, given the flexibility inherent to IDPs and the sparse populations of intermediate states, it is difficult to reveal binding-folding pathways at atomic resolution using experimental methods. Computer simulations are another tool for studying these pathways at high resolution. Accordingly, we have applied 40 µs of unbiased molecular dynamics simulations and Markov state modeling to map the complete binding-folding pathway of a model IDP, the 59-residue Cterminal portion of the DNA binding domain of Drosophila melanogaster transcription factor Brinker (BrkDBD). Our modeling indicates that BrkDBD binds to its cognate DNA and folds in ~50 µs by an induced fit mechanism, acquiring most of its stable secondary and tertiary structure only after it reaches the final binding site on the DNA. The protein follows numerous pathways en route to its bound and folded conformation, occasionally becoming stuck in kinetic traps. Each binding-folding pathway involves weakly bound, increasingly folded intermediate states located at different sites on the DNA surface. These findings agree with experimental data and provide additional insight into the BrkDBD folding mechanism and kinetics.

Unlike natively folded proteins, intrinsically disordered proteins (IDPs) do not form stable secondary or tertiary structures when they are unbound in solution1-4 (see Ref. 5 for a recent review). Rather, they occur as dynamic ensembles of rapidly interconverting conformations with high structural heterogeneity5. Despite their lack of persistent structure, IDPs are biologically active and abundant, with proteins consisting entirely or largely of regions of intrinsic disorder accounting for ~30% of eukaryotic proteomes6. IDPs have a broad range of cellular functions, frequently acting as hubs in proteinprotein interaction networks and cell signaling pathways7-12. Moreover, they are overrepresented in numerous human disease pathways, such as cancer, diabetes, amyloidoses, and cardiovascular and neurodegenerative diseases13-16. Given that IDPs do not have stable structures when they are free in solution yet are biologically active, they challenge the previously held paradigm that protein structure determines function. To perform their biological functions, many IDPs must interact with binding partners, including other proteins and nucleic acids. In many cases, binding to the partner is coupled to IDP folding, and the binding event is associated with a protein transition from disorder to order that often allows highspecificity, low-affinity interactions15. The structures of several complexes between folded IDPs and their partners have been solved in recent years and are catalogued in the pE-DB database (pE-DB)17. However, considering the large number of IDPs found in eukaryotic genomes, surprisingly few complex structures have been determined experimentally. Since many IDPs interact with binding partners, their coupled binding and folding reactions are central to their function. Nevertheless, many aspects of the mechanisms and dynamics of coupled binding and folding remain poorly understood. Among the most heavily debated aspects is the question of whether binding or folding occurs first18,19. Two extreme, limiting cases of coupled binding and folding are induced fit (also known as folding-after-binding), where the unfolded IDP binds to its partner and then folds, and conformational selection, where the partner binds to one or more conformations from a small population of pre-folded IDP that is in equilibrium with the unfolded state18. It has been shown for some proteins that these two limiting models are overly simplistic and that the proteins exhibit a combination of the two20. Additional aspects of IDP folding that require further investigation include how the kinetics of the coupled binding-folding reaction is controlled, how IDPs interact with their binding partners in the absence of well-structured binding sites, and how they perform specific binding21. Greater insight into these aspects of coupled binding and folding reactions could be gained if complete binding-folding pathways, including metastable states and kinetic rate constants, could be mapped at atomic resolution. The information thus produced would not only advance our understanding of how many IDPs function, but it could also lead to practical applications, including protein engineering22 and drug discovery by yielding novel drug targets14,16,23. Elucidating the full mechanisms of coupled binding and folding of IDPs requires following the binding and folding in time and characterizing the conformational changes that occur along the binding-folding pathway, ideally at high spatial and time resolution. Several experimental studies have been published that provide information about transition state structures and kinetics of binding-folding pathways24-26. Unfortunately,

1 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

due to a combination of the conformational flexibility of IDPs and the complexity of their binding-folding pathways, characterizing these pathways at atomic resolution by experimental methods is highly challenging27,28. Molecular dynamics (MD) simulations offer a complementary approach to probe the complex details of IDP folding pathways at atomic resolution over the femtosecond to millisecond timescales (reviewed in Ref. 29). Several recent MD studies have been published which model the pathways, transition state structures, and kinetics of IDP binding and folding, including, for instance, transcription factor c-myb binding to CREB binding protein30, Drosophila melanogaster Brinker nuclear repressor binding to DNA31, calmodulin binding domain binding to calmodulin32, and p53 binding to MDM233. However, as in the case of experimental methods, the flexibility of IDPs, their often large size, and the large number of degrees of freedom involved in binding pathways make simulating coupled binding and folding likewise challenging. To overcome these challenges and to increase computational efficiency, most previously published simulations have relied on coarse-graining methods such as Gō models, implicit solvent methods, and biasing/targeted MD techniques. While allowing gains in computational efficiency, these approaches, unfortunately, have considerable drawbacks, as they compromise the level of atomic resolution and detail (in the case of coarsegrained models), potentially model solvent effects less accurately (in the case of implicit solvent methods), or involve applying user-defined, often arbitrary collective variables or reaction coordinates (in the case of biasing). Another efficiency-enhancing simulation technique that has been used in prior studies is to model the IDP unbinding and/or unfolding process using either high temperature or biasing methods. However, these techniques make the assumption that the unbinding/unfolding and the binding/folding pathways are the reverse of one another and further introduce the potential distortions caused by high temperature or user-defined collective variables. In principle, a preferable modeling approach, though costly, would be to simulate the complete binding-folding pathways of IDPs using many unbiased MD simulations in explicit solvent at full atomic resolution and realistic temperatures, starting from discrete conformations of unbound, unfolded protein. The resulting ensemble of MD trajectories could subsequently be analyzed using techniques such as Markov state models (MSM)34-36 to identify the most relevant metastable states in the pathway(s) and compute rate constants for transitions between metastable states, which are essential features for characterizing binding-folding mechanisms. To the best of our knowledge, few prior modeling studies of IDP folding have implemented such an approach33,37. Accordingly, in this study we map the full binding-folding pathway of a model IDP by combining 40 µs of atomistic, unbiased MD simulations and Markov state modeling. Our model protein is the 59-residue C-terminal portion of the DNA-binding domain of Drosophila melanogaster nuclear repressor Brinker (BrkDBD). The binding-folding pathway for this protein remains poorly understood, and experimental data have not clarified whether folding occurs via induced fit or conformational selection. A prior NMR study has demonstrated that BrkDBD is unfolded and highly flexible throughout the entire backbone at room temperature in the absence of DNA38. In contrast, when it binds to a cognate DNA at ambient temperatures, BrkDBD undergoes a transition to a completely folded structure consisting of four α-helices that form a helix-

Page 2 of 20

turn-helix DNA recognition motif (Figure 1). It thus undergoes an extreme DNA-induced folding transition from a globally unfolded, flexible apo state38. Although a previous MD study31 has provided valuable insights into BrkDBD folding, this study used relatively short simulations to probe unfolding pathways at high temperature (498 K), did not directly simulate the binding or folding process, and did not report metastable states.

Figure 1. Sequence and DNA-bound structure of BrkDBD protein. (A) Amino acid sequence of BrkDBD. The four α-helices are highlighted in red. (B) NMR structure of BrkDBD bound to a double-stranded 12-nucleotide GC-rich omb-enhancer region DNA of Drosophila melanogaster (PDB ID 2GLO). The DNA 39 sequence corresponds to an in vivo binding site for BrkDBD . Sticks represent protein side chains that form hydrogen bonds with the DNA. The protein-binding region of the DNA is colored cyan. Helical and non-helical protein residues are colored blue and red, respectively.

We present here the set of metastable intermediate states that BrkDBD occupies as it journeys from its unbound, unfolded state to its bound, fully folded state under equilibrium conditions at 300 K, as well as the time scales for the transitions between these metastable states. BrkDBD follows several distinct binding-folding pathways with varying degrees of net flux, but each pathway is consistent with an induced fit mechanism in which the protein remains unfolded until it has formed a complex with the target DNA. Once the protein has loosely associated with the DNA, it explores the surface of the DNA, transiting through a series of weakly bound intermediate states featuring increasing degrees of secondary structure content. The protein acquires the majority of the secondary and tertiary structure of the fully folded conformation only after it reaches the experimentally observed binding site on DNA. Moreover, we show that BrkDBD spends a significant quantity of time stuck in kinetic trap states that contribute little

2 ACS Paragon Plus Environment

Page 3 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

to the overall binding-folding flux. The methods that we use to characterize the binding-folding pathway of BrkDBD can be applied in studies of other IDPs.

Methods General Molecular Dynamics Simulation Parameters. All MD simulations were performed using the Gromacs 4.6 simulation package40,41 in conjunction with the AMBER99SB force field42,43 and TIP3P water solvent44, applying 2 fs time steps, particle mesh Ewald periodic boundary conditions45, and 10 Å non-bonded cutoffs. The AMBER99SB force field was originally developed to address the issue of overstabilization of αhelices that had been a drawback of several previous AMBER force fields46. Benchmarking studies against NMR data have demonstrated that its secondary structure predictions are among the more reliable for available force fields, but that it still has a slight tendency to overpopulate α-helices46,47. Bonds were constrained using the LINCS algorithm48. Temperature coupling was carried out using the velocity-rescaling thermostat of Bussi et al49. Simulations in the NPT ensemble used the Parrinello-Rahman pressure coupling scheme50,51 with a barostat relaxation time of 2.0 ps at a pressure of 1 atm. All simulations included a 10 mM NaCl concentration. Replica-Exchange Molecular Dynamics to Generate Distinct Starting Conformations of Unfolded, Unbound BrkDBD. The only available experimental structure of the 59residue BrkDBD protein38 captures the protein in its completely folded state bound to its target DNA, the omb-enhancer region of Drosophila melanogaster, a double-stranded DNA containing 12 nucleotides per strand. Thus, prior to modeling the binding-folding pathway, we generated a diverse set of starting conformations of the unfolded, unbound protein located at different points on its free energy landscape. We sampled unfolded conformations by applying replica-exchange molecular dynamics52 (REMD) simulations of the unbound protein over a wide range of temperatures, using 32 replica temperatures spaced evenly over a range of 295-410 K. The temperatures were distributed exponentially such that the expected exchange probability was 20%, as calculated using an online temperature generator53 for REMD simulations (http://folding.bmc.uu.se/remd/). The starting structure of unbound BrkDBD was taken as the first conformer of the solution NMR structure38 (PDB ID 2GLO) of folded BrkBDB bound to its DNA target. The DNA was removed, and the resulting unbound protein structure was minimized in vacuo by 1000 steps of steepest-descent minimization. The minimized structure was solvated in a dodecahedral box containing 3800 TIP3P water molecules44 such that all protein atoms were located ≥ 11 Å from the box edge. The overall system charge was neutralized by adding 8 Cl– ions, and the system was energy minimized by 1000 steps of steepest-descent minimization. Solvent molecules and ions were gradually heated from 200 K to 300 K in the NVT ensemble over the course of 5 ns, while position restraints of 10 kJ mol–1 Å–2 were applied to all protein atoms. The solvent was then equilibrated in the NPT ensemble at 300 K for 5 ns; the same position restraints on all protein atoms were maintained. Finally, 32 replicas of the solvent-equilibrated system were heated (or cooled) from 300 K to each of the 32 REMD temperatures (ranging from 295-410 K) in the NPT ensemble over 5 ns without applying position restraints, followed by 5 ns of unre-

strained simulation at the final temperatures. Finally, a 50-ns production REMD simulation was performed in parallel for each replica using randomly assigned initial atomic velocities taken from a Maxwell distribution at the relevant temperature. Replica exchange was attempted once every 500 steps (1 ps). Structures were saved for analysis every 1 ps. The replicas at higher temperatures often yielded completely unfolded protein conformations, and these unfolded conformations occasionally were exchanged with the trajectory at 300 K. To extract a set of dissimilar conformations for the BrkDBD replica at 300 K, the protein conformations from the 300 K replica trajectory were clustered into 10 clusters based on the root-mean-square deviation (RMSD) of the Cα and Cβ atoms relative to the first conformer of the solution NMR structure. Clustering was performed using the k-means algorithm in the R statistical package (http://www.R-project.org/). Each cluster was verified to correspond to an unfolded state by analyzing the average fraction of native intramolecular contacts (defined as intramolecular contacts present in the folded protein in the NMR structure) involving Cα/Cβ atoms observed in all the cluster snapshots, using a contact cutoff distance of 5 Å. Each cluster had an average of < 5% native intramolecular contacts formed. Centroid structures from the 10 clusters were extracted from the 300 K trajectory, minimized in the water solvent by 1000 steps of steepest-descent minimization, and used as starting conformations in the subsequent set of binding-pathway simulations. Generating Starting Conformation of Unbound Target DNA. To generate conformations of the double-stranded DNA corresponding to its unbound state, the DNA dimer was extracted from the first conformer of the solution NMR structure of the BrkDBD-DNA complex. The apo DNA was solvated in a dodecahedral box containing 6500 TIP3P water molecules, with all DNA atoms located ≥ 11 Å from the box edge. The system was neutralized by adding 22 Na+ ions and minimized using 1000 steps of steepest-descent minimization. Equilibration at 300 K was performed as described for apo BrkDBD. A 100-ns production simulation at 300 K was performed, and trajectory snapshots were saved for analysis every 5 ps. The DNA conformations from the last 80 ns of the production simulation were clustered into three clusters based on the RMSD of the C4' atoms relative to the first conformer of the NMR structure. The three cluster centroid structures were minimized in solvent by 1000 steps of steepest-descent minimization and served as starting DNA conformations in the set of bindingpathway simulations. Simulations of Coupled Binding-Folding Pathways. We performed 67 successive batches of production MD simulations of the BrkDBD–DNA system in the NPT ensemble at 300 K. Each batch included 30 individual 20-ns simulations, yielding 600 ns per batch and 40 µs of aggregate simulation time. Trajectory snapshots were saved for analysis every 100 ps. Starting conformations for the first batch of simulations were generated by combining each of the 10 unfolded centroid structures from the REMD simulations of apo BrkDBD with each of the 3 cluster centroids from the unbound DNA simulation, yielding 30 combinations of BrkDBD and DNA starting conformations. For each combination, the BrkDBD was randomly placed ~40 Å from the DNA, and the system was solvated in an orthorhombic box containing 13,000-15,000 TIP3P

3 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

water molecules such that all protein and DNA atoms were located ≥ 14 Å from the box edge. The resulting box dimensions were 98 Å x 74 Å x 62 Å, which corresponds to a protein concentration of ~4 mM. The system charge was neutralized by adding 14 Na+ counterions. Each system was minimized and equilibrated at 300 K using the same protocol as described previously for the REMD simulations. Starting conformations for all subsequent MD simulation batches were selected using the fluctuation amplification of specific traits (FAST) goal-oriented sampling method54. The FAST method performs enhanced sampling of conformational space by generating successive batches of simulations such that the starting points for each batch are chosen from the set of all previously sampled conformations based on a reward function. This reward function estimates the probability that simulations that are started from different conformations will discover new conformations that minimize or maximize a certain metric of interest, e.g. the RMSD relative to a reference structure. In this study, the molecular property, φ, that we sought to minimize is the RMSD of BrkDBD Cα and Cβ atoms relative to the completely bound and folded conformation of the protein in the first conformer of the NMR structure after aligning the DNA to its NMR conformation. Using this metric has the advantage of taking into account both the distance between the protein and its binding site on the DNA and the extent to which the protein backbone and side chain conformations differ from those of the folded conformation. When the protein is unbound, the former metric predominates; when the protein has reached its binding site on the DNA, the latter metric predominates. Starting conformations for each batch of simulations (other than the first batch) were chosen using the following algorithm54: (i) Cluster all MD simulation conformations sampled so far into 10 discrete states and select the cluster centroids as representative state structures. Ten was chosen as the number of clusters in order to yield an average of three simulations started from each representative structure in the upcoming batch of 30 simulations (step iii). (ii) For each of the 10 representative state structures, calculate a reward function

r (i) = φ (i)+ αψ (i) φ where i is the state,

φ (i) is

(1)

a directed component favoring

states that optimize the structural property φ of interest (in this case the Cα/Cβ-RMSD relative to the NMR structure), ψ (i) is an undirected component that favors states that have been poorly sampled so far compared to other states, and α is a parameter controlling the relative importance of the directed and undirected terms. For a property that one wishes to minimize (here RMSD) the value of φ (i) is given as

φ (i) =

φmax − φ (i) φmax − φmin

(2)

Page 4 of 20

where φ(i) is the value of φ for state i, and φmax and φmin are the maximum and minimum values of the property of interest, respectively, over all previously sampled conformations. The value of ψ (i) , which is applied to favor poorly sampled states, is computed as

ψ (i) =

Cmax − Ci Cmax − Cmin

(3)

where Ci is the number of observations of state i, and Cmax and Cmin are the maximum and minimum number of observations of any state, respectively. A value of α = 1 was chosen in order to give equal weighting to the directed and undirected terms. Previous work has shown that α values ranging from 0.5 to 1.5 often give similar results54. (iii) Start a new batch of 30 simulations, using as starting conformations the cluster representatives from the ten clusters in step (i), such that the number of simulations launched from each state is proportional to the state’s reward function. Initial atomic velocities were randomly assigned from a Maxwell distribution at a temperature of 300 K for each simulation. (iv) Repeat steps (i)-(iii) until a desired cumulative sampling time has been reached. A total of 67 batches were performed, generating 40 µs of aggregate simulation time. Supplementary Simulations of Apo, Unfolded and Bound, Folded BrkDBD. We performed multiple additional simulations of apo, unfolded BrkDBD and bound, folded BrkDBD at 300 K in order to improve sampling of these two end states. Starting protein conformations for the apo unfolded simulations were the same as those used in the first batch of bindingfolding simulations. For each of the 10 starting conformations, three separate 30-ns simulations were performed using randomly assigned initial atomic velocities, generating 30 simulations. Seven subsequent batches of simulations were launched sequentially; batches consisted of 30 separate 30-ns simulations. Each batch’s starting conformations were selected from all previous apo-state simulations using the FAST method. The FAST metric that we sought to maximize was the Cα/CβRMSD of BrkDBD relative to the experimental bound conformation, which was chosen in order to enhance sampling of maximally unfolded conformations. Altogether these 8 batches of simulations yielded 7.2 µs of additional simulation time for the apo, unfolded state. For the bound, folded BrkDBD state, we performed 50 simulations each of length 20 ns, starting from the first conformer of the NMR structure, thus yielding 1 µs of additional bound-state simulation time. In all simulations, trajectory snapshots were saved for analysis every 100 ps. Calculation of Conformational Entropy to Assess Conformational Variation. To quantify the conformational flexibility of BrkDBD, we computed the conformational entropy values of each main chain φ and ψ dihedral angle and of each side chain χ1 dihedral angle. For each MD frame, a given dihedral angle is assigned to a bin of width 5˚, with bins ranging from – 180˚ to 180˚. The conformational entropy of the angle is calculated using the relationship

4 ACS Paragon Plus Environment

Page 5 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

S = −kB ∑ pi ln( pi ) i

(4)

where kB denotes the Boltzmann constant, and pi denotes the overall probability that the angle is within the ith bin. Estimation of Effective Mutual Repulsion Between Positively Charged Residues. Mutual repulsion between a set of positively charged residues (R71, R75, K76, H80, R81, R82, and K86) may contribute to the persistently unfolded state of BrkDBD when it is free in solution55. Accordingly, we estimated the effective electrostatic repulsion, which accounts for collective solvent effects, between each pair of residues in this set over the supplementary 7.2 µs of MD simulations for the apo unfolded protein. After removing the water solvent and counterions from the simulation trajectories, the simulations were rerun using the rerun flag of the Gromacs mdrun command. During the simulation reruns, the implicit GBSA solvent model55,56 was employed, and residue-specific energies were computed by using the positively charged residues as energy groups. The g_energy tool of Gromacs was applied to the rerun trajectories to measure the Coulombic interaction energy between each pair of positive residues. For purposes of comparison, we also estimated the effective repulsion between pairs of the same positive residues that would occur if BrkDBD were hypothetically to adopt its final folded conformation in the absence of DNA. The trajectories for the 1 µs of supplementary MD time of the bound, folded conformation were stripped of DNA, water solvent, and counterions, and the trajectories were likewise rerun using implicit GBSA solvent model. Coulombic interaction energies between residues were calculated as described for the unfolded-state simulations. Markov State Modeling of BrkDBD Binding-Folding Pathways. In the last decade, Markov state models (MSM) have become more commonly used to analyze large volumes of MD simulation data and to extract information about metastable states and long-timescale kinetics from many short MD simulation trajectories34,36,57-60. We applied the MSMBuilder2 software package59 to generate a Markov state model based on ~400,000 trajectory frames from the 67 batches of production MD simulations. The hybrid k-centers/k-medoids clustering method59 was used to cluster the conformations into 1000 microstates based on the Cα-/Cβ-atom RMSD between BrkDBD and its folded, bound conformation in the NMR structure. An MSM was constructed based on the transitions between microstates, as described in ref. 59. Briefly, the number of transitions between microstates at an interval of a certain lag time τwas counted to yield a raw count matrix C, and a maximum likelihood estimator (MLE)59 was used to compute the transition matrix T from C. In order to compare MSMs generated by different estimators, we also constructed an MSM using the row-normalization estimator method described and reviewed recently by Zimmerman and coworkers61. Briefly, in the row-normalization estimator method, a pseudocount is added to all elements of the raw count matrix C; the pseudocount is defined as 1/n, where n is the number of states. Each row of the modified C matrix is then row-normalized to yield Tnormalize, from which an MSM is built. All reported results in this work are based on the MSM generated by the MLE method. The Markov time, the timescale at which the model is Markovian, is revealed by examining the implied timescales at

different values of τ. The implied timescale corresponding to τ can be calculated as

κ (τ ) =

−τ ln [ µ (τ )]

(5)

where κ(τ) is the implied timescale and µ(τ) is an eigenvalue of the transition matrix T(). If the model is Markovian at τ, the implied timescales should remain constant when using longer lag times. The minimum lag time yielding a Markovian model was determined as 15 ns, and this lag time was used to build the microstate MSM. In order to simplify interpretation of the MSM, we coarsegrained the 1000-microstate MSM to yield a MSM containing 70 macrostates. A suitable number of macrostates was estimated using the Bayesian agglomerative clustering engine (BACE)62. Visual inspection of the plot of the computed Bayes factor as a function of number of macrostates yielded 70 as an appropriate choice for the latter. All subsequent analyses of the binding-folding pathway were performed using the 70-macrostate model. The built-in tools of MSMBuilder2 were used to calculate the populations (stationary probabilities) of metastable states, mean first passage times (MFPT) between states, and the flux of individual pathways leading from the set of unbound, unfolded metastable states (set U) of BrkDBD to the completely bound and folded state (state F). Pathways were computed by decomposing total flux using transition path theory (TPT)35. Set F was defined as the set of all clusters that have Cα/Cβ-RMSD < 4 Å relative to the first conformer of the NMR structure of the BrkDBD-DNA complex. Set U was defined as the set of all clusters that have < 20% of the intraprotein hydrogen bonds that are observed in set F. Transition path theory was also used to calculate the committor probability,

qi+ , of each intermediate state (states other than

those in

), where

qi+ is the probability, when being in

state i, that the system will reach state F next rather than set U63-65. This committor probability, denoted as pfold, is the probability that the protein in state i will continue to the bound, folded state rather than return to the unbound, unfolded state. Characterizing Order of Structural Events. The probability of a certain event E (e.g. “helix 2 and helix 3 form concurrently”) occurring in the binding-folding pathway can be computed by analyzing each individual pathway Pi that leads from set U to state F in the MSM. The relative probability pi of pathway Pi is determined by the individual flux fi of Pi:

pi = fi / ∑ f j j

(6) where the sum is over all pathways leading from U to F. The probability of event E is then given by

p(E) = ∑ piδ i (E) i

(7) where δi(E) = 1 if E occurs in pathway Pi and 0 otherwise.

5 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Structural Analysis of MSM Macrostates. The SaveStructures tool of MSMBuilder2 was used to save 500 randomly selected conformations from each of the 70 states of the MSM macrostate model. Average values and uncertainties in structural metrics of interest were calculated for these conformations using built-in Gromacs analysis tools. Since macrostate conformational ensembles can be highly heterogeneous, we computed a metric, termed Q, that quantifies the extent to which the BrkDBD protein structure adopts the final folded-state structure without applying artificial thresholds for contact distances66. For each macrostate, a vector c(x) indexed by x = (i, j) is defined, where c(x) denotes contacts between Cα atoms of protein residues i and j. The entries of c(x) are non-integer values between 0 and 1 that represent the fraction of conformations sampled from each macrostate for which the Cα atoms of residues i and j are within 8 Å of one another. The Q-value is given by

Q=

c • c nat c nat • c nat (8)

where cnat is the c(x) vector for the final bound, folded BrkDBD macrostate. The Q-value for the final bound, folded state is 1, and smaller values of Q correspond to lower degrees of structural similarity between this state and intermediate macrostates. Protein-DNA Binding Affinities of MSM Macrostates. The rescoring function of AutoDock Vina67 was used to estimate the Gibbs free energy of protein binding to DNA (∆Gbind) for all 500 conformations that were randomly sampled from each macrostate. The 95% confidence interval of ∆Gbind was estimated by 1000 rounds of bootstrapping. Atomic charges for BrkDBD and the DNA were generated automatically using the prepare_ligand4.py and prepare_receptor4.py tools, respectively, contained in the AutoDock Tools package68. Identifying Kinetic Trap States. Kinetic traps are metastable states that have significant populations but contribute little to the overall folding flux35. We identified a set of intermediate metastable states in the macrostate MSM that together account for ~20% of the total population but are not present in any of the pathways contributing to the top 70% of the net flux between the unfolded set of states U and the folded state F. One common approach to identifying kinetic traps is to remove substates from a MSM and to observe the resulting change in MFPT to the folded state F35,69. Accordingly, to estimate the effect of these states on the folding rate, we removed them from the MSM and recalculated the MFPT between U and F. To control for the effect of removing states K from the MSM, we performed 1000 control rounds in which a set of randomly chosen macrostates (other than those belonging to ) having the same size as set K were removed from the MSM. In each control round, the MFPT between U and F was computed. The effects on the MFPT of removing the random sets of states were compared to the effect of removing states K. Uncertainty Calculations For Observables Derived from MSM. To estimate 80% confidence intervals for implied time-

Page 6 of 20

scales, populations of metastable states, and MFPTs, 100 rounds of bootstrapping were performed. During each round, a new microstate and a new macrostate MSM were constructed using the same number of states as used previously, and populations and MFPTs were computed for each bootstrapped macrostate MSM. Images. All molecular images were produced using UCSF Chimera70.

Results and Discussion In this section, we validate our MSMs of the BrkDBD binding-folding pathway. We then characterize the structural diversity and dynamics of BrkDBD in its free state. Finally, we describe the complete pathways that BrkDBD follows as it binds to a 12-nucleotide DNA double helix (omb-enhancer region DNA of Drosophila melanogaster) and subsequently folds, including intermediate states, the order of protein secondary structure formation, initial binding sites on the DNA, the role of fly-casting in forming encounter complexes, kinetic traps, and the relative degrees of induced fit versus conformational selection binding mechanisms. In the following discussion, we use the term free state to refer to BrkDBD in its completely unbound state, which is free in solution, and the term bound state to refer to BrkDBD when it is bound and completely folded at the experimentally observed binding site on DNA. States occurring between the free and bound states are referred to as intermediate states. Unless indicated otherwise, all reported uncertainties are 95% confidence intervals estimated by bootstrapping. MSM Validation for Simulations of BrkDBD Coupled Binding-Folding. Over the past decade MSM methods have proven highly useful for modeling conformational dynamics of biomolecules in long-timescale processes such as conformational transitions, ligand binding and unbinding, and protein folding, providing a powerful complement to experiment (reviewed in Ref. 71). MSMs model the conformational dynamics of biological molecules and other polymers as transitions between sets of metastable states34-36,57. They offer an efficient strategy to sample transitions between these states by allowing a large number of short parallel MD trajectories to be combined into a single kinetic network model that describes long-timescale dynamics and equilibrium properties35. This advantage often makes it unnecessary to conduct longtimescale simulations. For these reasons, we generated a 1000microstate MSM to map the complete binding-folding pathway of BrkDBD at 300 K, including metastable states along the pathway and kinetics of transitions between the states. The MSM was built using the 40 µs of MD simulations of the binding-folding process. To ease visualization, the 1000microstate MSM was coarse-grained into a MSM containing 70 macrostates. The calculated MSMs are Markovian in nature. In MSM validation, a requirement for Markovian behavior is that the implied timescales remain constant at different lag times. For both the microstate and the macrostate MSMs, the implied timescales level off after a lag time of 15 ns (Figure 2), suggesting that they demonstrate Markovian behavior when the lag time is at least 15 ns. For this reason we selected 15 ns as the lag time for constructing both MSMs.

6 ACS Paragon Plus Environment

Page 7 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry this definition, rather than a definition based strictly on the NMR structure, to take into account inherent protein flexibility in the folded state.

Figure 3. Representative conformations and main chain flexibility of BrkDBD in the absence of DNA. Centroid structures of the twelve most highly populated clusters from 7 µs of unbound-state MD simulations are depicted. Residues are colored by root-meansquare fluctuation (RMSF) of Cα atoms.

Figure 2. Implied timescales for Markov state models (MSM) of BrkDBD binding to DNA. (A) Implied timescales for 1000microstate MSM. (B) Implied timescales for 70-macrostate MSM. Error bars represent 80% confidence intervals calculated by bootstrapping.

When Free in Solution, BrkDBD Adopts a Heterogeneous Set of Flexible Unfolded Conformations with Low Overall Secondary Structure Content but Strong Helical Propensity for Helix 2. To characterize the free state of BrkDBD, we performed 7 µs of MD simulations of apo BrkDBD at 300 K, starting from a diverse set of unbound, unfolded conformations. For comparison to the bound state, we also performed 1 µs of MD simulations of the protein-DNA complex taken from the NMR structure. Over the course of the 7 µs of unbound-state simulations, BrkDBD remains unfolded (Figure 3). The Cα-RMSD values of the protein relative to its bound conformation observed by NMR range from 6 Å to 16 Å, with an average value of 11±3 Å. Moreover, since the protein remains unfolded in the free state, it contains an average of only 8±3% of the bound-state intramolecular contacts. Here, bound-state intramolecular contacts are defined as pairs of protein non-hydrogen atoms at least four residues apart that are separated by ≤ 5 Å for at least 80% of the 1 µs simulation time of the bound state. We use

Although BrkDBD never adopts its bound-state conformation in the absence of DNA, the majority of the protein remains compact. The protein has an average radius of gyration of 13±2 Å and 12±1 Å in the free and in the bound states, respectively. Only the N-terminal residues 43-50 of free BrkDBD occasionally take on an extended conformation. This overall compactness is enabled in part by extensive intramolecular hydrogen bonding throughout the structure. The protein forms on average 40 intramolecular hydrogen bonds in each MD frame, although nearly all of these hydrogen bonds are formed between different residue pairs than the residue pairs for the bound-state hydrogen bonds. In the free state, secondary structure is largely absent from BrkDBD. Of the four α-helices that BrkDBD forms when bound to DNA, only the second helix (residues 70-77) has a propensity to form when BrkDBD is free (Figure 3). This helix is completely formed for 36% of the total free-state simulation time. In contrast, helix 1 (residues 50-62) and helix 3 (residues 81-89) are never completely formed in the free state, but instead are on average only 14% and 15% formed at any given time. Similarly, helix 4 (residues 91-97) forms completely for only 4% of the simulation time and is on average 14% formed. The observed propensity for helix 2 formation in unbound BrkDBD (36% of unbound population) agrees reasonably well with NMR studies72 that report a ~20% population of folded helix 2. Moreover, in 29% of the simulation frames of free BrkDBD, residues 63-69 form a helix that corresponds to an extended helix 1 structure observed by NMR72. However, in the folded state, residues 63-69 form a loop. Consequently, the region around C66, a residue shown experimentally to be crucial for folding72, must change its conformation from helical to a loop during the folding event. NMR studies have demonstrated that mutation of C66 abolishes BrkDBD folding, putati-

7 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

vely because a key hydrogen bond between the C66 and D63 side chains greatly stabilizes the helix1-helix2 loop in the folded conformation72. Consistenly with this observation, the simulations show that C66 fails to form hydrogen bonds with the surrounding residues 63-69 when BrkDBD is free but forms a stable hydrogen bond with D63 when the protein is in its final folded conformation at the DNA binding site. When BrkDBD is free, it has much greater conformational flexibility than in the bound state. The average root-meansquare fluctuations (RMSF) of all Cα atoms in the free and bound states are respectively 7.9±0.3 Å and 0.9±0.1 Å. We also determined the average per-residue conformational entropy values for all φ and ψ dihedral angles to quantify the main chain conformational flexibility. For the free and bound states, the average per-residue main chain conformational entropy values are 5.7 cal/mol•K and 3.7 cal/mol•K, respectively. Similarly, the average per-residue conformational entropy values over all χ1 side chain dihedral angles are 2.7 cal/mol•K for free BrkDBD and 2.0 cal/mol•K for the bound state. These entropy values further indicate the significantly greater conformational freedom not only of the protein main chain, but also of the side chains, when BrkDBD is free in solution. Similarly to many other IDPs, BrkDBD has a significantly greater solvent-exposed hydrophobic surface area when it is free than when it is bound to its target. In the free state, the average hydrophobic surface area exposed to solvent is 1900±90 Å2. However, if the protein adopted the bound conformation while free in solution, the corresponding surface area would be only 1480±70 Å. These differences between the free and bound states of BrkDBD are consistent with those of other IDPs. Most experimentally characterized IDPs are highly flexible, structurally diverse, and lack stable, well-defined secondary structures when unbound. Mutual Effective Repulsion Between Positively Charged Residues Contributes to BrkDBD Unfolding. BrkDBD appears to remain unfolded when free in solution at least partially due to mutual repulsion between a large set of positively charged residues (R71, R75, K76, H80, R81, R82, and K86) on the surface of the protein. In the bound state, all of these residues except K76 are clustered within the DNA-protein binding interface and form extensive hydrogen bonding and electrostatic interactions with the DNA that help drive complex formation. However, it has been speculated that the high positive charge density generated by these residues leads to repulsion between them when BrkDBD is free and that this repulsion keeps the protein unfolded in the absence of DNA38. In our MD simulations, the mutual repulsion between R71, R75, K76, H80, R81, R82, and K86 is, indeed, significantly relieved when BrkDBD remains unfolded compared to in a hypothetical scenario where it folds without binding to DNA (Table S1). We computed average Coulombic interaction energies between each pair of these positive residues for the 7 µs of unbound-state simulations and the corresponding values if the protein were to adopt its folded conformation in the apo

Page 8 of 20

state without the stabilizing presence of DNA. The average pairwise Coulombic interaction energies of these residue pairs for the unfolded and for the hypothetical apo folded protein conformations are 11 kcal/mol and 27 kcal/mol, respectively. Thus, when the protein remains unfolded, there is an average of 16 kcal/mol less repulsion between each positive residue pair than there would be if the protein were to fold while unbound. For a few residue pairs, there is a particularly drastic decrease in repulsion when the protein remains unfolded. For example, the Coulombic repulsions between R71 and R75 and between R82 and K86 are respectively 37 kcal/mol and 28 kcal/mol lower in the unfolded state than in the hypothetical apo folded state. The observed decrease in interresidue repulsion is also reflected in greater distances between the set of positively charged residues in the unfolded state. The Cα atoms of R71, R75, K76, H80, R81, R82, and K86 are on average ~3 Å farther apart from one another when BrkDBD is free than in the bound state (Table S1). Certain pairs of amino acids, including R71-R81 and R71-K86, increase their separation by ~8 Å in the free state, allowing their mutual repulsion to diminish. Taken together, the computed decrease in Coulombic interactions between the set of positively charged residues when BrkDBD is unfolded and unbound supports the previous speculation that mutual repulsion may be a driving force for the protein to remain unfolded when free in solution. Future studies of apo BrkDBD folding in which these residues were replaced by neutral residues could, in principle, be used to probe this hypothesis further. BrkDBD Follows Multiple Diverse Pathways En Route From its Free, Unfolded State to its Bound, Folded State. In order to characterize the coupled binding and folding mechanism of BrkDBD, we conducted 40 µs of unbiased MD simulations starting from a diverse set of unbound, unfolded conformations. Markov state modeling was performed on the aggregate set of simulation trajectories to cluster the simulated phase space into kinetically related metastable states. Folded states were defined as the set of all clusters having a Cα/CβRMSD less than 4 Å relative to the folded conformation in the BrkDBD-DNA NMR structure. Unfolded states were defined as the set of all clusters having less than 20% of the intramolecular hydrogen bonds that are present in the folded, bound state. The folding flux between the unfolded and folded states and folding committor probabilities (pfold) were calculated as described in Methods. To view the sequence of events during the binding and folding process, the folding flux was decomposed into individual pathways using transition path theory. BrkDBD follows multiple binding-folding pathways (Figure 4). The pathways differ significantly in terms of initial encounter complexes, binding locations of the intermediate states on the DNA, order of secondary structure formation, and mean first-passage times (MFPTs) between states.

8 ACS Paragon Plus Environment

Page 9 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

Figure 4. BrkDBD binding-folding pathways from the 70-macrostate MSM. (Left) The five networks with the highest folding flux. Blue numbers indicate state numbers within the MSM. The numbers on the left indicate committor probabilities (pfold), and the thickness of arrows indicates the relative flux of folding trajectories between each pair of states. Depicted conformations are centroid structures of the corresponding states. Numbers next to arrows give the mean first-passage time (MFPT) in units of µs between connected states and 80% confidence interval limits (small numbers). The protein-binding region of the DNA is colored cyan. Protein residues that are helical and non-helical in the folded state are colored blue and red, respectively. (Lower right) Trap states that contribute only minor folding flux but have significant stationary probabilities and slow down folding by ~40%.

The five binding-folding pathways with the highest individual fluxes account for 71% of the total folding flux (Figure 5). These pathways comprise a total of eleven intermediate metastable states with heterogeneous conformations, extent of folding, extent of secondary structure formation,

and conformational flexibility (Table S2). (See Movie S1 for a depiction of BrkDBD as it transits through the highestflux pathway.)

9 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 20

Figure 5. Individual top five BrkDBD binding-folding pathways from the 70-macrostate MSM. The top five pathways account for 71% of the total folding flux. The numbers on bottom indicate pathway numbers and flux for each pathway. Blue numbers indicate state numbers within the MSM. DNA and protein residue coloring is as described in Figure 4.

For the five highest-flux pathways there are two initial free metastable states (numbered 24 and 43) in which the protein is unfolded yet compact. The mean Cα-RMSD of the protein in these free states relative to the NMR structure is 9 Å. In states 24 and 43 the protein lacks stable secondary structure, possessing 0% and 2% of its total bound-state αhelical content, respectively. In ~30% of BrkDBD conformations in state 43, the protein adopts a short (3-4 residues) portion of helix 2, but the other three helices are entirely absent in all conformations. However, BrkDBD in state 24

completely lacks all its bound-state helices, possessing instead a 4-residue helix at residues 63-66 that the bound protein never adopts. Similarly to its behavior in the 1 µs of supplemental simulation time for free BrkDBD, the protein remains compact in states 24 and 43, having a ~12 Å mean radius of gyration in each state. Among our aggregate raw MD trajectory data, we observed 143 total unique binding events between BrkDBD and DNA. A binding event was taken to occur if the minimum protein-DNA distance was ≥ 20 Å at some time in the

10 ACS Paragon Plus Environment

Page 11 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

trajectory and ≤ 4 Å at a later time in the trajectory. This number of observed binding events suggests that an adequate quantity has been sampled to provide robust statistics for the analysis of the binding pathways. (See Movies S2 and S3 for examples of binding events observed in the raw MD trajectories.) In the case of our simulation data, the observed coupled binding-folding pathways do not depend on the estimator that we used to construct the MSM from the count matrix C. We built MSMs using both the maximum likelihood estimator method and the simpler row-normalization method, and both models yielded the same set of top pathways (accounting for ≥ 70% total flux) from the unfolded to the folded state. However, the MSMs yielded slightly different MFPTs between states (Table S3). BrkDBD Forms Weakly Bound, Disordered Initial Encounter Complexes Far From its Final Binding Site on DNA. In all binding pathways BrkDBD forms an initial encounter complex with DNA at a position on the DNA that is distant from the final binding site. The final binding site in the NMR structure and in the MD simulations is in the DNA major groove at the position of the GGCGTC recognition sequence (Figures 4-5). In pathways I, II and V, which together account for 47% of the total folding flux, the protein binds initially at the minor groove on the opposite side of the double helix relative to the final binding site. States 55 and 46 contain the encounter complex for pathways I/II and pathway V, respectively. For the state 46 encounter complex, residues 43-48 of the N-terminus bind principally to the nonrecognition-sequence DNA strand in the minor groove (Figure 6). The R46 side chain forms two hydrogen bonds with this strand, and the R45 main chain nitrogen forms a hydrogen bond with the terminal C of the GGCGTC recognition sequence.

Figure 6. Encounter complexes between BrkDBD and DNA for the top five binding-folding pathways. Protein side chains mediating hydrogen bonds with the DNA are shown as sticks. DNA and protein residue coloring is as described in Figure 4.

For the state 55 encounter complex, protein residues 4546 and 62-71 of BrkDBD contact the non-recognitionsequence DNA strand in the minor groove. Seven hydrogen bonds form between the DNA and the R45, R46, and R71 side chains. Moreover, in the encounter complex of pathways III and IV (represented by state 54), protein residues 64-83 and 91-92 contact the DNA at the 5’ end of the strand containing the recognition sequence and the 3’ end of the opposite strand. This complex involves three hydrogen bonds between the protein and DNA, two of which are mediated by the R71 side chain and one of which is mediated by the R95 side chain. Interestingly, each encounter complex in the five highest-flux pathways contains only one short 4residue helix over residues 63-66. This helix disappears by the end of the folding pathway. Residues 63-66 represent an extension immediately after the residues that make up helix 1 in the bound state. Each of the three initial encounter complexes has pfold < 0.50, indicating that there is a greater probability that the protein will dissociate from the DNA rather than continue through the folding pathway. Although the helices observed in the bound state of BrkDBD are absent in the encounter complexes, the protein initially binds the DNA mainly in regions that subsequently fold into the bound-state helices. In 51% of the folding flux, the residues corresponding to helix 2 alone are the first re-

11 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

gion of the protein to contact the DNA, whereas the residues of helices 2-3 together and helices 2-4 together form the initial contacts in 12% and 26% of the folding flux, respectively. In the remaining 11% of the folding flux (pathway V) the encounter complex is mediated by residues 43-48, which never develop secondary structure. Pathway V is the only pathway in which BrkDBD uses a fly-casting mechanism to bind initially to the DNA. Represented by state 46, the fly-casting encounter complex features a long, extended N-terminal segment consisting of residues 43-51, of which R45 and R46 hydrogen bond to the DNA. For all other pathways the compact protein core (residues 52-101) binds to the DNA. Each encounter complex involves relatively weak binding between BrkDBD and DNA. The estimated average values of ∆Gbind for protein binding to DNA in states 46, 54, and 55 are –4±3 kcal/mol, –8±4 kcal/mol, and –9±3 kcal/mol, respectively. In contrast, the estimated ∆Gbind in the final bound state is –23±3 kcal/mol. This weak binding reflects in part the small number of hydrogen bonds between the protein and DNA in the encounter complexes.

Page 12 of 20

states 31 and 30, states 31 and 57, and states 11 and 30 all exceed 30 µs (Figure 4) in part because in each transition a relatively large quantity of helical content (10-13 residues) is formed (Figure 7).

BrkDBD Diffuses Across the DNA Surface Before it Reaches its Final, High-Affinity Binding Site. After BrkDBD initially contacts the DNA, it spends an average time of ~35 µs wandering among intermediate states located at different locations on the DNA before it discovers its final binding site in the major groove in states 30 and 57. As the protein traverses the DNA surface, residues 70-75 (which form the majority of helix 2 in the bound state) remain consistently in contact with the DNA, while residues 81-85 (which form half of bound-state helix 3) and residues 43-45 at the N-terminus also occasionally contact the DNA. In each of the intermediate states before the final binding site is reached, the protein forms an average of only 5 hydrogen bonds with the DNA, compared to 18 hydrogen bonds in the bound state. Moreover, the average binding interface area between protein and DNA in these intermediate states is ~500 Å2, which is less than half the area of the binding interface (1300 Å2) in the bound state. This combination of low degrees of hydrogen bonding and small binding interface areas leads to weak protein-DNA complexes that have ∆Gbind less than half the magnitude of the final bound-state ∆Gbind (Table S2). Despite its relatively low binding affinities, however, the protein becomes much less flexible once it has associated with the DNA. Whereas the average Cα-RMSF over all protein residues is ~8 Å in the two metastable unbound states 24 and 43, once the protein is on the DNA surface, the combination of hydrogen bonding and electrostatic interactions with the DNA lead to Cα-RMSF values below 4.0 Å for all intermediate states. Weak association between the protein and DNA allows the protein to remain mobile, preventing it from remaining bound at any intermediate location for longer than 9 µs before it reaches its final binding site. The MFPTs between intermediate states at different sites on the DNA range widely. Some states, for example, states 50 and 31, are separated by a short MFPT of only 1 µs, while other states, including states 46 and 54, are separated by a longer MFPT of up to 9 µs. In contrast, the largest MFPTs overall occur for the three transitions in which the protein reaches its final binding site in states 30 and 57. The MFPTs for the transitions between

12 ACS Paragon Plus Environment

Page 13 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

Figure 7. Helix formation along BrkDBD binding-folding pathways. Probabilities of residues adopting an α-helical conformation within each macrostate are indicated. Macrostates are listed in the approximate order in which they occur in the pathways; macrostates in which the protein has reached its final binding site are listed in italics. Residue stretches corresponding to helices 1-4 in the NMR structure of the folded state are enclosed between vertical lines.

The observation that BrkDBD moves along the DNA until a high-affinity binding site has been recognized agrees with NMR data for BrkDBD folding72 and is also observed experimentally for the folding pathways of several other IDPs, including the HOXD9 homeodomain73,74. These experimental data indicate that when the protein binds strongly to its DNA partner, it loses conformational degrees of freedom such that it does not explore any further unfolded or partially unfolded conformations at other sites on the binding partner. Once the final binding site is discovered in states 30 and 57, pfold > 0.50. There is thus a >50% likelihood that the protein will remain bound and reach the folded state rather than dissociate from the DNA. Of particular importance, it is not until BrkDBD reaches the final binding site that a large overall conformational change takes place that makes the protein more structurally similar to its fully folded conformation. To quantify the extent of overall structural similarity between each state and the fully bound, folded state, we calculated Q for all states. Q-values range continuously between 0 and 1, where larger Q-values correspond to higher fractions of Cα-Cα contacts shared in common between a given state and the fully folded state. For all macrostates in which the protein has not yet reached the final binding site, Q-values are below 0.20 (Figure 8). After BrkDBD diffuses onto the binding site, Q approximately doubles and further increases gradually until the protein attains state 0, the penultimate state. A large increase in Q (∆Q ~ 0.4) accompanies the transition from state 0 to the fully folded state, showing that in the final step alone of the folding process nearly half of the folded-state contacts form.

Figure 8. Q-values versus pfold (committor) values for metastable states in the top five binding-folding pathways. Q-values indicate the extent to which bound-state Cα-Cα contacts are formed within BrkDBD. MSM state numbers are indicated beside each data point. Data points for states in which BrkDBD has reached its final binding site are colored red.

BrkDBD Lacks Stable Secondary Structure Until it Reaches its Final Binding Site on DNA. In all of the intermediate states in which BrkDBD has not reached its final binding site on the DNA, the protein has minimal secondary structure, containing ≤ 5 helical residues on average (Figure

13 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

7). Moreover, ~50% of the helical residues in these intermediate states occur in residues 63-67, which are not helical in the folded, bound state. Larger quantities of helical content are adopted only after the protein reaches its final binding site in states 30 and 57. In these states, BrkDBD has an average of 14 and 16 total helical residues, respectively, corresponding to 35% and 41% of the helical content occurring in the bound state. Similarly, states 0 and 19, the only remaining intermediate states before the protein folding pathway completes, have averages of 15 and 14 total helical residues, representing 39% and 35% of bound-state helical content. In the final step of the folding pathway, as the protein transitions from state 0 to the fully folded conformation, 21 additional residues adopt helical structure, accounting for the remaining bound-state helical content. This adoption of more than half the final helical content during a single state transition contributes to the large MFPT (13 µs) between state 0 and the folded conformation. Once the folded conformation is reached, the protein has 36 helical residues on average, corresponding to 97% of the helical content observed in the NMR structure. In each of the five highest-flux folding pathways, helices 2-4 begin to form concurrently (Figures 5 and 7). In pathway I, helices 2-4 begin to form during the transition from state 31 to state 57, while in pathways II-V they begin to form during the transition to state 30. By the time BrkDBD reaches state 57 in pathway I, helices 2-4 are 70%, 60%, and 52% formed, respectively, whereas helix 1 has not begun to form at all. Similarly, when state 30 is reached in pathways II-V, helices 2-4 are all ~50-60% formed, yet helix 1 is completely missing. In all pathways, helices 2-4 remain unformed until the final folded state is reached. In contrast to the other helices, helix 1 begins to form very late in the pathway, not appearing until the penultimate state 0. A Set of Kinetic Traps Reduces the Overall BrkDBD Folding Rate by ~40%. The MFPT between the free states 24 and 43 and the final bound, folded state is 49 µs. The corresponding rate constant kfold for the complete BrkDBD binding-folding process, as given by 1/MFPT, is 2.0•105 s–1. The 70-macrostate MSM contains three metastable states that together have a stationary probability of 17% but contribute a very small (7%) folding flux. These states are hence defined as kinetic trap states. In the trap states, the protein binds weakly (average ∆Gbind of –9±3 kcal/mol) at different sites on the DNA surface and lacks secondary structure except a partially formed helix 1 (Figure 4). When the trap states are collectively removed from the MSM, the computed kfold increases by ~40% to 2.8•105 s–1. This increase shows that the protein gets stuck in at least one of the traps for a large amount of time and that the traps significantly slow down folding. BrkDBD Folding Occurs Exclusively Via an Induced Fit Model. Two widely used models for protein-ligand binding are the induced fit and conformational selection models. Each model attempts to describe how a protein transitions from an unbound conformation to a bound conformation in complex with a ligand. In the induced fit model75,76, the ligand binds initially to the protein in the protein’s unbound conformation, and this binding event induces the protein to transition to the bound state. In contrast, in the conformational selection model, the intrinsic dynamics of the protein

Page 14 of 20

allow it to transition continually between a stable unbound conformation and an unstable bound conformation before the ligand binds76-78. After the protein has transitioned to the unstable bound conformation, the ligand binds and stabilizes the bound conformation. Depending on the concentrations of the protein and ligand, differing relative contributions of the two binding mechanisms can be observed79. For any proteinligand binding event, the corresponding binding model is of fundamental importance. In the present system, the DNA molecule is treated as the ligand. We characterize the folding model of BrkDBD by evaluating the percentage of bound-state interatomic contacts that are formed within the protein as a function of the minimum distance between the protein and the DNA (Figure 9). The percentage of bound-state contacts that are formed is used to measure the extent to which the protein is folded. As described previously, bound-state contacts are defined as pairs of non-hydrogen protein atoms at least four residues apart in sequence space that are within 5 Å of one another for at least 80% of the 1 µs of supplementary simulation time for the bound state.

Figure 9. BrkDBD folds exclusively via an induced fit model at the simulated protein concentration of ~4 mM. For each sampled conformation in the binding-folding pathway, the percentage of bound-state intramolecular protein contacts is plotted versus the minimum distance between the protein and DNA. The vertical dashed line corresponds to 5 Å intermolecular distance, the cutoff distance for the protein’s being bound to DNA.

The BrkDBD folding mechanism follows exclusively an induced fit model at the simulated protein concentration of ~4 mM, which lies within the range of common intracellular concentrations of proteins having BrkDBD’s sequence length for a wide variety of organisms80. If the conformational selection model contributed to the binding mechanism, we would expect that the aggregate simulation data would contain a set of conformations in which the protein is folded or nearly folded while still free in solution. However, whenever the protein is not bound to the DNA surface (> 5 Å minimum distance), it forms less than 20% of its bound-state intramolecular contacts. High levels of bound-state contacts

14 ACS Paragon Plus Environment

Page 15 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

are formed only when the protein has associated with the DNA. Additionally, even after BrkDBD initially contacts the DNA, no notable bound-state structural content is adopted until the protein reaches its final binding site (Figure 8). Consequently, in the MSM the protein follows an induced fit mechanism in which it begins to fold after it has contacted the DNA but completes the majority of its folding only after it has reached the final binding site. Modeled BrkDBD Binding-Folding Pathway Agrees Largely With Experimental Data. Many properties of BrkDBD folding that our simulation set reveals agree well with experimental observations. Here we present a brief summary of observed agreement between our modeling results and experimental results, as well as a few discrepancies and potential sources of these discrepancies. Moreover, we comment briefly on agreement between our findings and those of the only previously published computational study of BrkDBD folding. NMR studies and our MD simulations indicate that BrkDBD is highly unfolded when free in solution and exists as a heterogeneous, dynamic ensemble of conformers. Although the protein largely lacks secondary structure when free, helix 2 has a moderate propensity to form. In the simulations there is a 36% probability that helix 2 is completely formed in the absence of bound DNA, while NMR studies report a ~20% probability72. However, in the simulations helices 3 and 4 have a low likelihood of being formed, while NMR experiments detect helical propensity for helices 3 and 4 in addition to helix 2. This quantitative discrepancy could result from insufficient MD sampling and/or the necessarily limited set of initial unfolded protein conformations. Additionally, the simulations and experimental studies38 both strongly suggest that electrostatic repulsion between the positively charged residues R71, R75, K76, H80, R81, R82, and K86 make folding in the unbound state highly unfavorable. Experimental observations have also indirectly suggested a fly-casting model in BrkDBD binding to DNA, and our simulations clearly indicate that fly-casting does, indeed, play a role. Since NMR studies have shown that BrkDBD is unfolded when unbound, it has previously been hypothesized that the protein approaches and binds to its DNA target with a large capture radius by a fly-casting mechanism72. Nevertheless, to our knowledge, no experimental data have definitively supported this hypothesis. According to our MSM, in 11% of the folding flux BrkDBD approaches the DNA following a fly-casting mechanism in which the extended Nterminus initially contacts the non-recognition-sequence DNA strand. The weak initial contacts thus formed by the Nterminus subsequently “reel” the protein onto the DNA. In the remaining 89% of the folding flux, fly-casting does not take place, as the protein lacks an extended capture radius in the initial encounter complexes and adopts instead a compact conformation whose radius of gyration is only ~1 Å larger than that of the folded conformation. In addition, the simulations agree with experimental data72 suggesting that principally helices 2-4 mediate the initial contacts between BrkDBD and DNA. In contrast to experimental data, however, in the simulations helices 3-4 are not preformed before binding. Our simulation data and experimental findings both reveal that BrkDBD is highly mobile on the DNA until it has reached its final binding site. NMR data suggest that the

protein moves across DNA by either sliding or hopping along the molecular surface and that the N-terminal part of helix 3 probes for specific contacts72. The MSM presented here likewise shows that the protein spends ~35 µs exploring multiple locations on the DNA before it settles in its final binding site. As the protein wanders across the DNA, it binds relatively weakly, enabling it to move more readily. According to the MSM, the translocation mechanism involves not only helix 3, as suggested by NMR, but also helix 2 and the loop between helices 2 and 3, implying that larger swathes of the protein structure are involved in the search for the binding site. Experimental studies have not distinguished between induced fit and conformational selection binding models for BrkDBD. Nevertheless, simulations of BrkDBD unfolding were applied in a previous computational study31 to probe the binding model. The previous study reported that the RMSD between the protein and its final folded conformation decreases as the protein approaches the DNA, consistent with the induced fit model. Similarly, our MSM, which directly models the binding and folding process, shows that BrkDBD binds exclusively via an induced fit model under the simulated system conditions, as shown by a low quantity of formed bound-state contacts whenever the protein is free in solution. Finally, the timescale of diffusion of BrkDBD along the DNA observed in our MSM is similar to that observed experimentally for certain other transcription factors. To our knowledge, no kinetic studies of BrkDBD diffusion along the DNA have been published. Nevertheless, experimental studies have characterized DNA diffusion rates for several other transcription factors, including E. coli purR81. The purR transcription factor travels along DNA over distances of 10–100 base pairs with MFPTs of ~1-100 µs, which are comparable to those observed here for BrkDBD, which range from 1 µs to 38 µs.

BrkDBD-DNA is a Multibody System. The presence of two macromolecules in BrkDBD-DNA binding-folding makes the binding mechanism of this system a typical multibody process. Several recent advances in creating MSMs for multibody processes have been published. In the automatic state partitioning for multibody systems method (APM)82, the presence of different time scales in a multibody process (e.g. the fast time scales of free ligand diffusion before the ligand binds to a receptor) is addressed by directly incorporating dynamics into geometric clustering when identifying metastable states. This method has been shown to afford great enhancements in computational efficiency of MSM construction and in model accuracy82. Another advancement in MSMs for multibody processes involves the use of hidden Markov models (HMMs)83. HMMs are built on k-means clustering and find a coarse-grained kinetic model by grouping fast mixing states into metastable sets, and estimating transition rates between them. This method has recently been applied to model the complete protein-protein association kinetics for barnase and its inhibitor barstar83. Future work could involve applying the APM method and/or HMMs to assess how they may affect the discovered binding-folding pathways for the BrkDBD-DNA system.

15 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 20

ABBREVIATIONS

Conclusion MD simulations of the BrkDBD coupled binding-folding pathway indicate that BrkDBD binds to its cognate DNA on the microsecond timescale via an induced fit mechanism. Like other IDPs, BrkDBD is unfolded and lacks stable secondary structure in the absence of DNA. BrkDBD binds to DNA both by colliding with the DNA when the protein is in a compact, unfolded conformation and by a fly-casting model, in which the protein has a large capture radius due to its extended N-terminus. After its initial binding event, the protein transits through multiple metastable intermediate states involving weak, transient complexes with the DNA. Before the protein reaches its final binding site on DNA, it is highly mobile, moving readily across the DNA surface. Once the protein reaches its final, high-affinity binding site, it completes its folding, acquiring the majority of its secondary and tertiary structure. Previous experimental studies have established that BrkDBD represents an extreme case of coupled bindingfolding and that the protein remains unfolded when it is free in solution. Our simulations and MSM elucidate for the first time the complete binding-folding pathways that BrkDBD follows in order to adopt its bound, folded conformation as observed by NMR, including structures of intermediate states at atomistic resolution and timescales that separate metastable states. We anticipate that the methodology employed in this study can be readily applied to other systems, enriching understanding of IDP binding-folding pathways.

ASSOCIATED CONTENT Supporting Information The Supporting Information is available free of charge on the ACS Publications website. Table of mutual repulsions between positively charged residues; table of physical and structural properties of MSM macrostates in top five binding-folding pathways (PDF); table comparing MFPTs for MSMs calculated using different estimators; movie depicting top binding-folding pathway from MSM; movies depicting protein binding from raw MD trajectories

AUTHOR INFORMATION Corresponding Author * E-mail: [email protected]. Phone: 425-352-3281

Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

ACKNOWLEDGMENT This work was facilitated by the use of advanced computational, storage, and networking infrastructure provided by the Hyak supercomputer system at the University of Washington. Startup research funding for PCA from University of Washington Bothell was used to fund this work.

IDP, intrinsically disordered protein; MD, molecular dynamics; MSM, Markov state modeling; BrkDBD, Brinker transcription factor DNA-binding domain; RMSD, root-mean-square deviation; RMSF, root-mean-square fluctuation; MFPT, mean-firstpassage time.

REFERENCES (1) Dyson, H. J. (2011) Expanding the proteome: disordered and alternatively folded proteins. Q. Rev. Biophys. 44, 1–52. (2) Uversky, V. N. (2011) Multitude of binding modes attainable by intrinsically disordered proteins: a portrait gallery of disorder-based complexes. Chem. Soc. Rev. 40, 1623–1634. (3) Wright, P. E., and Dyson, H. J. (2009) Linking folding and binding. Curr. Opin. Struct. Biol. 19, 31–38. (4) Tompa, P. (2012) Intrinsically disordered proteins: a 10-year recap. Trends Biochem. Sci. 37, 509–516. (5) Uversky, V. N. (2016) Dancing protein clouds: the strange biology and chaotic physics of intrinsically disordered proteins. J. Biol. Chem. 291, 6681–6688. (6) Peng, Z., Mizianty, M. J., and Kurgan, L. (2014) Genome-scale prediction of proteins with long intrinsically disordered regions. Proteins 82, 145–158. (7) Patil, A., and Nakamura, H. (2006) Disordered domains and high surface charge confer hubs with the ability to interact with multiple proteins in interaction networks. FEBS Lett. 580, 2041–2045. (8) Ekman, D., Light, S., Björklund, A. K., and Elofsson, A. (2006) What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol. 7, R45. (9) Dosztányi, Z., Chen, J., Dunker, A. K., Simon, I., and Tompa, P. (2006) Disorder and sequence repeats in hub proteins and their implications for network evolution. J. Proteome Res. 5, 2985– 2995. (10) Singh, G. P., Ganapathi, M., Sandhu, K. S., and Dash, D. (2006) Intrinsic unstructuredness and abundance of PEST motifs in eukaryotic proteomes. Proteins 62, 309–315. (11) Haynes, C., Oldfield, C. J., Fi, F., Klitgord, N., ME, C., Radivojac, P., Uversky, V. N., Vidal, M., and Iakoucheva, L. M. (2006) Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput. Biol. 2, e100. (12) Dunker, A. K., Cortese, M. S., Romero, P., Iakoucheva, L. M., and Uversky, V. N. (2005) Flexible nets: the roles of intrinsic disorder in protein interaction networks. FEBS J. 272, 5129–

16 ACS Paragon Plus Environment

Page 17 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

5148. (13) Xie, H., Vucetic, S., Iakoucheva, L. M., Oldfield, C. J., Dunker, A. K., Obradovic, Z., and Uversky, V. N. (2007) Functional anthology of protein disorder. 3. Ligands, post-translational modifications, and diseases associated with intrinsically disordered proteins. J. Proteome Res. 6, 1917–1932. (14) Uversky, V. N., Oldfield, C. J., and Dunker, A. K. (2008) Intrinsically disordered proteins in human diseases: Introducing the D2 concept. Annu. Rev. Biophys. 37, 215–246. (15) Uversky, V. N. (2013) Intrinsic disorderbased protein interactions and their modulators. Curr. Pharm. Des. 19, 4191–4213. (16) Metallo, S. (2010) Intrinsically disordered proteins are potential drug targets. Curr. Opin. Struct. Biol. 14, 481–488. (17) Varadi, M., Kosol, S., Lebrun, P., Valentini, E., Blackledge, M., Dunker, A. K., Felli, I. C., Forman-Kay, J. D., Kriwacki, R. W., Pierattelli, R., Sussman, J., Svergun, D. I., Uversky, V. N., Vendruscolo, M., Wishart, D., Wright, P. E., and Tompa, P. (2014) pE-DB: a database of structural ensembles of intrinsically disordered and of unfolded proteins. Nucleic Acids Res. 42, D326– D335. (18) Kiefhaber, T., Bachmann, A., and Jensen, K. S. (2012) Dynamics and mechanisms of coupled protein folding and binding reactions. Curr. Opin. Struct. Biol. 22, 21–29. (19) Shammas, S. L., Crabtree, M. D., Dahal, L., Wicky, B., and Clarke, J. (2016) Insights into coupled folding and binding mechanisms from kinetic studies. J. Biol. Chem. 291, 6689–6695. (20) Csermely, P., Palotai, R., and Nussinov, R. (2010) Induced fit, conformational selection and independent dynamic segments: an extended view of binding events. Trends Biochem. Sci. 35, 539–546. (21) Mollica, L., Bessa, L. M., Hanoulle, X., Ringkjøbing Jensen, M., Blackledge, M., and Schneider, R. (2016) Binding mechanisms of intrinsically disordered proteins: Theory, simulation, and experiment. Front. Mol. Biosci. 3, 52. (22) Santner, A. A., Croy, C. H., Vasanwala, F. H., Uversky, V. N., Van, Y., and Dunker, A. K. (2012) Sweeping away protein aggregation with entropic bristles: intrinsically disordered protein fusions enhance soluble expression. Biochemistry 51, 7250–7262. (23) Cheng, Y., LeGall, T., Oldfield, C. J., Mueller, J. P., Van, Y., Romero, P., Cortese, M. S., Uversky, V. N., and Dunker, A. K. (2006) Rational drug design via intrinsically disordered protein. TRENDS Biotechnol. 24, 435–442. (24) Zhang, Z., Mazouchi, A., Chong, A., Forman-

Kay, J., and Gradinaru, C. (2014) The conformations of the DrkN SH3 domain studied by single molecule fluorescence spectroscopy. Biophys. J. 106, 50a. (25) Sugase, K., Dyson, H. J., and Wright, P. E. (2007) Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature 447, 1021–1025. (26) Lacy, E. R., Filippov, I., Lewis, W. S., Otieno, S., Xiao, L., Weiss, S., Hengst, L., and Kriwacki, R. W. (2004) p27 binds cyclin-CDK complexes through a sequential mechanism involving binding-induced protein folding. Nat. Struct. Mol. Biol. 11, 358–364. (27) Cordeiro, T. N., Herranz-Trillo, F., Urbanek, A., Estaña, A., Cortés, J., Sibille, N., and Bernadó, P. (2017) Small-angle scattering studies of intrinsically disordered proteins and their complexes. Curr. Opin. Struct. Biol. 42, 15–23. (28) Kikhney, A. G., and Svergun, D. I. (2015) A practical guide to small angle X-ray scattering (SAXS) of flexible and intrinsically disordered proteins. FEBS Lett. 589, 2570–2577. (29) Baker, C. M., and Best, R. B. (2014) Insights into the binding of intrinsically disordered proteins from molecular dynamics simulations. WIREs Comput. Mol. Sci. 4, 182–198. (30) Ithuralde, R. E., Roitberg, A. E., and Turjanski, A. G. (2016) Structured and unstructured binding of an intrinsically disordered protein as revealed by atomistic simulations. J. Am. Chem. Soc. 138, 8742–8751. (31) Qin, F., Jiang, Y., Chen, Y., Wu, M., Yan, G., Ye, W., Li, Y., Zhang, J., and Chen, H.-F. (2011) Conformational selection or induced fit for Brinker and DNA recognition. Phys. Chem. Chem. Phys. 13, 1407–1412. (32) Zhang, Y., Tan, H., Chen, G., and Jia, Z. (2010) Investigating the disorder-order transition of calmodulin binding domain upon binding of calmodulin using molecular dynamics simulation. J. Mol. Recognit. 23, 360–368. (33) Zhou, G., Pantelopulos, G. A., Mukherjee, S., and Voelz, V. A. (2017) Bridging microscopic and macroscopic mechanisms of p53-MDM2 binding with kinetic network models. Biophys. J. 113, 785–793. (34) Noé, F., and Fischer, S. (2008) Transition networks for modeling the kinetics of conformational change in macromolecules. Curr. Opin. Struct. Biol. 18, 154–162. (35) Noé, F., Schütte, C., Vanden-Eijnden, E., Reich, L., and Weikl, T. R. (2009) Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proc. Natl. Acad. Sci. U.S.A. 106, 19011–19016. (36) Bowman, G. R., Huang, X., and Pande, V. S.

17 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(2009) Using generalized ensemble simulations and Markov state models to identify conformational states. Methods 49, 197–201. (37) Qiao, Q., Qi, R., Wei, G., and Huang, X. (2016) Dynamics of the conformational transitions during the dimerization of an intrinsically disordered peptide: a case study on the human islet amyloid polypeptide. Phys. Chem. Chem. Phys. 18, 29892–29904. (38) Cordier, F., Hartmann, B., Rogowski, M., Affolter, M., and Grzesiek, S. (2006) DNA recognition by the Brinker repressor–an extreme case of coupling between binding and folding. J. Mol. Biol. 361, 659–672. (39) Sivasankaran, R., Vigano, M. A., Muller, B., Affolter, M., and Basler, K. (2000) Direct transcriptional control of the Dpp target omb by the DNA binding protein Brinker. EMBO J. 19, 6161–6172. (40) Berendsen, H. J., van der Spoel, D., and van Drunen, R. (1995) GROMACS: A messagepassing parallel molecular dynamics implementation. Comput. Phys. Commun. 91, 43–56. (41) Hess, B., Kutzner, C., van der Spoel, D., and Lindahl, E. (2008) GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4, 435–447. (42) Hornak, V., Okur, A., Rizzo, R. C., and Simmerling, C. (2006) HIV-1 protease flaps spontaneously open and reclose in molecular dynamics simulations. Proc. Natl. Acad. Sci. U.S.A. 103, 915–920. (43) Hornak, V., Abel, R., Okur, A., Strockbine, B., Roitberg, A., and Simmerling, C. (2006) Comparison of multiple AMBER force fields and development of improved protein backbone parameters. Proteins 65, 712–725. (44) Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W., and Klein, M. L. (1983) Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935. (45) Darden, T., York, D., and Pedersen, L. (1993) Particle mesh Ewald - An N log(N) method for Ewald sums in large systems. J. Chem. Phys. 98, 10089–10092. (46) Wickstrom, L., Okur, A., and Simmerling, C. (2009) Evaluating the performance of the ff99SB force field based on NMR coupling data. Biophys. J. 97, 853–856. (47) Best, R. B., Buchete, N.-V., and Hummer, G. (2008) Are current molecular dynamics force fields too helical? Biophys. J. 95, L07–L09. (48) Hess, B. (2008) P-LINCS: A parallel linear constraint solver for molecular simulation. J. Chem. Theory Comput. 4, 116–122. (49) Bussi, G., Donadio, D., and Parrinello, M. (2007) Canonical sampling through velocity

Page 18 of 20

rescaling. J. Chem. Phys. 126, 014101. (50) Parinello, M., and Rahman, A. (1981) Polymorphic transitions in single crystals: a new molecular dynamics method. J. Appl. Phys. 52, 7182–7190. (51) Nosé, S., and Klein, M. L. (1983) Constant pressure molecular dynamics for molecular systems. Mol. Phys. 50, 1055–1076. (52) Sugita, Y., and Okamoto, Y. (1999) Replicaexchange molecular dynamics method for protein folding. Chem. Phys. Lett. 314, 141–151. (53) Patriksson, A., and van der Spoel, D. (2008) A temperature predictor for parallel tempering simulations. Phys. Chem. Chem. Phys. 10, 2073– 2077. (54) Zimmerman, M. I., and Bowman, G. R. (2015) FAST conformational searches by balancing exploration/exploitation trade-offs. J. Chem. Theory Comput. 11, 5747–5757. (55) Still, W. C., Tempczyk, A., Hawley, R. C., and Hendrickson, T. (1990) Semianalytic treatment of solvation for molecular mechanics and dynamics. J. Am. Chem. Soc. 112, 6127–6128. (56) Hawkins, G. D., Cramer, C. J., and Truhlar, D. G. (1995) Pairwise solute descreening of solute charges from a dielectric medium. Chem. Phys. Lett. 246, 122–129. (57) Prinz, J.-H., Keller, B., and Noé, F. (2011) Probing molecular kinetics with Markov models: metastable states, transition pathways and spectroscopic observables. Phys. Chem. Chem. Phys. 13, 16912–16. (58) Bowman, G. R., Beauchamp, K. A., Boxer, G., and Pande, V. S. (2009) Progress and challenges in the automated construction of Markov state models for full protein systems. J. Chem. Phys. 131, 124101. (59) Beauchamp, K. A., Bowman, G. R., Lane, T. J., Maibaum, L., Haque, I. S., and Pande, V. S. (2011) MSMBuilder2: Modeling conformational dynamics at the picosecond to millisecond scale. J. Chem. Theory Comput. 7, 3412–3419. (60) Senne, M., Trendelkamp-Schroer, B., Mey, A. S. J. S., Schuette, C., and Noé, F. (2012) EMMA: A software package for Markov state model building and analysis. J. Chem. Theory Comput. 8, 2223–2238. (61) Zimmerman, M. I., Porter, J. R., Sun, X., Silva, R. R., and Bowman, G. R. Choice of adaptive sampling strategy impacts state discovery, transition probabilities, and the apparent mechanism of conformational changes. arXiv:1805.04616 [qbio.BM]. (62) Bowman, G. R. (2012) Improved coarsegraining of Markov state models via explicit consideration of statistical uncertainty. J. Chem. Phys. 137, 134111.

18 ACS Paragon Plus Environment

Page 19 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Biochemistry

(63) Weinan, E., and Vanden-Eijnden, E. (2006) Toward a theory of transition paths. J. Stat. Phys. 123, 503–523. (64) Du, R., Pande, V. S., Grosberg, A. Y., and Tanaka, T. (1998) On the transition coordinate for protein folding. J. Chem. Phys. 108, 334–350. (65) Bolhuis, P. G., Chandler, D., Dellago, C., and Geissler, P. L. (2002) Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu. Rev. Phys. Chem. 53, 291–318. (66) Voelz, V. A., Bowman, G. R., Beauchamp, K. A., and Pande, V. S. (2010) Molecular simulation of ab initio protein folding for a millisecond folder NTL9 (1-39). J. Am. Chem. Soc. 132, 1526–1528. (67) Trott, O., and Olson, A. J. (2009) AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461. (68) Morris, G. M., Huey, R., Lindstrom, W., Sanner, M. F., Belew, R. K., Goodsell, D. S., and Olson, A. J. (2009) AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30, 2785–2791. (69) Savol, A. J., and Chennubhotla, C. S. (2014) Quantifying the sources of kinetic frustration in folding simulations of small proteins. J. Chem. Theory Comput. 10, 2964–2974. (70) Petterson, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C., and Ferrin, T. E. (2004) UCSF Chimera–a visualization system for exploratory research and analysis. J. Comput. Chem. 25, 1605–1612. (71) Bowman, G. R., Pande, V. S., and Noé, F. (Eds.). (2014) An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation. Springer, Heidelberg. (72) Gentner, M. (2013) Folding of the transcription factor Brinker and interactions of the bacterial second messenger c-di-GMP studied by NMR. Ph.D. Thesis, University of Basel. (73) Iwahara, J., Zweckstetter, M., and Clore, G. M. (2006) NMR structural and kinetic characterization of a homeodomain diffusing and hopping on nonspecific DNA. Proc. Natl. Acad. Sci. U.S.A.

103, 15062–15067. (74) Iwahara, J., and Clore, G. M. (2006) Detecting transient intermediates in macromolecular binding by paramagnetic NMR. Nature 440, 1227–1230. (75) Koshland, D. E. (1958) Application of a theory of enzyme specificity to protein synthesis. Proc. Natl. Acad. Sci. U.S.A. 44, 98–104. (76) Changeux, J. P., and Edelstein, S. (2011) Conformational selection or induced fit? 50 years of debate resolved. F. Biol. Rep. 3, 19. (77) Monod, J., Wyman, J., and Changeux, J. P. (1965) On the nature of allosteric transitions: A plausible model. J. Mol. Biol. 12, 88–118. (78) Rubin, M. M., and Changeux, J. P. (1966) On the nature of allosteric transitions: Implications of non-exclusive ligand binding. J. Mol. Biol. 21, 265–274. (79) Hammes, G. G., Chang, Y. C., and Oas, T. G. (2009) Conformational selection or induced-fit: A flux description of reaction mechanism. Proc. Natl. Acad. Sci. U.S.A. 106, 13737–13741. (80) Milo, R. (2013) What is the total number of protein molecules per cell volume? A cell to rethink some published values. Bioessays 35, 1050–1055. (81) Slutsky, M., and Mirny, L. A. (2004) Kinetics of protein-DNA interactions: Facilitated target location in sequence-dependent potential. Biophys. J. 87, 4021–4035. (82) Sheong, F. K., Silva, D.-A., Meng, L., Zhao, Y., and Huang, X. (2015) Automatic state partitioning for multibody systems (APM): An efficient algorithm for constructing Markov state models to elucidate conformational dynamics of multibody systems. J. Chem. Theory Comput. 11, 17–27. (83) Plattner, N., Doerr, S., De Fabritiis, G., and Noé, F. (2017) Complete protein-protein association kinetics in atomic detail revealed by molecular dynamics simulations and Markov modelling. Nat. Chem. 9, 1005–1011.

19 ACS Paragon Plus Environment

Biochemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 20

Insert Table of Contents artwork here

ACS Paragon Plus Environment

20