Using Dimensionality Reduction to Systematically Expand

Sep 2, 2016 - One of the approaches to improve our ability to characterize biologically important processes and to map out an underlying free energy l...
4 downloads 11 Views 6MB Size
Article pubs.acs.org/JCTC

Using Dimensionality Reduction to Systematically Expand Conformational Sampling of Intrinsically Disordered Peptides Oleksandra Kukharenko,* Kevin Sawade, Jakob Steuer, and Christine Peter* Theoretical Chemistry, University of Konstanz, 78547 Konstanz, Germany ABSTRACT: One of the approaches to improve our ability to characterize biologically important processes and to map out an underlying free energy landscape is to direct MD simulations to explore molecular conformational phase space faster. Intrinsically disordered systems with shallow free energy landscapes of a huge number of metastable minima pose a particular challenge in this regard. Both characterization of the often ill-defined conformational states as well as the assessment of the degree of convergence of phase space exploration are problematic. We have used a multidimensional scaling-like embedding (sketch-map) to describe the energetically accessible regions of phase space for a peptide fragment of the intrinsically disordered protein α-synuclein. Using sketch-map coordinates from a short initial simulation, we guided additional MD simulations to efficiently expand sampling of the conformational space. The sketch-map projections are very well suited to detect rare but possibly functionally relevant events, metastable intermediates, and transition states in the vast amount of data.

1. INTRODUCTION Intrinsically disordered proteins and peptides (IDPs) are biomolecules that lack well-defined secondary and tertiary structure. The great interest in this class of systems is motivated by an increasing awareness of their biological relevance.1,2 For example, several proteins that are considered intrinsically unstructured in solution, such as α-synuclein, polyQ, τ-protein, and Aβ, are involved in neurodegenerative diseases, such as Parkinson’s disease and Alzheimer’s syndrome.3,4 This is presumably due to their propensity for self-aggregation and fibril formation but possibly also due to their interaction with biomembranes.5 Often, these pathophysiologically relevant phenomena are accompanied by conformational transitions where the proteins change from their disordered state into a structurally more defined β-sheet-rich or α-helical state. IDPs pose a challenge for many experimental methods that are traditionally used to study proteins (NMR,6,7 IR,8 CD, SAXS,9 FRET,10 etc.) and probe well-defined structures or secondary structures.11,12 For example, NMR spectroscopy is an enormously powerful technique to obtain ensembleaveraged information about the conformations in solution, thus reducing the conformational ambiguity and better defining the free-energy landscapes of IDPs, even at amino acid resolution. It is also highly suitable for investigating dynamics. However, the characteristic time scales of IDP conformational transitions, the investigation of individual molecules, or the molecular mechanisms controlling protein folding upon binding remain a challenge.7 Here, atomistic molecular dynamics (MD) simulations can be a substantial supplement.13,14 Assuming ergodicity, MD produces a properly weighted statistical ensemble (e.g., canonical) of conformations © 2016 American Chemical Society

and a free energy landscape that can be used to characterize the states and transitions of disordered proteins.15,16 The ergodicity assumption, however, is problematic in the case of IDPs. They are characterized by a comparatively shallow free energy landscape with multiple local minima lacking a dominant global minimum, resulting in a multitude of low free energy states. Despite comparatively low energy barriers, their dynamics for transitions between different metastable states can be slow because of the enormity of conformational space. For obtaining a convergent, statistically relevant picture of the dynamics, i.e., an accurate representation of the Boltzmann-weighted distribution of conformations, extensive sampling is required. A number of methods has been developed to enhance the sampling ability of classical MD simulations. In one class of techniques, a simulation bias is added to force the interesting, rarer events to occur more often and to promote folding/ unfolding. Such methods include simulated annealing,17 replica exchange molecular dynamics (REMD),18 or replica exchange with solute scaling,19,20 which will improve the sampling but not actively drive the system away from already explored regions of phase space. Other methods such as reconnaissance metadynamics,21 local elevation,22 NMR-guided metadynamics,23 and so forth24 explicitly “push” the system from already known states. These latter methods typically require a priori definition of suitable reaction coordinates. The simulation bias forces the value of some particularly interesting collective variables (CVs) to change more rapidly than it otherwise would. Although this is more or less straightforward in the case Received: May 18, 2016 Published: September 2, 2016 4726

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734

Article

Journal of Chemical Theory and Computation

processing cores/32 threads with 12800 CPUh per μs simulation time. 2.2. Sketch-Map. Dimensionality reduction is a key element in identifying the regions of conformational phase space the system has visited and in subsequently grouping similar structural motives in the trajectories (clustering) and vice versa in determining sparse (low probability) regions. Direct clustering of high dimensional data usually faces a number of challenges: the concept of distance becomes less precise as the number of dimensions grows, meaning the notion of nearest and farthest points becomes meaningless and the data become sparse (known as “the curse of dimensionality”); not all dimensions are uncorrelated (i.e., they do not add any new information to the description), and it is harder to identify similarities in systems with many characteristics (dimensions). Therefore, for clustering of high dimensional data, usually some dimensionality reduction is used, either by projecting the data to low dimensional space or by clustering of subspaces of high dimensional space (simple examples are clustering based on RMSD, binning the Ramachandran plot for small peptides, etc.). Here, to reduce the number of dimensions, we used the sketch-map algorithm,29 which can project a highly nonlinear conformation space. The high dimensional coordinates that accurately characterize the conformations can be internal distances, Ramachandran angles, or other sets of collective variables. The algorithm is based on metric multidimensional scaling, and its advantage is that it focuses on an intermediate range of similarities/dissimilarities. It disregards information that corresponds to thermal fluctuations around (meta)stable structures as well as exact reproduction of large values of dissimilarities caused by the high dimensionality of space. It generates a set of low dimensional projections from a set of high dimensional landmark points (a small subset of the simulated data) by minimizing the function of dissimilarities of distances between these points

of a well-defined folded structure, the choice of appropriate CVs is not obvious in the case of partially ordered systems. A different category of methods, such as Markov State Models (MSM),25 directional milestoning,26 forward-flux sampling,27 known as state-based methods, combine information from shorter simulations between various locations in phase space to obtain global information, such as reaction paths and rates, instead of relying on the statistical analysis of long trajectories. As such, they may be more advantageous in systems with many metastable minima and multiple paths.24 However, they require the initial set of conformations to cover a wide area of the energy landscape. For shallow free energy landscapes with a vast amount of metastable states in particular, this can be problematic. For such systems, a need for sampling algorithms arises that neither rely on prior knowledge of the states the system visits nor on predefined reaction coordinates. In this paper, we propose a sampling scheme that uses a multidimensional scaling-like (MDS) embedding (sketchmap),28,29 to characterize and effectively explore the conformational space of IDPs. Sketch-map is a dimensionality reduction algorithm that generates a low-dimensional projection of a highly nonlinear high-dimensional conformational space. Using a fragment of the IDP α-synuclein as an example, we will demonstrate that the characteristics of the algorithm are particularly well-suited for ID systems with shallow conformational free energy landscapes. We use sketch-map coordinates of low-populated regions of conformational space with a large structural deviation from already sampled free energy minima to drive the exploration of conformational space. The relation of the proposed method to algorithms, such as diffusion-mapdirected MD30,31 and the free energy-guided sampling,32 will be discussed below. We will also show that the multidimensional scaling-like (MDS) embedding of simulation trajectories of IDPs allows for observing transitional pathways and identifying transitional states.

χ2 =

2. METHODS 2.1. System and Simulation Details. To illustrate the method that will be explained in section 2.3, we chose a 22residue long peptide fragment (42−63) of α-synuclein, which, as suggested in ref 33, includes a four-amino acid linker important for fibril formation. Its sequence in one-letter code is SKTKEGVVHGVATVAEKTKEQV. Because α-synuclein is highly interface active with conformational transitions that are driven by interface interactions, the peptide was placed at a water/vacuum interface. The presence of the interface further slows conformational dynamics, which pose additional sampling challenges and makes the chosen system and setup an ideal test case for our algorithm. All simulations were carried out on an atomistic level using the GROMACS 4.6.5 program package,34 SPC water,35,36 and the GROMOS 54A7 force field.37 Electrostatic interactions were handled by the particle mesh Ewald method (PME)38,39 with a real space cutoff distance of 1.0 nm and a van der Waals cutoff distance of 1.4 nm. A standard molecular dynamics simulation was performed using the leapfrog integrator with 2 fs integration steps and a velocity-rescaling algorithm40 for the temperature coupling at 300 K. No pressure coupling algorithm was introduced to ensure a stable interface. The system was prepared with the peptide immersed in the water phase close to the surface. The peptide moved to the interface and stayed there during the simulation with only very brief intervals of (partial) immersion. Simulations were performed on 16

∑ [s(R ij , A , B) − s(rij , a , b)]2 (1)

i≠j

where Rij and rij are dissimilarities (e.g., Euclidean distances) between high dimensional landmark points Xi and Xj and their low-dimensional projections xi and xj, respectively; s(r,a,b) is the following sigmoid function s(r , a , b) = 1 − (1 + (2a / b − 1)(r /σ )a )−b / a

(2)

which transforms all distances to values between zero and one. Selection of sigmoid function parameters σ, A, B, a, and b (which should save only important features of a phase space) is done by analyzing the distance distribution between high dimensional points. Using the projection of the initial landmark projections of all other points from high dimensional space, X, are found by iteratively minimizing a strongly nonlinear fitness function N

σ 2(x) =

∑ [s(R i(X), A , B) − s(ri(x), a , b)]2

→ min

i=1

(3)

where Ri(X) is the dissimilarity between X and ith landmark point, and ri(x) is the distance between point x and the projection of the ith landmark. 2.3. Expansion. The sketch-map projection generates dense basins from structurally close conformations. Therefore, rarely visited distinctly different conformations as well as conforma4727

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734

Article

Journal of Chemical Theory and Computation

Figure 1. Illustration of the application of the sketch-map algorithm to the initial MD simulation of an α-synuclein fragment (at the first step of the expansion algorithm, see section 3.1). (a) Coloring of the projected data points according to the fraction of residues adopting α-helix and β-sheet secondary structure elements. The projection allows segregating helical and β-sheet conformational states as well as unfolded motifs. The insets show how sketch-map groups together structurally similar, but conformationally different, elements. (b) The same projection as in (a) with probability density-based coloring. Black dots refer to the starting points for the expansion from the sparsely populated regions. Numbers correspond to the time step in the initial simulation (in multiples of 10 ps).

Steps 4 and 5 can be run simultaneously (projections can be calculated on the fly). There is a possibility of using certain stopping criteria at step 4. A simulation can be terminated in two cases: (a) if it falls back to already explored states or (b) if it has formed a new state and stays there for a predefined amount of time. Then, the method is divided into two stages: exploring and refining (with the help of other enhanced sampling techniques or without, as shown in refs 30−32). For the present case, we decided to run many parallel, comparatively long simulations without introducing any bias. This allows for directly using the obtained data for further analysis, e.g., to extract kinetic properties afterward using methods such as MSM and to estimate free energies for the states of interest.

tions that lie on transition paths between basins can be easily identified. By selecting starting conformations from the sparsely populated regions, we can increase the probability that the system explores so far unvisited regions of phase space. It was suggested29 that the sketch-map algorithm will generate meaningful coordinates even if the already mapped trajectory does not visit all of the energetically accessible parts of conformational space, so all of the additional simulations can be projected using the initially obtained and projected small subset of representative points (landmarks). The landmark points do not have to necessarily be present in all simulations. They can be generated separately and selected from initial simulations, and the set can also be extended during the broadening of conformational space. (Note: If selected landmarks poorly represent new conformational space, then either additional landmarks from new projections or totally new landmarks can be selected from all available trajectories.) The expansion scheme can be summarized in the following algorithm: (1) Run initial MD simulation(s) from the starting conformation(s). (2) Select landmark points from the simulation(s) and perform dimensionality reduction using the sketch-map algorithm. Project all other points from the simulation and obtain a low-dimensional representation of the free energy surface as a function of sketch-map coordinates. (3) Select N random points from the low density regions of the obtained projection. (4) Run new MD simulations from the selected conformations with new initial velocities and a short (a few picoseconds) position restraining of the peptide. (5) Make projections of new simulations using landmark points from step 2. Repeat step 2 with additional landmarks if sampling appears insufficient. (6) Repeat steps 3−5 until the sampled phase space converges: new states are not obtained.

3. RESULTS 3.1. Applying the Algorithm. As an initial step, we performed a 1 μs simulation of the 22-residue α-synuclein fragment starting from a helical conformation. For analysis, we selected frames every 10 ps (in total, 1 × 105 snapshots). Sketch-map coordinates were generated by selecting 2000 landmark points from the trajectory using a farthest point strategy. The full set of pseudotorsion angles between Cα atoms was used as the high-dimensional collective variables, resulting in 20 dimensions. Euclidean distances (taking into account the periodicity of torsional angle values) were used as a dissimilarity measure in high and low dimensional space. The optimal twodimensional projection was obtained with the following parameters for the sigmoid function σ = 5, A = 20, B = 6, a = 2, and b = 6 (see eq 2). The obtained projection is shown in Figure 1a. The figure nicely demonstrates one difficulty when one simulates IPDs. How does one assess convergence, i.e., exhaustive/sufficient exploration of phase space for a system that does not possess well-defined structures but is not entirely structureless either? How does one know that one has found all relevant states? For the (patho)physiological function of these 4728

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734

Article

Journal of Chemical Theory and Computation

Figure 2. Configurational phase space visited by the α-synuclein peptide fragment after applying the expansion algorithm (colored according to the Boltzmann inverted probability distribution). Free energy minima are shown as solid points and their representative structures are plotted as small inset figures. Numbering of the minima corresponds to their population. Triangles mark the free energy minima of the initial simulation. The plot in the upper right corner shows a comparison of the conformational space of the initial simulation (black) with the one obtained after the application of the algorithm (gray).

systems, it may be essential to capture certain low populated states of partially formed structures. Some structures may be relevant in conformational-selection scenarios when the systems undergo transitions to β-sheets in amyloid aggregates or to α-helices when interacting with membranes. The method suggested here can help to address these problems. Moreover, analysis of some clusters grouped together by sketch-map due to structural similarity (see snapshots in Figure 1a) suggests that neither a conventional clustering analysis based on atom positional root-mean-square deviations nor a secondarystructure-based analysis would easily have identified all those states. From the initial simulation shown in Figure 1a, we started the expansion. From the low density regions, we randomly selected 100 new starting points (Figure 1b) for the expansion in step 4. The new simulations amounted to a total of 10 μs, and again, timeframes were selected every 10 ps for the projection (i.e., a total of 11 × 105 analyzed snapshots). The same landmarks and parameters of the sigmoid function were used. After the analysis of the obtained projection, step 2 was repeated: 2000 new landmarks were selected from all obtained trajectories, and a new dimensionality reduction was conducted (with the same sigmoid function parameters). A further extension was prepared with 100 new simulations and an aggregated simulation time of 3 μs. The collection of landmark points was extended to 2500. The final projection as well as the evolution of the sampled configurational phase space can be seen in Figure 2. The projection in the inset nicely shows that the system has effectively explored new regions of

conformational space and sampled new structures that had not been visited during the initial simulation. Note that because of the bias in selecting conformations from sparse regions for the extension with the main aim of exploring the accessible configurational phase space, this final projection is in fact not a properly weighted depiction of the free energy landscape of the system. Nevertheless, it is useful to display the sampled landscape in terms of Boltzmann inverted probabilities for identifying the centers of local free energy minima and to define structures or states for subsequent analysis. The figure also shows representative structures of local free energy minima (with numbering indicating their population). Minima in the free energy surface were automatically detected by binning the two-dimensional projection to rectangles (cells) and selecting iteratively local maxima (most populated cells separated by change in the density gradient). Vice versa, bins that have only one point in them and are surrounded by empty bins are called low density or sparse regions. From the latter, new starting configurations for step 3 are selected. During the expansion, the size of the projection usually increases, but the size of the bins is kept constant, only their number increases. It is possible to use different definitions for dense/sparse regions, e.g., introducing some cutoff distance for points and counting neighbors within this cutoff. For minima shown in Figure 2, we analyzed which simulation trajectories contributed to them and whether minima had been repeatedly visited by several separate trajectories. This allows a first, cautious assessment of the convergence of the approach. For illustrating several scenarios that can occur during the 4729

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734

Article

Journal of Chemical Theory and Computation

Figure 4a nicely illustrates this by showing a short trajectory fragment in which a conformational transition occurs, as can be seen both in the (zoomed in) sketch-map projection and the conventional secondary structure plot. In such a case, the distribution of all (low dimensional) distances between pairs of conformations is bimodal, and the integral of the distribution exhibits distinct steps. Such distinct steps in the integral over the distance distribution indicate that a transition between states has happened. Transition points can be identified as those points where the difference in the distances to the previous and the subsequent points (as a function of time) is significant. Figure 4b presents a second example of a successful detection of an important event that might easily have been missed among a large amount of data. We have identified the formation of a full α-helix, which happened only once in 14 μs of simulation time. In addition, by using not only density information but also the coloring based on the secondary structure as an additional description for each point (coloring scheme is the same as in Figure 1a), one can identify regions of interest, even if they are poorly populated, and automatically select new simulations that contribute to those regions (although they were started in different locations of phase space). It is straightforward to use the basins of the sketch-map and run transition path sampling simulations if one is interested in this specific event for the current IDP. 3.3. Pushing the Algorithm. So far, the algorithm has been shown to work and to harness sketch-map to automatically assess and improve the sampling of conformational phase space of IDPs and to detect events even if these are only spurious folding events into partially formed, short-lived structures. However, the simulation data (a total of 14 μs of simulation time) presented so far do not yet conclusively show that the expansion algorithm, if brutally used for expansion only, is indeed significantly more efficient than brute-force MD. In the following example, the same α-synuclein peptide fragment had been prepared in a β-sheet starting conformation, immersed in water, and moved to a water/vacuum interface. After 100 ns of simulation time, it turns out that this starting conformation leads to a kinetically trapped strand-loop-strand structure at the interface with frequently forming and loosening β-sheets. In 1 μs MD simulation time, no significant conformational change could be observed (see Figure 5a). We used the expansion algorithm with stopping criteria to compare how much of the conformational space could be recovered by the β-sheet simulation at the same simulation time and whether the same metastable states as in the previous 14 μs simulation (starting from an α-helix) are visited during the expansion. The algorithm was applied in the following way: The first 100 ns of the simulation (bulk water phase) were fully excluded from the analysis. The following 100 ns were considered as an initial simulation (i.e., as step 1 of the algorithm). This means all new simulations were conducted at the water/vacuum interface. After projecting the initial simulation, 500 landmarks were selected. During the expansion steps, this number was increased to 1500 points. Sigmoid function parameters were changed as the distance distribution profile was changing while new conformations were explored. At step 3, 20 random frames were chosen from sparsely populated regions (which are defined by binning the projection as explained in section 3.1), and new simulations were started. When the new simulations reached 5 ns of simulation time, stopping criteria were applied

expansion, the trajectories that contributed to minima 1, 3, and 5 and the relative time they spent there are shown in Figure 3.

Figure 3. Illustration of how different free energy minima are “visited” during the initial simulation and the expansion simulations. Minima numbers correspond to Figure 2. The initial (1 μs) trajectory is marked as “initial”, and additional trajectories are labelled according to the time step at which their starting configuration had been taken from the initial trajectory (in multiples of 10 ps). Red dots show which time frames the trajectories spent in the respective state, and black solid lines indicate the full time span of the short trajectory relative to the initial simulation.

The deepest (first) minimum was present at the initial simulation and was extensively populated during the expansion stage by five more trajectories (which became “trapped” in this minimum). It would have been possible to use the stopping criteria and prevent the simulations from falling back into this minimum, thus making the exploration stage more efficient, but because we plan to subsequently use the data of this system for kinetic analysis, we decided not to use any stopping criteria here. The third minimum was populated by the initial simulation as well as by several new ones, but in this case, simulations were able to “escape” this metastable state. There are also minima that are populated only by new simulations (e.g., the fifth deepest minimum; see Figure 3). 3.2. Zoom In and Automated Event Detection. The free energy landscape of IDPs appears qualitatively inverted with respect to that of structured proteins.23 Meaning that the deepest minima will be populated by unstructured motifs, and partially folded conformations may appear as transitional states with relatively similar free energies and rapidly interconnect among each other. Detection of these events is important because they may be responsible for signaling and regulatory functions of proteins and possibly important in view of mechanistic studies of environment-induced conformations.41 The advantage of using sketch-map coordinates for objects such as IDPs is the possibility of using other descriptors (e.g., time, temperature of a replica, some energetic quantity, etc.) to pick small (but highly relevant details), which are easily overlooked. By automatically analyzing distance distributions in the short trajectories (or trajectory fragments), it is possible to detect metastable and transition states. 4730

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734

Article

Journal of Chemical Theory and Computation

Figure 4. Two examples of the automated detection of conformational transitions in short trajectories by analyzing the distribution of distances between pairs of (projected) structures. Upper panels: zoom in to the sketch-map projection of the short trajectory (with green or gray lines connecting the data points in time) with representative structures. Middle panels: Secondary structure plot (DSSP) of the respective trajectory. Bottom panels: Distribution of distances between all pairs of structures (after projection) and integral over these distributions. Example (b) shows the states leading toward the formation and breaking of a full α-helix, an extremely rare event. The separate substates are not easily identifiable on the secondary structure plot. Coloring of projected data points in (b) according to α-helical content.

to each of them with the following termination conditions: if a simulation falls to an already explored state (here we call a state any number of bins having more than a fixed amount of points) and stays there for 1 ns or if the simulation forms a new state but does not exit it during 0.2 ns, it should be terminated, and a new simulation had to be started from a new random point selected from sparse regions of the joint projection of the initial simulation and all expansion runs. If the stopping conditions were not fulfilled, the simulation was continued and checked again each 5 ns. We stopped the expansion when we reached a total of 900 ns of simulation time. Comparison of the projections of an initial 1 μs simulation (bulk phase excluded) with 1 μs aggregated simulations obtained using the algorithm (composed of 100 ns initial simulation and the 900 ns expansion) are shown in Figure 5. It is evident that during the same simulation time we were able to cover a much larger area of conformational space. By comparing this simulation with the one described in section 3.1, we see that, in this comparatively short simulation time, we were able to “visit” most of the deepest minima detected in 14 μs of simulation (Figure 5c). We would like to point out that Figure 5c could be misleading with respect to the continuity of trajectories. The blue points are a projection of a new simulation into a sketchmap of a previous one, i.e., the blue trajectories have not been well-represented by the landmarks used to span the space. Therefore, this figure cannot necessarily be used as an indication that sketch-map disrupts trajectories. It is only

meant as proof that the drastic expansion method had succeeded at “blasting” the system out of the deadlock structure (hairpin) at the interface; therefore, we did not resketch-map the two ensembles together (which would have avoided the strong impression of discontinuity and allowed for observing traces of small conformational transitions in the sketch-map projection). In general, projections to low dimensional space are not necessarily continuous even though, by construction, the trajectories generated by the dynamics are continuous and the expansion steps start in regions of conformation space that had been visited by a previous trajectory. It should be noted that, because of its focus on preserving intermediate distances, the sketch-map algorithm has been found to be well-suited for representing conformational transitions and continuity in phase space.29

4. DISCUSSION The previous sections show that sketch-map is excellently suited to reduce the dimensionality of conformational space and to project and group data in the case of intrinsically disordered systems with shallow free energy landscapes. Although other valuable methods for dimensionality reduction exist, we expect sketch-map to be particularly suitable for our expansion scheme compared to other noniterative linear methods, such as principal component analysis,42 time-lagged independent component analysis (TICA),43 or other nonlinear methods (locally linear embedding,44 Isomap,45 and diffusion maps46,47). For example, TICA uses the kinetic information of 4731

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734

Article

Journal of Chemical Theory and Computation

Figure 5. Comparison of plain MD and expansion simulations with stopping criteria. (a) RMSD plot of the 1.1 μs simulation with a β-sheet conformation as a starting structure. The blue dashed lines indicate the transition of the peptide from the bulk water phase to the water/vacuum interface. The part of the simulation that was used at the first step of the expansion algorithm is found within the blue and red dashed lines. (b) Conformational space obtained after the application of the expansion algorithm and plain MD. Gray points: 900 ns of expansion simulations; black points: 1 μs of plain MD simulation; red points: 100 ns of initial simulation (identical for both simulations), which were used for the expansion. (c) Expansion simulation (1 μs; blue points) projected on the conformational space (gray points) of the previous 14 μs simulation (see section 3.1). Green triangles mark the energy minima shown in Figure 2. Areas that are populated by plain 1 μs MD from (a) are marked with red circles.

dimensionality reduction. This might be challenging in terms of computational time with the growth of trajectory length. In refs 30 and 31, the locally scaled diffusion map was used to select starting conformations for new simulations in the exploring stage based on diffusion coordinates calculated as eigenfunctions of the kernel, which is derived from the root-mean-square displacement (RMSD) between all conformations and “local scale”. Determining the last parameter requires a Boltzmanndistributed set of molecular configurations and the performance of MDS for each conformation within some neighborhoods.49 In ref 32, an MSM was used to guide the exploration stage. At each step of clustering, RMSD calculations were conducted to define Markov states and a new starting point using the mean first passage time of obtained MSMs.

the system for extracting the slowest collective motion. This is a good choice for systems with a small number of well-defined metastable states, but all fast transitions will be inseparable, at least at the small number of dimensions. For the present (IDP) system, we found that to obtain a reasonable projection using TICA one needs more than 10 dimensions. In low dimensional space with significantly more dimensions than two, identifying truly sparse regions with a limited amount of data points is difficult. Diffusion map was successfully used for the exploration of conformational space for several systems with a limited number of states/metastable intermediates,30,31,48 but it requires calculating the distance matrix (and its eigenvalues) between all frames in all trajectories at each extension step to perform 4732

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734

Article

Journal of Chemical Theory and Computation

adaptively, resulting in an optimal redistribution to evaluate sparsity and find new points from which to expand. This distinguishes the method from other projection methods, where the a priori decision on the coordinates of the projection is a limitation (especially if one does not know where the system will eventually go). Here, the coordinates are optimally chosen depending on the current state of sampling. The algorithm is particularly suited for disordered systems with many metastable, partially structured states. It was successfully applied to a 22-residues peptide fragment of αsynuclein at a water/vacuum interface. This is a challenging system because of the vast number of conformational states and extremely long-lived metastable states populated by unstructured or partly structured motifs stabilized by the interface. At the same time, folded conformations, which are relevant for comparison with experimental measurements, are rare and unstable and cannot be easily distinguished by most methods relying on the dynamics of the system. It can effectively distinguish between different conformational motifs and separate them in two-dimensional space using values from high dimensional conformational descriptors, such as the torsion angles between Cα atoms. Projections of new data from the ongoing simulation can be done on the fly. Exploration of the accessible phase space can be very efficiently guided. Obtained data can be directly used for further analysis, such as MSM, and does not require further refinement or an unbiasing stage.

For an IDP system with an extremely high number of metastable states, using RMSD as the main dissimilarity criterion is problematic, the here presented MDS-based algorithm may have advantages. Applying the above-mentioned methods may also be challenging because there may not be a significant difference in the eigenvalues, meaning there is no time scale separation, which is important for MSM or big structural differences determined by a diffusion map. Additionally, sketch-map uses only a small number of reference structures (landmarks) to produce good out-of-sample embedding. It is worth mentioning that these coordinates can be used as collective variables for constructing bias potentials to enhance sampling.50 The proposed scheme can be embedded in automatic adaptive sampling packages, e.g., HTMD.51 At present, we have used MDS-based characterization only for expansion and exploration. We envision this method as a very first step in a more quantitative process toward a proper free energy landscape. Here, it merely explores the conformational “landscape” (heuristically and qualitatively). During this process, the landscape is dynamically clustered into states, and transitions are recorded. Thus, the expansion scheme is a search scheme rather than a sampling scheme, but it can be extended to become a sampling scheme by combining with TPS-like or forward-flux methods27 with sketch-map coordinates as the reaction coordinates to obtain more quantitative estimates of energy differences between states determined by our expansion scheme. A more rigorous use of the method to generate free energy landscapes, using reweighting algorithms such as TRAM,52 is planned.48,52−54 With the method proposed here, we were trying to address a particularly formidable challenge in the simulation of IDPs caused by their high conformational heterogeneity. We need to stress that, given the limitation of atomistic force fields, comparison with experimental data is crucial to assess the physical relevance of the simulated data. Here, we have refrained from interpreting the results with regard to the particular system and force field used. Previous force field comparison studies have demonstrated a high sensitivity of IDP conformational ensembles to differences between the force fields.55,56 There is a substantial interest in obtaining accurate ensembles of IDPs using all-atom simulations. We believe that our algorithm can also be very efficiently applied to a qualitative comparison of conformational phase space sampling of IDPs in different force fields.55,57 Furthermore, the data of the simulated ensembles can then be used for a quantitative comparison with experimental methods that give access to, for example, distance distributions such as NMR or EPR spectroscopy.



AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. *E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank I. Kevrekidis for very helpful discussions. This work was supported by a fellowship of the Zukunftskolleg of the University of Konstanz to O.K. and by the SFB969 funded from the German Science Foundation. Simulations were performed on the computational resource bwUniCluster funded by the Ministry of Science, Research and the Arts Baden-Württemberg and the Universities of the State of Baden-Württemberg, Germany, within the framework program bwHPC.



REFERENCES

(1) Tompa, P. Trends Biochem. Sci. 2002, 27, 527−533. (2) Dyson, H. J.; Wright, P. E. Nat. Rev. Mol. Cell Biol. 2005, 6, 197− 208. (3) Uversky, V. N.; Oldfield, C. J.; Dunker, A. K. Annu. Rev. Biophys. 2008, 37, 215−246. (4) Goedert, M. Science 2015, 349, 601. (5) Uversky, V. N.; Dunker, A. K. In Protein and Peptide Folding, Misfolding, and Non-Folding; Schweitzer-Stenner, R., Ed.; John Wiley and Sons, Inc.: Hoboken, NJ, 2012; pp 1−54. (6) Jensen, M. R.; Ruigrok, R. W.; Blackledge, M. Curr. Opin. Struct. Biol. 2013, 23, 426−435. (7) Jensen, M. R.; Zweckstetter, M.; Huang, J.; Blackledge, M. Chem. Rev. 2014, 114, 6632−6660. (8) Natalello, A.; Ami, D.; Doglia, S. M. In Intrinsically Disordered Protein Analysis: Vol. 1, Methods and Experimental Tools; Uversky, N. V., Dunker, K. A., Eds.; Humana Press: Totowa, NJ, 2012; Chapter Fourier Transform Infrared Spectroscopy of Intrinsically Disordered Proteins: Measurement Procedures and Data Analyses, pp 229−244.

5. CONCLUSIONS We proposed an expansion scheme to efficiently explore conformational space, which uses the sketch-map dimensionality reduction algorithm. An advantage compared to many other methods is that one neither needs to have a general knowledge of the states nor a reaction coordinate. The algorithm is best suited if one knows rather little of the conformations the system might eventually adopt and if it is very complex, i.e., while the sampling is expanding the picture of the conformational landscape of the system, which changes drastically. It allows for projecting newly sampled areas to the known landscape to assess the expansion. At the same time, because the algorithm allows resketching whenever necessary, the two coordinates that the ensemble is projected to change 4733

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734

Article

Journal of Chemical Theory and Computation (9) Bernado, P.; Svergun, D. I. Mol. BioSyst. 2012, 8, 151−167. (10) Hofmann, H.; Soranno, A.; Borgia, A.; Gast, K.; Nettels, D.; Schuler, B. Proc. Natl. Acad. Sci. U. S. A. 2012, 109, 16155−16160. (11) Uversky, V., Longhi, S., Eds. Instrumental Analysis of Intrinsically Disordered Proteins: Assessing Structure and Conformation; John Wiley and Sons, Inc.: Hoboken, NJ, 2010. (12) Uversky, N. V., Dunker, K. A., Eds. Intrinsically Disordered Protein Analysis: Vol. 1, Methods and Experimental Tools; Humana Press: Totowa, NJ, 2012. (13) Knott, M.; Best, R. B. PLoS Comput. Biol. 2012, 8, 1−10. (14) Burger, V. M.; Gurry, T.; Stultz, C. M. Polymers 2014, 6, 2684. (15) Stanley, N.; Esteban-Martín, S.; De Fabritiis, G. D. Prog. Biophys. Mol. Biol. 2015, 119, 47−52. (16) Qiao, Q.; Bowman, G. R.; Huang, X. J. Am. Chem. Soc. 2013, 135, 16092−16101. (17) Bernardi, R. C.; Melo, M. C.; Schulten, K. Biochim. Biophys. Acta, Gen. Subj. 2015, 1850, 872−877. (18) Sugita, Y.; Okamoto, Y. Chem. Phys. Lett. 1999, 314, 141−151. (19) Liu, P.; Kim, B.; Friesner, R. A.; Berne, B. J. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 13749−13754. (20) Wang, L.; Friesner, R. A.; Berne, B. J. J. Phys. Chem. B 2011, 115, 9431−9438. (21) Tribello, G. A.; Ceriotti, M.; Parrinello, M. Proc. Natl. Acad. Sci. U. S. A. 2010, 107, 17509−17514. (22) Huber, T.; Torda, A. E.; van Gunsteren, W. F. J. Comput.-Aided Mol. Des. 1994, 8, 695−708. (23) Granata, D.; Baftizadeh, F.; Habchi, J.; Galvagnion, C.; De Simone, A.; Camilloni, C.; Laio, A.; Vendruscolo, M. Sci. Rep. 2015, 5, 15449. (24) Rohrdanz, M. A.; Zheng, W.; Clementi, C. Annu. Rev. Phys. Chem. 2013, 64, 295−316. (25) Bowman, G. R., Pande, V. S., Noé, F., Eds. An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation; Springer Netherlands: Dordrecht, 2014; Vol. 797. (26) Májek, P.; Elber, R. J. Chem. Theory Comput. 2010, 6, 1805− 1817. (27) Allen, R. J.; Valeriani, C.; ten Wolde, P. R. J. Phys.: Condens. Matter 2009, 21, 463102. (28) Ceriotti, M.; Tribello, G. A.; Parrinello, M. Proc. Natl. Acad. Sci. U. S. A. 2011, 108, 13023−13028. (29) Ceriotti, M.; Tribello, G. A.; Parrinello, M. J. Chem. Theory Comput. 2013, 9, 1521−1532. (30) Zheng, W.; Rohrdanz, M. A.; Clementi, C. J. Phys. Chem. B 2013, 117, 12769−12776. (31) Preto, J.; Clementi, C. Phys. Chem. Chem. Phys. 2014, 16, 19181−19191. (32) Zhou, T.; Caflisch, A. J. Chem. Theory Comput. 2012, 8, 2134− 2140. (33) Shvadchak, V. V.; Subramaniam, V. Biochemistry 2014, 53, 279− 281. (34) Hess, B.; Kutzner, C.; van der Spoel, D.; Lindahl, E. J. Chem. Theory Comput. 2008, 4, 435−447. (35) Berendsen, H. J. C.; Postma, J. P. M.; van Gunsteren, W. F.; Hermans, J. Intermol. Forces 1981, 14, 331−342. (36) Berendsen, H. J. C.; Grigera, J. R.; Straatsma, T. P. J. Phys. Chem. 1987, 91, 6269−6271. (37) Schmid, N.; Eichenberger, A. P.; Choutko, A.; Riniker, S.; Winger, M.; Mark, A. E.; van Gunsteren, W. F. Eur. Biophys. J. 2011, 40, 843−856. (38) Darden, T.; York, D.; Pedersen, L. J. Chem. Phys. 1993, 98, 10089−10092. (39) Essmann, U.; Perera, L.; Berkowitz, M. L.; Darden, T.; Lee, H.; Pedersen, L. G. J. Chem. Phys. 1995, 103, 8577−8593. (40) Bussi, G.; Donadio, D.; Parrinello, M. J. Chem. Phys. 2007, 126, 014101. (41) Dalgicdir, C.; Globisch, C.; Peter, C.; Sayar, M. PLoS Comput. Biol. 2015, 11, 1−24. (42) Pearson, K. Philos. Mag. 1901, 2, 559−572.

(43) Pérez-Hernández, G.; Paul, F.; Giorgino, T.; De Fabritiis, G.; Noé, F. J. Chem. Phys. 2013, 139, 015102. (44) Roweis, S. T.; Saul, L. K. Science 2000, 290, 2323−2326. (45) Tenenbaum, J. B.; Silva, V. d.; Langford, J. C. Science 2000, 290, 2319−2323. (46) Nadler, B.; Lafon, S.; Coifman, R. R.; Kevrekidis, I. G. Appl. Comput. Harmon. Anal. 2006, 21, 113−127. (47) Kim, S. B.; Dsilva, C. J.; Kevrekidis, I. G.; Debenedetti, P. G. J. Chem. Phys. 2015, 142, 085101. (48) Das, P.; Frewen, T. A.; Kevrekidis, I. G.; Clementi, C. In Coping with Complexity: Model Reduction and Data Analysis; Gorban, N. A., Roose, D., Eds.; Springer: Berlin, Heidelberg, 2011; Chapter Think Globally, Move Locally: Coarse Graining of Effective Free Energy Surfaces, pp 113−131. (49) Rohrdanz, M. A.; Zheng, W.; Maggioni, M.; Clementi, C. J. Chem. Phys. 2011, 134, 124116. (50) Tribello, G. A.; Ceriotti, M.; Parrinello, M. Proc. Natl. Acad. Sci. U. S. A. 2012, 109, 5196−5201. (51) Doerr, S.; Harvey, M. J.; Noé, F.; De Fabritiis, G. J. Chem. Theory Comput. 2016, 12, 1845−1852. (52) Wu, H.; Paul, F.; Wehmeyer, C.; Noé, F. Proc. Natl. Acad. Sci. U. S. A. 2016, 113, E3221−E3230. (53) Hummer, G.; Kevrekidis, I. G. J. Chem. Phys. 2003, 118, 10762− 10773. (54) Mey, A. S. J. S.; Wu, H.; Noé, F. Phys. Rev. X 2014, 4, 041018. (55) Henriques, J.; Cragnell, C.; Skepö, M. J. Chem. Theory Comput. 2015, 11, 3420−3431. (56) Rauscher, S.; Gapsys, V.; Gajda, M. J.; Zweckstetter, M.; de Groot, B. L.; Grubmüller, H. J. Chem. Theory Comput. 2015, 11, 5513− 5524. (57) Best, R. B.; Mittal, J. Proteins: Struct., Funct., Genet. 2011, 79, 1318−1328.

4734

DOI: 10.1021/acs.jctc.6b00503 J. Chem. Theory Comput. 2016, 12, 4726−4734