pubs.acs.org/JPCL
Hidden Complexity of Protein Free-Energy Landscapes Revealed by Principal Component Analysis by Parts Abhinav Jain,† Rainer Hegger,‡ and Gerhard Stock*,† †
Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany, and ‡Institute of Physical and Theoretical Chemistry, Goethe University, 60438 Frankfurt, Germany
ABSTRACT A method to analyze molecular dynamics (MD) simulations of protein folding is proposed, which is based on a principal component analysis (PCA) of the protein's backbone dihedral angles. The protein is first partitioned into parts (e.g., its secondary structure elements), for each of which a separate PCA is performed. On the basis of these PCAs, the free-energy landscapes of each part are constructed and used to determine their metastable conformational states. In a second step, the various states of each protein part are employed to construct a product basis, which now represent the metastable conformational states of the full protein. Adopting extensive MD simulations of the villin headpiece by Pande and co-workers, it is shown that this “PCA by parts” allows us to characterizes the free-energy landscape of the protein with unprecedented detail. SECTION Macromolecules, Soft Matter
lassical molecular dynamics (MD) simulations using all-atoms force fields and explicit solvent are able to capture the structure, dynamics, and function of biomolecules in microscopic detail.1 Generating huge amounts of data, however, these simulations pose the problem of analyzing the data set obtained. To extract the essential information from the MD trajectories, dimensionality reduction methods are commonly employed which reduce the complex and highly correlated biomolecular motion from 3N atomic coordinates to a few collective degrees of freedom.2-17 In particular, principal component analysis (PCA) has been applied for this purpose.2-9 By diagonalizing the covariance matrix of the atomic coordinates, it has been shown that a large part of the system's fluctuations can be described in terms of a few PCA eigenvectors or principal components (PCs) Vi. The resulting low-dimensional representation of the dynamics can then be used to construct the free-energy landscape ΔG(V) = -kBT ln P(V), where P is the probability distribution of the molecular system along the PCs V = {V1, V2, ...}. Characterized by its minima (which represent the metastable conformational states of the systems) and its barriers (which connect these states), the energy landscape allows us to account for the pathways and their kinetics occurring in a biomolecular process.18-20 Recent simulations have revealed that the free-energy landscape of flexible small peptides and nucleic acids can be quite complex, showing numerous metastable conformational states.7,11-15 On the other hand, various computational studies of proteins have given a comparatively simple picture of their energy landscape.8-10 Although this may be expected at low energies where the protein fluctuates around its native state, it appears surprising at higher energies where the protein can reversibly fold and unfold. A protein consists of
several secondary structure segments such as R-helices, β-sheets, and turns, which need to be formed in the folding process. Assuming that the energy landscape of each part is already complex, one would naively expect that the complete landscape is even more involved. It has been discussed that this “hidden complexity” of the protein free-energy landscape may be caused by choosing a too small dimensionality of the landscape9,17,21 and by using Cartesian coordinates, which may suffer from the mixing of internal and overall motion.7 In this work, we propose a conceptionally simple method, termed PCA by parts, which is capable of resolving a protein energy landscape to any desired degree of detail. Given a MD trajectory of a biomolecular system, the idea is to first partition the molecule into parts (e.g., the secondary structure elements of a protein) and to perform a separate PCA for the coordinates of each part. From the PCA, the energy landscape of each part can be constructed and used to determine its metastable conformational states. In a second step, we employ the various states of each protein part to construct a product basis which characterizes the energy landscape of the full protein. To demonstrate the potential of the method, we adopt a fast folding variant of the villin headpiece subdomain (HP-35 NleNle).22-28 In this work, we use extensive (about 350 μs in total) simulations of HP-35 in explicit solvent, which were carried out by Pande and co-workers on the Folding@ home distributed computing platform.27 Figure 1 shows the structure of the native state of HP-35 as well as the starting structure used in the folding MD simulations of HP-35.
C
r 2010 American Chemical Society
Received Date: August 2, 2010 Accepted Date: September 1, 2010 Published on Web Date: September 07, 2010
2769
DOI: 10.1021/jz101069e |J. Phys. Chem. Lett. 2010, 1, 2769–2773
pubs.acs.org/JPCL
coordinates for the PCA. However, the energy landscape of this dihedral angle PCA (dPCA) shown in Figure 2B is similarly diffuse and structureless, as in the case of the cPCA. Even worse, it is found that the dPCA converges quite slowly with the number of included PCs; one needs 15 components to account for 60% of the system's fluctuations. A closer examination reveals that this effect is caused by significant nonlinear correlations between the individual PCs, which reflects the fact that the motion of adjacent backbone dihedral angles is necessarily correlated in a protein with relatively rigid secondary structures.30 Hence, many PCs are needed to account for the system's conformational dynamics, which hampers a conclusive low-dimensional representation of the free-energy landscape. The main idea of the PCA by parts (pPCA) is based on the assumption that each individual part of the molecule shows a structured free-energy landscape. In principle, this can be always enforced by making the parts smaller. Here, we partition HP-35 into five parts (Figure 1), helix-1 (residues 4-10), turn-1 (residues 11-14), helix-2 (residues 15-19), turn-2 (residues 20-22), and helix-3 (residues 23-32). The residues 1-3 and 33-35 at the termini of the protein are not considered in the analysis. Next, we perform a dPCA individually for each part. Interestingly, only a few (2-4) PCs are needed to cover more that 60% of the fluctuations of each individual part. That is, the dPCA works well for the secondary structure parts but not so for the complete protein (which requires many PCs). As shown in Figure 3, the resulting freeenergy landscapes of the secondary structure parts are wellresolved and exhibit numerous well-defined minima. Employing k-means clustering (see SI), we obtain 9, 4, 6, 2, and 7 clusters for helix-1, turn-1, helix-2, turn-2, and helix-3, respectively. Their position in the energy landscape (Figure 3) reveals that for the most part, the clusters coincide with the free-energy minima and therefore represent metastable conformational states. For later reference, we label the states of each part by a letter code, U for unfolded, Ij ( j = 1, ..., 5) for intermediates, Nj (j = 1, 2) for nearly native, and F for folded. On the basis of the clustering of the five parts, we now construct a time series C(t) from the MD trajectory of HP-35, where C = {Ci} is a five-dimensional vector, with Ci denoting the conformational state of part i. For example, at time t = 0, we start in the unfolded state shown in Figure 1B, C(0) = (UUUFI4), and at long times, a typical trajectory will end up in the folded state C(¥) = (FFFFF). In this way, each different combination of the states of the secondary structure parts corresponds to a specific and unique conformation of the complete protein. We note that this second step of the pPCA is crucial because it restores all structural correlations between the various parts. Although 9 4 6 2 7 = 3024 combinations are possible in the chosen clustering, only 1197 combinations are sampled by the MD trajectories due to conformational and energetic constraints of HP-35. Even much fewer states are sampled, notably; for example, the 10 most sampled states already cover half of the overall population (see Table 1 in SI). The fast convergence of the cumulative population with the number of pPCA states is depicted in Figure 4A, which also shows the metastability of
Figure 1. (A) Native state of the villin headpiece subdomain HP35. (B) Starting structure of the folding MD simulations of HP-35.27
Figure 2. Two-dimensional representations of the free-energy landscape ΔG(V1,V2) (in units of kBT) of HP-35 as obtained by the cPCA (A), dPCA (B), and pPCA (C and D). The latter show the 20 most metastable conformational states (1-10 in C and 11-20 in D) using the eigenvectors of the overall dPCA in (B).
(See the Supporting Information (SI) for details on the computational methods.) To get an overall impression of the free-energy landscape of HP-35, we first performed a standard PCA of the complete protein using the Cartesian coordinates of the backbone atoms (cPCA). The first four PCs clearly exhibit non-Gaussian distributions and contain ∼60% of the total fluctuations of the system (see Figure S1 in the SI). The resulting cPCA energy landscape along the first two PCs is shown in Figure 2A. It reveals the dominant native state of the system as well as a few metastable conformational states, which show up as purple-colored minima in the figure. Since the energy landscape is rather diffuse, though, only a few conformational states can be identified from an analysis of the cPCA landscape. As discussed by Mu et al.,7 this lack of structure may be caused by the mixing of internal and overall motion in Cartesian coordinates. This is because the standard procedure to separate these motions ; the superposition of the MD structures with a reference structure ; breaks down for folding molecules, where it is not possible to unambiguously define a single reference structure.29 To overcome this problem, it was proposed7 to use sine/ cosine-transformed dihedral angles instead of Cartesian
r 2010 American Chemical Society
2770
DOI: 10.1021/jz101069e |J. Phys. Chem. Lett. 2010, 1, 2769–2773
pubs.acs.org/JPCL
Figure 3. (Top) Free-energy landscapes ΔG(V1,V2) (in units of kBT)obtained from a dPCA of the various secondary structure parts of HP-35. (Bottom) The k-means clustering of the conformational states in these landscapes.
One can think of numerous applications of the pPCA outlined above. First, it represents a versatile method to generate conformational microstates from a MD trajectory. They can be used, for example, for the construction of Markov state models,14-16 protein folding networks,11,12 and invariant free-energy profiles.13 More generally speaking, the pPCA states can be employed as a, in principle, complete mathematical basis to represent various aspects of the molecular system. A prime example is the construction of the folding pathways of HP-35, which reveals that in most (but not all) cases, helix-2 and helix-3 form before helix-1 (manuscript in preparation). Our results support the findings of Bowman and Pande31 that the connectivity of states increases as one moves from the unfolded states to the native basin. Moreover, one can characterize the time evolution along some arbitrary reaction coordinate in terms of the pPCA states visited. As an illustration, Figure 4B shows the cumulative number of pPCA states visited by the ensemble of all trajectories plotted against the cumulative simulation time. The figure reveals that the individual trajectories may be quite different with respect to the number and nature of states that they visit. Moreover, the steady increase of the cumulative number of states indicates that the MD sampling of the folding structures of HP-35 is not yet converged on a 10 μs time scale. The sampling improves significantly when we include all ∼400 trajectories of Ensign et al.,27 which start from nine different unfolded initial structures. In conclusion, we have outlined a conceptually simple and, in practice, effective method to characterize the folding freeenergy landscape of proteins. We showed that a standard PCA of the complete protein can be problematic because of the mixing of the overall and internal motion (if Cartesian coordinates are used7) and because of nonlinear correlation between the individual PCs (if dihedral angle coordinates are used30). In the pPCA, we partition the protein in various parts, and perform a dihedral angle PCA for each part (which does not suffer from nonlinear correlation). The resulting structural resolution achieved depends on the partitioning (which may range from single residues, via secondary structures adopted here, to even protein domains) and the number of states
Figure 4. (A) Cumulative population probability (red) and metastability (i.e., the probability that a trajectory remains in this state within a lag time of 100 ns) of the pPCA states. (B) Cumulative number of pPCA states visited by the ensemble of all trajectories with the starting structure in Figure 1B, plotted against the cumulative simulation time. Each bin of the histogram accounts for the number of pPCA states visited by a specific trajectory of the ensemble.
these states (given by the diagonal elements of the transition matrix; see SI). As may be expected, most of the highly populated states are quite stable as well. As a further illustration, Figure 2C and D show the free-energy landscape constructed from the 20 most metastable pPCA states (where the trajectory segments during which the system is in these states were projected on the eigenvectors of the overall dPCA). Hence, the pPCA provides a means to identify with unprecedented resolution the metastable conformational states, which may be hidden in a conventional PCA.
r 2010 American Chemical Society
2771
DOI: 10.1021/jz101069e |J. Phys. Chem. Lett. 2010, 1, 2769–2773
pubs.acs.org/JPCL
considered in each clustering. We note that the method is in the spirit of the zipping and assembly approach of Dill and coworkers,32 in which proteins are folded in silico by first dividing up the protein sequence into several segments which are simulated separately and second joining the various segments to a simulation of the complete protein. Applied to the folding of HP-35, it has been shown that the pPCA allows us to identify the thermally populated metastable conformational states of the protein. Finally, we wish to stress that the divide and conquer strategy of the pPCA is readily applied to large molecules since the most time consuming step (i.e., the identification of the conformational states of all parts) scales linearly with the size of the system.
(10)
(11)
(12)
(13)
(14)
SUPPORTING INFORMATION AVAILABLE All details on the computational methods, cumulative fluctuations of all PCAs (Figure S1), and definition of the 25 most populated states of the pPCA (Table S1). This material is available free of charge via the Internet at http://pubs.acs.org.
(15)
AUTHOR INFORMATION
(16)
Corresponding Author: *To whom correspondence should be addressed. E-mail: stock@ physik.uni-freiburg.de.
(17)
ACKNOWLEDGMENT We thank Vijay Pande for providing the MD trajectories of HP-35 used in this study and Phuong Nguyen and Alexandros Altis for helpful discussions.
(18)
(19)
REFERENCES (1)
(2)
(3) (4) (5) (6)
(7)
(8)
(9)
van Gunsteren, W. F.; Bakowies, D.; Baron, R.; Chandrasekhar, I.; Christen, M.; Daura, X.; Gee, P.; Geerke, D. P.; Gl€ attli, A.; H€ unenberger, P. H.; Kastenholz, M. A.; Oostenbrink, C.; Schenk, M.; Trzesniak, D.; van der Vegt, N. F. A.; Yu., H. B. Biomolecular Modelling: Goals, Problems, Perspectives. Angew. Chem., Int. Ed. 2007, 45, 4064–4092. Ichiye, T.; Karplus., M. Collective Motions in Proteins: A Covariance Analysis of Atomic Fluctuations in Molecular Dynamics and Normal Mode Simulations. Proteins 1991, 11, 205–217. Garcia, A. E. Large-Amplitude Nonlinear Motions in Proteins. Phys. Rev. Lett. 1992, 68, 2696–2699. Amadei, A.; Linssen, A. B. M.; Berendsen., H. J. C. Essential Dynamics of Proteins. Proteins 1993, 17, 412–425. Kitao, A.; G, N. Investigating Protein Dynamics in Collective Coordinate Space. Curr. Opin. Struct. Biol. 1999, 9, 164–169. de Groot, B. L.; Daura, X.; Mark, A. E.; Grubm€ uller, H. Essential Dynamics of Reversible Peptide Folding: MemoryFree Conformational Dynamics Governed by Internal Hydrogen Bonds. J. Mol. Biol. 2001, 309, 299–313. Mu, Y.; Nguyen, P. H.; Stock, G. Energy Landscape of a Small Peptide Revealed by Dihedral Angle Principal Component Analysis. Proteins 2005, 58, 45. Lange, O. F.; Grubm€ uller, H. Can Principal Components Yield a Dimension Reduced Description of Protein Dynamics on Long Time Scales. J. Phys. Chem. B 2006, 110, 22842–22852. Maisuradze, G. G.; Liwo, A.; Scheraga, H. A. How Adequate Are One- And Two-Dimensional Free Energy Landscapes for Protein Folding Dynamics? Phys. Rev. Lett. 2009, 102, 238102.
r 2010 American Chemical Society
(20) (21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
2772
Das, P.; Moll, M.; Stamati, H.; Kavraki, L. E.; Clementi, C. LowDimensional, Free-Energy Landscapes of Protein-Folding Reactions by Nonlinear Dimensionality Reduction. Proc. Natl. Acad. Sci. U.S.A. 2006, 103, 9885–9890. Gfeller, G.; De Los Rios, P.; Caflisch, A.; Rao, F. Complex Network Analysis of Free-Energy Landscapes. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 1817–1822. Muff, S.; Caflisch, A. Kinetic Analysis of Molecular Dynamics Simulations Reveals Changes in the Denatured State and Switch of Folding Pathways upon Single-Point Mutation of a Beta-Sheet Miniprotein. Proteins 2008, 70, 1185–1195. Krivov, S. V.; Karplus, M. One-Dimensional Free-Energy Profiles of Complex Systems: Progress Variables That Preserve the Barriers. J. Phys. Chem. B 2006, 110, 12689–12698. Noe, F.; Horenko, I.; Sch€ utte, C.; Smith, J. C. Hierarchical Analysis of Conformational Dynamics in Biomolecules: Transition Networks of Metastable States. J. Chem. Phys. 2007, 126, 155102. Chodera, J. D.; Singhal, N.; Pande, V. S.; Dill, K. A.; Swope, W. C. Automatic Discovery of Metastable States for the Construction of Markov Models of Macromolecular Dynamics. J. Chem. Phys. 2007, 126, 155101. No e, F.; Sch€ utte, C.; Vanden-Eijnden, E.; Reich, L.; Weikl, T. Constructing the Full Ensemble of Folding Pathways from Short off-Equilibrium Simulations. Proc. Natl. Acad. Sci. U.S.A. 2009, 106, 19011–19016. Krivov, S. V.; Karplus, M. Hidden Complexity of Free Energy Surfaces for Peptide (Protein) Folding. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 14766–14770. Onuchic, J. N.; Schulten, Z. L.; Wolynes, P. G. Theory of Protein Folding: The Energy Landscape Perspective. Annu. Rev. Phys. Chem. 1997, 48, 545–600. Dill, K. A.; Chan, H. S. From Levinthal to Pathways to Funnels: The “New View” of Protein Folding Kinetics. Nat. Struct. Biol. 1997, 4, 10–19. Wales, D. J. Energy Landscapes; Cambridge University Press: Cambridge, U.K., 2003. Altis, A.; Otten, M.; Nguyen, P. H.; Hegger, R.; Stock, G. Construction of the Free Energy Landscape of Biomolecules via Dihedral Angle Principal Component Analysis. J. Chem. Phys. 2008, 128, 245102. Duan, Y.; Kollman, P. A. Pathways to a Protein Folding Intermediate Observed in a 1-Microsecond Simulation in Aqueous Solution. Science 1998, 282, 740–744. Snow, C. D.; Nguyen, H.; Pande, V. S.; Gruebele, M. Absolute Comaprison of Simulated and Experimental Protein Folding Dynamics. Nature 2002, 420, 102. Kubelka, J.; Eaton, W. A.; Hofrichter, J. Experimental Tests of Villin Subdomain Folding Simulations. J. Mol. Biol. 2003, 329, 625–630. Fern andez, A.; yi Shen, M.; Colubri, A.; Sosnick, T. R.; Berry, R. S.; Freed, K. F. Large-Scale Context in Protein Folding: Villin Headpiece. Biochem. 2003, 42, 664–671. Lei, H.; Wu, C.; Liu, H.; Duan, Y. Folding Free-Energy Landscape of Villin Headpiece Subdomain from Molecular Dynamics Simulations. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 4925–4930. Ensign, D. L.; Kasson, P. M.; Pande, V. S. Heterogeneity Even at the Speed Limit of Folding: Large-Scale Molecular Dynamics Study of a Fast-Folding Variant of the Villin Headpiece. J. Mol. Biol. 2007, 374, 806–816. Rajan, A.; Freddolino, P. L.; Schulten, K. Going beyond Clustering in Md Trajectory Analysis: An Application to Villin Headpiece Folding. PLoS One 2010, 5, e9890.
DOI: 10.1021/jz101069e |J. Phys. Chem. Lett. 2010, 1, 2769–2773
pubs.acs.org/JPCL
(29)
(30)
(31) (32)
H€ unenberger, P. H.; Mark, A. E.; van Gunsteren, W. F. Fluctuation and Cross-Correlation Analysis of Protein Motions Observed in Nanosecond Molecular Dynamics Simulations. J. Mol. Biol. 1995, 252, 492–503. Omori, S.; Fuchigami, S.; Ikeguchi, M.; Kidera, A. Latent Dynamics of a Protein Molecule Observed in Dihedral Angle Space. J. Chem. Phys. 2010, 132, 115103. Bowman, G. R.; Pande, V. S. Protein Folded States Are Kinetic Hubs. Proc. Natl. Acad. Sci. U.S.A. 2010, 107, 10890–10895. Ozkan, S. B.; Wu, G. A.; Chodera, J. D.; Dill, K. A. Protein Folding by Zipping and Assembly. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 11987–11992.
r 2010 American Chemical Society
2773
DOI: 10.1021/jz101069e |J. Phys. Chem. Lett. 2010, 1, 2769–2773