3606
J. Phys. Chem. B 2003, 107, 3606-3612
Sequential Collapse Folding Pathway of β-Lactoglobulin: Parallel Pathways and Non-Native Secondary Structure Fernando Bergasa-Caceres*,† and Herschel A. Rabitz Department of Chemistry, Princeton UniVersity, Princeton, New Jersey 08544 ReceiVed: January 23, 2003
In this paper, the sequential collapse model (SCM) for revealing protein folding pathways is applied to bovine β-lactoglobulin. The results for the dominant folding pathway are in general agreement with recent experimental data. An analysis based on the SCM suggests that, in addition to the dominant early intermediate specified by the best predicted primary contact, there are coexisting populations of early intermediates defined by the formation of less energetically favorable primary contacts. The issue of parallel folding pathways defined by primary contacts with different formation propensities is discussed. The main primary contact is energetically dominant with respect to the additional primary contacts. However, it is possible that the additional population of contacts might contribute to the non-native helical structure observed to accumulate in the early folding stages if there were kinetic constraints along the dominant pathway that led to an accumulation of misfolded early intermediates. A mutation experiment is suggested to reveal whether any non-native secondary structure contributes to the observed helical excess due to the additional folding pathways defined by the nondominant primary contacts. The relevance of these results to larger issues of protein folding is also discussed.
1. Introduction Significant progress has been made in elucidating the folding mechanisms of globular proteins in recent years. One of the most influential proposals made to guide the development of theoretical efforts has been the hierarchical principle, which postulates that protein tertiary structures reflect the sequential accumulation of native-like secondary structure elements forming before non-local contacts defining the overall protein topology.1-3 This principle underlies folding models such as hierarchical condensation4 and framework.5 Other models such as diffusion-collision6 incorporate the hierarchical principle through the inclusion of secondary structure propensities; this inclusion is not fundamental to diffusion-collision, which could remain a viable folding model even if the hierarchical principle was shown to be wrong. The hierarchical principle is not essential to the operational models developed within the funnel context either because calculations within the energy landscape theory of protein folding do not decisively hinge on secondary structure propensities, although they can be taken into account wherever necessary.7-12 The sequential collapse model (SCM) is a recent theoretical proposal that strives to explain how the primary sequence of relatively small-sized (∼100-150 amino acids) proteins controls their multistate folding pathway from the random coil state to the native structure.13 In the SCM, the earliest intermediates along the folding pathway are defined by the balance between the entropic cost of forming protein loops (i.e., the topology of the intermediate states) and the stabilization energy gained through the formation of hydrophobic contacts.13 The same factors account for the attainment of the native topology when the protein core undergoes hydrophobic collapse.14 Other interactions such as secondary structure propensi* To whom correspondence should be addressed. E-mail: Fbergasa@ Endesa.Es. Phone: 34 91 213 97 73. Fax: 34 91 213 13 58. † Current address: Endesa, C/Prı´ncipe de Vergara 187, 28002 Madrid, Spain.
ties, salt bridges, and hydrogen bonds could also be taken into account within the SCM, but such an inclusion has not been found to be necessary at the level of resolution sought for the folding pathways studied to date. Because of this flexibility, the SCM belongs to the category of models for which the hierarchical principle is not a fundamental element. Without considering secondary structure propensities, the SCM has been able to account for the main elements of the multistate folding pathway of a number of proteins at low resolution, including cytochrome c,13 apomyoglobin,13 barnase,13 hen lysozyme,14 R-lactalbumin,14 and apoleghemoglobin.15 This paper is devoted to the study of the SCM folding pathway of β-lactoglobulin. In β-lactoglobulin, it has been observed that non-native R-helical structure accumulates on the submillisecond time scale, suggesting that the folding pathway of β-lactoglobulin involves the formation of non-native R-helical structures.16-18 The formation of non-native secondary structure apparently violates the hierarchical principle. The observation that the folding of β-lactoglobulin might be non-hierarchical has led to intense experimental16-19 and theoretical20 efforts to determine the nature of the intermediate states and the relevance of the non-native secondary structure observed to form early along the folding pathway. Consideration of the related matter of an R T β transition has received attention,21 including the possible relationship of such conformational interconversion with the pathogenesis of prion-related diseases.22,23 In this paper, the SCM will be applied to obtain predictions for the folding pathway of β-lactoglobulin, and the results will be shown to be consistent with recent experimental data on the early stages of the folding pathway of β-lactoglobulin.19 The calculation method will follow that previously applied to other proteins.13-15 The results suggest that β-lactoglobulin mostly folds through the dominant pathway leading to the native structure, lending indirect support to the presence of an R T β interconversion mechanism along the main folding pathway. The results also suggest that there are smaller additional populations
10.1021/jp030087l CCC: $25.00 © 2003 American Chemical Society Published on Web 03/25/2003
Folding Pathway of β-Lactoglobulin
J. Phys. Chem. B, Vol. 107, No. 15, 2003 3607
of early intermediates folding through parallel pathways. Then, it is possible that the accumulation of non-native secondary structure in the early stages of the folding pathway of β-lactoglobulin might be at least partially accounted for by the existence of parallel folding pathways that do not lead to the native structure, in addition to the dominant pathway that does lead to the native structure. On the basis of these results, mutagenesis experiments are also suggested that could help clarify the matter. The consistency of the SCM with laboratory data lends further support to the assumption made in the SCM that the early stages of the folding pathway of globular proteins can be nonhierarchical. However, if the accumulation of non-native secondary structure were due to the presence of one or several frustrated additional folding pathways, then the presence of non-native secondary structure in the early folding stages of β-lactoglobulin would not constitute by itself evidence against the hierarchical principle. 2. Model The SCM is outlined in full detail elsewhere.13 Here, only a brief review appropriate for the specific goals of this paper is presented. In the SCM, the free energy of formation of a successful contact is written as
∆Gcont ) ∆Gloop + ∆Gint
(1)
where ∆Gloop is the free-energy change associated with loop closure and is expected to be positive because loop formation defines a state that has fewer conformational possibilities than an open protein chain, thus inducing a large entropic loss ∆Sloop. The term ∆Gint represents all of the interactions that help to stabilize the contact, including hydrophobic interactions, van der Waals interactions, hydrogen bonds, disulfide bonds, and salt bridges,25 generally with ∆Gint < 0. Hydrophobic interactions have been postulated to constitute the main driving force of the folding process.26 Also, the hydrophobic effect has been observed to be dominant with respect to the secondary formation propensity.27 Thus, it is reasonable to write eq 1 as
∆Gcont ≈ ∆Gloop + ∆Ghyd
(2)
where ∆Ghyd is the free-energy change associated with the burial of hydrophobic groups and the gain of entropy in the solvent and is expected to be negative.23 For the proteins studied within SCM to date,13-15 it was shown that employing eq 2 suffices to determine the sequence of early contact-forming events. This theoretical observation is consistent with the hypothesis that a combination of entropic considerations and the hydrophobic effect is adequate to establish the native topology of proteins, which has been suggested in the context of observations dealing with the nature of protein-folding intermediate states.28 It is also consistent with recent experimental evidence suggesting that the formation of hydrophobic clusters in non-native form might play a significant role in determining the sequence of events along the folding pathway.29 This observation is significant because the native topology has been shown to be an important determinant of folding rates and thus probably of the dynamics of the entire folding process.30,31 It also suggests that the hierarchical principle might not be a universal rule. In the SCM, it was predicted that the dominant earliest contact-forming event in the main folding pathway of apomyoglobin involves an interaction between segments located in helices B and G in the native structure.13 The segment included in helix B in the native structure observed to be involved in the earliest folding events seems not to be fully native-like in this early stage.32
Upon the formation of a contact in a protein of N amino acids, three regions are distinguished within the SCM: the contact region c of length nc, the open connecting loop l of length nl, and the free ends or tails of combined length nt ) N - (nc + nl). Up to a constant, ∆Gloop can be written as13
3 ∆Gloop ≈ -kT nl ln(fl/f0) + nc ln(fc/f0) - ln nl 2
[
]
(3)
where fi represents the conformational freedom of the amino acids in a given region i (i.e., i ) l, c) of the protein and the conformational freedom of the amino acids in the free ends of the protein f0 is taken to be the same as that of the amino acids in the random coil. ∆Gloop can be shown to have two minima13 as a function of loop length nl: a deeper primary one at 65 e n e 85 amino acids, called the optimal loop length, and a shallower one at nl ≈ 3-4 amino acids. The shallow minimum represents the shortest length over which the protein chain can reverse its direction. Because a few amino acids are required to form a stable contact, we define a minimal loop to be generated by a protein contact between two segments of nc,min ≈ 3-5 amino acids linked by a turn of ∼4 amino acids. Thus, the minimal loop size is nmin ≈ nl + nc,min ) 10-14 amino acids. In proteins of length of ∼100-150 amino acids, within the SCM, most of the contacts will form at nmin because no more than one or two primary contacts may form.14 This behavior is consistent with experimental evidence showing that short-range contacts predominate over long-range ones in protein structures.33 It is not an easy task to determine the difference ∆∆Gcont ) ∆Gcont,op - ∆Gcont,min between the free energy of fomation of a contact at nop and at nmin. A heuristic argument suffices, however, to show that the formation of initial contacts at nop should be strongly favored over the formation of contacts at nmin. For equivalent contact regions, ∆∆Gcont becomes
3 ∆∆Gcont ) ∆∆Gloop ≈ kT nl ln(fl/fop) + ln(nop/nl) (4) 2
[
]
where fop ≈ f0. The conformational freedom f can be approximated to be proportional to the available volume per amino acid side chain in the loop defined by the contact (f ∝ Vγloop/n1 where γ is a constant and γ > 1 to account for the fact that there are several degrees of freedom in most side chains). The overall loop volume Vloop can then be approximated to evolve as Vloop ∝ nl3, so the conformational freedom per amino acid scales as f ∝ nl2. Taking nl ) 4 amino acids and nop ) 75 amino acids, we have kTnl ln(fl/fop) ≈ -γ23kT and kT(3/2) ln(nop/nl) ) 4.4. kT. Thus, |nl ln(fl/fop)| . |(3/2) ln(nop/nl)|, ∆∆Gloop , 0, and the formation of initial contacts at nop should be strongly favored over formation of contacts at nmin. On the basis of the formalism developed above, in the SCM, a detailed analysis13 argues that for proteins that are sufficiently long (∼100 amino acids) the folding pathway is likely initiated by the formation of a contact between segments located at the optimal distance nop of 65-85 amino acids. The formation of this initial contact, referred to as the primary contact, nucleates a burst phase leading to a multistate folding pathway that includes an intermediate state with many of the properties of a molten globule34 and is therefore referred to as the molten globule-like intermediate state (MGLIS).13 Depending on the location along the primary sequence of the hydrophobic amino acids, one or more primary contacts might be possible for a given sequence. Several populations of initial contacts may form if the free energy of formation is enough to stabilize distinct
3608 J. Phys. Chem. B, Vol. 107, No. 15, 2003 contacts formed by different pairs of hydrophobic segments located at nop along the sequence. In apoleghemoglobin, it was found that there are two energetically equivalent primary contacts that probably initiate parallel folding pathways leading to the native structure.15 However, the formation of more than one primary contact does not necessarily imply the existence of more than one folding pathway leading to the native structure, as there might be constraints further downstream along a folding pathway that prevent it from reaching the native structure. The formation of the initial contact is most likely followed by the folding of the residues located outside the primary loop, referred to as the tails of the protein. The next step in the SCM is a hydrophobic collapse of the protein core in which the native topology of the primary loop is established.14 The collapse within the SCM is followed by an optimization subphase governed by the activation barriers generated by the need to exclude water from the interior of the protein core to establish fully the interactions that stabilize the native structure. Proteins that are shorter than 65-85 amino acids, henceforth referred to as short proteins in the SCM, must fold through the formation of loops close to the minimal size because their short length precludes the existence of an initial long-range contact. Because the formation of individual minimal loops is entropically much less favorable than the formation of optimal loops, the SCM suggests that small proteins fold through the simultaneous cooperative formation of several minimal loops in a process resembling a hydrophobic collapse. Such a mechanism would be consistent with experimental evidence that shows that small proteins tend to fold through a two-state collapse-like transition.30 There also may be long proteins falling outside this model in which a stable primary contact does not form and for which the folding mechanism should be similar to that of short proteins. The SCM is in the same spirit as recent theoretical efforts to develop simple models that are able to describe the intermediates along the folding pathway in the context of the so-called “funnel” view of the folding process.7-12,35,36 The SCM, however, shares much of the “old” view of the folding process in which the protein descends toward the free-energy minimum through few intermediate steps.37 Although the SCM does not preclude the possibility that a protein might fold through several parallel pathways,13 it suggests that there is a strong preference for those pathways that minimize the conformational entropy loss upon the formation of primary contacts. Also, because nop is relatively large, the number of possible primary contacts (i.e., the nucleation event that initiates the folding pathway) for a given sequence is likely to be small. The “new” view embodied in the funnel picture postulates instead a large number of intermediates, especially in the early folding stages. However, these two views need not be antagonistic because recent theoretical results show that some folding pathways might be strongly statistically preferred in a free-energy landscape that allows for a multiplicity of folding pathways.38 In the next section, the SCM folding pathway of β-lactoglobulin is calculated. The calculation does not make use of the hierarchical principle, relying only on partition coefficients (i.e., hydrophobicity) to determine the sequence of events along the folding pathway. 2.1. Calculational Method. In this paper, we closely follow the method introduced previously13 to determine the SCM folding pathway of apomyoglobin, cytochrome c, barnase and ribonuclease A, hen lysozyme, R-lactalbumin,14 and apoleghemoglobin.15 The procedure is presented below. The primary contact is determined by the minimum value of ∆Ghyd for segments of five amino acids located 65 to 85 residues apart
Bergasa-Caceres and Rabitz along the sequence. Since the identification of the primary contact is determined by the hydrophobicity of the segments forming the contact, polarity values obtained from the Fauche`rePliska scale39 were assigned to each residue. To summarize the procedure, the hydrophobicity Pk of each residue is added over a contact segment of five amino acids centered at residue i, resulting in a polarity of Pi. To determine the best contact, the Pi value of a segment centered at residue i is added to the Pj value of a segment centered at residue j, that is, 65 to 85 residues away from i, to give a contact propensity of Pij ) Pi + Pj . The ij pair along the sequence separated by 65 to 85 residues that produces the highest value of Pij is selected as the primary contact. Differences in Pij of ∼0.4 reflect differences in ∆Ghyd of ∼ kT.39 For the purpose of determining the activation barriers Hj governing the formation of native structure in the cooperative collapse, the amino acids L, W, F, V, M, and I were assigned hydrophobicity values from the Fauche`re-Pliska scale. All other amino acids were considered to be non-hydrophobic and were assigned a hydrophobicity of zero. For the calculations, a segment size of 15 amino acids was chosen. This length is equivalent to that of a minimal loop13 and is long enough that even the largest possible secondary structure elements, the omega loops,40 could be detected in the cooperative collapse sequence. The results were seen to be robust for windows between 13 and 17 amino acids, where a result was considered to be robust if any displacement in the location of the minima of Hj upon a change in window size was less than ∼1/4(window size) ≈ 3-4 amino acids. A more refined model should consider the nativelike topology established at the time of collapse (i.e., the loops defined upon collapse)14 to determine the optimal window sizes to be considered for the calculation of Hj. The mechanism by which the topology of the primary loop is established upon collapse remains an open issue within the SCM. The hydrophobicities Pk of 15 consecutive residues centered at residue j are summed, resulting in a hydrophobicity value of Hj. The Hj’s are calculated for all possible segments of 15 amino acids along the protein sequence. To determine the sequence of folding events, the 15 amino acids with the lowest Hj’s that do not overlap with each other are sequentially chosen. These segments are assumed to reach their native structure in increasing order of Hj because the activation energy E j for each protein segment to exclude water fully in the SCM is taken as being proportional to its hydrophobicity represented by Hj (i.e., a large Hj value means that more water needs to be excluded upon structure optimization). The calculation method, employing only partition coefficients, is consistent with the underlying assumption in the SCM that secondary structure propensities do not play a critical role in determining the early sequence of contact-forming events along the folding pathway.13 However, secondary structure propensities could be included if deemed necessary, for example, adding a term reflecting the secondary structure free energy of formation to the contact-formation propensity Pij. This additional term could be derived from experimental data to be fully consistent with the chosen hydrophibicity coefficients.27 3. Results In this section, results are presented for the folding pathway of bovine β-lactoglobulin and compared with existing experimental and theoretical data. Bovine β-lactoglobulin is a 162residue protein of known structure. The protein contains a large proportion of β structure, including a nine-strand antiparallel
Folding Pathway of β-Lactoglobulin
J. Phys. Chem. B, Vol. 107, No. 15, 2003 3609
TABLE 1: Five Best Predicted Primary Contacts for β-Lactoglobulin and Their Associated Contact-Formation Propensity Pija
a
segments defining the contact
Pij
19-23 on 103-107 29-33 on 103-107 54-58 on 119-123 38-42 on 103-107 78-82 on 143-147
13.1 12.4 12.0 11.6 11.4
Differences in Pij of ∼0.4 reflect differences in ∆G of ∼kT.
β-barrel. It also contains a major R-helix and four short helical segments.41 Its folding pathway has been probed through timeresolved CD spectroscopy,16-18 and hydrogen exchange rates for over 80 residues are known.18 The early stages of the folding pathway of β-lactoglobulin have recently been probed on the submillisecond time scale19 by a combination of ultrarapid mixing techniques with fluorescence detection42,43 and hydrogen exchange.44-46 3.1. Primary Contacts of β-Lactoglobulin. The predicted primary contacts and their associated Pij values are shown in Table 1. The best predicted primary contact (i.e., the dominant primary contact) for β-lactoglobulin is established between segments 19-23 and 103-107 with Pij ) 13.1. Residues 1923 are included in the βA strand, and 103-107 are included in the βF strand in the native structure. The next best contact is established between segments 29-33 (or 28-32) and 103107 with Pij ) 12.4. The third best contact is established between segments 54-58 and 119-123 with Pij ) 12.0. The fourth best primary contact is established between segments 3842 and 103-107 with Pij ) 11.6. The fifth best contact is established between segments 78-82 and 143-147 with Pij ) 11.4, energetically equivalent at this level of resolution to contact [38-42, 103-107]. All other contacts are significantly less stable, Pij e 11 (i.e., a difference in free energy with respect to the best predicted contact g5kT), which makes it likely that they can already be discarded as viable folding initiation points for a relevant population of molecules. The differences in Pij between the best and the additional contacts are significant, ∆Pij,1-2 ≈ 0.7, ∆Pij,1-5 ≈ 1.7, reflecting free energy differences of ∆Gij,1-2≈ 1.7kT and ∆Gij,1-5 ≈ 4.2kT, respectively. However, the Pij’s for the additional contacts are comparable to the contactformation propensities associated with the dominant primary contact leading to the dominant native folding pathway in several proteins studied within the SCM.13-15 The existence of several possible primary contacts with high Pij’s suggests that at least the second, and maybe the third to fifth best primary contacts in β-lactoglobulin might define significant additional early populations of partially folded molecules that are different from those defined by the dominant primary contact. These populations could be significant if there were kinetic constraints along the dominant folding pathway that led to an accumulation of the additional intermediates defined by the non-dominant primary contacts. Tryptophan quenching shows that at least one of the two tryptophan residues, W19 and W61 of β-lactoglobulin, is buried early along the folding pathway.19 Residue W19 lies in the dominant primary contact, and the SCM predicts that it is buried early along the folding pathway. W61 is included in the open primary-loop region both in all but the fifth early intermediate. The SCM predicts that it should be significantly more solventexposed in the early folding stages than W19. W61 is highly quenched under native conditions,47 and its quenching takes place late along the folding pathway.19 Burst phase labeling experiments show that native hydrogen bonding within the
segments included in the dominant primary contact takes place within the earliest resolved stages a time scale of ∼1.8 ms.15 L103, L104, Phe 105, and Met 107, included in the primary contact, show very high proton occupancies (above 1) after 5 s of refolding both from a denatured state in 40% TFE and pH 3 and a denatured state in 6 M Gdn-HCl and pH 3.18 Tyr 102, adjacent to the primary contact, also shows very high proton occupancy in the same experiment.18 Only Val 123, Val 118, and Met 140 show comparable proton occupancies in the same experiment.18 Also, L103 and L104 show the highest proton occupancies after 5 s of refolding both from a denatured state in 6 M Gdn-HCl and pH 3.18 Protection within the 19-23 segment is, however, lower than within the 103-107 region in the burst-phase labeling experiment.19 It is possible that, because the βB strand remains unfolded within the primary loop of the dominant pathway, the formation of native hydrogen bonds of segment 19-23 takes place at a slower rate than that of the 103-107 segment. Most of the 19-23 segment in the predicted dominant primary contact is included in region 12-21. It has been hypothethized that the 12-21 region has persistent nonnative secondary structure in the early folding stages because the adjacent region that becomes the βB strand in the native structure appears unfolded at this stage.19 The SCM does not require that a primary contact has native-like secondary structure, as it assumes that hydrophobic interactions are sufficient to determine the location of the contact. Thus, in the SCM, it is possible that segment 19-23 remains non-native for a longer time than segment 103-107, even though they are closely associated in a hydrophobic contact. The region between the two segments included in the dominant primary contact shows much smaller early protection factors in the same experiment.19 No amino acids in segment 29-33 included with segment 103-107 in the second best predicted primary contact were probed in the burst phase labeling experiment. Segment 119-123, included in the third best primary contact and forming most of βH in the native structure, is seen to reach its native structure in the earliest resolved time scales in the burst phase labeling experiment. Probes in segment 54-58 also included in the third best primary contact and forming most of βC in the native structure did not show high protection factors in the same experiment.19 Segment 119-123 is included in the C-terminal tail defined by the dominant primary contact, and the SCM suggests that it should fold early, following the dominant pathway. Probes in segment 38-42, defining with segment 103-107 the fourth best primary contact and partially included in βB, do not show high protection factors in the burst phase labeling experiment.19 Burst phase probes in segment 143-147 included in the fifth best primary contact do not show high protection factors. Most amino acids within segment 78-82 that were partially included in βE were also not probed.19 In the next two sections, the remaining folding phases of the dominant and the two subsequent best SCM folding pathways of β-lactoglobulin will be described. Then, in section 3.5, the issue of non-native secondary structure accumulation in the early folding stages will be analyzed in connection with the additional folding pathways defined by the parallel primary contacts described above. 3.2. MGLIS of the Dominant Folding Pathway of β-Lactoglobulin. The predicted dominant folding pathway is described in Figure 1. The MGLISs associated with the dominant and the next best three primary contacts are described in Table 2. The dominant primary contact [19-23, 103-107] implies an MGLIS folded region comprising the N terminus of the protein chain
3610 J. Phys. Chem. B, Vol. 107, No. 15, 2003
Figure 1. SCM dominant folding pathway for β-lactoglobulin from the unfolded state U. Boxes indicate the regions that are likely to form topologically native-like contacts in each phase. The windows considered for the optimization subphase were 15 amino acids long. A more refined model could consider the native-like topology established at the time of collapse (i.e., the loops defined upon collapse14) in order to determine the optimal window size to be considered in the optimization subphase. Region 88-102 has been observed to enter the folding pathway at an earlier stage,19 possibly because of interactions with the compact region of the MGLIS. (See the text.)
TABLE 2: Folded Regions Included in the Molten Globule-like Intermediate States (MGLIS) Associated with the Five Best Primary Contactsa primary contact
regions in MGLIS
consistency with experimental data13
[19-23, 103-107] [29-33, 103-107] [54-58, 119-123] [38-42, 103-107] [78-82, 143-147]
1-19 and 103-162 1-32 and 103-162 1-53 and 119-162 1-42 and 103-162 1-77 and 143-162
consistent consistent not fully consistent consistent not fully consistent
a A MGLIS is considered to be consistent with the burst phase labeling data19 if the folded regions included in it show significant native structure in the earliest resolved time scale.
and segment 19-23 along with the C terminus and segment 103-107. This result is consistent with experimental observations showing that the regions included in the dominant MGLIS of β-lactalbumin are the only ones to exhibit persistent native secondary structure on the millisecond time scale, whereas the rest of the chain remains largely unprotected except for the 90100 region that also shows a high degree of early protection.19 The best primary contact is established between segments located 84 amino acids apart, close to the upper limit of nop. Then, it is possible that region 90-100 including βF could fold before the hydrophobic collapse of the primary loop, interacting with βG and βH included in the folded region of the MGLIS, defining an open fluctuating primary loop somewhat shorter than the one defined initially by the dominant primary contact [1923, 103-107]. The MGLIS predicted by the SCM includes longrange interactions between the N- and C-terminal regions of the protein. An alternative explanation of the experimental results would be to propose that native-like structure is attained independently in both regions.19 The issue of whether global or local interactions take place first along the folding pathway of β-lactoglobulin will be settled only when better time resolution for the folding pathway is available48 and would constitute a strong test for the SCM. 3.3. Optimization Subphase of the Dominant Folding Pathway of β-Lactoglobulin. According to the calculated Hj values shown in Figure 2, the regions comprising and around residues 69-70 (including βD in the native structure), residues
Bergasa-Caceres and Rabitz
Figure 2. Relative hydrophobicity Hj versus the center of the 15 amino acid segments for β-lactoglobulin. The plot shows that inside the primary loop the native structure should be attained sequentially along the dominant folding pathway by the region centered on residues 70, 44, and 95. The region around residue 95 has been observed to enter the folding pathway at an earlier stage,19 possibly because of interactions with the compact region of the MGLIS. Hj is equivalent for the regions centered on residues 44 and 95. Windows shorter than 15 amino acids at the ends of the plot were set equal to zero.
40-48 (including βB), and residue 95 (including βF) inside the primary loop should sequentially attain their native structure before the rest of the primary loop. The early folding of the region around residue 95 has already been discussed above. Observed folding rates in a proton-exchange experiment for refolding times of 5 ms to 1 s for the region around residues 69-70 are on average slightly higher than those observed for the region around residues 40-48.19 However, more detailed experimental information is needed to determine whether the sequence of events suggested here correctly represents the late folding stages of β-lactoglobulin. 3.4. MGLISs of the Additional Primary Contacts. The predicted additional MGLISs are described in Table 2. The second best predicted primary contact between segments 2933 and 103-107 implies an additional MGLIS formed by folded regions comprising the N terminus of the protein chain and segment 29-33 along with the C terminus from segment 103107, which is very similar to the MGLIS defined by the dominant primary contact. Moreover, there are no burst phase probes within residues 25 to 32. Thus, on the basis of available data, the location of the compact regions of this MGLIS is also generally consistent with experimental observations and could significantly contribute to the earliest population of native-like folding molecules.19 The third best predicted primary contact between segments 54-58 and 119-123 implies an additional MGLIS formed by folded regions comprised of the N terminus of the protein chain and segment 54-58 along with the C terminus from segment 119-123. The location of the compact regions of this MGLIS is not fully consistent with experimental observations. Probes in region 40-58 do not show significant native secondary structure within the millisecond time scale,19 as would be expected if the [54-58, 119-123] contact led to a significant population of native-like MGLIS. This result suggests that any population of molecules folding through a pathway defined by the primary contact [54-58, 119-123] either is very small or contains significant non-native structure. The fourth best predicted primary contact between segments 38-42 and 103-107 implies an additional MGLIS formed by folded regions comprising the N terminus of the protein chain and segment 38-42 along with the C terminus from segment 103-107. There are no burst-phase probes in region 25-39, whereas probes 40-42 do not show significant native secondary structure in the earliest resolved folding stages. The location of the compact region of this MGLIS is possibly consistent with experimental observations, but more experimental information is needed to settle the matter fully. The fifth best predicted primary contact [78-82, 142-145] implies an additional MGLIS formed by folded regions com-
Folding Pathway of β-Lactoglobulin prising the N terminus of the protein chain up to and including segment 78-82 along with the C terminus from segment 142145. The third best primary contact [78-82, 142-145] defines an MGLIS that is not fully consistent with experimental evidence because probes in region 40-82 included in the N-terminal tail appear not show significant native-like structure early along the folding pathway.18 This result suggests that any population of molecules folding through a pathway defined by the primary contact [78-82, 142-145] either is very small or contains significant non-native structure. Then, to summarize, the set of MGLISs associated with the four additional primary contacts can be divided into two subsets. The first set would include the MGLISs defined by contacts [29-33, 103-107] and [38-42, 103-107]; these are generally similar to the dominant MGLIS and generally consistent with experimental evidence. The second subset includes the MGLISs defined by contacts [54-58, 119-123] and [78-82, 142-145], which are broadly distinct from the dominant MGLIS and inconsistent with experimental evidence, suggesting that the associated MGLISs are not significantly populated or are nonnative. The apparent inconsistency between experimental evidence and the MGLISs defined by the third and fifth best primary contacts could reflect two different things: (a) the associated folding pathways do not significantly contribute to the overall population of folding molecules, or (b) the associated folding pathways are not native-like. In the next section, a two-step mutation experiment is proposed that would help determine whether any early intermediates defined by the formation of the [54-58, 119-123] and [78-82, 142-145] primary contacts are responsible for the non-native secondary structure observed experimentally. 3.5. Role of Non-Native Secondary Structure and a Proposed Distinguishing Mutation Experiment. The first detectable intermediate state along the folding pathway of β-lactoglobulin displays a helical content of ∼20%, significantly higher than the 11% present in the native structure.12 It has been suggested that the best way to reconcile the experimental findings with the percentage of secondary structure in the native structure is to postulate the existence of an R T β interconversion mechanism that takes place along a single native folding pathway.18,19 Another possibility is that there might be more than one folding pathway.18 It is possible that the non-native helical structure observed to form in the earliest detected intermediates might be due to the presence of misfolding populations of molecules that form at least one of the primary contacts [54-58, 119-123], [78-82, 142-145], [29-33, 103107], and [38-42, 103-107] possibly defining non-native MGLISs. It is not possible to establish firmly within the current form of the SCM whether primary contacts [54-58, 119-123], [78-82, 142-145], [29-33, 103-107], and [38-42, 103-107] and any adjacent structure that might form in the associated tails are responsible for the non-native secondary structure present in the earliest detected intermediates of β-lactoglobulin. To be able to do so, the SCM would have to include microscopic interactions fully in a consistent way, evolving closer to a full native structure predictor. However, the results suggest a relatively simple mutation experiment that should be able to test the relevance of the early folding events defined by primary contacts [54-58, 119-123], [78-82, 142-145], [29-33, 103107] ,and [38-42, 103-107] for the formation of the non-native secondary structure. 3.6. Mutation of the Additional Primary Contacts. To investigate the relevance of primary contact [54-58, 119-123]
J. Phys. Chem. B, Vol. 107, No. 15, 2003 3611 for the formation of non-native secondary structure, one or more point mutations might be made in native molecules in one or both of the predicted second best contact regions 54-58 and 119-123. These mutations should replace one or more hydrophobic amino acids by polar ones to lower the contact-formation propensity (i.e., the hydrophobicity) significantly below the lowest Pij of the other contacts (i.e., Pij , 11). There are two possible outcomes for this experiment: (a) the mutated population shows significantly decreased non-native secondary structure in the early folding stages, and (b) non-native secondary structure does not decrease in the early folding stages. The experiment can then be repeated for the [78-82, 142-145] primary contact. For primary contacts [29-33, 103-107] and [38-42, 103-107], the mutations must be made in the 29-33 and 38-42 segments because mutations in the 103-107 segment would affect the contact-formation propensity of the dominant primary contact. The SCM predicts that outcome (a) in any of the two experiments would suggest that the associated MGLIS is responsible for the accumulation of non-native structure in the early folding stages of native β-lactoglobulin. Also, outcome (a) for any of the two experiments and the subsequent blocking of the early folding event that they define could have an enhancing effect on the overall folding rate because there would be no need to unfold the associated nonnative population in order to enter the native folding pathway. If any observed decrease in non-native secondary structure due to mutations of the additional primary contacts were insufficient to explain the experimentally observed helical excess, such an outcome would lend strong support to the presence of an R f β interconversion mechanism along the dominant pathway. Also, if the outcome of any of the two experiments is (a), it becomes interesting to investigate the structure of the associated MGLIS to understand fully the sequence of early folding events. 4. Conclusions In this paper, the SCM was applied to explore the folding pathway of β-lactoglobulin, and the results were found to be in general agreement with recent experimental findings. The SCM suggests that, in addition to the early native folding pathway intermediate, there might be coexisting populations of molecules partially folding through non-native pathways, although these populations are likely to be small relative to that of the dominant early intermediate. Then, it is possible, albeit unlikely, that the accumulation of non-native secondary structure observed in the early stages of the folding pathway might be due to the presence of the additional pathways. If this hypothesis was confirmed, it would call into question the need to postulate an R T β interconversion mechanism to explain the experimental findings. Full resolution of this issue will probably have to await further experimental results, and mutation experiments were considered to help clarify the matter. The calculations did not include any consideration of secondary structure propensities, thus suggesting that the issue of whether early intermediates in protein folding have native or non-native secondary structure is relatively unimportant as long as the contacts formed define topologically native-like intermediates. Although this conclusion is sustained within the SCM by results obtained for a significant number of globular proteins, all of these proteins are of similar length and are postulated to share the same folding mechanism. It is possible that several folding modes may exist, and thus the results presented in this paper do not rule out the possibility that the hierarchical principle might be fundamental to understanding the folding mechanism of other globular proteins. Also, the presence of frustrated folding pathways would lend support
3612 J. Phys. Chem. B, Vol. 107, No. 15, 2003 to the view that the determinants of the folded structure are to be found not only in the relative contributions to the stability of the native structure but also in the folding events along the available folding pathways. These are important issues that will be clarified only when experimental data for more proteins becomes available. Acknowledgment. H.A.R. acknowledges support from the ARO. F.B.C. thanks Dirk Dittmer for critical reading of the manuscript. References and Notes (1) Kim, P. S.; Baldwin, R. L. Annu. ReV. Biochem. 1990, 59, 631. (2) Baldwin, R. L.; Rose, G. D. Trends Biochem. Sci. 1999, 24, 26. (3) Baldwin, R. L.; Rose, G. D. Trends Biochem. Sci. 1999, 24, 77. (4) Rose, G. D. J. Mol. Biol. 1979, 124, 447. (5) Ptitsin, O. B.; Rashin, A. A. Biophys. Chem. 1975, 3, 1. (6) Karplus M.; Weaver, D. L. Nature 1976, 260, 404. (7) Sali, A.; Shakhnovich, E.; Karplus, M. Nature 1994, 369, 248. (8) Bryngelson, J. D.; Onuchic J. N.; Socci, N. D.; Wolynes, P. G. Proteins: Struct., Funct., Genet. 1995, 21, 167. (9) Sheinerman, F. B.; Brooks, C. L., III Proteins: Struct., Funct., Genet. 1997, 29, 193. (10) Alm., E.; Baker, D. Proc. Natl. Acad. Sci. U.S.A. 1999, 96, 11305. (11) Onuchic, J. N.; Nymeyer, H.; Garcia, A. E.; Chaine, J.; Socci, N. D. AdV. Protein Chem. 2000, 53, 87. (12) Portman, J. J.; Takada, S.; Wolynes, P. G. J. Chem. Phys 2001, 114, 5082. (13) Bergasa-Caceres, F.; Ronneberg, T. A.; Rabitz, H. A. J. Phys. Chem. B 1999, 103, 9749. (14) Bergasa-Caceres, F.; Rabitz, H. A. J. Phys. Chem. B 2001, 105, 2874. (15) Bergasa-Caceres, F.; Rabitz, H. A. J. Phys. Chem. B 2002, 106, 4818. (16) Hamada, D.; Segawa, S.-I.; Goto, Y. Nat. Struct. Biol. 1996, 3, 868. (17) Kuwajima, K.; Yamada, H.; Sugai, S. J. J. Mol. Biol. 1996, 264, 806. (18) Forge V.; Hoshino, M.; Kuwata, K.; Arai, M.; Kuwajima, K.; Batt, C. A.; Goto, Y. J. Mol. Biol. 2000, 296, 1039. (19) Kuwata, K.; Shastry, R.; Cheng, H.; Hoshino, M.; Batt, C. A.; Roder, H. Nat. Struct. Biol. 2001, 8, 151.
Bergasa-Caceres and Rabitz (20) Chikenji, G.; Kikuchi M. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 14273. (21) Abkevich, V. I.; Gutin, A. M.; Shakhnovich, E. I. Proteins: Struct., Funct., Genet. 1998, 31, 335. (22) Prusiner, S. B. Science 1997, 278, 245. (23) Jackson, G. S.; Hosszu, L. L. P.; Power, A.; Hill, A. F.; Kenney, J.; Saibil, H.; Craven, C. J.; Waltho, J. P.; Clarke, A. R.; Collinge, J. Science 1999, 283, 1935. (24) Levinthal, C. J. Chim. Phys. 1968, 65, 44. (25) Dill, K. A. Biochemistry 1990, 29, 7123. (26) Kauzmann W. AdV. Protein Chem. 1959, 14, 1. (27) O’Neil, K. T.; Degrado W. F. Science 1990, 250, 646. (28) Gillespie, J. R.; Shortle, D. J. Mol. Biol. 1997, 268, 170. (29) Klein-Seetharaman, J.; Oikawa, M.; Grimshaw, S. B.; Wirmer, J.; Duchart, E.; Ueda, T.; Imoto, T.; Smith, L. J.; Dobson, C. M.; Schwalbe, H. Science 2002, 295, 1719. (30) Baker, D. Nature 2000, 405, 39. (31) Plaxco, K. W.; Simons, K. T.; Ruczinski, I.; Baker, D. Biochemistry 2000, 39, 11177. (32) Ha, J, H.; Loh, S. N. Nat. Struct. Biol. 1998, 5, 730. (33) Schulz, G. E.; Schirmer, R. H. Principles of Protein Structure; Springer-Verlag: New York, 1979; pp 84-85. (34) Kuwajima, K. Proteins: Struct., Funct., Genet. 1989, 6, 87. (35) Shoemaker, B. A.; Wolynes, P. G. J. Mol. Biol. 1999, 287, 657. (36) Shoemaker, B. A.; Wang, J.; Wolynes, P. G. J. Mol. Biol. 1999, 287, 675. (37) Levinthal, C. J. J. Chim. Phys. 1968, 65, 44. (38) Lazaridis, T.; Karplus, M. Science 1999, 278, 1928. (39) Fauche`re, J. L.; Pliska, V. Eur. J. Med. Chem. 1983, 18, 369. (40) Leszcynski, J.; Rose, G. D. Science 1986, 234, 849. (41) Brownlow, S.; Morais Cabral, J. H.; Cooper, R.; Flower, D. R.; Yewdall, S. J.; Polikarpov, I.; North, A. C.; Sawyer, L. Structure 1997, 5, 481. (42) Shastry, M. C. R.; Luck, S. D.; Roder, H. Biophys. J. 1998, 74, 2714. (43) Shastry, M. C. R.; Roder, H. Nat. Struct. Biol. 1998, 5, 385. (44) Roder, H.; Elove, G. A.; Englander, S. W. Nature 1988, 335, 700. (45) Udgaonkar, J. B.; Baldwin, R. L. Nature 1998, 335, 694. (46) Gladwin, S. T.; Evans, P. A. Folding Des. 1996, 1, 407. (47) Cho, Y.; Batt, C. A.; Sawyer, L. J. Biol. Chem. 1994, 269, 11102. (48) Ballew, R. M.; Sabelko, J.; Gruebele, M. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 5759.