Probability of Observing a Number of Unfolding Events while

Jul 16, 2014 - censoring time Tc is the time when the experiment is interrupted after the last observed event (Tc > Tkmax). Any possible subsequent ev...
0 downloads 0 Views 853KB Size
Letter pubs.acs.org/Langmuir

Probability of Observing a Number of Unfolding Events while Stretching Polyproteins Rodolfo I. Hermans* London Centre for Nanotechnology, and Department of Physics and Astronomy, University College London, London WC1E 6BT, United Kingdom ABSTRACT: The mechanical stretching of single polyproteins is an emerging tool for the study of protein (un)folding, chemical catalysis and polymer physics at the single molecule level. The observed processes, i.e., unfolding or reduction events, are typically considered to be stochastic and by nature are susceptible to be censored by the finite duration of the experiment. A formal analytical and experimental description on the number of observed events under various conditions of practical interest is developed. Rules of thumb are provided to define an optimal experiment protocol duration. Finally, a methodology is described to accurately estimate the real number of stretched molecules based on the number of observed unfolding events. The model-free numerical analysis applied to experimental data confirms that poly-ubiquitin binds at a random position both to the substrate and to the pulling probe.



unfold, yet the number of such observed unfolding events varies from case to case and rarely reaches the total of 12 engineered modules. Figure 1a shows an example of the raw data obtained by AFM-SMFS in constant-force mode, featuring only kmax = 7 unfolding events at times {T1,T2,...,T7}, in an experiment that lasted around 1 second. Figure 1b shows the population distribution of the number of observed unfolding events in each single poly-protein chain for an unfiltered data set consisting of 1198 unfolding polyprotein chains and a total of 1741 unfolding events. The number of traces observed to unfold kmax events seems to decrease approximately by 22% for each increment in kmax. This observed distribution is explained with a simple approach that does not require assuming a particular model for the unfolding kinetics and relies only on the assumptions that the unfolding events are independent and identically distributed.

INTRODUCTION Atomic force spectroscopy-based single-molecule force spectroscopy technique (AFM-SMFS) (described in detail elsewhere1−4) allows calibrated mechanical stretching and monitoring of individual polymer molecules such as sugars, DNA or proteins. For more than a decade this tool has been used for the study of the folding and unfolding mechanism of proteins at the single molecule level, and lately it has emerged as the means to test chemical catalysis at the level of a localized single disulfide bond.5,6 In order to obtain a suitable “fingerprint,” i.e., unambiguous independent evidence of the controlled condition of the intended sample, protein samples are often engineered to contain multiple linked identical repeats of the structure of interest.7 The kinetics and mechanics of these polyproteins is observed to be identical to monomers but data from polyproteins is considered to be less likely to be perturbed by spurious surface interactions.8 It has been observed experimentally that for a polyprotein designed to contain N identical modules, the number of recorded unfolding events is prevailingly smaller than N, but to the author’s knowledge there has not been a thorough analytical explanation for this phenomena. The following analysis explain the experimentally observed distribution of events by considering the stochastic nature of three relevant processes that define the experiment: the bonding of the sample, the unfolding event, and the spontaneous (or programmed) termination of the experiment. The analysis is valid in a broad scope, and the experimental data is provided as a representative example.





EXPERIMENTAL OBSERVATION

Polyprotein chains, engineered to contain exactly Nmax = 12 identical domains of the protein ubiquitin (Ubi12), are stretched by AFM at a constant force of 110 pN.9 The stretching force weakens the tertiary structure of all “protein modules” exposed to force allowing them to © 2014 American Chemical Society

PROBABILITY MODEL

Let’s assume a chain of N identical modules and define t = 0 when the stretching force is first applied. The unfolding events occur between time 0 and t with probability Pe(t). The experiment is interrupted stochastically by spontaneous detachment of the sample with probability density pc(t) or by design at a known time. The experimental measurables are the observed unfolding times Tk with 0 ≤ T1 ≤ T2 ≤,...,Tkmax. The censoring time Tc is the time when the experiment is interrupted after the last observed event (Tc > Tkmax). Any possible subsequent event after Tc is not measurable, constituting a type-2 right censored system.10 Now we obtain the analytical expression of the probability for a kth order statistics11 to be the last one observed before Received: March 27, 2014 Revised: July 2, 2014 Published: July 16, 2014 8650

dx.doi.org/10.1021/la501161p | Langmuir 2014, 30, 8650−8655

Langmuir

Letter

time t. The probability of a series of k modules to unfold before the time t is (Pe(t))k. These k modules can be chosen in N k different ways. For the kth to be the last event, no other events should happen after the interruption of the experiment. Then the probability that any k modules unfolded before t and the remaining N − k did not, is given by

( )

P(Tkmax t |N ) =

N! Pe(t )k (1 − Pe(t ))N − k k ! (N − k )!

Now we consider that the experiment is interrupted between time t and t + dt with probability pc(t) dt, now the probability that the kth is the last observed event at any time is P(k|N ) = =

∫0



pc (t )P(Tkmax t |N ) dt

N! k ! (N − k )!

∫0



pc (t )Pe(t )k (1 − Pe(t ))N − k dt (1)

If algebraic expressions for pc(t) and Pe(t) are available, eq 1 could be integrated analytically. For example, a trivial case of practical interest is the idealized Arrhenius kinetics when the unfolding rate is a single exponential Pe(t) = 1 − e−αt and experiment is only interrupted by design at time td, and pc(t) = δ(t − td). By making k = N, we see that the probability of observing the last event out of N is

Figure 1. (a) Example of the raw data in an AFM-SMFS experiment. End-to-end length of the polyprotein sample features extension events at times {T1,T2,...,T7}, each one corresponding to the unfolding of an individual protein module. The experiment is interrupted shortly after 1 second, so any further events are “censored”. (b) Number of observed unfolding events in each single poly-protein chain. Despite the fact that the poly-ubiquitin chain is designed to contain 12 identical modules, the population of observed long chains is much smaller than short ones. Except for traces with zero or one unfolding events, the trend of observed number of events versus chain length can be empirically modeled with the function (0.78)k, implying that the population of observed events decreases by 22% for each extra observed unfolding module.

PA(N |N ) = (1 − e−αtd)N

(2)

Solving for td allows the design of an experimental protocol that extends in time enough to observe a desired proportion of the last event.12 Another example of interest is the case where the experiment is interrupted by the spontaneous detachment of the sample from the substrate or the cantilever tip. If we assume the sample attachment also follows Arrhenius kinetics, then pc(t) = −βe−βt in eq 1 we obtain

Table 1. For a Given Number of Stretched Modules N, Any Number of Events from Zero to N Can Be Observed Depending on the Probability That the Events Are Censoreda N k

0 1 2 3 4 5 6 7 8 9 10 11 12

0

1

2

3

4

5

6

7

8

9

10

11

12

1.000 0 0 0 0 0 0 0 0 0 0 0 0

0.313 0.687 0 0 0 0 0 0 0 0 0 0 0

0.163 0.298 0.538 0 0 0 0 0 0 0 0 0 0

0.098 0.195 0.252 0.454 0 0 0 0 0 0 0 0 0

0.064 0.139 0.182 0.215 0.401 0 0 0 0 0 0 0 0

0.043 0.103 0.143 0.161 0.188 0.363 0 0 0 0 0 0 0

0.030 0.077 0.115 0.133 0.141 0.169 0.335 0 0 0 0 0 0

0.022 0.059 0.093 0.113 0.120 0.125 0.156 0.313 0 0 0 0 0

0.016 0.046 0.076 0.096 0.105 0.108 0.113 0.146 0.294 0 0 0 0

0.012 0.036 0.062 0.082 0.093 0.097 0.098 0.103 0.138 0.279 0 0 0

0.009 0.029 0.052 0.070 0.082 0.087 0.089 0.089 0.095 0.132 0.266 0 0

0.007 0.023 0.043 0.061 0.072 0.079 0.081 0.081 0.082 0.089 0.127 0.254

0.005 0.019 0.036 0.052 0.064 0.071 0.074 0.075 0.075 0.076 0.084 0.124 0.244

Each element of matrix Mk,n corresponds to P(kmax|N), the probability of observing k out of N events for a given censored data set. The values are calculated numerically by integrating eq 1 where all probabilities are directly estimated from the experimental populations of unfolding and detachment events. This matrix Mk,n transforms the probability of picking N modules to the probability of observing k events.

a

8651

dx.doi.org/10.1021/la501161p | Langmuir 2014, 30, 8650−8655

Langmuir

Letter

Figure 2. Probability of picking N (left) or observing k (right) protein modules under three different scenarios. Protein grabbed from (a,b) the ends (always), (c,d) one end fixed and one random position, and (e,f) two random places in the chain.

( (

Γ n−k+ β n! pβ (k|N ) = α (n − k )! Γ n + 1 +

β α

) β α)

{T1,T2,...,Tkmax} which allow to estimate Po(t) and the censoring times Tc that defines Pc(t). All the observed events times and censoring times are respectively aggregated to estimate their probability functions. The cumulative distribution function (CDF) is calculated directly by an interpolating formula.

(3)

where Γ(z) is Euler’s Gamma function. Particularly exemplifying is the case when α = β and the probability of observing the last event is pβ(N|N) = (1 + N)−1, that is, if the detachment rates are comparable to the unfolding rate, observing the unfolding of the last module exposed to force is a rare event. It is feasible and desirable to avoid any model assumptions and integrate eq 1 numerically by estimating pc(t) and Pe(t) directly from the raw experimental data. Whereas censoring is independent of the unfolding time and pc(t) can be directly estimated only from the set of observed censoring times, the unfolding events are dependent on not been censored, and Pe(t) is not a direct observable but must me estimated from both the observed unfolding times and the censoring times. Therefore, the probability of observing an event is equal to the probability of the event happening and not been censored, that is Po(t) = Pe(t) (1 − Pc(t)).

⎛ t − tk ⎞ P(̂ t ) = a⎜k(t ) + ⎟ tk + 1 − tk ⎠ ⎝

(4)

where k(t) is the rank of the sample tk, such that t ∈ [tk,tk+1], and a is a constant such that lim P ̂ (t) = 1. The probability t →∞

density function (PDF) is calculated by a Gaussian kernel density estimation.13 Similar calculations for CDF and PDF are already implemented in the simple-to-use function SmoothKernelDistribution in version 8 of Wolfram’s Mathematica.14 The probability P(k|N) is calculated numerically by integrating eq 1 using the estimated Pe(t) = Po(t)/(1 − Pc(t)). The calculated values are given in Table 1. The values of P(kmax|N) in Table 1 are arranged in a matrix Mk,n for algebraic convenience. A consequence of censoring is that observing kmax events is possible from traces with any value of N in the range Nmax ≥ N ≥ kmax. Consequently, the overall probability of observing kmax events P(kmax) is given by



RESULTS Let’s assume the experimental data has been analyzed and compiled in R sets containing the unfolding event times 8652

dx.doi.org/10.1021/la501161p | Langmuir 2014, 30, 8650−8655

Langmuir

Letter

Figure 3. Comparison between observed distribution of chain length with model of random attachment places. (a) The distribution of experimentally observed unfolding events (red) is similar to the distribution of observable events scenario with random attachment places and the observed censoring (blue). (b) The real distribution of modules before censoring is calculated from eq 7, and the result (red) is remarkably similar to the model of random attachment places (blue). Nmax

P(k max ) =



P(k max|N )P(N )

N =1

Pn⃗ = M−k ,1nPk⃗ (5)

In order to compensate for the censored data in the first two bins of Figure 1b, an extrapolation of the count values for kmax = 0 and kmax = 1 is used to match the observed trend using an exponential fit. [Notice that if the two fist bins in Figure 1b were not corrected then the estimated population of N would contain negative values.] The corrected histogram is plotted with red bars in Figure 3a. For comparison in the same figure, the blue bars indicate the distribution expected when both ends of the polyprotein are picked randomly (Scenario 3). The calculated population of N is shown in Figure 3b in red bars, together with the distribution expected when both ends of the polyprotein are picked randomly in blue bars. Figure 3a,b confirms the similarity of the experimental data and the expected distribution for random binding of the polyprotein. From this result, we can conclude that proteins are anchored in a completely random process, both to the substrate and to the cantilever. Even if any anchor position is equally probable, the fact that there is only one way to pick all 12 modules but several more more ways to pick a smaller number of modules explains the distribution Pn⃗ of the number N of stretched modules. The population of observed events Pk⃗ is further reduced by censoring.

Defining the vectors Pk⃗ and Pn⃗ such that their components are the probabilities P(kmax) and P(N), respectively: Pk⃗ = {P(k max = 1), P(k max = 2), ..., P(k max = Nmax )} Pn⃗ = {P(N = 1), P(N = 2), ..., P(N = Nmax )}

Using this definition, eq 5 can be written more conveniently in matrix form:

Pk⃗ = Mk , nPn⃗

(7)

(6)

Matrix Mk,n allows one to transform the probability (or population) distributions from the available modules N to the observed modules kmax, and vice versa. Now is possible to investigate the expected population of observed number of events per chain for three different significant scenarios represented in Figure 2a,c,e. 1. All polyprotein chains are deterministically picked from the end linkers, always exposing all 12 modules to force. P(N) = δ(12) (Figure 2a). 2. One of the end linkers is fixed to the substrate, and the cantilever picks randomly the chain at any linker, exposing between 1 and 12 modules with equal probability. P(12 ≥ N > 0) = 1/12 (Figure 2c). 3. Both ends are picked randomly. There are (13 − N) ways to pick N modules, making a total of 12 × 11/2 = 66. P(12 ≥ N > 0) = (13 − N)/66 (Figure 2e). The calculated probability of observing kmax modules is shown in Figure 2b,d,e. Notice that Figure 2e looks remarkably similar to the histogram of experimentally observed kmax in Figure 1b, except for the first two bins corresponding to kmax = 0 and kmax = 1. No other scenario resembles the experimental data. The small number of counts in the first two bins of Figure 1b is explained by the fact that traces with less than two events are rarely saved because of the lack of a characteristic fingerprint to discriminate them from a nonspecific sample.15,16 Given the values in Table 1, it is possible to calculate M−1 k,n , the inverse of Mk,n, to solve the inverse problem and estimate the population distribution of N based on the population of kmax.



DISCUSSION The importance of addressing censoring in order to obtain unbiased estimators has been previously addressed for different single-molecules experiments.17 The use of nonparametric estimators that allow an explicit analytical description of the experimental constrains, such as maximum likelihood is of paramount importance.18,19 Maximum likelihood estimators have been used in order to account for censoring in the analysis of dwell-time or survival-time analysis of poly-Ubiquitin,9 dramatically reducing the bias toward fast events and allowing a more detailed analysis of the distribution of survival times. In light of the present results, a more complete estimator both for unfolding rates and the value of N for each trace could be obtained by a similar procedure but now including a more accurate estimation of the prior probability of picking N. Therefore, the analysis presented is complementary to an optimum nonparametric analysis of protein unfolding dwell time statistics. 8653

dx.doi.org/10.1021/la501161p | Langmuir 2014, 30, 8650−8655

Langmuir

Letter

probabilities in maximum-likelihood estimators for unfolding rates. This conclusion also informs on the details of the attachment of single-protein samples. Similar analysis could be applied both to account for other unobservable events, such as events that happen faster than the instrumental resolution, or to design experimental protocols that minimize censoring. I’m confident that the tools here developed will also be of use to the experimentalist in the process of designing protocols that account for the often ignored stochastic nature of the attachment and detachment of the sample to substrate and probe.

Analysis based on simulated data have neglected crucial experimental constrains and conclude that fitting dwell time histograms provide unbiased estimation of unfolding rates.20 In practice, distribution of censoring times are broad and therefore bias is not avoided by excluding particular regions of a histogram. Cao et. al. rightly conclude that under ideal conditions the probability of unfolding events are independent of the number N of available modules.20 Nevertheless, an accurate estimation of N is still necessary for the analysis of the individual traces. That extends to censored data, provided that the events are equally distributed and memory-less. It must be noted that this is not necessarily the case for fast unfolding rates where dwell times become comparable to the limited response of the feedback loop that keeps the stretching force constant. The procedure here described is a quantification of the probability of censoring data and allows one to transform the observed population of events into the existing population based on a matrix constructed numerically based on experimental data and free of any assumptions on the kinetics of the events. The calculated existing population is consistent with the scenario where both anchoring points are chosen among all the protein modules with equal probability and the probability of picking N modules is P(N) = 2(Nmax − N + 1)/ (Nmax(Nmax − 1)). Therefore, the less likely configuration is to have a protein grabbed from the ends, and the most likely is to grab a single module. In the presence of significant spontaneous detachment of the sample or any other limitation on the experiment duration, the observed distribution is further skewed by censoring the last events, making the observation of the unfolding of all engineered modules even less likely and observing none the most likely outcome. In the presented experimental conditions, the protein sample was attached nonspecifically to the substrate and AFM cantilever tip, and it was consistently concluded that the stretching ends were picked randomly. That scenario is not the one present in experiments performed with successful specific functionalization at the protein ends.21 Current polyprotein designs, including the sample here studied, have been designed to include two cysteine residues on the C-terminus with the idea of providing covalent attachment to a gold substrate. Nevertheless, similar experiments can be successfully replicated in surfaces different from gold.22 The results here presented suggest that C-terminus cysteine are not anchoring the protein directly to the surface from the ends. A more likely scenario is that proteins are entangled in clusters,23 a formation that could be modulated by the presence of cysteine. It remains an open question to provide a detailed mechanism that reconciles these results with the prevailing belief in the necessity of C-terminus cysteine.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This manuscript builds from previous work described in my Ph.D. Thesis. I thank my former Ph.D. advisor Professor Julio M. Fernandez for access to the experimental data. I’m grateful to Dr. Sergi Garcia-Manyes for valuable discussion. The author acknowledges financial support from the Coherent Terahertz Systems Programme Grant, EPSRC EP/J017671/1.



REFERENCES

(1) Fisher, T. E.; Marszalek, P. E.; Fernandez, J. M. Stretching single molecules into novel conformations using the atomic force microscope. Nat. Struct Mol. Biol. 2000, 7, 719−724. (2) Fernandez, J. M.; Li, H. Force-clamp spectroscopy monitors the folding trajectory of a single protein. Science 2004, 303, 1674−1678. (3) Oberhauser, A. F.; Hansma, P. K.; Carrion-Vazquez, M.; Fernandez, J. M. Stepwise unfolding of titin under force-clamp atomic force microscopy. Proc. Natl. Acad. Sci. U. S. A. 2001, 98, 468−472. (4) Baró, A. M., Reifenberger, R. G., Eds. Atomic Force Microscopy in Liquid: Biological Applications; Wiley-VCH: Weinheim, Germany, 2012; p 402. (5) Perez-Jimenez, R.; Inglés-Prieto, A.; Zhao, Z.-M.; SanchezRomero, I.; Alegre-Cebollada, J.; Kosuri, P.; Garcia-Manyes, S.; Kappock, T. J.; Tanokura, M.; Holmgren, A.; Sanchez-Ruiz, J. M.; Gaucher, E. A.; Fernández, J. M. Single-molecule paleoenzymology probes the chemistry of resurrected enzymes. Nat. Struct. Mol. Biol. 2011, 18, 592−596. (6) Garcia-Manyes, S.; K, T.-L.; F, J. M. Contrasting the individual reactive pathways in protein unfolding and disulfide bond reduction observed within a single protein. J. Am. Chem. Soc. 2011, 133, 3104− 3113. (7) Carrion-Vazquez, M.; Oberhauser, A. F.; Fisher, T. E.; Marszalek, P. E.; Li, H.; Fernandez, J. M. Mechanical design of proteins studied by single-molecule force spectroscopy and protein engineering. Prog. Biophys. Mol. Biol. 2000, 74, 63−91. (8) Garcia-Manyes, S.; Brujic, J.; Badilla, C. L.; Fernández, J. M. Force-clamp spectroscopy of single-protein monomers reveals the individual unfolding and folding pathways of i27 and ubiquitin. Biophys. J. 2007, 93, 2436−2446. (9) Brujic, J.; Hermans Z, R. I.; Walther, K. a.; Fernandez, J. M.; Hermans Z, R. I.; Brujić, J. Single-molecule force spectroscopy reveals signatures of glassy dynamics in the energy landscape of ubiquitin. Nat. Phys. 2006, 2, 282−286. (10) Balakrishnan, N. Order Statistics: Theory & Methods; Handbook of Statistics 16; Elsevier Science B.V.: Amsterdam, 1998. (11) David, H. Advances in Distribution Theory, Order Statistics, and Inference; Statistics for Industry and Technology 3; Birkhäuser: Boston, 2006; pp 157−172.



OUTLOOK AND CONCLUSIONS The study of poly-proteins by AFM-SMFS has been identified as a type-2 right censored system and a model-independent method to understand and predict the population of observed unfolding events was provided. The population of modules exposed to force is accurately estimated based on the number of observed events accounting for the unavoidable censoring of the data. Using this method, it is concluded that the experimental data on AFM-SMFS unfolding of Ubi12 stretched at constant 110 pN clearly suggest that the attachment of the protein sample to the substrate and pulling probe is random, and therefore observing the unfolding of all possible modules is highly unlikely. This result is relevant to define prior 8654

dx.doi.org/10.1021/la501161p | Langmuir 2014, 30, 8650−8655

Langmuir

Letter

(12) Hermans, R. I. Experimental study of single protein mechanics and protein rates of unfolding. Ph.D. Thesis, Graduate School of Arts and Sciences, Columbia University, New York, 2010. (13) Gerard, P. D.; Schucany, W. R. Local bandwidth selection for kernel estimation of population densities with line transect sampling. Biometrics 1999, 55, 769−773. (14) Wolfram Research, Inc. Mathematica, version 9.0; Wolfram Research: Champaign, IL, 2013. (15) Fernandez, J. M. Fingerprinting single molecules in vivo. Biophys. J. 2005, 89, 3676−7. (16) Sarkar, A.; Caamano, S.; Fernandez, J. M. The mechanical fingerprint of a parallel polyprotein dimer. Biophys. J. 2007, 92, L36− L38. (17) Koster, D. A.; Wiggins, C. H.; Dekker, N. H. Multiple events on single molecules: Unbiased estimation in single-molecule biophysics. Proc. Natl. Acad. Sci. U. S. A. 2006, 103, 1750−1755. (18) Aldrich, J. R.A. Fisher and the making of maximum likelihood 1912−1922. Stat. Sci. 1997, 12, 162−176. (19) Jones, M. C.; Charalambides, C. A.; Koutras, M. V.; Balakrishnan, N. Probability and Statistical Models with Applications; CRC Press: Boca Raton, FL, 2001; p 429. (20) Cao, Y.; Li, H. Single-molecule force-clamp spectroscopy: Dwell time analysis and practical considerations. Langmuir 2011, 27, 1440− 1447. (21) Popa, I.; Berkovich, R.; Alegre-Cebollada, J.; Badilla, C. L.; Rivas-Pardo, J. A.; Taniguchi, Y.; Kawakami, M.; Fernandez, J. M. Nanomechanics of HaloTag tethers. J. Am. Chem. Soc. 2013, 135, 12762−12771. (22) Garcia-Manyes, S.; Dougan, L.; Badilla, C. L.; Brujic, J.; Fernández, J. M. Direct observation of an ensemble of stable collapsed states in the mechanical folding of ubiquitin. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 10534−9 (see Supporting Information).. (23) Valbuena, A.; Oroz, J.; Vera, A. M.; Gimeno, A.; GómezHerrero, J.; Carrión-Vázquez, M. Quasi-simultaneous imaging/pulling analysis of single polyprotein molecules by atomic force microscopy. Rev. Sci. Instrum. 2007, 78, 113707.

8655

dx.doi.org/10.1021/la501161p | Langmuir 2014, 30, 8650−8655