Application of Capture− Recapture Models to Estimation of Protein

Estimation of Protein Count in MudPIT ... across the various cycles of a typical MudPIT experiment. ... nj ) number of proteins captured (detected) in...
0 downloads 0 Views 188KB Size
Anal. Chem. 2006, 78, 3203-3207

Application of Capture-Recapture Models to Estimation of Protein Count in MudPIT Experiments James A. Koziol,*,† Anne C. Feng,† and Jan E. Schnitzer‡

Department of Molecular and Experimental Medicine, The Scripps Research Institute, 10550 North Torrey Pines Road, MEM216, La Jolla, California 92037, and Sidney Kimmel Cancer Center, 10835 Altman Row, San Diego, California 92121

MudPIT is an automated shotgun proteomics approach that enhances the separation of peptides for sequencing by mass spectrometry analysis.We here adapt a mathematical model from ecology, namely, the capturerecapture model with a closed population and timevarying and heterogeneous individual probabilities of capture, to model the number of peptide identifications across the various cycles of a typical MudPIT experiment. In the absence of any prior information on abundance levels, the model can be used to estimate the total number of proteins in the experimental sample. We apply the model to a recent MudPIT-based experiment to estimate the total number of rat lung endothelial cell surface proteins. The model provides some practical guidelines for planning MudPIT experiments. Yates and colleagues1 have developed an automated shotgun proteomics approach, termed multidimensional protein identification technology, or MudPIT. This method enhances the separation of peptides for sequencing by mass spectrometry analysis. A fundamental issue relating to this method relates to experimental design: namely, how many MudPIT experiments might be required in practice to ensure relatively complete coverage of complex peptide mixtures. To address this issue, we introduce a probability model for the number of protein identifications across the various cycles of a typical MudPIT experiment and subsequently use this model to estimate the total number of proteins in the experimental sample subjected to MudPIT analyses in the absence of any prior information on abundance levels. The probability model is not new; it has been widely used in ecology and biometry (under the rubric of capture-recapture) to estimate the size of populations. The theoretical framework provided by the model affords estimates of biologically relevant parameters and assessments of uncertainty about these estimates. Fundamental assumptions of the model in the present context are largely a derivative of the experimental paradigm for MudPIT experiments introduced by Yates and colleagues in their seminal work.1,2 The formalism provided by probability model building and assessment, followed by statistical inference, adds rigor and robustness to MudPIT experiments. †

The Scripps Research Institute. Sidney Kimmel Cancer Center. (1) Wolters, D. A.; Washburn, M. P.; Yates, J. R. III. Anal. Chem. 2001, 73, 5683-5690. (2) Liu, H.; Sadygov, R. G.; Yates, J. R., III. Anal. Chem. 2004, 76, 41934201. ‡

10.1021/ac051248f CCC: $33.50 Published on Web 03/30/2006

© 2006 American Chemical Society

We give some mathematical details relating to the model in the next section then discuss its relevance to MudPIT experiments. In the concluding section, we apply this model to findings from a recent experiment by Durr et al.,3 who used MudPIT measurements to analyze rat lung endothelial cell surface proteins comprehensively with multiple MudPIT cycles on endothelial cell plasma membranes isolated from rat lungs. The Capture-Recapture Model with a Closed Population and Time-Varying and Heterogeneous Individual Probabilities of Capture. We here provide some details relating to the capture-recapture model for closed populations with time-varying and heterogeneous individual capture probabilities, commonly referred to as the Mth model in the ecology literature. Readers primarily interested in practical applications of the Mth model to MudPIT experiments can skip to the next section. In addition, we refer the interested reader to comprehensive reviews of closed models and applications by Otis et al.,4 White et al.,5 or Seber.6 We first introduce some notation for a typical capturerecapture experiment with a closed population, within the context of a typical MudPIT experiment involving multiple cycles: N ) total population size, that is, number of distinct proteins in the experimental sample t ) number of independent MudPIT cycles over the experimental period pi ) frequency of the ith protein in the experimental sample, N ∑i)1 pi ) 1 pij ) unknown capture or detection probability of the ith protein (i ) 1, ..., N) in the jth cycle (j ) 1, ..., t) nj ) number of proteins captured (detected) in the jth cycle (j ) 1, ..., t) fk ) number of proteins captured exactly j times in t cycles, k ) 0, 1, ..., t Mj ) number of previously detected proteins at start of the jth cycle Mt+1 ) total number of distinct proteins detected in the t cycles A convenient representation for the data accruing from a MudPIT experiment is as follows.7,8 The data consist of an N × t matrix X ) (Xij), where Xij ) I [the ith protein is detected in the (3) Durr, E.; Yu, J.; Krasinska, K. M.; Carver, L.Al; Yates, J. R.; Testa, J. E.; Oh, P.; Schnitzer, J. E. Nat. Biotechnol. 2004, 22, 985-992. (4) Otis, D. L.; Burnham, K. P.; White, G. C.; Anderson, D. R. Wildlife Monographs; No. 62; The Wildlife Society: Bethesda, Maryland, 1978. (5) White, G. C.; Anderson, D. R.; Burnham, K. P.; Otis, D. L. Report LA-8787NERP; Los Alamos National Laboratory: Los Alamos, New Mexico, 1982. (6) Seber, G. A. F. The Estimation of Animal Abundance and Related Parameters, 2nd ed.; Griffin: London, 1982.

Analytical Chemistry, Vol. 78, No. 9, May 1, 2006 3203

jth cycle], and I[A] is the usual indicator function; that is, I[A] ) 1 if event A occurs, 0 otherwise. Note that N

t

∑ ∑X

Mt+1 )

I[

i)1

ij

g1]

j)1

and N

fk )

t

∑ ∑X I[

i)1

ij

) k]

j)1

where k ) 0, 1, ..., t. Clearly, only Mt+1 rows in the data matrix are observed, that is, have nonzero entries: fo represents the number of unobserved proteins in the overall experiment, with N ) Mt+1 + fo. Parametric approaches to the estimation of the underlying population size N typically involve distributional assumptions to model the capture probabilities pij; hence, they are highly dependent on the validity of these distributional assumptions.9 A nonparametric approach that avoids such distributional assumptions was introduced by Chao and colleagues.10-12 Their methodology is based on the notion of sample coverage.9,13 Formally, the sample coverage, C, is the sum of the underlying frequencies of proteins observed in the overall MudPIT experiment. N

C)

∑p I

[the ith protein is captured at least once]

i

i)1 N

)

t

∑ ∑X piI[

i)1

ij

g1]

j)1

The quantity C has been well-studied;14-16 estimation of N follows directly from estimation of C. For our purposes, it suffices to summarize Chao’s estimators N ˆ and Cˆ of N and C, respectively,

Mt+1 f1 2 + γˆ C ˆ C ˆ

N ˆ )

where

C ˆ )1-

γˆ ) max

{

t

∑jf

j

j)1

t

N ˆ0

2

f1

∑ j(j - 1)tj j)2

∑ ∑ nn

j k

j