Quantifying the Effect of Competition for Detection between Coeluting

Dec 7, 2013 - Quantifying the Effect of Competition for Detection between Coeluting Peptides on Detection Probabilities in Mass-Spectrometry-Based ...
0 downloads 0 Views 4MB Size
Article pubs.acs.org/jpr

Quantifying the Effect of Competition for Detection between Coeluting Peptides on Detection Probabilities in Mass-Spectrometry-Based Proteomics Paul Schliekelman* and Shangbin Liu Department of Statistics, University of Georgia, 204 Statistics Building, Athens, Georgia 30602, United States S Supporting Information *

ABSTRACT: There are many factors that contribute to the variation in detection probabilities of proteins in LC−MS/MS experiments, and currently little is known about their relative importance. In this study, we analyze the effect of competition for detection between coeluting peptides on peptide detection probability. Using a novel method for estimating peptide detection probabilities, we show that these probabilities can vary by an order of magnitude between peptides that elute from the liquid chromatograph at the same time as many other peptides and those that elute with fewer other peptides. To explore these results, we use a mathematical model to show that competition for detection between peptides is expected to be a major source of missed detections in complex mixtures because there will be many MS/MS scanning intervals that contain more coeluting peptides than can be subjected to MS/MS analysis. Our data and simulation results show that the number of coeluting peptides is a primary determinant of whether a peptide will be detected. In our data, this had a several-fold larger effect on peptide detection probability than did peptide abundance. Furthermore, the distribution of elution times for the most frequently detected peptides was strongly shifted toward values where there were few coeluting peptides, indicating that the number of coeluting peptides is a major determinant of whether a peptide is proteotypic. KEYWORDS: mass spectrometry (MS), MS-based proteomics, simulation, protein identification, protein abundance, retention time, elution time, LC−MS/MS, liquid chromatograph



INTRODUCTION Although LC−MS/MS methods have been highly successful in proteomics studies, a typical run will often identify only a small portion of the proteins in a sample. Numerous factors influence peptide identification: protein abundance, proteolytic digestion efficiency, peptide separation and presence of coeluting peptides, peptide ionization efficiency, ion suppression, scanning speed of the mass spectrometer, and dynamic exclusion efficiency. Little is known about the dynamic interaction of these factors or their importance in determining the probability of protein identification. A better understanding of the relative importance of these factors would allow optimization of technology, experimental protocols, and statistical analysis. Peptide detection probabilities have been investigated in a number of previous studies.1−18 It has been observed that many proteins are repeatedly and consistently identified by a small number of peptides. That is, these peptides have much higher probability of being detected than other peptides in the protein. Mallick and colleagues11 termed such peptides as proteotypic peptides. They defined a proteotypic peptide as one that occurs in at least 50% of protein identifications. They and others4,8,10−12,14,15,17,19 have developed models for predicting highly detectable peptides based on peptide properties. Tang et al.2 defined standard peptide detectability as a peptide’s detection probability in a standard sample with a fixed number of proteins mixed in equal quantities. They proposed that this © 2013 American Chemical Society

standard detectability is an intrinsic property of the peptide that is independent of its abundance and developed a model for predicting the standard detectability from its sequence. Li et al.3 developed a method for estimating standard detectability from a complex mixture of varying protein abundance. They found that most peptides in their data had very low detectability, a small portion had detectability near one, and few had intermediate values. These studies show that there is great variability in detection probability between peptides. However, despite these advances, there is still little known about the sources of variation in peptide detection probabilities. In this study, we quantify one potentially important such source of variation in peptide detections: the “competition” for detection between peptides. In a typical LC−MS/MS experiment, a survey scan is followed by a number of MS/MS events on the most intense ions in the MS survey scan. We will refer to the number of MS/MS events per cycle as c. When the protein mixture is complex, there are often more coeluting peptides than the number of MS/MS events. In this instance, a peptide will be detected only if it is in the top c most intense peptides in some sampling interval. When there are substantially more than c coeluting peptides, then it is probable that some peptides are never in the top c before their period of elution ends. Received: February 21, 2013 Published: December 7, 2013 348

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

10 mM ammonium formate. Mobile phase B consisted of 80% acetonitrile with 0.1% formic acid and 10 mM ammonium formate. Mobile phase B concentration was increased from 5 to 50% B over 90 min at a flow rate of 4 μL/min. Mass spectrometer settings included an MS1 scan range of m/z 300 to 2000, with selection of the top eight most intense precursor ions selected for MS2 acquisition.

This paper has two major components. First, we demonstrate a novel method for estimating peptide detection probabilities as a function of their elution time (ET) and abundance. We apply this method to data and show that peptide detection probabilities have only a weak dependence on ET in a simple mixture, but in a complex mixture vary by an order of magnitude between peptides that coelute with many other peptides and peptides that coelute with few other peptides. In the complex mixture, the variation in detection probability due to ET is several-fold higher than the variation due to protein level factors such as abundance. The second major component of the paper is a mathematical model that we used to study the dynamics of competition for detection between peptides. The goal of the model is to give context to the experimental results and to help determine whether we should expect competition between coeluting peptides to be major factor in determining peptide detection probability. The model shows that, consistent with our data, this competition is likely to be a major source of missing detections in complex mixtures and that the number of other peptides that a peptide coelutes with is likely a major determinant of its detection probability. Peptides with abnormal elution times will be detected with higher probability, while peptides that elute at the same time as many other peptides have a much lower detection probability. The model and data results together allow us to quantify the effect of competition between peptides and show it is a major determinant of whether peptides are detected, with a magnitude of effect several fold higher than abundance.



Data Processing

Data were acquired as Sequest output via Protein Discoverer (version 1.2; Thermo Scientific). Protein identification settings were as follows: tryptic enzymatic cleavages allowing for up to two missed cleavages, peptide tolerance of 1000 parts-per-million, fragment ion tolerance of 0.6 Da, fixed modification due to carboxyamidomethylation of cysteine (+57 Da), and variable modifications of oxidation of methionine (+16 Da) and deamidation of asparagine or glutamine (+0.98 Da). A 1% FDR threshold calculated from a decoy database was applied to determine high-confidence matches. If a peptide was matched to multiple proteins, then one protein was arbitrarily chosen. This will result in some proteins being assigned incorrect counts of detected peptides, and this could be possibly result in a small number of proteins being assigned to different abundance categories (see later) than they otherwise would have. We took all proteins that were detected with at least one peptide and subjected them to in silico digestion using the Protein Digestion Simulator (http://omics.pnl.gov/software/ ProteinDigestionSimulator.php) to produce a list of tryptic peptides for each protein. Posttranslational modifications (PTMs) were ignored. Although PTMs have been shown to affect ET,20 our analysis showed (see Supporting Information) that the variation in ET in our data between different modifications of the same peptide was very low compared with the variation between peptides and thus would have little impact on our results. For the results shown in this paper, we allowed no miscleaves. We also conducted our analysis with two miscleaves allowed in the in silico digestion phase. These results were not substantially different (see Supporting Information). We compared the list of tryptic peptides back to the original list of detected peptides and dropped all peptides that were not on the tryptic list. This was necessary because we require predicted elution times for each peptide included in the analysis.

METHODS

Data

Data were acquired using an Agilent 1100 Capillary LC system (Palo Alto, CA) using the EXP Stem Trap 2.6 μL cartridge packed with Halo Peptide ES-C18 (Optimize Technologies, Oregon City, OR) along with a 0.2 × 150 mm Halo Peptide ES-C18 capillary column (Advanced Materials Technology, Wilmington, DE). Online MS detection used the Thermo-Fisher LTQ ion trap (San Jose, CA) with a Michrom (Michrom Bioresources, Auburn, CA) captive spray interface. Mobile phases used formic acid, ammonium formate and acetonitrile from Sigma-Aldrich (St. Louis, MO). There were two samples used in this study. For the first (henceforth referred to as the simple mixture), a standard tryptic peptide mixture was made, consisting of three proteins: carbonic anhydrase (1 pmol/uL) and apo myoglobin (2 pmol/uL) from MicroSolv Technology (Eatontown, NJ) and transferrin (3 pmol/uL) from Sigma-Aldrich (St. Louis, MO). Mobile phase A consisted of 99.9% water with 0.1% formic acid, and mobile phase B consisted of 99.9% acetonitrile with 0.1% formic acid. Mobile phase B concentration was increased from 2% B to 45% B over 85 min at a flow rate of 4 μL/min. Mass spectrometer settings included an MS1 scan range of m/z 300 to 2000, with selection of the top four most intense precursor ions selected for MS2 acquisition. The second sample (henceforth referred to as the complex mixture) was a whole cell lysate from procyclic form trypanosome brucei. Soluble proteins from T. brucei underwent electrophoresis through a NuPAGE 12% Bis-Tris gel (Invitrogen, Carlsbad, CA). This step was aimed to remove undesirable nonprotein containments with negligible protein separation. Hence, the entire lane was treated as a single sample. Detailed in-gel digestion procedures can be found in the Supporting Information. Mobile phase A consisted of 99.9% water with 0.1% formic acid and

Predicted Elution Times

The two data sets that we use in this study both include observed ETs for each detected peptide. However, because we will compare detection probabilities as a function of ET, we will also need ETs for peptides that were not detected but that we assume are present in the mixture because their protein is present. Thus, we used the predicted ETs generated by the Normalized Elution Time Prediction utility (http://omics.pnl.gov/software/ NETPredictionUtility.php). Of the available ET prediction algorithms, the Kangas/Petritis LC retention time model21 had the highest correlation (0.94) with our observed ETs and thus we used it. We will refer to these predicted ETs as NET values. These values are used for all analysis involving ETs. Figure S1 in the Supporting Information shows a plot of actual versus predicted ETs. Method for Estimating Peptide Detection Probabilities

Previous work on peptide detectability has focused on estimating detection probabilities2,3,6 or proteotypic properties4,8,10−12 for individual peptides based on their physiochemical properties. Our goal is different. We wish to estimate the detection 349

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

probability for peptides as a function of their ET. Thus, we will not estimate detection probabilities for individual peptides, but rather groups of peptides with similar ET. Suppose that r replicate MS/MS runs are conducted for the same biological sample and that for each replicate we have a list of detected peptides as produced by programs like Mascot or Sequest. In each replicate, we assume that if a protein was detected at least once by any peptide then all of its tryptic peptides are present in the sample. We will consider two characteristics of peptides in our analysis: abundance and ET. We will characterize the abundance of a protein by the mean count across its peptides of detections in r replicates. We will divide proteins into different abundance classes according to this value and do the analysis separately within each abundance class. We will refer to these as abundance classes, but more generally they would reflect any phenomenon that causes joint variation in detection probability for all of the peptides of a protein. To characterize ET, peptides will be divided into categories according to their NET value. Peptide detection probabilities are estimated for each combination of abundance class and NET class. If we assume independence in detection probability between peptides within replicates and between replicates, then the number of times that a peptide j is detected has a binomial (r, p(Aj,NETj)) distribution, where p(Aj,NETj) is the detection probability for a peptide with abundance class Aj and NET value NETj. We formulated a maximum likelihood model based on this binomial model and used it to estimate the peptide detection probabilities. Full details are given in the Supporting Information.

having identical masses. Once a peptide was detected, its mass was added to an exclusion list. They remained on the exclusion list for a period of Tex seconds. Any peptide mass within ±b Daltons of a mass on the list were ignored for detection. In the simulations shown in this paper, Tex = 30 and b = 0.5. The steps of the simulation model are illustrated in Figure 1. We first specify the number of proteins in the sample, the number

Simulation Model

The goal of the model is to simulate the dynamics of peptide elution in the LC−MS/MS process and how these dynamics affect peptide detection probabilities. The model is fully explained in the Supporting Information. We will outline it here. Suppose that we have a sample with n proteins. Each protein i has bi peptides and has abundance (number of copies of the protein) ai. To simplify the analysis, we assume complete protein digestion. Thus, all peptides of the protein have the same abundance. Now, suppose that a LC−MS/MS run is being conducted with this sample. The peptides pass through the LC and are eluted at different times. Take the elution time tij for copies of peptide j in protein i as tij ∼ normal(μij, σij). The mean μij and variance σij in elution time for each peptide are determined by the physicochemical properties of the peptide. After the peptides are ejected from the LC apparatus, they pass into the mass spectrometer (MS). An MS1 scan is conducted every t0 seconds, and the number of copies of each peptide eluting over a period of τ seconds is determined. A peptide is selected for MS2 fragmenting if it is among the top c most intense peptides, where c is the number of MS2 scans. We assume that peptides are identified with 100% accuracy if they are selected for MS2. If a peptide is not in the top c most intense eluting peptides in some MS1 sampling interval, then it is not detected. We assume that dynamic exclusion is in operation, such that when a peptide is identified it is added to an exclusion list and spectra of that mass will be ignored in future scanning cycles when determining the top c most intense spectra. Peptides in the simulations were assigned masses randomly selected with replacement from the peptide masses in the complex mixture data set. These random masses were then adjusted by adding a small random uniform component to prevent different peptides

Figure 1. Algorithm for the simulation model.

of peptides of each protein, and the abundances for each protein. Next, we generate the mean (μij) and standard deviation (σij) of elution time for each peptide from an appropriate distribution (see later). The actual elution times for each peptide copy are then randomly generated for each peptide (i, j) from a normal distribution with the mean μij and standard deviationσij. We then do a loop over the MS1 sampling points t1, t2,... and determine how many copies of each peptide elute in each MS1 interval. Peptides on the exclusion list are removed from consideration. The remaining peptides are ordered by eluted copy number in that cycle, and the top c most abundant peptides are considered detected. This continues until the final MS1 sampling interval. We do multiple replicates of this process and keep track of detection probabilities for individual peptides or proteins. 350

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

each data set using the Protein Digestion Simulator (http://omics. pnl.gov/software/ProteinDigestionSimulator.php) to produce a list of tryptic peptides for each protein. This produced a total of 2659 peptides for the complex mixture and 148 peptides for the simple mixture. Predicted ETs were generated for each theoretical peptide, as described in the Methods. It was necessary to use predicted ETs instead of observed ones so that we would have ETs for all theoretical peptides whether observed or not. Figure 2 shows histograms of the predicted ETs for the two data sets.

This simulation was implemented as an R (R Core Development Team) program. Further details are available in the Supporting Information. Figures S2 and S3 in Supporting Information show histograms of ET for individual copies of specific peptides. The distributions are quite variable in shape. We have chosen a normal distribution for modeling the ET of individual as compromise, but the ET distributions for many peptides appear very different than a normal. We do not expect that this simplification will affect the broad qualitative conclusions from our model, but it may affect the details. We note here that the model is only simulating the effect of competition between coeluting peptides on detection. It does not consider the many other causes of missed detections, such as the failure to correctly match the peptide to the database. These other causes of missed detections are important. However, to at least a first order approximation, they are independent causes, and it is not necessary to model them to model the effect of competition between coeluting peptides. Including them would only complicate the results and not shed any light on the effect that we are interested in. The peptide detection probabilities produced by the simulation model are likely higher than reality because these other effects are not included. However, our focus is the change in detection probability caused by peptide coelution, and for that the model gives valuable insight. Model Parameters

The scanning interval t0 (the time between MS1 scans) was set to 1.5 s, the length of the MS1 scan was set to 0.2 s, and c (the number of MS2 scans) was set to 5. The total experiment length was set to 60 min. The distribution of the mean peptide ETs (i.e., the μij values) is crucial to our results. We used an empirical distribution based on our data to make this as accurate as possible. That is, the μij values in the simulation modeled were sampled with replacement from the predicted ETs for the tryptic peptides of detected proteins in the complex mixture (i.e., the whole cell lysate). The standard deviation in ET (σij) was set equal to 15 s for all peptides. This is also based on our data (see Results for details).

Figure 2. Histogram of predicted normalized elution times for the complex (a) and simple (b) mixtures.

On the basis of these histograms, we separated the complex mixture peptides into four categories: normalized elution time (NET) < 0.2, 0.2 < NET < 0.4, 0.4 < NET < 0.6, and NET > 0.6. These categories correspond to high, medium, low, and very low numbers of coeluting peptides. Likewise, we divided the simple mixture into categories of NET < 0.4, 0.4 < NET < 0.6, and NET > 0.6.



Simple Coelution Calculation

Next, we will make a simple calculation to get a rough estimate of the number of peptides coeluting at a given point in time. This calculation is only used for illustrative purposes and is not used in the rest of the paper. Suppose that there are M proteins in the sample. For simplicity, assume that each protein has the same number k peptides and that each peptide elutes for the same amount of time τ. Take T as the total time of the experiment. Assume that the times at which peptides start eluting are uniformly distributed on the interval (0, T) and that elution times are independent between peptides. The probability that a given peptide is eluting at time t0 is τ/T. Then, the number of peptides eluting at time t0 is distributed as binomial(Mk, τ/T), and the expected number of peptides eluting at a given point in time is Mkτ/T. We take T as 60 min, a typical length of an LC−MS/MS run. We will take k = 30 peptides per protein, which is the approximate average number of peptides per protein in our data. We should choose τ to be the time period in which most (by some definition) of the copies of a peptide elute. Discussion in the Supporting Information shows that 30−60 s would be a reasonable value for τ. If we take 45 s, then Mkτ/T = M(30)(45)/3600 = 0.375M peptides. This means an average of 37.5, 187.5, and 375 coeluting peptides for 100, 500, and 1000 protein mixtures. Thus, for complex mixtures, the number of coeluting peptides will be much higher than the number that can be detected in a given MS1/MS2 cycle.

RESULTS The overall goal of this study is to explore how the detection probability of a peptide depends on the number of peptides that coelute with it. We will use a combination of data analysis and simulations to accomplish this. We will work with two data sets, one a more complex mixture and the other a simpler mixture with just three proteins. Each data set has two technical replicates. We will use the data sets to demonstrate key results and the simulations to explain those results. As such, we will intermix the data and simulation results in the following. Initial Steps of Data Analysis

Our first step is to examine the overall distribution of ETs in the two samples. Because we want to relate ETs to detection probabilities, it is necessary to know something about ETs of both peptides that were detected and peptides that were present in the sample but not detected. A fundamental assumption of our analysis is that if at least one peptide of a protein is detected, then all tryptic peptides of that protein are present in the mixture whether they were detected or not. This assumption allows us to make inferences about undetected peptides. The more complex mixture data set had 93 detected proteins in two replicates. The simpler mixture had three proteins detected in two replicates. We conducted in silico digestion for 351

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

sample complexity and peptide ET. Each protein has 30 peptides and the peptide ETs are sampled from the data set ETs, as described in the Methods. It should be emphasized that these simulations are only modeling the effect of peptide coelution on detection and thus assume that coelution is the only source of missed detections. The peptide detection probability starts high when the sample is very simple and decreases steadily as the number of proteins increases, with a detection probability of ∼0.45 when there are 400 proteins in the sample and ∼0.25 when there are 1000 proteins in the sample. The shape of the curve is near-linear, showing that peptide detection probability always decreases when there are more peptides, regardless of whether the sample is simple or complex. This is due to the effect of competition for detection between peptides. When there are more coeluting peptides, then individual peptides are less likely to be detected. Any increase in the number of peptides makes it more likely that there will be more coeluting peptides in some sampling intervals than can be detected by the MS. This is true even when the number of potential peptide detections by the MS instrument is much larger than the number of peptides in the sample because the peptide ETs may be distributed in such a way that the number of coeluting peptides at particular points in time is more than can be detected. This effect of the distribution of ETs is shown in Figure 4. The top plot shows the distribution of peptide ETs in the simulation. This distribution was generated by sampling with replacement from the predicted ETs of proteins detected in the data. The middle plot shows how the peptide detection probability depends on ET for a mixture with 1000 proteins, and the bottom one shows the same for a mixture with 100 proteins. The fluctuations seen in the detection probability curve are due to random fluctuations in the number of peptides eluting in different ET ranges. In the 1000 protein mixture, there is a strong

Balanced against this is the fact that there is a very large number of potential peptide detections in an LC−MS/MS experiment. For the settings used in our data, there was one MS1 scan and five MS2 scans in each cycle of 1.5 s. If dynamic exclusion were perfect, this would mean up to five peptides could be detected per 1.5 s. Suppose that the elution start times for the peptides were uniformly distributed over the 3600 s of the experiment. If there were 400 proteins with 30 peptides each, then an average of 400*30/3600*1.5 = 5 peptides would start eluting per 1.5 s interval. If the peptides were perfectly uniformly spaced over the 3600 s, then all could be detected. If there were 800 proteins, then an average of 10 peptides would start eluting in each 1.5 s interval. Even with perfect dynamic exclusion and perfect spacing of the peptides, only half could be detected. Of course, dynamic exclusion is not perfect and peptides are not perfectly evenly spaces, and so this calculation is too simplistic and will greatly overestimate the number of possible detections. We emphasize again that these calculations are for illustrative purposes only and are not used in the remainder of our results. Simulation Results − Sample Complexity and Distribution of Peptide Detection Probability with ET

The calculations above give some rough intuition about the dynamics of peptide competition for detection between coeluting peptides. We see that there is the potential for there to be many missed detections due to this phenomenon, but it depends on the details of the peptide ET distribution and other factors. Next, we will use computer simulations to examine the relationship between sample complexity and peptide detection probability in more detail. Figure 3 shows simulation results for the relationship between average peptide detection probability and the number of proteins in the samples. For this simulation, we assume that all proteins have the same abundance. This allows us to separate the effects of protein abundance from those of

Figure 3. Simulation results for the relationship between average peptide detection probability and the number of proteins in the samples. The y-axis points are the average proportion of peptides detected over 400 replicate runs of the simulation for sample protein numbers ranging from 100 to 1000. 352

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

Figure 4. Simulated dependence of detection probability on ET. The top figure shows the empirical distribution of ETs used for the simulation. These were derived from sampling with replacement from the predicted ETs in the data. The middle plot and bottom plots show the peptide detection probability versus for a mixture with 1000 proteins (middle) and 100 proteins (bottom). The points on the curve are the proportion of times in which individual peptides with the ETs shown were detected in 400 replicates.

described in the Methods to estimate the peptide detection probabilities in each ET class. In the results shown next, we did not separate by abundance class. We will show results separated by abundance class later. Table 1 shows the estimated peptide

inverse relationship between the detection probability for a peptide and the number of other peptides eluting at the same time. A large number of peptides elute in the first few minutes of the experiment, with about one-quarter of peptides having ET < 5 min. The detection probability in this ET range is 40 min range. The dependence of the detection probability on ET is much weaker in the simpler mixture with only 100 proteins. The peptide detection probability is 85% or higher over the entire ET range, except for ET < 10 min where there are many coeluting peptides. In the ET < 10 min range, the detection probability reaches a minimum at ∼0.55. With 1/10 the proteins, there are on average 1/10 the number of peptides coeluting in the 100 protein mixture than in the 1000 protein mixture. Thus, the impact of coelution on peptide detection is much less, although it is still significant.

Table 1. Estimated Peptide Detection Probability for the Simple Mixturea NET range



NET < 0.4 0.4 < NET < 0.6 0.6 < NET < 1.1

0.12 [0.001−0.2] 0.26 [0.05−0.4] 0.25 [0.05−0.4]

a

Rows correspond to the different normalized elution time (NET) categories. p̂ is the maximum likelihood estimate for the peptide detection probability. The support intervals (range of p̂ values with likelihood greater than 1/10 of maximum) are shown in brackets.

detection probability as a function of the NET class for the simple mixture. The number shown is the estimated probability that a peptide of the given NET class is detected in a single replicate. The simple mixture has only three proteins. On the basis of our simulations, we would expect that there would be minimal competition for detection between peptides in a mixture this simple. This is observed in the results. The detection probability is high (0.12 to 0.26) and is not significantly different between NET classes. Although the difference was not significant, there was a weak inverse relationship between

Data Results − Sample Complexity and Distribution of Peptide Detection Probability with ET

Next, we will explore whether the data show the patterns seen in the simulations. We use the maximum likelihood method 353

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

To find further evidence to support the idea that the pattern observed in the first three NET categories is due to coelution dynamics, we calculated the Pearson’s correlation coefficient between peptide detection probability and the average rate at which peptides are eluting in each NET interval. This rate was calculated by dividing the number of peptides in each NET category by the length of the NET category. If we exclude the NET > 0.6 category, then there is a near-perfect linear correlation (Pearson correlation coefficient = −0.9996, p value=0.017) between peptide detection probability and the rate of eluting peptides. Figure S7 in the Supporting Information shows a plot of peptide detection probability versus the average rate of peptide elution. It is interesting to observe that the maximum peptide detection probability is approximately the same for both data sets: 0.26 in the simple mixture and 0.21 in the complex mixture for NET categories with few eluting peptides. This appears to be the baseline detection probability for peptides that are not coeluting with many other peptides. The peptide detection probabilities are substantially higher in the simulation results than in the data. The estimated detection probabilities in the complex mixture range from 0.00445 to 0.21. The detection probabilities in the simulation ranged from 0.037 to 1.0. This may be because the number of proteins in the complex mixture is higher than the 1000 that we assumed in the simulations. It may also be because the simulations do not include varying abundance. In the simulations below with varying abundance, the peptide detection probabilities are much smaller for lower abundance peptides. It should also be emphasized that the simulations are only modeling the effect of peptide coelution

number of coeluting peptides and peptide detection probability. This is as we expect from the simulations, where any change in number of coeluting peptides was seen to change the detection probability. There was a much stronger relationship between detection probability and number of coeluting peptides observed in the complex mixture, as shown in Table 2. The estimated peptide Table 2. Estimated Peptide Detection Probability for the Complex Mixture NET range



NET < 0.2 0.2 < NET < 0.4 0.4 < NET < 0.6 NET > 0.6

0.0035 [0.002−0.006] 0.07 [0.06−0.08] 0.12 [0.11−0.15] 0.02 [0.01−0.04]

detection probability varies from 0.0045 in the NET class 0 to 0.2 with the highest number of peptides to 0.21 in the NET class 0.4 to 0.6 with a low number of peptides. The strong inverse relationship between peptide detection probability and number of coeluting peptides is shown in Figure 5. On the basis of the simulations, this is precisely the pattern that we expect if competition for detection between peptides is a major contributor to the failure of peptides to be detected. This pattern breaks down in the category NET > 0.6, where the peptide detection probability is low at 0.02. The reason for this is unclear. There are only 79 total peptides in this category, of which three were detected. The support interval for the estimate is quite broad, ranging up to 0.04.

Figure 5. Relationship between peptide detection probability and number of coeluting peptides in the complex mixture. The black solid curve is a histogram of the NET values in the trypanosome brucei whole cell lysate. The red dashed curve is the estimated peptide detection frequencies for peptides of the ET ranges shown. 354

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

to the ET range 36 min. Most peptides have a detection probability higher than 80%, regardless of protein abundance. There is minimal dependence of detection probability on abundance except for very low abundance peptides, although the dependence appears to be linear in this range. Figure S9 in the Supporting Information shows plots of detection probability versus ET for small ranges of protein abundance. The shapes of the curves are similar to what is observed in Figure 4. Lower abundance peptides have an extended range of ET with near-zero detection probability and only have increased detection probability in the ET range where there are very few coeluting peptides. Higher abundance peptides have low detection probability when they have an ET of ∼5 min, where there are many other eluting peptides. Their detection probability increases quickly as the number of coeluting peptides decreases. Overall, we see a strong dependence of detection probabilities on both abundance and ET. Detection probabilities range from zero to one for all except the highest abundance levels, depending on ET. Likewise, detection probabilities for any ET range from zero to one, depending on abundance.

on detection and thus assume that coelution is the only source of missed detections. Of course, there are many other causes of missed detections, and thus we would expect that the simulation detection probabilities would be higher than reality. The range of variation in detection probably was nearly identical between the simulations and the data, with both having roughly 50-fold variation in detection probability between different NET categories. Simulation Results: Effect of Protein Abundance on Peptide Detection

Next, we look at interaction between protein abundance, ET, and peptide detection probability. We simulated a sample with 1000 proteins with varying abundances. The abundance is quantified as the number of copies of the protein in the sample. The abundance of each protein was sampled from a gamma distribution with the shape parameter equal to 1 and the scale parameter equal to 200. This gives an abundance distribution with a shape similar to that observed in the data (see Figure S8 in the Supporting Information). Only relative abundances have any impact in our simulations. Thus, the fact that the absolute abundances are low relative to reality is unimportant. Figure 6

Data Results: Effect of Protein Abundance on Peptide Detection

We also considered protein abundance in our data analysis. Variation in detection probability of individual peptides should be due to individual peptide characteristics. Variation in peptide detection probabilities between proteins should be due to protein abundance or other protein level causes. Thus, we characterized the abundance of a protein by the mean (across its peptides) proportion of replicates in which its peptides were detected. We emphasize that although we use the shorthand terminology of “abundance” to refer to this proportion it is really reflecting general protein-level changes in detection probability (see Discussion). Figure S8 shows a histogram of the peptide mean detection proportion for the complex mixture. A total of 54 out of the 93 detected proteins in the complex mixture had an average peptide detection proportion of 0.6

0.002 [0.001−0.004] 0.0045[0.0015−0.0125] 0.02 [0.006−0.06]

0.03[0.02−0.03] 0.14[0.11−0.18] 0.39[0.31−0.50]

0.05[0.03−0.06] 0.22[0.16−0.30] 0.52[0.39−0.62]

0.006[0.005−0.03] 0.0001[0.0001−0.05] 0.17 [0.05−0.45]

Figure 7. Simulated ET distribution of proteotypic peptides. The top panel is a scatter plot for all peptides in the simulation shown in Figure 6. The bottom panel is a scatterplot of peptides detected in at least 75% of replicates.

Data Results − Proteotypic Peptides

effect of both abundance class and NET. The lowest detection probability is 0.002 for low abundance class peptides with NET < 0.2. The highest detection probability is 0.52 for high abundance class peptides with 0.4 < NET < 0.6. This is 260-fold range of variation. Detection probabilities change by a factor of 10 over abundance categories within NET categories and by a factor of 25−50 over NET categories within abundance categories. That is, ET had a several-fold larger effect than abundance class.

Figure 8 shows histograms of NET for all peptides of proteins that were detected in the complex mixture (top), peptides that were detected once (middle), and peptides that were detected twice (bottom). There is a major shift in the NET distribution between the three plots. The more that peptides are detected, the more that their NET distribution is shifted toward values where there are few coeluting peptides. The range NET < 0.2, in which 50% of all peptides lie, is greatly reduced among peptides detected once and is completely absent from peptides detected twice. That is, the most detectable peptides had NET values in the range with few coeluting peptides. This gives strong support for the model prediction that a major determinant of a peptide being proteotypic is that it does not coelute with many other peptides.

Simulation Results − Proteotypic Peptides

Figure 7 shows plots comparing the distribution of elution times as a function of abundance for all peptides in the simulation (top) and only those that were detected in at least 75% of replicates. Not surprisingly, peptides of high abundance proteins were detected with high frequency. However, peptides of low abundance were also detected in >75% of replicates if they had abnormal (high) elution times. That is, the simulations show that the peptide ET is a major determinant of whether a peptide is proteotypic. Specifically, peptides with abnormal ET will tend to be proteotypic.

Simulation Results − Uniform ET Distribution

The above results show that the presence of a peak in the ET distribution with many coeluting peptides causes many peptides to not be detected. This raises the question of how much detection probabilities could be increased by changing the ET distribution. 356

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

Figure 8. NET distribution of proteotypic peptides. The top plot shows a histogram of the NET distribution for all tryptic peptides of detected proteins, the middle plot shows the histogram for all peptides detected at least once, and the bottom plot shows the histogram for peptides detected twice out of two replicates.

Figure 9. Simulated effect of uniform ET distribution. The simulation shown in this plot was the same as Figure 6, except that ET values were sampled from a uniform (3.4 min, 61.8 min) distribution. The dashed curve shows the running mean of detection probabilities versus abundance from the simulation in Figure 6

shown in Figure 6. However, the ET values were sampled from uniform distribution with lower and upper boundaries the same as the first and last ET values observed in the data. The dashed

Figure 9 shows the impact of replacing the observed ET distribution with a uniform ET distribution. The abundance values for this simulation were identical to those used for the simulations 357

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

from the data. We found that this mean ET is a major determinant of the peptide’s detection probability in a complex sample. For low and medium abundance peptides, variation in ET caused detection probabilities to range from near-zero to near-one for fixed abundance with 1000 proteins in the sample, and the model parametrized similarly to the experimental conditions of our data. For high abundance peptides, the detection probabilities ranged from 10 to 100%. Peptides that elute in the range 3 to 10 min, where there is a high number of coeluting peptides, had low detection probability. This detection probability increased as the ET increased and the number of coeluting peptides decreased. Overall, the magnitude of the effect of ET and abundance in the model were similar. An implication of these results is that ET is likely to be a major factor in causing some peptides to be proteotypic. A peptide that has an ET with few coeluting peptides for a given set of experimental conditions will have a high detection probability relative to other peptides of the protein. It will have a strong tendency to be detected when the protein is detected and thus be classified as proteotypic.

blue curve shows the running mean of detection probabilities. For comparison, the solid red curve shows the running mean of peptide detection probabilities from the simulation in Figure 6. We see that peptide detection probabilities are substantially increased with the uniform ET distribution. At an abundance of 200, the average peptide detection probability is ∼0.2 with the empirical ET distribution and ∼0.4 for the uniform ET distribution. At an abundance of 500, the average peptide detection probabilities are approximately 0.58 and 0.8, respectively. Furthermore, there is far less variation in detection probability with the uniform ET. In the simulation with empirical ETs, many peptides had very low detection probabilities and would rarely be detected except if there were many replicates in the experiment. In contrast, most peptides are detectable at moderate abundance in the case with the uniform ET distribution.



DISCUSSION It is well-established that peptides vary greatly in their detectability,2−4,11,12 and many factors play a role in determining this variation.22 However, the relative importance of these factors is not well understood. The goal of this study was to quantify the impact of one such factor: the competition for detection among coeluting peptides. The reason for this competition is that the MS/MS scan can only detect a limited number of peptides at a time. Simple calculations (see Results) show that the number of simultaneously eluting peptides will often greatly exceed this number. A peptide will be detected only if its elution intensity is in the top c values in some scanning interval. In a complex mixture, this will not ever occur for many peptides. Using data and simulations, we have shown that competition between peptides for detection is a major determinant of whether peptides will be detected in an LC−MS/MS experiment and that the characteristic ET of a peptide for a given set of experimental conditions plays a major role in determining its detection probability. The size of the effect is several-fold larger than that of abundance and other protein-level factors. Our data showed 10−20 fold variation in peptide detection probability due to protein-level factors, while there was 10−50 fold variation due to ET. Our results show that the most detectable peptides are those that coelute with a small number of other peptides.

Data Results

We introduced a novel method for estimating peptide detection frequencies as a function of specified peptide characteristics. In contrast with previous methods whose goal is identifying proteotypic peptides4,8,10−12 or estimating detection probabilities for individual peptides,2,3,6 the goal of our method is to estimate peptide detection probabilities as a function of ET or other peptide characteristics. Using this method, we determined how peptide detection probabilities vary with ET and abundance class. We used two data sets. One is a very simple mixture with three proteins. The other is a complex mixture from a whole cell lysate from procyclic form Trypanosoma brucei. In the simple mixture, there was not a significant difference in peptide detection probability between different ET ranges, although the point estimate for detection probability did vary by a factor of two between the range NET < 0.4, where there are more coeluting peptides, and the other NET categories, where there are fewer coeluting peptides. In the complex mixture, estimated peptide detection probabilities varied by 260-fold between the peptides with the lowest detection probabilities and the peptides with the highest detection probabilities. There was 25−50 fold variation in detection probability between NET classes within abundance classes and 10-fold variation in detection probability between abundance classes within NET classes. Thus, the effect of ET variation on detection probability was larger than the effect of abundance. There was a strong inverse relationship between the number of coeluting peptides and the detection probability, giving strong support to the model prediction that competition between coeluting peptides should be a major determinant of peptide detection. This conclusion was further supported by results showing that the more replicates in which peptides were detected, the more their NET distribution was shifted toward higher values where there are few coeluting peptides. Finally, there was a strong linear correlation between peptide detection probability and the peptide elution rate in an NET interval. The data strongly support the idea that competition between coeluting peptides is a major determinant of whether a peptide is proteotypic. The data have close agreement with the model in several important respects. First is the result that the effect of ET is much smaller but still present in the simple mixture compared to the complex mixture. Second is the result that the detection

Model Results

Little has been previously known about the dynamics of peptide competition in coelution, and thus we developed a mathematical model to explore these dynamics. The question that we wish to answer is, given the conditions of a typical LC−MS/MS experiment, how many peptide detections are missed because there are too many coeluting peptides? It is clear that this effect will decrease peptide detections. The challenge is in quantifying the effect, and in this a mathematical model can be very useful. The dynamics are too complicated for simple intuition but are amenable to a straightforward mathematical treatment. Although the model is simple, it gives important insights that seem wellsupported by our data. The model shows that competition for detection between coeluting peptides is likely a major determinant of peptide detection probabilities. That is, given reasonable assumptions about the coelution dynamics, the model shows that in a complex mixture there will be many peptides that are never in the top c values in any MS1 scan and are therefore never detected. In our simulations, the ETs for individual peptide copies follow a normal distribution with mean and variance specific to the peptide. These mean ETs followed an empirical distribution derived 358

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

probability for a peptide is inversely related to the number of coeluting peptides. Third is the result that the effect of the ET and abundance were of similar large magnitude in the complex mixture. Fourth is the result that the magnitude of variation in detection probabilities is similar between model and data. This agreement between model and data supports our contention that the model does capture important elements of the coelution dynamics and that the patterns seen in the data are, in fact, due to competition between coeluting peptides.

apparently due to a bias for detection of longer peptides was of a similar magnitude to variation due to abundance. Their results also include a theoretical model that demonstrates, similarly to ours, a complicated relationship between the properties of a peptide and its detectability. Further work is necessary to understand the full range of variation in peptide detection probability. As with any mathematical model, our model is based on assumptions, and the results are only valid to the extent that the assumptions are reasonable. A key point is that we are only modeling the impact of elution time distributions and protein abundance on peptide detection. We have not looked at other factors such as ionization efficiency or the probability that the database search correctly identifies a peptide. If these factors are independent from the elution time distribution, then it is reasonable to model them separately. However, if there are correlations, then this complicates the analysis. In our formulation, the elution distributions are correlated because of their time dependence but independent when conditioned on time (both between different copies of the same peptide and between different peptides). This implies that there is no interaction between the peptide copies in the LC or at any other point in the process. Such interactions could change the dynamics substantially. Our model assumes that ET for peptide copies follows a normal distribution with mean specific to the peptide. Figures S2 and S3 in the Supporting Information show that this assumption is questionable for many peptides. We do not expect that differences in this distribution would have a substantial qualitative effects on our model results, but future work should look into this issue. We assume that the standard deviation in ET is 15 s for all peptides. This was the observed median value, but some individual peptides deviated greatly from this. Although these deviations from our assumptions are unlikely to have a substantial effect on the broad conclusions of this paper, they are a shortcoming of our model. It should also be noted that there could be biases in the empirical distribution for mean ET because it was based on peptides that belonged to proteins with at least one detected peptide. Thus, for example, if all of the peptides of a protein happen to coelute with many other peptides, then that protein may not be detected and none of its peptides will be part of the ET distribution. Another key question is whether the ET distribution observed in our data is typical. Different experimental platforms and protocols could produce different ET distributions that could either increase or decrease the magnitude of the peptide competition effect. Thus, although our results show the peptide competition is capable of having a major effect on peptide detection probabilities, this may not be the case in all experimental settings. Figure S10 in the Supporting Information shows histograms of the ET distribution for two other data sets.28,29 One resembles a gamma distribution and the other a uniform distribution with tails. Unlike our data, both data sets show regions with few coeluting peptides at both low and high ET values. With these ET distributions, we would expect to see the most detectable peptides occurring at the most extreme low and high ET values.

Caveats

The congruence between model and data gives strong evidence that the model is fundamentally correct and that our interpretation of the patterns observed in the data is correct. However, it is possible that the relationship between detection probability and ET is due to some mechanism other than peptide competition. The number of coeluting peptides is lowest at small ETs and decreases monotonically with increasing ET. Thus, there could be some other factor that correlates with ET that is actually driving the observed pattern. It has been previously observed that hydrophobicity is an important factor in determining how well peptides ionize in electrospray ionization (ESI). Peptides with higher hydrophobicity have a higher ESI response23−25 and would therefore be expected to have a higher detection probability. Peptide ET also increases with hydrophobicity. Thus, our results could potentially be explained as a manifestation of this well-known phenomenon. However, there are two pieces of evidence that argue against this. First is the high correlation between the rate of peptide elution and the peptide detection probability (Figure S7 in the Supporting Information). Second, the variation in detection probability with ET was much weaker in the simple mixture than in the complex mixture. This is what we would expect to see if peptide competition is the causative mechanisms but not if ESI response is the causative mechanism. This is not to argue that ESI response is not important but that it is likely not the factor driving the observed pattern of peptide detection variation. In this study, we characterized individual proteins’ abundance by the average proportion across peptides of replicates in which their peptides were detected. However, any factor that causes protein-level variation in detection probability will affect this proportion, and thus differences between abundance classes as we defined them could be partially due to other factors. Furthermore, it has been shown that peptide counts (which are closely related to this proportion) are an imperfect measure of abundance. For example, ref 26 found that changes in peptide count between experimental treatments were generally in the same direction as changes in transcription level as measured by microarrays but often many times lower in magnitude. This indicates that although peptide counts are related to abundance, they cannot be viewed as direct measurements of abundance. Thus, although the average protein abundance almost certainly is different between abundance classes as we defined them, we can say little about either individual proteins in those classes or the magnitude of abundance variation between classes. Although we have shown that ET has an impact on detection probability comparable to abundance, nothing in our results implies that there are not other factors with equal or greater impact. It is interesting to compare our results to those of ref 27. They calculated detectability across multiple replicates for peptides in mixtures of combinatorial libraries made from synthetic peptides. They found that variation in detectability

Improving Peptide Detection

We have shown that the ET distribution has a major impact on peptide detection. The presence of a large peak in the ET distribution where there are many coeluting peptides causes many peptides to have low detection probabilities. The question then arises as to whether we can improve detection probabilities 359

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

(4) Sanders, W. S.; Bridges, S. M.; McCarthy, F. M.; Nanduri, B.; Burgess, S. C. Prediction of peptides observable by mass spectrometry applied at the experimental set level. BMC Bioinf. 2007, 8. (5) Liu, H. B.; Sadygov, R. G.; Yates, J. R. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 2004, 76 (14), 4193−4201. (6) Alves, P.; Arnold, R. J.; Novotny, M. V.; Radivojac, P.; Reilly, J. P.; Tang, H. Advancement in Protein Inference from Shotgun Proteomics using Peptide Detectability. Pac. Symp. Biocomput., 17th 2007, 12, 409− 420. (7) Kuster, B.; Schirle, M.; Mallick, P.; Aebersold, R. Scoring proteomes with proteotypic peptide probes. Nat. Rev. Mol. Cell Biol. 2005, 6 (7), 577−583. (8) Webb-Robertson, B. J. M.; Cannon, W. R.; Oehmen, C. S.; Shah, A. R.; Gurumoorthi, V.; Lipton, M. S.; Waters, K. M. A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics. Bioinformatics 2008, 24 (13), 1503−1509. (9) Eichacker, L. A.; Granvogl, B.; Mirus, O.; Muller, B. C.; Miess, C.; Schleiff, E. Hiding behind hydrophobicity - Transmembrane segments in mass spectrometry. J. Biol. Chem. 2004, 279 (49), 50915−50922. (10) Blonder, J.; Veenstra, T. D. Computational prediction of proteotypic peptides. Expert Rev. Proteomics 2007, 4 (3), 351−354. (11) Mallick, P.; Schirle, M.; Chen, S. S.; Flory, M. R.; Lee, H.; Martin, D.; Raught, B.; Schmitt, R.; Werner, T.; Kuster, B.; Aebersold, R. Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 2007, 25 (1), 125−131. (12) Mallick, P.; Schirle, M.; Kuster, B.; Aebersold, R. Predicting proteotypic peptides as probes for quantitative proteomics. Mol.Cell.Proteomics 2006, 5 (10), S317−S317. (13) Lu, P.; Vogel, C.; Wang, R.; Yao, X.; Marcotte, E. M. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat. Biotechnol. 2007, 25 (1), 117−124. (14) Le Bihan, T.; Robinson, M. D.; Stewart, I. I.; Figeys, D. Definition and characterization of a ″trypsinosome″ from specific peptide characteristics by nano-HPLC−MS/MS and in silico analysis of complex protein mixtures. J. Proteome Res. 2004, 3 (6), 1138−1148. (15) Nielsen, M. L.; Savitski, M. M.; Kjeldsen, F.; Zubarev, R. A. Physicochemical properties determining the detection probability of tryptic peptides in Fourier transform mass spectrometry. A correlation study. Anal. Chem. 2004, 76 (19), 5872−5877. (16) Ethier, M.; Figeys, D. Strategy to design improved proteomic experiments based on statistical analyses of the chemical properties of identified peptides. J. Proteome Res. 2005, 4 (6), 2201−2206. (17) Fusaro, V. A.; Mani, D. R.; Mesirov, J. P.; Carr, S. A. Prediction of high-responding peptides for targeted protein assays by mass spectrometry. Nat. Biotechnol. 2009, 27 (2), 190−198. (18) Xu, C. M.; Zhang, J. Y.; Liu, H.; Sun, H. C.; Zhu, Y. P.; Xie, H. W. Advance of Peptide Detectability Prediction on Mass Spectrometry Platform in Proteomics. Chin. J. Anal. Chem. 2010, 38 (2), 286−292. (19) Alves, G.; Ogurtsov, A. Y.; Yu, Y. K. Assigning statistical significance to proteotypic peptides via database searches. J. Proteomics 2011, 74 (2), 199−211. (20) Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. ModifiComb, a new proteomic tool for mapping substoichiometric post-translational modifications, finding novel types of modifications, and fingerprinting complex protein mixtures. Mol. Cell. Proteomics 2006, 5 (5), 935−48. (21) Petritis, K.; Kangas, L. J.; Ferguson, P. L.; Anderson, G. A.; PasaTolic, L.; Lipton, M. S.; Auberry, K. J.; Strittmatter, E. F.; Shen, Y. F.; Zhao, R.; Smith, R. D. Use of artificial neural networks for the accurate prediction of peptide liquid chromatography elution times in proteome analyses. Anal. Chem. 2003, 75 (5), 1039−1048. (22) Barton, S. J.; Whittaker, J. C. Review of Factors That Influence the Abundance of Ions Produced in a Tandem Mass Spectrometer and Statistical Methods for Discovering These Factors. Mass Spectrom. Rev. 2009, 28 (1), 177−187. (23) Fenn, J. B. Ion Formation from Charged Droplets - Roles of Geometry, Energy, and Time. J. Am. Soc. Mass Spectrom. 1993, 4 (7), 524−535.

by changing the ET distribution. It is beyond the scope of this paper to discuss the technical feasibility of manipulating the ET distribution. However, using simulations we can predict the effect of such manipulation. We simulated the best-case scenario of a uniform ET distribution. Average peptide detection probabilities were increased by a factor of 1.5 to 2 over most of the abundance range. Furthermore, the variation in peptide detection probability was greatly reduced. Whereas with the observed ET distribution many peptides are essentially undetectable except at high abundance, nearly all peptides are detectable at low to moderate abundance with the uniform ET distribution. It will not be feasible to get a perfectly uniform ET distribution, but any flattening out of peaks in the distribution will make at least some peptides much more detectable. Spreading out the time over which peptides elute would decrease coelution and therefore increase peptide detection probabilities even if the shape of the distribution remained the same. Of course, this lengthens the time of the experiment, but this tradeoff could be worthwhile if it reduces the number of replicates necessary to detect proteins reliably.



ASSOCIATED CONTENT

* Supporting Information S

In-gel digestion procedures. Comparison of predicted and actual peptide retention times. A simple coelution calculation. Method for estimating peptide detection probabilities. Distribution of ETs for individual peptide copies. Model derivation. Estimating peptide detection probabilities with two miscleaves. Relationship between peptide detection probability and peptide density. Distribution of protein abundance in the complex mixture. Simulation of variation in detection probability as a function of ET within abundance. ET distributions for other data sets. ET variation among posttranslational modifications of the same peptide. References. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Phone: (706) 542-4241. Fax: (706) 542-3391. E-mail: pdschlie@ uga.edu. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This study was partially supported by resources and technical expertise from the University of Georgia Advanced Computing Resource Center, a partnership between the Office of the Vice President for Research and the Office of the Vice President for Information Technology.



REFERENCES

(1) Koziol, J. A.; Feng, A. C.; Schnitzer, J. E. Application of capturerecapture models to estimation of protein count in MudPIT experiments. Anal. Chem. 2006, 78 (9), 3203−3207. (2) Tang, H. X.; Arnold, R. J.; Alves, P.; Xun, Z. Y.; Clemmer, D. E.; Novotny, M. V.; Reilly, J. P.; Radivojac, P. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics 2006, 22 (14), E481−E488. (3) Li, Y. F.; Arnold, R. J.; Tang, H. X.; Radivojac, P. The Importance of Peptide Detectability for Protein Identification, Quantification, and Experiment Design in MS/MS Proteomics. J. Proteome Res. 2010, 9 (12), 6288−6297. 360

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361

Journal of Proteome Research

Article

(24) Gordon, E. F.; Mansoori, B. A.; Carroll, C. F.; Muddiman, D. C. Hydropathic influences on the quantification of equine heart cytochrome c using relative ion abundance measurements by electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry. J. Mass Spectrom. 1999, 34 (10), 1055−1062. (25) Cech, N. B.; Krone, J. R.; Enke, C. G. Predicting electrospray response from chromatographic retention time. Anal. Chem. 2001, 73 (2), 208−213. (26) Gao, J.; Opiteck, G. J.; Friedrichs, M. S.; Dongre, A. R.; Hefta, S. A. Changes in the protein expression of yeast as a function of carbon source. J. Proteome Res. 2003, 2 (6), 643−649. (27) Bohrer, B. C.; Li, Y. F.; Reilly, J. P.; Clemmer, D. E.; DiMarchi, R. D.; Radivojac, P.; Tang, H. X.; Arnold, R. J. Combinatorial Libraries of Synthetic Peptides as a Model for Shotgun Proteomics. Anal. Chem. 2010, 82 (15), 6559−6568. (28) Petritis, K.; Kangas, L. J.; Yan, B.; Monroe, M. E.; Strittmatter, E. F.; Qian, W. J.; Adkins, J. N.; Moore, R. J.; Xu, Y.; Lipton, M. S.; Ii, D. G. C.; Smith, R. D. Improved peptide elution time prediction for reversedphase liquid chromatography-MS by incorporating peptide sequence information. Anal. Chem. 2006, 78 (14), 5026−5039. (29) Pfeifer, N.; Leinenbach, A.; Huber, C. G.; Kohlbacher, O. Statistical learning of peptide retention behavior in chromatographic separations: a new kernel-based approach for computational proteomics. BMC Bioinf. 2007, 8.

361

dx.doi.org/10.1021/pr400034z | J. Proteome Res. 2014, 13, 348−361