Improving Peptide Identification in Proteome ... - ACS Publications

Jun 3, 2009 - Germany, and Department of Molecular Biology, Division of Chemistry, University of Salzburg,. 5020 Salzburg, Austria. Received January 2...
0 downloads 0 Views 676KB Size
Improving Peptide Identification in Proteome Analysis by a Two-Dimensional Retention Time Filtering Approach Nico Pfeifer,†,¶ Andreas Leinenbach,§,‡,¶ Christian G. Huber,*,§ and Oliver Kohlbacher*,† Division for Simulation of Biological Systems, Eberhard Karls University Tu ¨ bingen, 72076 Tu ¨ bingen, Germany, Department of Chemistry, Instrumental Analysis and Bioanalysis, Saarland University, 66123 Saarbru ¨ cken, Germany, and Department of Molecular Biology, Division of Chemistry, University of Salzburg, 5020 Salzburg, Austria Received January 24, 2009

The combination of a two-dimensional peptide separation scheme based on reversed-phase and ionpair reversed phase HPLC with a computational method to model and predict retention times in both dimensions is described. The algorithm utilizes statistical learning to establish a retention model from about 200 peptide retention times and their corresponding sequences. The application of retention time prediction to the peptides facilitated an increase in true positive peptide identifications upon lowering mass spectrometric scoring thresholds and concomitantly filtering out false positives on the basis of predicted retention times. An approximately 19% increase in the number of peptide identifications at a q-value of 0.01 was achievable in a whole proteome measurement. Keywords: proteome analysis • two-dimensional separation • tandem mass spectrometry • retention time prediction • statistical learning • Sorangium cellulosum

Introduction One major requirement in modern biology is the correct identification and quantification of thousands of proteins in proteomic samples. Because of major improvements in experimental techniques during the last decades, scientists have reached a point where protein identification by mass spectrometry-based techniques1,2 has become practical even for large biological studies like whole proteome measurements of organisms, tissues, or biological fluids. Nevertheless, the methods for protein identification still require improvements because a significant number of spectra cannot be annotated with acceptable accuracy3 or cannot be annotated at all. In standard identification routines by database searching, every peptide spectrum match is assessed with a score, reflecting how well the measured spectrum matches a theoretical spectrum of the peptide sequence.4,5 Nevertheless, top-scoring peptide sequences are not necessarily the correct sequences. Hence, the top scoring hits delivered by database searching routines also include a significant number of false positive identifications. To distinguish false positive from correct identifications with a preselected statistical significance, identification routines define a scoring threshold. Since the thresholds do not perfectly separate correct and false identifications, there are still some correct identifications * To whom correspondence should be addressed. For C.G.H.: e-mail, [email protected]; phone, +43(0)662-8044-5704; fax, +43(0)662-8044-5751. For O.K.: e-mail, [email protected]; phone, +49(0)707129-70457; fax, +49(0)7071-29-5152. † Eberhard Karls University Tu ¨ bingen. ¶ These authors contributed equally to this work. ‡ Saarland University. § University of Salzburg. 10.1021/pr900064b CCC: $40.75

 2009 American Chemical Society

with scores below the threshold. To access these false negative identifications, the mass spectrometric identification threshold values need to be lowered in combination with suitable means for eliminating false positives by other identification filters. Many approaches to address this issue are based on machine learning techniques in which a measured parameter such as the chromatographic retention time6 or the isoelectric point of a peptide is compared to its predicted value. Strittmater et al.7 incorporated the deviation between observed and predicted retention time into their scoring function. Furthermore, Klammer et al.8 and Pfeifer et al.9 introduced very accurate methods for retention time prediction which required a very small number of training peptides. These retention time prediction models were successfully used to filter out false spectrum identifications. Additionally, other properties of the peptides were used to improve the number of identified spectra. Klammer et al.10 predicted the fragmentation of peptides and utilized the predicted likelihood of a spectrum identification to be correct to improve the identification process. Uwaje et al.11 used a database of measured pairs (peptide, isoelectric point) to improve peptide identification. Because of the limited peak capacities of only one separation dimension, two-dimensional separations are frequently used for the analysis of complex samples. The most common combination for the separation of highly complex peptide mixtures is strong cation exchange chromatography (SCX) with reversed-phase (RP) or ion-pair reversed-phase (IP-RP) highperformance liquid chromatography (HPLC).12 Toll et al.13 and Delmotte et al.14 demonstrated that peptide separation on reversed-phase stationary phases using different pH and eluent additives is applicable to proteome analysis. Although the two separation dimensions were not fully orthogonal, the combinaJournal of Proteome Research 2009, 8, 4109–4115 4109 Published on Web 06/03/2009

research articles tion led to better “peptide identification yield” compared to the classical combination of SCX- with RP-HPLC.14 This is mainly due to the fact that the fractions collected from the first dimension separation are free of involatile salt and can, after concentration and evaporation of the organic solvent, be injected directly into the second-dimension separation system. In this work, we significantly extend the applicability of peptide retention prediction9 to whole proteome analysis by incorporating retention time predictors to both separation dimensions. By doing so, it is feasible to incorporate essentially four different peptide properties into an identification scheme, namely, peptide retention in high-pH reversed chromatography, peptide retention in low-pH ion-pair reversed-phase chromatography, intact molecular mass, and fragmentation pattern of a peptide. We build a retention model for the first as well as the second separation dimension and then use predicted and observed retention times to build one filter for the first and another for the second dimension. We show that each filter independently improves the precision of the spectrum identifications, whereas the largest improvement in precision can be achieved by combining the filters. Following this protocol, it is possible to obtain about 35% more spectrum identifications at the same precision for a standard protein mixture analyzed. To show the practicability of this approach to the analysis of whole proteomes, the filtering methods are employed to a whole cell lysate of the Sorangium cellulosum bacteria.

Experimental Section Peptide Separation and Mass Spectrometric Detection. The data sets for the standard mixture and the whole digested proteome were generated with emphasis on high reproducibility in terms of retention times using an actively split capillary HPLC system (Ultimate 3000, Dionex, Germering, Germany). Separated peptides were detected and identified by electrospray ionization tandem mass spectrometry (ESI-MS/MS) in an ion trap mass spectrometer (HCTultra PTM Discovery System, Bruker Daltonics, Bremen, Germany). Two different tryptic peptide mixtures were analyzed: a simple protein digest comprising peptides from albumin (220 fmol/µL, bovine serum, Sigma Aldrich, St. Louis, MO) and thyroglobulin (410 fmol/µL, bovine thyroid gland, Fluka, Buchs, Switzerland), as a training and validation data set, and a tryptic digest of a whole protein extract from S. cellulosum (So ce56, digest of 690 µg of protein extract), a soil-dwelling bacterium from the group of myxobacteria. Proteins were digested with trypsin (Promega, Madison, WI) using published protocols.15 The peptide mixtures were separated using an offline two-dimensional HPLC setup as described before.14 We combined reversed-phase (RP) highperformance liquid chromatography (HPLC) at pH 10.0 with micro ion-pair reversed-phase (IP-RP) HPLC at pH 2.1. Finally, the training data set was used to characterize both separation dimensions. In total, 36 fractions of the simple protein digest (fractions 4-39) and 31 fractions from the analysis of S. cellulosum (fractions 14-44) were analyzed in triplicate in the second dimension. Peptide Identification. The MS/MS spectra of the standard mixture were aligned by the algorithm of Lange et al.16 using standard parameters. This was also done for the MS/MS spectra of S. cellulosum. We identified the MS/MS spectra using Mascot (version 2.2)4 with one missed cleavage, precursor tolerance 1.3, carboxymethyl as fixed modification and deamidated asparagine or glutamine as well as oxidized methionine as 4110

Journal of Proteome Research • Vol. 8, No. 8, 2009

Pfeifer et al. variable modifications. For the standard mixture, we searched against the MSDB database, restricted to Chordata (vertebrates and relatives). For the S. cellulosum spectra, we used an inhouse database containing all protein sequences of the organism constructed from the published DNA sequence.17 For both data sets, we also searched the spectra against a reverse version of the database. In this way, we could estimate the false discovery rates (FDRs) of the spectrum identifications, depending on the score. The function, which maps a score of a spectrum identification to an FDR, is not monotonically decreasing. This means that a certain score s1 can have a lower FDR than another score s2 even though s1 is smaller than s2. The consequence is that one cannot define scoring thresholds by just using the FDR. Therefore, we used the FDRs in the same way as Ka¨ll et al.18 to compute q-values. In our application, the particular q-values correspond to the minimal FDR at which a spectrum identification is accepted. The q-value cutoffs can be used directly as filter thresholds as in the work of Ka¨ll et al. The q-values for the spectrum identifications were calculated in the following way. We searched the whole set of spectra against a decoy database which contained the reverse protein sequences of the MSDB database restricted to Chordata. This means that all spectrum identifications which we get from the search are very likely false identifications. Statistically speaking, the hits originate from a null distribution. An approximation of the FDR can be calculated as in ref 18 by calculating the ratio of the number of spectrum identifications to the decoy database to the number of spectrum identifications to the correct MSDB database. Let d1, d2,...,dn be the scores of spectrum identifications to the decoy database and let s1, s2,...,sn be the scores of the spectrum identifications to the normal database. The FDR of a certain score threshold t can then be approximated by [(#{di|di g t ∧ i g 1 ∧ i e n})/(#{si|si g t ∧ i g 1 ∧ i e n})]. The q-value of a certain spectrum identification is then the smallest FDR at which the spectrum identification is accepted. Therefore, q-value(si) ) mintesi FDR(t). All spectrum identifications corresponding to peptides shorter than six amino acids were filtered out since identifications of shorter length are less reliable and in most cases they cannot be mapped uniquely to protein sequences. Prediction of Retention Times and Filtering by Retention Times. In supervised machine learning, one normally tries to learn the underlying properties of the training data to be able to come up with accurate predictions for additional data which adhere to the same underlying properties. More formally this means that given the training data {(x1,y1), (x2,y2),..., (xn,yn)|xi ∈ X ∧ yi ∈ Y }, we want to approximate the (mostly unknown) underlying distribution P(y|x) as well as possible. The xi are often called input sequences and the yi are often called labels. In our application, the input sequences are the peptide sequences, the labels are the retention times, and we want to be able to predict the retention time of a peptide sequence very accurately in each RT dimension. Since the label is a continuous variable, the prediction problem is a regression problem. Linear approaches for regression, like Least Squares or Multiple Linear Regression, can only exploit linear relations between input sequences. This is why nonlinear methods like Kernelized Least Squares or Support Vector Regression (SVR)19 with nonlinear kernels are used to also detect nonlinear relations. In this study, the retention times were predicted with an improved version of the method introduced in Pfeifer et al.9 It uses ν-SVR and a new kernel function called paired oligo-

Improving Peptide Identification in Proteome Analysis

research articles

border kernel (POBK) to train the predictor. All methods are integrated into the open-source framework for mass spectrometry (OpenMS).20 The tools for retention time prediction and filtering are part of the OpenMS Proteomics Pipeline (TOPP).21 The software is available free of charge under the Lesser GNU Public License (LGPL) at http://www.openms.de and can be installed for Linux, Windows and MacOS. To explain the extensions elaborated for this work, we give an overview of ν-SVR and the POBK. Then, we describe how the predicted retention times can be used to filter out false identifications. ν-Support Vector Regression. In ε-SVR,19 one tries to learn a function f : X f IR. In training, there is a tube around the optimal regression curve with distance ε. A predicted label for a training sample is not penalized if the deviation is smaller than ε. This means that training samples which are inside the tube are not penalized. This leads to the following optimization problem for ε-SVR:

If one applies a mapping function to the input variables Φ: X f F as stated in Scho¨lkopf et al.,19 one can learn a regression function in the feature space. Since computing the inner product 〈Φ(xi),Φ(xj)〉 of the mapped feature vectors in feature space can be very time-consuming, a kernel function k can be used instead: k : X 2 f Y : k(xi,xj) ) 〈Φ(xi),Φ(xj)〉, which efficiently computes the results of the inner product in feature space. The kernel matrix of the training samples has to be positive semidefinite. The regression function is learned by maximizing

1 |w| 2 + C 2 subject to f(xi) - yi e ε + ξi

minimizew∈H,ξ(∗)∈IRn,b∈IR F(w, ξ(∗)) )

n

∑ (ξ + ξ ) i

∗ i

i)1

yi - f(xi) e ε + ξi∗

ξ, ξi∗ g 0 ∀i ) 1, 2, ..., n

w is the weight vector and ξi and ξi* are slack variables, which are introduced to tolerate errors. One problem with ε-SVR is that one has to set the ε a priori. This is why there exists another variant of SVR called ν-SVR:19 1 |w| 2 + 2

minimizew∈H,ξ(∗)∈IRn,b∈IR,ε∈IR+ F(w, ξ(∗)) ) n

C(νnε +

∑ |y - f(x )| ) i

i ε

i)1

subject to f(xi) - yi e ε + ξi

yi - f(xi) e ε + ξi∗

ξ, ξi∗ g 0 ∀i ) 1, 2, ..., n

The ν is directly related to the capacity of the learning machine and the number of training points allowed to lie outside the tube. Therefore, it does not depend on the scale of the labels like the ε parameter. It was shown that one can solve the minimization problem by solving the following maximization problem: l

W(R(∗)) )

∑ (R

l

∗ i

- Ri)yi -

i)1



1 (R∗ - Ri)(R∗j - Rj)〈xi, xj〉 2 i,j)1 i

subject to: l

∑ (R - R ) ) 0 i

∗ i

l

W(R(∗)) )

∑ (R

l

∗ i

- Ri)yi -

i)1



1 (R∗ - Ri)(R∗j - Rj)k(xi, xj) 2 i,j)1 i

Kernel Function and Filtering. The paired oligo-border kernel (POBK) was introduced in 2007.9 The main advantage of the kernel in application to computational proteomics is that it enables the learning machine to learn chemical properties of the data (e.g., composition, sequence length, hydrophobic regions, etc.) directly from the amino acid sequence. It was shown that very little training data is needed for ν-SVR in combination with the POBK to achieve very accurate retention time prediction models.9 The kernel operates directly on the sequence data on which every different amino acid is considered as a separate letter in the alphabet. In this work, we now extend this alphabet to modified amino acids. This means that, for example, a modified methionine with an additional methyl group is treated differently than an oxidized methionine. The method does not rely on any special features because it learns the necessary features for the particular separation process directly from the training data. Therefore, the POBK can be applied to a wide range of problems like separation prediction in strong anion-exchange chromatography and reversed-phase chromatography.9 It can also be used as in our work to learn peptide retention behavior under different pH conditions. A further extension to the method introduced by Pfeifer et al.9 is that one does not have to normalize the retention times to the interval between zero and one. Instead, the aligned retention times can be used directly to train the learning machine. The learned retention time models for each dimension are then used to build a retention time filter for the corresponding dimension.9 The filters are based on a statistical test which measures how likely the peptide under consideration is a true identification. Therefore, the measured and the predicted retention times are taken into account and the user can specify a certain significance level for the filter. Evaluation of Precision of the Identifications. Precision was measured for different subsets of the spectrum annotations. Precision is defined as the number of true positives (TP) divided by the sum of the number of true positives and the number of false positives (FP). This means that precision ) (TP)/(TP + FP). In our application, the precision is the number of spectra for which the best-scoring annotation is correct divided by the total number of annotated spectra. If there was more than one top hit peptide with the same score for a spectrum, we excluded this spectrum from the evaluation, because it cannot be labeled as true or false positive.

i)1

Results and Discussion

[ Cl]

Ri∗ ∈ 0, l

∑ (R + R ) e C · ν i

i)1

∗ i

Retention Time Prediction at pH 10.0 and pH 2.1. Because fractions of peptides were taken in the first dimension, we can only assign retention windows for peptide elution and take the Journal of Proteome Research • Vol. 8, No. 8, 2009 4111

research articles median of the elution window as the retention time for all peptides contained in a fraction. To show that the method performs well for the prediction of retention times in both dimensions, we performed a nested cross-validation (CV) on a subset of the data with high-quality identifications. All spectrum identifications with a q-value smaller than or equal to 0.1 and a peptide length greater than five residues, which were a substring of the known protein sequences of the standard mixture, were considered as true positive identifications. If there were several instances of the same spectrum annotation, we calculated the median of the retention times. Before we measured the performance of the retention time prediction models, we measured the reproducibility of the retention times of the identified peptides by calculating the standard deviation of the retention times for each peptide that was identified in several repetitive runs. The average standard deviation for the retention times at pH 10.0 was 1.36 min. In the second dimension, where a retention time represents the exact elution time of a peptide, the average standard deviation of retention times was 8.43 s. The nested CV was performed in the following way: First, the spectrum identifications were split randomly into five partitions. On four of the partitions, we performed a 5-fold CV to find the best parameters of the learning machine (C, ν, and σ). Therefore, ν ∈{0.4 · 1.2i|i ∈ {0, 1, 2}}, and σ ∈ {0.2 · 1.221055i|i ∈ {0, 1,..., 21}}. Since it is recommended to have the C values in the range of the maximal label,22 we had C ∈ {0.001, 0.01, 0.1, 1, 10, 100} for the retention times at pH 10.0 and C ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000} for the retention times at pH 2.1 since the retention times at pH 10.0 were measured in minutes and the retention times at pH 2.1 were measured in seconds. Then, we trained on the four partitions with the best parameters of the 5-fold CV and measured the Pearson correlation between the observed and the predicted retention times on the residual fifth partition. This was done for every possible combination of the five partitions to get a mean performance. To exclude random effects introduced by the random partitioning of the data, we repeated the calculations five times with different random partitionings. The average Pearson correlation coefficient between predicted and observed retention times for the evaluation at pH 10.0 is 0.93 and 0.98 at pH 2.1. This means that the prediction of retention times works very well for both dimensions. The better performance for the second-dimension separation at pH 2.1 can be explained by the fact that fractions were collected at pH 10.0 only every minute. Although an exact measurement of the retention times in the first dimension would increase the performance of the prediction methods, this is not reasonable for off-line two-dimensional peptide separations. Elimination of False Identifications by Retention Time Filters. To show the applicability of retention time filters, we conducted the following experiment on the standard mixture. We trained the retention time model on all peptides of the standard mixture yielding spectra with a q-value smaller than or equal to 0.01. This data set contained 223 unique peptides. The retention times of these peptides in both separation dimensions and their corresponding sequences were utilized to perform SVR with the POBK function. Then, we used the trained models to predict retention times for both dimensions for the whole data set, similar to Klammer et al.8 With the two models for retention time prediction, we could build a filter for each dimension as described in the Experimental Section. The filter is based on the confidence intervals in the correlation between 4112

Journal of Proteome Research • Vol. 8, No. 8, 2009

Pfeifer et al.

Figure 1. Comparison of precision depending on the q-value of the annotations with and without filtering: This plot shows the precision for various data sets with and without filtering. At every point, all spectrum annotations having a q-value smaller than or equal to the x-axis value are considered.

predicted and observed retention times. Since the model for the first-dimension separation at pH 10.0 is not as good as the model for the second dimension at pH 2.1 due to the use of 1-min retention time windows, we set the significance level of the retention time filter for the first dimension to 0.01. This means that the probability of filtering out a correct identification is smaller than or equal to 0.01. The significance level for the filter of the second dimension was set to the standard value of 0.05, because the retention times in the second dimension were better defined, allowing a more stringent filtering. Since we knew which proteins were in the mixture and could therefore distinguish false positives from true positives, we were able to evaluate the performance of the filtering approach. Therefore, we measured the precision as described in the Experimental Section on all spectrum identifications having a q-value smaller than or equal to 0.01 and correspondingly for q-values from 0.02 to 0.5. The precision was measured for the data sets without filtering as well as with one of the filters or with both filters in combination. It can be seen in Figure 1 that each filter improved the precision for every evaluated subset. Furthermore, it can be seen that the combination of both filters leads to the largest improvement in precision. The numbers underlying the figure are collected in Table 1. The complementarity of both filters is further emphasized in Figure 2, which shows the number of correctly annotated spectra with regard to precision. To calculate the underlying values, we took the precision of the different identification sets and evaluated the number of correctly annotated spectra for each data set. It can be seen that both filters improve the number of correctly annotated spectra. Moreover, the largest improvement in the number of correctly annotated spectra can be achieved for a combination of both filters. For example, at a precision of 0.94, meaning that 94% of the identifications are correct, we obtained 1567 correctly annotated spectra using both filters compared to 1165 spectra without filtering. This corresponds to a 35% increase in peptide identifications at the same level of precision. Spectrum annotations having a massspectral q-value smaller than or equal to 0.05 and additional

research articles

Improving Peptide Identification in Proteome Analysis a

Table 1. Overview of Precision Depending on q-Value Threshold and Filtering unfiltered

filtered in first dimension

filtered in second dimension

filtered in both dimensions

q-value threshold

tp

fp

precision

tp

fp

precision

tp

fp

precision

tp

fp

precision

0.01 0.02 0.03 0.04 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

1165 1345 1468 1577 1663 1962 2104 2230 2315 2408 2512 2595 2665 2723

70 100 130 159 183 393 598 807 1097 1360 1780 2562 3368 4044

0.943 0.931 0.919 0.908 0.901 0.833 0.779 0.734 0.678 0.639 0.585 0.503 0.442 0.402

1106 1279 1395 1495 1575 1852 1981 2102 2185 2268 2367 2443 2505 2562

58 80 99 115 125 239 329 422 553 677 877 1166 1523 1809

0.950 0.941 0.934 0.929 0.926 0.886 0.858 0.833 0.798 0.770 0.730 0.677 0.622 0.586

1165 1342 1464 1569 1653 1942 2078 2198 2282 2366 2466 2542 2606 2663

64 85 101 117 128 221 292 339 415 475 569 783 954 1132

0.948 0.940 0.935 0.931 0.928 0.898 0.877 0.866 0.846 0.833 0.813 0.765 0.732 0.702

1106 1277 1393 1489 1567 1836 1960 2076 2158 2233 2328 2401 2458 2513

56 72 83 91 96 158 198 223 267 303 356 444 535 619

0.952 0.947 0.944 0.942 0.942 0.921 0.908 0.903 0.890 0.881 0.867 0.844 0.821 0.802

a This table shows the precision for different subsets of the data. Every row corresponds to one subset. The q-values of the spectrum identifications have to be smaller than or equal to the q-value threshold in the first column. tp stands for (the number of) true positives and fp stands for (the number of) false positives. The precision is defined as tp/(tp + fp).

Figure 2. Comparison of correctly annotated spectra with and without filtering: This plot shows the number of correctly annotated spectra with and without filtering at a certain precision. The points correspond to the different partitions of the data which were evaluated. The numbers underlying this figure can be found in Table 1 (precision vs tp).

filtering by our two-dimensional retention time filter yield the same precision of 0.94 as compared to all spectrum identifications with a q-value smaller than or equal to 0.01 without filtering. To illustrate the filtering capabilities, we plotted the observed retention time against the predicted retention time for the identifications with q-value less than or equal to 0.05. Figure 3 shows the performance of the filter for the separation at pH 10.0. It can be seen that the correlation between observed and predicted retention time is quite good for the correct identifications. The lines represent the 99% confidence intervals for the retention times predicted by our model for peptide separation at pH 10.0 (see Experimental Section for details). Furthermore, one can see that there is a significant number of false identifications which are filtered out only by the filter of the first dimension (crosses without circle), or by the filters in both dimensions (crosses with circle). This effect can also be seen in Figure 4, which demonstrates the performance of the filter for the second-dimension separation at pH 2.1. Because of the measurement of exact retention times in the second separation dimension, the correlation between observed and predicted retention time is considerably better than for the first retention

Figure 3. Filter performance of separation at pH 10.0: This plot shows observed against predicted retention time (first dimension) for all spectrum identifications having a q-value which is smaller or equal to 0.05. The lines show the borders of the filter. Every point which is not between the two lines is filtered out by this filter. Points having an extra circle are also filtered out by the filter of the second dimension (pH 2.1).

time dimension, resulting in a very narrow band for the 95% confidence interval. The majority of the data points for the correct identifications lies very close to the ideal 45° line in this plot with only few outliers beyond the 95% confidence interval, which clearly demonstrates the high performance of retention time prediction for peptides in ion-pair reversedphase chromatography using support vector regression and the paired oligo-border kernel function. Using RT Filters to Improve Identifications in Whole Proteome Analysis. The same protocol as above was applied to the S. cellulosum data to obtain more identifications, keeping the precision at the same level. We did not train on all spectra with a q-value smaller than or equal to 0.01 since our learning method does not require such a large amount of training data.9 Instead, we just used the 600 best-scoring identifications. We then utilized the trained models to predict retention times in both dimensions for the whole data set of mass spectrometrically identified S. cellulosum peptides. Journal of Proteome Research • Vol. 8, No. 8, 2009 4113

research articles

Figure 4. Filter performance for separation at pH 2.1: This plot shows observed against predicted retention time (second dimension) for all spectrum identifications having a q-value which is smaller or equal to 0.05. The lines show the borders of the filter. Every point which is not between the two lines is filtered out by this filter. Points having an extra circle are also filtered out by the filter of the first dimension (pH 10.0).

The study on the standard mixture showed that similar precision is achieved by choosing all spectrum identifications with a q-value smaller than or equal to 0.01 without filtering or choosing all spectrum identifications with a q-value smaller than or equal to 0.05 and filtering by the two retention time filters. Since true and false positive identifications cannot be directly distinguished in the whole proteome data, we evaluated the total number of annotated spectra and the number of identified unique peptides for these two sets of identification parameters. At a q-value of 0.01, we annotated 21 038 spectra which identified 6202 unique peptides, and at a q-value of 0.05 with additional RT-filtering, we annotated 25 347 spectra, which yielded 7115 unique peptide identifications. This represents an increase in the number of successful peptide identifications by 15% without any loss in the precision of peptide identifications. In this evaluation, peptides with the same amino acid sequence but different post-translational modifications were considered to be different peptide identifications. Furthermore, we looked at the overlap of the unique peptide identifications between the two sets. The majority of identifications are part of both sets. Nevertheless, there are 720 unique peptide identifications in the unfiltered set and 1633 unique peptide identifications in the filtered set. The numbers are plotted as a Venn diagram in Figure 5. To further validate the performance of the two-dimensional filtering approach, we measured the postfiltering q-values similar to Klammer et al.8 Therefore, we created a decoy database by randomly shuffling the amino acids in each protein sequence. Then, we evaluated the number of annotated spectra at q-values {0.005, 0.01, 0.015,..., 0.1} using the decoy identifications as well as the identifications to the standard database. These identifications were then filtered by the retention time filters to compute post filtering q-values. We then evaluated the number of annotated spectra for q-values {0.005, 0.01, 0.015,..., 0.1}. To minimize random effects, the evaluation was repeated five times and the average numbers for the unfiltered as well as the filtered annotations are shown in Figure 6. At q-value 0.01, 21 470 spectra were annotated without filtering, 4114

Journal of Proteome Research • Vol. 8, No. 8, 2009

Pfeifer et al.

Figure 5. Increase in unique annotations on the S. cellulosum data set. This plot shows the number of unique spectrum annotations of two sets, for which the precision is estimated to be equal (based on empirical results on standard mixture). In the first set, all are unfiltered annotations with a q-value smaller than or equal to 0.01, and in the second set, all are spectrum annotations having a q-value smaller than or equal to 0.05 which are not filtered out by any of the two retention time filters.

Figure 6. Comparison of number of annotated spectra with and without filtering at the same q-value. This plot shows the number of annotated spectra at a particular q-value for the unfiltered annotations as well as for the identifications filtered by our twodimensional retention time filter.

whereas 25 580 spectra were annotated by using the twodimensional filtering approach. This means that about 19% more spectra were successfully annotated at exactly the same significance threshold.

Conclusions We presented a new approach to improve the number of correctly identified spectra resulting from mass spectrometry experiments by using experimental data that are inherent to the analytical process. In our approach, we are able to build retention time predictors for a two-dimensional chromatographic separation using the retention times of peptides identified with high confidence by tandem mass spectrometry. Thus, no additional calibration using standard mixtures was necessary. The retention time filters were successfully applied to filter out false positive identifications. Moreover, we showed that the scoring threshold can be lowered to include more previously false negatives (and to get more correct spectrum identifications) at the same level of precision in terms of correct identifications. This is accomplished by incorporating the

Improving Peptide Identification in Proteome Analysis

research articles

retention time predictors into a two-dimensional filter which eliminates false positive identifications. Therefore, we can achieve the same rate of precision, although the mass spectrometric scoring threshold is smaller. The method was validated on a standard mixture containing known peptide sequences. Finally, we applied the same method to the whole proteome analysis of the S. cellulosum bacteria. The analysis showed that by using this method we can find about 19% more spectrum identifications at q-value 0.01.

(5) MacCoss, M. J.; Wu, C. C.; Yates, J. R. Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal. Chem. 2002, 74, 5593–5599. (6) Ba¸czek, T.; Kaliszan, R. Predictions of peptides’ retention times in reversed-phase liquid chromatography as a new supportive tool to improve protein identification in proteomics. Proteomics 2009, 9, 835–847. (7) Strittmatter, E. F.; Kangas, L. J.; Petritis, K.; Mottaz, H. M.; Anderson, G. A.; Shen, Y.; Jacobs, J. M.; Camp, D. G.; Smith, R. D. Application of peptide LC retention time information in a discriminant function for peptide identification by tandem mass spectrometry. J. Proteome Res. 2004, 3, 760–769. (8) Klammer, A.; Yi, X.; MacCoss, M.; Noble, W. Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions. Anal. Chem. 2007, 79, 6111–6118. (9) Pfeifer, N.; Leinenbach, A.; Huber, C. G.; Kohlbacher, O. Statistical learning of peptide retention behavior in chromatographic separations: a new kernel-based approach for computational proteomics. BMC Bioinf. 2007, 8, 468. (10) Klammer, A. A.; Reynolds, S. M.; Bilmes, J. A.; MacCoss, M. J.; Noble, W. S. Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification. Bioinformatics 2008, 24, i348-i356. (11) Uwaje, N. C.; Mueller, N. S.; Maccarrone, G.; Turck, C. W. Interrogation of MS/MS search data with an pI Filter algorithm to increase protein identification success. Electrophoresis 2007, 28, 1867–1874. (12) Alpert, A. J.; Andrews, P. C. Cation-exchange chromatography of peptides on poly(2-sulfoethyl aspartamide)-silica. J. Chromatogr. 1988, 443, 85–96. (13) Toll, H.; Oberacher, H.; Swart, R.; Huber, C. Separation, detection, and identification of peptides by ion-pair reversed-phase highperformance liquid chromatography-electrospray ionization mass spectrometry at high and low pH. J. Chromatogr., A 2005, 1079, 274–286. (14) Delmotte, N.; Lasaosa, M.; Tholey, A.; Heinzle, E.; Huber, C. Twodimensional reversed-phase × ion-pair reversed-phase HPLC: An alternative approach to high-resolution peptide separation for shotgun proteome analysis. J. Proteome Res. 2007, 6, 4363–4373. (15) Schley, C.; Swart, R.; Huber, C. G. Capillary scale monolithic trap column for desalting and preconcentration of peptides and proteins in one- and two-dimensional separations. J. Chromatogr., A 2006, 1136, 210–220. (16) Lange, E.; Gro¨pl, C.; Schulz-Trieglaff, O.; Leinenbach, A.; Huber, C.; Reinert, K. A geometric approach for the alignment of liquid chromatography mass spectrometry data. Bioinformatics 2007, 23, i273-281. (17) Schneiker, S.; et al. Complete genome sequence of the myxobacterium Sorangium cellulosum. Nat. Biotechnol. 2007, 25, 1281– 1289. (18) Ka¨ll, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 2008, 7, 29–34. (19) Scho¨lkopf, B.; Smola, A. J.; Williamson, R. C.; Bartlett, P. L. New support vector algorithms. Neural Comput. 2000, 12, 1207–1245. (20) Sturm, M.; Bertsch, A.; Gro¨pl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. OpenMSsAn open-source software framework for mass spectrometry. BMC Bioinf. 2008, 9, 163. (21) Kohlbacher, O.; Reinert, K.; Gro¨pl, C.; Lange, E.; Pfeifer, N.; SchulzTrieglaff, O.; Sturm, M. TOPP--the OpenMS proteomics pipeline. Bioinformatics 2007, 23, e191-197. (22) Chalimourda, A.; Scho¨lkopf, B.; Smola, A. J. Experimentally optimal ν in support vector regression for different noise models and parameter settings. Neural Networks 2005, 18, 205.

Acknowledgment. We gratefully acknowledge the instrumental support of Arnd Ingendoh and Carsten Baessmann from Bruker Daltonics, Bremen, Germany. The authors thank Yasser A. Elnakady and Rolf Mu ¨ ller from the Department of Pharmaceutical Biotechnology at Saarland University, Germany, for providing the S. cellulosum sample. The authors thank the anonymous reviewers for valuable suggestions and comments. Supporting Information Available: The spectrum identifications of the standard mixture are given by four different data sets at similar precision (0.94), showing the larger number of spectra which can be annotated with similar precision by using our methods. SuppData1.xls contains all spectrum identifications with q-value equal to or smaller than 0.01. SuppData2.xls contains all spectrum identifications with q-value equal to or smaller than 0.02 that are not filtered out by the filter of the first RT dimension. SuppData3.xls contains all spectrum identifications with q-value equal to or smaller than 0.02 that are not filtered out by the filter of the second RT dimension. SuppData4.xls contains all spectrum identifications with q-value equal to or smaller than 0.05 that are not filtered out by either of the two RT filters. Additionally, the Supporting Information contains SuppTable1.xls, which contains the data from Table 1 together with the true negatives (tn) and false negatives (fn). The fn are correct spectrum identifications which are filtered out by the corresponding filter(s) and the tn are false spectrum identifications, which are filtered out by the corresponding filter(s). This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Yates, J. R. Mass spectrometry and the age of the proteome. J. Mass Spectrom. 1998, 33, 1–19. (2) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422, 198–207. (3) Cannon, W. R.; Taasevigen, D.; Baxter, D. J.; Laskin, J. Evaluation of the influence of amino acid composition on the propensity for collision-induced dissociation of model peptides using molecular dynamics simulations. J. Am. Soc. Mass Spectrom. 2007, 18, 1625– 1637. (4) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567.

PR900064B

Journal of Proteome Research • Vol. 8, No. 8, 2009 4115