Application of Peptide LC Retention Time Information in a Discriminant

Jul 9, 2004 - We describe the application of a peptide retention time reversed phase liquid chromatography (RPLC) prediction model previously reported...
0 downloads 10 Views 305KB Size
Application of Peptide LC Retention Time Information in a Discriminant Function for Peptide Identification by Tandem Mass Spectrometry Eric F. Strittmatter, Lars J. Kangas, Konstantinos Petritis, Heather M. Mottaz, Gordon A. Anderson, Yufeng Shen, Jon M. Jacobs, David G. Camp II, and Richard D. Smith* Biological Sciences Division and Environmental and Molecular Sciences Laboratory, Pacific Northwest National Laboratory, P.O. Box 999, MSIN: K8-98, Richland, Washington 99352 Received January 21, 2004

We describe the application of a peptide retention time reversed phase liquid chromatography (RPLC) prediction model previously reported (Petritis et al. Anal. Chem. 2003, 75, 1039) for improved peptide identification. The model uses peptide sequence information to generate a theoretical (predicted) elution time that can be compared with the observed elution time. Using data from a set of known proteins, the retention time parameter was incorporated into a discriminant function for use with tandem mass spectrometry (MS/MS) data analyzed with the peptide/protein identification program SEQUEST. For singly charged ions, the number of confident identifications increased by 12% when the elution time metric is included compared to when mass spectral data is the sole source of information in the context of a Drosophila melanogaster database. A 3-4% improvement was obtained for doubly and triply charged ions for the same biological system. Application to the larger Rattus norvegicus (rat) and human proteome databases resulted in an 8-9% overall increase in the number of confident identifications, when both the discriminant function and elution time are used. The effect of adding “runner-up” hits (peptide matches that are not the highest scoring for a spectra) from SEQUEST is also explored, and we find that the number of confident identifications is further increased by 1% when these hits are also considered. Finally, application of the discriminant functions derived in this work with ∼2.2 million spectra from over three hundred LC-MS/MS analyses of peptides from human plasma protein resulted in a 16% increase in confident peptide identifications (9022 vs 7779) using elution time information. Further improvements from the use of elution time information can be expected as both the experimental control of elution time reproducibility and the predictive capability are improved. Keywords: bioinformatics • proteome • algorithm • accurate mass and time tag • multivariate statistics • capillary liquid-chromatography • retention time • FTICR

Introduction Complete genome sequences are now available for over 150 organisms.1 The combination of this information with data from evolving approaches that enable measurement of hundreds to thousands of proteins2-4 stands to greatly impact biological and medical research.5,6 Although genomic sequences provide a set of proteins that could potentially be present, information generally desired from proteomics studies includes determination of the abundances of these proteins, where they are located, and their interactions with other biomolecules.7 To obtain a system level understanding of an organism’s proteome, large scale efforts are required to handle, store and process this information.8-11 Many of these initial efforts have been concentrated in cataloging protein identifications, obtaining quantitative information and increasing the throughput of measurements.12-16 For proteomics, the large datasets generated are often comprised of results from LC-MS/MS or PAGEMALDI/MS17,18 analyses. 760

Journal of Proteome Research 2004, 3, 760-769

Published on Web 07/09/2004

Peptide identification by MS/MS often uses assignments from pattern matching algorithms such as SEQUEST19 and Mascot,20 where increasing score values indicate higher confidence. However, the point of transition from a good score to a poor one is difficult to discern, and the existence of differing filtering criteria indicates these judgments are often in the “eye of the beholder”.21-24 Recently, there has been significant progress in establishing more robust methods for interpreting MS/MS data providing a firmer statistical basis for peptide identification.23-27 One such method developed by Keller et al.23 uses a discriminant function based on several SEQUEST parameters and Bayesian statistics to create a probabilistic model that is useful for accepting or eliminating possible peptide identifications. However, with mammalian proteomes, a large fraction of the tentative peptide identifications still fall in a gray area, wherein the peptide assignment is uncertain and greater efforts to improve peptide identifications should be beneficial. 10.1021/pr049965y CCC: $27.50

 2004 American Chemical Society

Peptide Identification by Tandem Mass Spectrometry

To date, there have been several studies demonstrating the usefulness of information obtained from the separation process (e.g., LC) used in conjunction with MS and MS/MS.28-30 We have previously reported development of an LC normalized elution time prediction capability based on peptide sequence that is typically accurate within 5% of the actual elution time.28 Because the information from liquid separations is orthogonal with gas-phase MS/MS, this information should augment multivariate discrimination methods. This combination of separation data and MS/MS measurements is promising for both eliminating errant matches and providing a basis for improved identification of peptides having poor fragmentation efficiencies or patterns that differ from the models used by the software for identification. We have previously shown that the use of elution time criteria greatly increases the confidence of peptide identifications using high mass measurement accuracy (MMA, e.g., ∼1 ppm) MS data (i.e., without the use of MS/MS). This accurate mass and time tag (AMT tag) approach,31-33 that has been demonstrated with several prokaryotic organisms. Experiments that involve high accuracy MS measurements are different from MS/MS experiments in that they do not require isolation of a single parent ion per cycle, but instead allow the measurement of masses and/or abundance ratios2,34,35 of many different species in the same cycle time. Ultimately, greater MMA and separation measurement accuracy have important implications for proteomics, since these measurements minimize the need for more time-consuming MS/MS measurements. In this study, we report the use of an LC elution time metric that effectively increases the number of accurately identified peptides compared with the number of identifications obtained when only MS/MS information is used. We demonstrate this capability in conjunction with the broadly used SEQUEST program for peptide identification.19 We also show that use of a discriminant function incorporating LC information for making more confident identifications, and providing increases of >30% compared to simpler commonly used criteria.24

Experimental Section Sample Preparation. A mixture of 17 proteins were digested with trypsin: bovine serum albumin (BSA), bovine-casein (containing four casein proteins), ovalbumin, horse myoglobin, bovine-lactoglobulin, bovine carbonic anhydrase, bovine hemoglobin, lysozyme (from chicken), bovine ubiquitin, bovine catalase, glyceraldehyde-3-phosphate dehydrogenase (rabbit), and bovine ribonuclease 1-A. Stock solutions of each protein were prepared in pH 8.4 NH4HCO3 buffer so that the protein concentration of each solution is 1.0 mg/mL. Aliquots of each of these stock solutions were added together so that the resulting mixture contained one nmol each of BSA and casein and 2 nmols each of the remaining proteins and the total protein concentration was ∼1.0 µg/µL. Urea, thiourea, and DTT were added to this solution to make their resulting concentrations 7 M, 2 M, and 5 mM, respectively. Reduction and denaturation of the proteins were accomplished by incubating this solution for 30 min at 60 °C. This sample was subsequently diluted 10-fold, and 17 µL of 1 M CaCl2 and 22 µg of Promega trypsin were added to the sample, which was then incubated for 5 h (at 37 °C). Solid-phase extraction (Supelco C-18) columns were used to obtain purified peptides. In this study, methods that were effective for the control proteins were also used on proteome samples from human plasma. The plasma proteins were denatured and reduced in

research articles a similar manner as the control proteins and subsequently desalted using a PD-10 desalting column (Amersham Pharmacia Biotech, Uppsala, Sweden). The resulting sample was digested and cleaned-up with a Supelco C-18 column. Detailed information is given in ref 36. The collection and use of human samples used in this study were reviewed by both the Stanford University and Pacific Northwest National Laboratory Institutional Review Boards for human subject’s research in accordance with federal regulations (title 45 CFR part 46). Mass Spectrometry and RPLC Separation. Experiments were performed using a ThermoFinnigan LCQ Duo (San Jose, CA) with 3 min dynamic exclusion duration. For LCQ data acquisition, our method allowed for a survey scan followed by tandem MS experiments for the three most intense peaks. The RPLC system was constructed in-house and used a 150 µm i.d. × 360 µm o.d. × 65 cm capillary (Polymicro Technologies Inc., Phoenix, AZ) packed with 5 µm Jupiter C18 stationary phase (Phenomenex, Torrence, CA).37 Mobile phase A consisted of 0.05% trifluoroacetic acid (TFA), 0.2% acetic acid in water, and mobile phase B consisted of 0.1% TFA, 90% ACN in water. The exponential gradient mixing of mobile phase A with mobile phase B (flow of 1.8 µL/min) began while maintaining constant pressure (5000 psi) for 20 min following a 10 µL injection of the sample (1.6 µg/µL).38 The human plasma datasets were separated using a 2D LC technique involving an initial strong cation separation (polysulfoethyl aspartamide-bonded silica particles, PolyLC Inc.) followed by C18 RPLC separation. The human plasma datasets were obtained using a different 30 µm i.d. × 85 cm nanoLC column with a 4 cm × 75 µm i.d. solidphase extraction precolumn used for high efficiency separations. For experiments on this system, an injection volume of 10 µL of a protein solution having protein concentration of 15 µg/µL was used. The mobile phase pressure for this higher separation efficiency LC system was 10 000 psi (additional information is given in ref 36). Data Analysis. MS/MS Data Interpretation. Tandem mass spectra for the mixture of 17 trypsin-digested proteins were generated from 12 separate LC-MS/MS runs and the resulting data were analyzed using SEQUEST v2.7 (ThermoFinnigan, San Jose, CA). The fasta database was constructed using sequences from the 17 proteins, porcine trypsin, several keratins, and the predicted 14 340 Drosophila melanogaster proteins (National Center for Biotechnology Information, ftp.ncbi.nlm.nih.gov/ genomes), unless otherwise noted. Although the sample does not contain Drosophila melanogaster proteins, SEQUEST does give reasonable scores to some spectra to sequences from this organism due to random overlap between the observed patterns and an expected peptide pattern. Thus, assignments to D. melanogaster were assumed to be errant, with the exception of 31 peptides that overlap with 648 tryptic peptides from the 17-component control proteins. A similar strategy has already been successfully employed in developing other statistical models.27 In some cases, proteins from other organisms (e.g., rat, human) are used in place of Drosophila in the SEQUEST search, to provide measures of false positive matches. Peptides hits that originate from the control protein set were accepted as correct identifications, with some exceptions. In contrast, because poorer spectra can be matched to a protein from random correlations and the database contains many more Drosophila vs control set entries (14 340 proteins vs 17 proteins), there is a high probability of obtaining random matches to the Drosophila proteins. However, since the parametrization set contains thousands of spectra, the finite Journal of Proteome Research • Vol. 3, No. 4, 2004 761

research articles

Strittmatter et al.

probability of obtaining a random control set identification warrants some additional filtering (in some cases, the discriminant approach used is sensitive to errant values in training set). Four peptide sequences tentatively identified from the protein control set were reassigned as being false positives on the basis of large discrepancies between the actual and expected elution time and lack of corroborating evidence from high mass accuracy LC-FT-ICR analysis. These four sequences amounted to less than 1% of the number of different peptide sequences identified. Short peptides of 0.1 and Xcorr′ > 0.2), and this was done individually for each LC-MS/MS analysis. Once the coefficients of the quadratic relationship are determined, observed NET values can be obtained for all peptide identifications. Using this method, the presence of identical peptides in all runs is not necessary for them to be normalized, since a predicted elution time can be generated for any sequence. Additionally, all NETobs peptide elution times are normalized consistently by this procedure. Although the dynamic exclusion settings limit the repetitive selection of a given peptide for dissociation and MS/MS analysis during one LC separation, on occasion some peptides were identified multiple times (e.g., from more abundant peptides or peptides that tend to give wider or tailing peaks). 762

Journal of Proteome Research • Vol. 3, No. 4, 2004

To minimize variability of peptide NETobs values in such runs, the earliest NET value was used, and subsequent identifications of the same peptide were assigned this NET value. Construction of Discriminant. Discriminant function is used to judge quality of peptide identifications and is a multivariate formula that is based on all scores that can be attributed to that spectrum assignment (Xcorr, ∆Cn, NET information, etc.). The discriminant is an empirical function and a controlled set of data is used to obtain values for the coefficients contained in the discriminant function. Important in this parametrization is that the data be partitioned into correct hits and false hits, so that characteristics of each score component used to distinguish these two populations can be learned. Once a discriminant is parametrized, it can be applied to proteomic data and normal production data to eliminate false positive hits. The discriminant function used in this study is based on eqs 2 and 3: discrim score ) sigmoid(Xcorr′) + sigmoid(∆Cn) + sigmoid(ln(RankSp)) + sigmoid(dM) + sigmoid(dNET) + tryptic parameter (2) where sigmoid(x) )

(

1 + exp

β1

(

))

(β2 - x) β3

(3)

where β1, β2, and β3 are coefficients that are optimized to a given maximum separation between confident identifications to the control set and the supposed errant assignments.23 The dNET parameter is calculated as the absolute value of the difference between the predicted and observed NETs. The tryptic parameter is chosen based on the number of cleavages at R or K residues that produced the peptide under consideration and is one of three values, since a peptide can occur from a tryptic cleavage at both termini (full tryptic), a tryptic cleavage at one termini (partial tryptic) or a no tryptic cleavages (nontryptic). Because increasing discriminant score values indicate increasing confidence, components that have negative correlations are expected to have β1 < 0.

Results and Discussion LC-NET information. We have previously described a neural network based model for peptide retention time determination derived from a large collection of peptides identified from tryptic digests of D. radiodurans and S. oneidensis proteins.28 This model generates a predicted retention time value for a peptide based on the amino acid composition of a peptide, or NETpred, using a specific set of separation conditions. In this work, the differences between the NETpred and NETobs were calculated for all peptide tentative identifications, which included tentative identifications to both control set proteins (referring to the 17 protein mixture) and to Drosophila melanogaster proteins (assumed to be errant identifications). After separating the tabulated peptide identifications, depending on whether they occurred from a control set or Drosophila melanogaster proteins, the difference between NETpred and NETobs was calculated and histograms for both populations were determined (Figure 1, where 0.01 NET ≈ 1.5 min). Note that assignments to the correct control proteins have considerably less error than the Drosophila proteins (filled bars), where

research articles

Peptide Identification by Tandem Mass Spectrometry

Table 1. Number of TRUE Peptide Identifications to the Control Set of Proteins, and the FALSE Errant Assignments to Drosophila Based upon the Indicated Criteriaa,b charge state of peptide TRUE FALSE

Figure 1. Error in normalized elution time (NET) value for the control set peptide identifications (filled bars) and those to Drosophila proteins (unfilled bars) for doubly protonated peptides.

90% of the control protein hits had a deviation of 0.10 or less and 50% of these peptide identifications were within 2.5%. Thus, the NETpred - NETobs value is a good metric for peptide identification quality. Additionally, compared with simple criteria, incorporation of NET data can provide a large increase in the confidence of identifications. Because the data in Figure 1 are derived from several LCMS/MS experiments performed in replicate, several peptides were identified multiple times. Using the different values for NETobs for these peptides, a measure of reproducibility can be obtained by calculating a standard deviation for NETobs for each peptide. The results obtained for the LC-MS/MS analysis indicates that the NETobs values were highly reproducible, with a measured median standard deviation for NETobs for peptides of 0.0034 NET units. Xcorr, ∆Cn and dNET. Nine LC-MS/MS datasets containing 14 762 spectra were used to develop the NET based discriminant. Of these spectra, 3420 (having 648 unique sequences) have top scoring spectrum assignments to peptides from one of the 17 control proteins. An ideal score criteria would allow these 3420 control set spectrum assignments identifications to be confidently distinguished from the remaining falsely identified proteins. Perfect discrimination is impractical for many reasons (data quality, noise, etc.). A baseline measure of performance can be obtained by evaluation using the commonly used Xcorr and ∆Cn criteria of Washburn et al.,24 which results in 2241 nonunique peptide identifications to the control set, across all charge states. The data in Table 1 show an approximate false positive rate of ∼5% when using these criteria with the present data. Table 1 also shows the effect of adding a dNET < 0.10 (dNET ) |NETpred NETobs|) constraint for each peptide on the overall number of supposed correct identifications and errant identifications. This new constraint nearly halves the number of incorrect Drosophila identifications, consistent with the goal of obtaining identifications with reduced false positives. When peptide identifications are separated according to the parent charge state, the results from the singly and triply charged species show smaller improvement, demonstrating that the MS/MS spectrum is usually sufficient in the present case for making confident peptide assignments using the indicated Xcorr and ∆Cn criteria. We also examined the number of control identifications as a function of Xcorr. These numbers are shown as both the raw

fully tryptic Xcorr > 1.9, ∆Cn >0.1 fully tryptic Xcorr > 1.9, ∆Cn >0.1, dNET < 0.1 fully or partial tryptic Xcorr > 2.2, ∆Cn > 0.1 OR any,c Xcorr > 3.0, ∆Cn > 0.1 fully or partial tryptic Xcorr > 2.2, ∆Cn > 0.1, dNET < 0.1 OR anyc Xcorr > 3.0, ∆Cn > 0.1, dNET < 0.1 fully or partial tryptic Xcorr > 3.75, ∆Cn > 0.1 fully or partial tryptic Xcorr > 3.75, ∆Cn > 0.1, dNET 0.1 criterion on the original data is shown in Figure 2b. The dNET < 0.10 restriction has limited impact on the number of control set identifications, however the false positive identifications are reduced by almost half. The addition of the ∆Cn restriction greatly reduces the number of false positives above Xcorr ) 2.0. These results indicate that many valid peptide identifications may have Xcorr’s below 1.9. Similar plots are provided for the doubly (Figure 3a,b) and triply (Figure 3c,d) charged peptides that show similar trends as those observed for singly charged peptide identifications. Discriminant Score. The other parameters, dM and ln(RankSp), can also be useful for judging the quality of a match between a peptide and mass spectrum. Rather than attempting to plot score distributions as a function of each score variable, we have used a discriminant function approach to determine the best combination of Xcorr′, ∆Cn, ln(RankSp), dM, and dNET. The discriminant function (eqs 2 and 3) provides a new value, F, which most effectively separates identifications to the control proteins from the false positives by combining the scores of the five separate score components. A discriminant score was determined for each MS/MS spectra, and the scores ranged from -3 to a maximum of 7. A distribution was obtained by placing spectra in bins of width 0.2 according to the discriminant score and then counting the number of peptides that had F scores in a particular bin. As in Figures 2 and 3, the fraction correct (probability) plot is shown for F instead of Xcorr. As suggested by Figure 2, more singly charged peptide identifications can be obtained by lowering Xcorr and ∆Cn criteria and by allowing for broader enzymatic criteria (Table 2). Using the less restrictive Xcorr > 1.8, ∆Cn > 0.08 thresholds,22 and allowing for both full and partial tryptic activity results in a substantial increase in both control protein set identifications (618 vs 478), although the false identifications increases from 1% to an excessive 7%, the use of dNET in the Journal of Proteome Research • Vol. 3, No. 4, 2004 763

research articles

Strittmatter et al.

Figure 2. Dependence of peptide identification confidence (singly charged only) with dNET, Xcorr, and ∆Cn. (a) Fraction of protonated peptides assigned to the control protein set plotted with respect to Xcorr (top) obtained by binning the Xcorr values for partially and fully tryptic peptide identifications in 0.2 score intervals. The histogram of the number of tentative ID’s obtained at each score interval for the control proteins (black) and Drosophila proteins (grey). Dotted lines correspond to data filtered with a dNET < 0.10, solid lines are unfiltered with respect to dNET. (b) A histogram of Xcorr values filtered so only tentative Identifications with ∆Cn > 0.1 are shown. Table 2. Comparison of Correctly Identified Peptides (singly charged only) between a Discriminant and Simple Xcorr and ∆Cn Criteriaa

full or partial tryptic Xcorr > 1.8, ∆Cn > 0.08 full or partial tryptic Xcorr > 1.8, ∆Cn > 0.08, dNET < 0.1 discriminant (w/o dNET) > 1.5 discriminant (w/o dNET) > 1.6 discriminant (w/dNET) > 1.7,

TRUE

FALSE

% false

618

48

7.2%

587

23

3.8%

727 711 814

26 21 26

3.5% 2.9% 3.1%

a Only top (rank ) 1) peptide assignments from SEQUEST were used, and final tally not filtered for unique sequences.

criteria reduces false identifications to 4%. Using the discriminant function on the singly charged species with an F(w/dNET) > 1.7 cutoff results in 814 control set identifications; the error rate is 3% is an increase of >30% over the simple criteria listed in Table 2. A separate discriminant function was derived by zeroing out the dNET parameter (and β1,dNET terms) and finding the discriminant cutoff that would produce false positives at an ∼3% level. This approach allows one to determine the improvement solely attributable to the dNET parameter. The results in listed in Table 2 shows that a cutoff of F (no dNET) > 1.6 results in 711 control set identifications and a 3% false rate, which is ∼12% less than with the dNET based function (814). Discriminant functions for higher peptide charge states also were determined, and the F value producing a ∼3% error rate was examined. As for the singly charged species, the effect of 764

Journal of Proteome Research • Vol. 3, No. 4, 2004

the dNET parameter was examined by including it in eq 2 and subsequently zeroing out the dNET component. The results indicate a 3.5% increase in identifications when dNET is used, amounting to 71 additional peptides for the doubly and triply charged species (Table 3). This modest improvement indicates that for these predominantly larger peptides that the mass spectra are most often sufficiently rich in information to enable confident identifications. In contrast, for the smaller singly charged species, fragmentation pattern matching must be sensitive to small changes in the number of detected fragment ions since the number of backbone fragments is proportionally smaller. For peptides containing 1.7 for +1 peptides, F > 2.6 for +2 peptides, and F > 2.9 for dNET, a total of 2827 peptide identifications (462 unique sequences) were made from the control set of proteins, and 92 errant or Drosophila identifications (50 unique). This compares with 2653 peptide (431 unique) from the control set and 82 peptides from Drosophila (45 unique) for the w/o NET discriminant. Thus, the overall improvement is 6.5% or (7.4% in comparing unique ids) in using NET information, but unique ID’s have higher false positive rates. Note that peptide assignments in Table 1 though 3 are not filtered for uniqueness, because standard protein

Peptide Identification by Tandem Mass Spectrometry

research articles

Figure 3. Dependence of multiply charged peptide identification confidence with respect to dNET, Xcorr, and ∆Cn. Tentative peptide identifications for (a) doubly charged, (b) doubly charged and ∆Cn > 0.1, (c) triply charged, (d) triply charged and ∆Cn > 0.1. As in Figure 2, dotted lines correspond to data filtered with a dNET < 0.10, solid lines are unfiltered with respect to dNET. Only partially and fully tryptic peptide identifications were used. Journal of Proteome Research • Vol. 3, No. 4, 2004 765

research articles

Strittmatter et al.

Table 3. Comparison, as in Table 2, with and without the Inclusion of dNET for Multiply Protonated Peptidesa

doubly charged discriminant (w/o dNET) > 2.7 discriminant (w/dNET) > 2.6 triply charged discriminant (w/o dNET) > 3.2 discriminant (w/dNET) > 2.9 a

TRUE

FALSE

% false

1522 1578

48 50

3.1% 3.1%

420 435

13 14

3.0% 3.1%

Only top (rank ) 1) peptide assignments from SEQUEST were used.

Figure 4. Test data discriminant score probabilities for all charge states, determined from binning the data in 0.2 score intervals and determining the fraction of true and false identifications. Data are shown for (circle, dotted line) [M + H]1+, (square, solid black line) [M + 2H]2+, and (triangle, gray) [M + 3H]3+ precursor ions.

digests tend to produce replicate peptide identifications. However, these replicate assignments have a wide range of Xcorr values, and by no means represent equivalence in quality (see Supplemental Table 1). In Tables 2 and 3, note that the cutoff values are different for the discriminant with and without inclusion of dNET, yet the percentages of false identifications are nearly equal. This difference is due to the combined effect of zeroing out the dNET term and rescaling the coefficients in eqs 2 and 3 so that the inter-group variance is equal to unity and Σ Fi ) 0. This rescaling step in the discriminant calculation is also performed by most alternative algorithms.39-41 An additional but useful step is to derive a function that would output a probability (Figure 4) when given an F value and charge state, as done by Keller et al.23 to provide the Bayesian expression, p(F value|true). However, the increased confidence in utilizing dNET should be fully accounted for in calculating F. An advanced model incorporating additional information from accurate mass measurements as part of the AMT tag approach32,33 is currently being explored. Coefficients of Discriminant. The coefficient in the numerator of the discriminant expression, β1, provides a rough indication of the importance of that parameter for judging peptide assignment confidence. The parameters are listed separately for each of the independent variables, i.e., β1,Xcorr, β1,∆Cn, β1,dNET, etc. One important observation is that ∆Cn appears to be highly weighted across all charge states in our discriminant expression (see Table 4). This observation means that methods using ∆Cn should be more effective at minimizing false positives compared to methods based on some subset of 766

Journal of Proteome Research • Vol. 3, No. 4, 2004

Table 4. Coefficients Used in the Discriminant Function (eqs 2 and 3) charge state

1

2

3

β1,Xcorr′ β1,dm β1,∆Cn β1,ln(Rsp) β1,dnet Partial tryptic Full tryptic

0.67 -1.35 1.96 -0.25 -1.71 1.2 2.0

1.01 -0.36 3.75 -1.64 -1.07 0.7 3.2

2.04 -0.30 4.30 -0.92 -4.94 1.2 2.2

the other remaining variables. Uncertainties based on random errors, determined using multiple variable regression42 after solving the discriminant, are estimated to be ( 10%, or less, for all coefficients listed in Table 4. With moderate changes in database size, the coefficients for the discriminant function are not likely to change significantly. We have verified this by performing this parametriztion using a larger database and found only insignificant differences. This is reasonable because the ∆Cn and Xcorr parameters are the dominant parameters and only drastic changes in database complexity would alter this (e.g. human database with extensive dymanic modifications). Identification of Lower Ranked Peptide SEQUEST Xcorr Assignments. SEQUEST gives a list of multiple possible peptides and their corresponding Xcorr for each spectrum, and in the present work the top ranked hit is usually correct. The “runner-up” peptide hits are typically discarded even though they are known to sometimes contain valid peptide identifications.19 Because the tryptic cleavages from which a peptide is derived is not reflected in the Xcorr value, it is reasonable that, for example, second ranked fully tryptic hits should be seriously considered, particularly when, e.g., the top hit is minimally higher scoring or has no tryptic cleavages. To study the effect of using such runner-up peptides, our data extraction programs were modified to eliminate all spectra previously identified by a sufficiently scoring top ranked peptide assignment. This program was also used to determine ∆Cn values for the lower ranked hits where, for example, ∆Cn(2nd ranked hit) ) [Xcorr(3rd) - Xcorr(2nd)]/Xcorr(2nd). Using the criteria listed in Tables 2 and 3 for singly, doubly, and triply charged peptides, 39 additional peptides were identified from the control set along with 13 that originated from false positive proteins. To further establish that these “runner-up” hits are indeed to correct hits, high accuracy measurements of several peptides identified from runner-up identifications were performed where the calculated mass agreed with observed mass to ∼1 ppm (Figures S1-3). This indicates that even if the correct sequence is present in the database, pattern matching algorithms are not always successful in assigning them correctly. It is our observation that runner-up hits have a higher possibility of being correct when there is strong sequence similarity between the top and lower ranked hits and/or when the signal-to-noise ratio for the MS/ MS spectrum is not high. Because these first runner-up peptide hits generally had low ∆Cn values, true and false hits were not separated as well as the highest scoring peptides. On average, the ∆Cn values for the 39 runner up identifications from the control set were 0.06, whereas the 2827 nonunique peptides identified from the control set in Tables 2 and 3 had a mean ∆Cn of 0.29. The NETobs values among these 39 spectra also tend to show good agreement with the NETpred derived for top and runner-

research articles

Peptide Identification by Tandem Mass Spectrometry

up peptides. The top and runner-up hits had high sequence similarity. However, since NETpred is a function of all amino acids that comprise a peptide, small changes in sequence do not impact this value greatly (also similar sequences can give similar mass spectra). Note that the experimental elution times (i.e., NETobs) are different for peptides with small changes in composition and sequence order, and efforts to more accurately reflect this in prediction models are continuing. Because 2827 assignments were made from top ranked SEQUEST assignments and 39 from runner-up spectrum sequence assignments, the small improvement appears to suggest that inclusion of the latter has a limited impact on the final number of identifications. In contrast, when accurate mass information and elution time information is available, more than one possible sequence (for one spectrum) can be considered since these methods have the power to distinguish among many candidates. Effect of Database Size and Validation. Several LC-MS/MS analysis runs of the protein control set were set aside to test the discriminant function. To test the applicability of the derived functions for analysis on a broad range of organisms, results were generated using three additional eukaryote databases as the source of false positive proteins. These databases are as follows: the (1) Stanford Saccharomyces cervesiae (yeast) database43 containing ∼5600 ORF entries, (2) IPI Rattus norvegicus (rat) database containing ∼34 400 ORF entries,44 and (3) the IPI human database containing ∼52 830 ORF entries.44 The yeast, rat, and human databases are 0.4, 2.4, and 3.7 times, respectively, the size of the Drosophila database. To generate SEQUEST results using both the control and eukaryote databases, the sequences for the control protein set were appended to the eukaryote sequence databases, as described with the Drosophila database (see Experimental section). The false positive identification rates were affected by database size and ranged from 2% for the yeast database to 15% for the human database. These false positive hits do not appear to be pure random occurrences; they generally occur for peptides having high sequence similarity with a peptide from the control protein set. Note that while the MudPIT criteria and the discriminant function produce nearly the same false positive identification rates, the latter results in significantly more identifications (Table 5). Depending on the discriminant parameters used, the increase in identifications ranges from 20% to 40%. Also, from calculations based on the discriminant function, a 5% to 9% (yeast to human) increase in the number of true identifications can be attributed solely to the use of the dNET parameter. From these results, it is apparent that mapping peptide identification confidence to discriminant score should include a measure of database size. We have validated the discriminant functions derived in this work with several hundred LC-MS/MS analyses (approximately 2.2 million spectra) of peptides from human plasma protein (Table 6). In applying discriminant functions, the assumption is made that this procedure extends to identifications on human SCX-MS/MS analyses, which is a more complex sample. Comparing different identifications to these dataset resulted in trends similar to those observed for the control protein datasets, but resulted in a more substantial (16%) increase in unique peptide identifications (9022 vs 7779) using the dNET parameter. Because the plasma proteins are more aptly considered unknowns rather than control proteins, we can only assume that the percent false identification rate is similar to that of the human database in Table 5. This assumption is

Table 5. Effect of Database Size on the Identification Rate Using the Yates Criteria and the Discriminant Function Based Analysesa

with < 0.10 dNET MudPIT discriminantc (w/dNET) discriminantc (w/o dNET) MudPIT with < 0.10 dNET MudPIT discriminantc (w/dNET) discriminantc(w/o dNET) MudPIT with < 0.10 dNET MudPIT discriminantc (w/dNET) discriminantc (w/o dNET) MudPIT with < 0.10 dNET MudPIT discriminantc (w/dNET) discriminantc (w/o dNET)

MudPITb

DB

TRUE

FALSE

% false

human human human human rat rat rat rat f. fly f. fly f. fly f. fly yeast yeast yeast yeast

1008 1047 1327 1219 1022 1064 1356 1253 1067 1111 1412 1328 1108 1156 1520 1445

176 196 231 203 114 142 165 139 23 42 40 34 21 47 36 28

14.9% 15.8% 14.8% 14.3% 10.0% 11.8% 10.8% 10.0% 2.1% 3.6% 2.8% 2.5% 1.9% 3.9% 2.3% 1.9%

a Only top (rank ) 1) peptide assignments from SEQUEST were used, and final tally not filtered for unique sequences. b Criteria correspond to those used in ref 24. c Using criteria in Tables 2 and 3.

Table 6. Number of Different (Unique) Peptides Identified from Human Plasma Using Different Simple and Discriminant Criteriaa peptides identified

MudPIT with < 0.10 dNET MudPIT Discriminantb (w/dNET) Discriminantb (w/o dNET)

7052 8415 9022 7779

a Only top (rank ) 1) peptide assignments from SEQUEST were used. Criteria used is F(using dNET) > 1.7(1+)/2.6(2+)/2.9(3+) and F(no dNET) > 1.6(1+)/2.7(2+)/3.2(3+). b

Figure 5. Histogram of the NETpred - NETobs values obtained from 9022 different peptides from human plasma using F g (1.7/ 2.6/2.9).

supported by the error distribution for these 9022 different peptides that shows good agreement exists between the predicted NET with the observed value (Figure 5) and is similar to the histogram in Figure 1. Although the number of peptide identifications for the plasma set of analyses is strongly influenced by NET informaJournal of Proteome Research • Vol. 3, No. 4, 2004 767

research articles tion, protein identification tallies show a less marked improvement. Protein identifications were done by using the ProteinProphet program,27 which uses probabilities from either our NET and non-NET discriminant and fully treats peptide degeneracy (or multorf peptides) information. The NET discriminant results in a 3-4% increase in protein identifications, depending on the protein grouping and probability level used. The number of proteins identified using the two discriminant functions are given as Supporting Information. In this study, peptide LC elution times are used to increase peptide identifications while maintaining similar levels for false identification rates, whereas Jacobs et al.15 recently used elution time information to eliminate questionable identifications. By eliminating all identifications with a dNET > 0.10, the initial set of 1700 proteins determined for the human epithelial cell proteome was reduce to 1574, a reduction of 126 proteins. For 107 of these 126 proteins, poor coverage of the protein was obtained. This finding that protein identifications with lower confidence scores, coverage, etc., can be eliminated using LC elution information is consistent with the present findings using the control protein and plasma datasets. Reversed Database False Positives. Another approach in determining false positives is to perform a MS/MS search using a FASTA database with protein sequences reversed.21 This is similar to adding protein sequences to the search database, like performed previously with Drosophila and other organisms, except that the false positive database is the similar in size. Peptide identifications from the forward database are assumed true and the reverse database is assumed false (if a peptide hit occurs in both database searches, this is not counted as a reverse database hit). Using the criteria in Table 6 for the NET based discriminant and taking 10 rich LC-MS/MS runs from the human plasma analyses, 126 unique peptides were identified from the reversed IPI human database and 1490 from the forward (normal) database. This 7.8% false positive is lower than that found using criteria in ref 24, where the reverse database search resulted in 245 unique ids and 1344 peptide identifications, where the false positive rate is 15.4%. The results are consistent with the results found in Table 6, except the NET discriminant false positive rate is significantly lower.

Conclusions Using a set of control proteins, the number of peptide identifications is increased about 6.5% when dNET is used with a Drosophila sized database. The improvement gained by using dNET is greater fore human or rat databases; we have consistently observed 8-9% gains in identification rates of the control protein sequences with no increase in the false positive rates. These gains appear to be more prominent in singly charged species over the multiply charged peptides, since the information from elution time augments the limited fragmentation information for small peptides. The discriminant function appears to be a good way to maximize the use of each parameter, especially when the parameters measure orthogonal chemical properties. We have consistently seen gains of 30% or better compared to simple criteria using thresholds. Both the relative gains in using dNET in the discriminant function are consistent between the control protein datasets and the shotgun SCX/LC-MS/MS experiments on human plasma. This indicates that functions derived from limited control proteins extend well to the shotgun LC-MS/MS analyses experiments explored here. We have used control protein datasets because they allow us to establish a population of accurate and false 768

Journal of Proteome Research • Vol. 3, No. 4, 2004

Strittmatter et al.

identifications that are as pure as possible. Because the control protein datasets do not represent the levels of modified peptides, sequence polymorphism and other issues such as contamination seen in some proteomic samples, further experiments with more complex control sets or a well characterized proteome would be useful for elucidating their impact. On the basis of our results we have initially found that “runner-up” peptides provide only a modest improvement to the number of control set identifications. However, there are likely additional ways that runner-up matches could be used and validated. In particular, we suggest that if there are two peptide sequences that have reasonably high scores from a particular spectrum, either the final score (or probability) of the top hit should be lowered or a flag should be raised so that additional efforts be used to discriminate between the two choices. This indicates that additional work to incorporate more parameters, such as highly accurate mass measurements and other separations, are needed. In higher mammalian species, errant protein identifications occur often, as indicated by the 17 protein mixture results and previous epithelial proteome results,15 since peptide homologues are present in abundance and pattern matching algorithms have difficulty discerning them. Finally, additional work is needed to improve the quality of the elution time predictive capability which should result in further substantial increases in the quality or quantity of identifications. The algorithm can be extended beyond the LC separation methods used here, to other HPLC systems and to other separation methods and techniques (electrophoresis and ion exchange chromatography). Additionally, protein modifications are expected to have some effect on observed elution time values, and the inclusion of these modifications in an algorithm would be beneficial, especially in eukaryote proteomics. Searches with large numbers of modifications would also require adjustment of the discriminant parameters, since these searches drastically enlarge the search space.

Acknowledgment. Portions of this work were supported by the National Council for Research Resources (RR 018522) and the Department of Energy Office of Biological and Environmental Research. Pacific Northwest National Laboratory is operated by Battelle for the DOE under contract DE-AC0676RLO 1830. The authors also would like to acknowledge Drs. Ruedi Aebersold, Alex Nesvizshkii and Andrew Keller at the Institute for Systems Biology for their permission and assistance in using the Peptide Prophet program. Supporting Information Available: Tables of peptide and protein idenitifcations from control, plasma protein analyses and mass spectral information for several peptides. This material is available free of charge via the Internet at http:// pubs.acs.org. References (1) http://wit.integratedgenomics.com/GOLD/. (2) Pasa-Tolic’, L.; Lipton, M. S.; Masselon, C.; Anderson, G. A.; Shen, Y.; Tolic, N., Smith, R. D. J. Mass Spectrom. 2002, 37. (3) Shen, Y.; Tolic, N.; Zhao, R.; Pasa-Tolic, L.; Li, L.; Berger, S. J.; Harkewicz, R.; Anderson, G. A.; Belov, M. E.; Smith, R. D. Anal. Chem. 2001, 73, 3011-3021. (4) Strittmatter, E. F.; Ferguson, P. L.; Tang, K.; Smith, R. D. J. Am. Soc. Mass Spectrom. 2003, 14, 980-991. (5) Boguski, M. S.; McIntosh, M. W. Nature 2003, 422, 233-237. (6) Hanash, S. Nature 2003, 422, 226-232. (7) Smith, R. D. Comparative and Functional Genomics 2002, 3, 143150.

research articles

Peptide Identification by Tandem Mass Spectrometry (8) Corbin, R. W.; Paliy, P.; Yang, F.; Shabanowitz, J.; Platt, M.; Lyons, C.; Root, K.; McAuliffe, J.; Jordan, M. I.; Kustu, S.; Soupene, E.; Hunt, D. F. P. Natl. Acad. Sci. USA 2003, 100, 9232-9237. (9) Bader, G. D.; Heilbut, A.; Andrews, B.; Tyers, M.; Hughes, T.; Boone, C. Trends Cell Biol. 2003, 13, 344-356. (10) Kislinger, T.; Emili, A. Curr. Opin. Mol. Therapeut. 2003, 5, 285293. (11) Anderson, N. L.; Anderson, N. G. Mol. Cell. Proteomics 2002, 1, 845-867. (12) Aebersold, R.; Goodlett, D. R. Chem. Rev. 2001, 101, 269-295. (13) Lipton, M. S.; Pasa-Tolic, L.; Anderson, G. A.; Anderson, D. J.; Auberry, D. L.; Battista, J. R.; Daly, M. J.; Fredrickson, J.; Hixson, K. K.; Kostandarithes, H.; Masselon, C.; Markillie, L. M.; Moore, R.; Romine, M. F.; Shen, Y.; Strittmatter, E.; Tolic, N.; Udseth, H. R.; Venkateswaran, A.; Wong, K. K.; Zhao, R.; Smith, R. D. Proc. Natl. Acad. Sci., U.S.A. 2002, 99, 11 049-11 054. (14) Washburn, M. P.; Ulaszek, R.; Deciu, C.; Schieltz, D.; Yates, J. R. Anal. Chem. 2002, 74, 1650-1657. (15) Jacobs, J. M.; Mottaz, H. M.; Yu, L. R.; Anderson, D. J.; Moore, R. J.; Chen, W. U.; Auberry, K. J.; Strittmatter, E. F.; Monroe, M. E.; Thrall, B. D.; Camp, D. G.; Smith, R. D. J. Proteome Res. 2003. (16) Link, A. J.; Hays, L. G.; Carmack, E. B.; Yates, J. R. Electrophoresis 1997, 18, 1314-1334. (17) Quach, T. T.; Li, N.; Richards, D. P.; Zheng, J.; Keller, B. O.; Li, L. J. Proteome Res. 2003, 2, 543-552. (18) Aebersold, R.; Mann, M. Nature 2003, 422, 198-207. (19) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (20) Perkins, D.; Pappin, D.; Creasy, D. London U. Electrophoresis 1999, 20, 3551-67. (21) Peng, J.; Elias J.; Thoreen, C.; Licklider, L.; Gygi, S. J. Proteome Res. 2003, 2, 43-50. (22) Florens, L.; Washburn, M. P.; Raine, J. D.; Anthony, R. M.; Grainger, M.; Hayness, J. D.; Moch, J. K.; Muster, N.; Sacci, J. B.; Tabb, D. L.; Witney, A. A.; Wolters, D.; Wu, Y.; Gardner, M. J.; Holder, A. A.; Sinden, R. E.; Yates, J.; Carucci, D. J. Nature 2002, 419, 520-526. (23) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392. (24) Washburn, M. P.; Wolters, D.; Yates, J. R. Nat. Biotechnol. 2001, 19, 242-247. (25) Moore, R. E.; Young, M. K.; Lee, T. D. J. Am. Soc. Mass Spectrom. 2002, 13, 378-386. (26) Craig, R.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2003, 17, 2310-2316.

(27) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. Anal. Chem. 2003, 75, 4646-4658. (28) Petritis, K.; Kangas, L. J.; Ferguson, P. L.; Anderson, G. A.; PasaTolic′, L.; Lipton, M. S.; Auberry, K. J.; Strittmatter, E.; Shen, Y.; Zhao, R.; Smith, R. D. Anal. Chem. 2003, 75, 1039-1048. (29) Palmblad, M.; Ramstrom, M.; Markides, K.; Hakansson, P.; Bergquist, J. Anal. Chem. 2002, 74, 5826-5830. (30) Palmblad, M.; Ramstrom, M.; Bailey, C.; McCutchen-Maloney, S. L.; Bergquist, J.; Zeller, L. C. J. Chromatogr. B 2004, 803, 131135. (31) Strittmatter, E.; Rodriguez, N.; Smith, R. D. Anal. Chem. 2003, 75, 460-468. (32) Conrads, T. P.; Anderson, G. A.; Veenstra, T. D.; Pasa-Tolic, L.; Smith, R. D. Anal. Chem. 2000, 72, 3349-3354. (33) Smith, R. D.; Anderson, G. A.; Lipton, M. S.; Masselon, C.; PasaTolic, L.; Shen, Y.; Udseth, H. R. OMICS 2002, 6, 61-90. (34) Pasa-Tolic, L.; Jensen, P. K.; Anderson, G. A.; Lipton, M. S.; Peden, K. K.; Martinovic, S.; Tolic, N.; Bruce, J. E.; Smith, R. D. J. Am. Chem. Soc. 1999, 121, 7949-7950. (35) Belov, M. E.; Anderson, G. A.; Wingerd, M. A.; Udseth, H. R.; Tang, K.; Prior, D. C.; Swanson, K. R.; Buschbach, M. A.; Strittmatter, E. F.; Moore, R. J.; Smith, R. D. J. Am. Soc. Mass Spectrom. 2004, 15, 212-232. (36) Shen, Y.; Jacobs, J. M.; Camp, D. G.; Fang, R.; Moore, R. J.; Smith, R. D.; Xiao, W.; Davis, R. W.; Tompkins, R. J. Anal. Chem. 2004, 76, 1134-1144. (37) Shen, Y.; Zhao, R.; Berger, S. J.; Anderson, G. A.; Rodriguez, N.; Smith, R. D. Anal. Chem. 2002, 74, 4235-4249. (38) Shen, Y.; Zhao, R.; Belov, M. E.; Conrads, T. P.; Anderson, G. A.; Tang, K.; Pasa-Tolic, L.; Veenstra, T. D.; Lipton, M. S.; Smith, R. D. Anal. Chem. 2001, 73, 1766-1775. (39) MatLab, 6.5; MathWorks Inc.: 2002. (40) Krzanowski, W. J. Principles of Multivariate Analysis; Oxford University Press: New York, 1988; 563. (41) The NAG C Library Manual, Mark 7. The Numerical Algorithms Group Ltd: Oxford UK, 2002. (42) Boggs, P. T.; Donaldson, J. R.; Byrd, R. H.; Schnabel, R. B. ACM Transactions on Mathematical Software (TOMS) 1989, 15, 348364. (43) Stanford Yeast Genome. http://www.yeastgenome.org/. (44) International Protein Index. ftp://ftp.ebi.ac.uk/pub/databases/ IPI/current/.

PR049965Y

Journal of Proteome Research • Vol. 3, No. 4, 2004 769