Large Scale Analysis of MASCOT Results Using a Mass Accuracy-Based THreshold (MATH) Effectively Improves Data Interpretation Paul A. Rudnick,*,† Yueju Wang,‡ Erin Evans,† Cheng S. Lee,‡ and Brian M. Balgley† Calibrant Biosystems, 7507 Standish Pl., Rockville, Maryland 20855, and Department of Chemistry and Biochemistry, University of Maryland, College Park, Maryland 20742 Received March 3, 2005
In this report, we take a heuristic approach to studying the effects of mass tolerance settings and database size on the sensitivity and specificity of MASCOT. We also examine the efficacy of the MASCOT Identity Threshold as a discriminator when applied to QqTOF data with an average mass accuracy of 10 ppm or better. As predicted, arbitrarily large mass tolerance settings negatively affect MASCOT’s specificity, and to a lesser degree, sensitivity. Increased mass tolerances also render the generation of a significance threshold less effective. To study these effects, we used Bayes’ Law to calculate MASCOT’s predictive values. With a relatively small search database (Human IPI), MASCOT had a mean positive predictive value of 0.993 when combined with MASCOT’s Identity Threshold. However, the corresponding average negative predictive value, or the probability that an ion was not present given no score or a score below threshold, was reduced as mass tolerances were tightened, and had an average value of 0.717. This value was improved upon by extrapolating an empirical threshold using a reversed database search and a new algorithm to rapidly identify false positive identifications. Using the empirical threshold reduced false negative identifications on the average 17% while limiting the false positive rate to below 5%; even larger reductions were obtained using mass tolerances approaching two times the actual error of the experimental data. A simple application of this strategy to the analysis of a microdissected glioblastoma multiforme sample analyzed by IEF/LC-MS/MS is reported, as is a description of the tools required to implement a large scale analysis using this alternative approach. Keywords: bioinformatics • MASCOT • data analysis • search algorithms • statistics • IEF/LC-MS/MS • SEQUEST • data standards • biomarkers
Introduction The emerging field of proteomics offers the promise of a new model for biological study. The rapid and sensitive identification of thousands of peptides by various implementations of liquid chromatography combined with tandem mass spectrometry (LC-MS/MS) and by inference, proteins, has become widespread. Techniques for quantifying proteomic experiments are established, and the ability to perform temporal analyses is accelerating.1,2 These experiments allow biologists to view the proteins isolated from a living system en masse, creating a vast potential experimental landscape of study. The foundation of these experiments is the identification of the peptide or protein. If this step is suspect, then all proceeding data are equally suspect. Therefore it is essential that biologists on the receiving end of the experimental pipeline have a known measure of confidence in the interpreted data. Recent reports have begun to raise key issues relating to the confidence of these identifications.3-5 * To whom correspondence should be addressed. Phone: (301) 424-2320 x23. Fax: (301) 424-2462. E-mail:
[email protected]. † Calibrant Biosystems. ‡ University of Maryland. 10.1021/pr0500509 CCC: $30.25
2005 American Chemical Society
Peptides are ionized in order to enter a mass spectrometer. As the mass spectrometer scans a set mass range, any ions with intensities above a preset threshold will be individually isolated and fragmented (MS/MS). The serial acquisition of fragmentation data takes place over the course of the entire LC elution, during which thousands to hundreds of thousands of MS/MS events occur during a typical run, depending on the type of mass spectrometer used. These data are recorded in real-time by a computer, and at the end of the analysis, a software program termed a peak-list generator will create files consisting of the precursor ion mass and intensity, its charge state, if available, its associated fragment ions and their intensities. This peak list is then submitted to a peptide identification search algorithm. The SEQUEST6 and MASCOT7 search algorithms are currently the two most widely used tools for peptide identification, although numerous others have been developed.8-11 Both SEQUEST and MASCOT take roughly the same approach to making identifications, although some of the algorithmic details are not published, the generalized scoring schemes have been documented. A database of possible peptide matches is created for each precursor ion. This is done by examining the mass of Journal of Proteome Research 2005, 4, 1353-1360
1353
Published on Web 06/25/2005
research articles the precursor ion, the mass accuracy tolerance setting input by the user, the enzyme specificity and the number of missed enzyme cleavages permitted within the context of the database to be searched. A larger mass tolerance setting will increase the number of possible matches to be searched, as will lowering the enzyme specificity required, increasing the number of allowed missed cleavages and the increasingly large databases. Once a search set has been created for a given precursor ion, the algorithm attempts to match the experimental MS/MS data with theoretical fragmentations of each of the possible peptide matches. Scores are assigned to each of the theoretical MS/ MS spectra, the top 10 matches are recorded and the top match is listed in the output seen by the user. The main difference between the two algorithms is that MASCOT uses a probability-based scoring mechanism while SEQUEST, as currently available, does not. There is one major consequence to this difference: SEQUEST users are unable to determine the likelihood that any given peptide match is nonrandom for a given search set. The result of this is that an arbitrary score for each peptide charge state is used as a cutoff for identification. In practice, the same cutoffs are used regardless of the search parameters. That is, a change in precursor ion mass tolerance, enzyme specificity, allowable missed cleavages, database (organism) and allowed modifications should result in different cutoffs as each of these parameters will change the probability of a random match. It has recently been noted that even different samples from the same organism and analyzed by the same techniques can have very different false-positive rates.5 Various solutions have been offered to account for these variables,12-17 however, none seems to be in widespread use. MASCOT, on the other hand, allows users to define a probability threshold above which an identified ion has a default 5% or less probability of occurring by chance alone. This allows users a degree of confidence in the displayed results. However, as the number of peptides meeting the identification threshold will decrease as the search tolerances are reduced, the user cannot know the degree to which sensitivity is compromised when using such a cutoff. The burden is on the user to ensure that all search parameter settings are appropriate to the data in order to maximize both interpretation and data quality. This need becomes more acute as the proteomics community moves toward documented data and data analysis standards. This paper focuses on identifications made from peptides analyzed by ESI-QqTOF-MS/MS using MASCOT as the peptide identification engine. Our current proteome analysis pipeline includes a Waters QTOF micro which has 1-2 order(s) of magnitude greater mass accuracy than ion trap instruments. It has previously been noted that higher mass accuracy can improve the discrimination power of database searches.18 With this in mind, and the fact that MASCOT does not directly consider mass accuracy when comparing theoretical ion masses, we wanted to determine how higher mass accuracy, and the implied data confidence, could be used to improve the quality and quantity of MASCOT identifications. To that end, we asked the following questions: (1) how do database size and mass tolerance settings affect search results and MASCOT’s interpretive models? (2) What is the actual predictive power of MASCOT (i.e., the ability of MASCOT’s probability-based score scoring model and threshold scheme to predict the presence or absence of an ion given a positive or negative result)? (3) Can mass accuracy be used to improve sensitivity and specificity 1354
Journal of Proteome Research • Vol. 4, No. 4, 2005
Rudnick et al.
and/or the predictive power of the MASCOT package? To answer these questions a mixture of known standard proteins and a microdissected sample of human brain tissue from a glioblastoma multiforme patient were analyzed. The standard protein sample was used to test the data models, a sample where true positives peptide identifications can be validated. This alternative approach was then applied to a 14 fraction CIEF/LC-MS/MS run of the complex human tissue sample.
Experimental Section Materials and Reagents. Fused-silica capillaries (100 µm i.d./ 375 µm o.d. and 50 µm i.d./375 µm o.d.) were acquired from Polymicro Technologies (Phoenix, AZ). Acetic acid, ammonium hydroxide, dithiothreitol (DTT), iodoacetamide (IAM), bovine serum albumin (A0281), yeast alcohol dehydrogenase (A7011), equine myoglobin (A8673), bovine carbonic anhydrase II (C2522), rabbit glycogen phosphorylase B (P6635), equine cytochrome C (C7752), bovine RNAse A (R5500), and bovine ubiquitin (U6253) were obtained from Sigma (St. Louis, MO). Acetonitrile, ammonium acetate, formic acid, hydroxypropyl cellulose (HPC, average MW 100 000), trifluoroacetic acid (TFA), tris(hydroxymethyl)aminomethane (Tris), and urea were purchased from Fisher Scientific (Pittsburgh, PA). Pharmalyte 3-10 was acquired from Amersham Pharmacia Biotech (Uppsala, Sweden). Sequencing grade trypsin was obtained from Promega (Madison, WI). Human [Glu1] fibrinopeptide B (GFP) was purchased from VWR (West Chester, PA). All solutions were prepared using water purified by a Nanopure II system (Dubuque, IA) and further filtered with a 0.22 µm membrane (Millipore, Billerica, MA). Microdissection and Protein Digestion. Tumor cells from human brain sections were microdissected as previously described.19 The microdissected cells were placed directly into a microcentrifuge tube containing 10 mM Tris-HCl, pH 8.0. No protein extraction or cleanup was performed on the cells. From this stage the cells and standard proteins were treated identically. Proteins were made to 8 M in urea, DTT was added to 10 mg/mL and the solution was incubated at 37 °C for 2 h. IAM was added to 20 mg/mL and the solution was incubated at room temperature for 1 h in the dark. The solution was then diluted 4-fold with 100 mM ammonium acetate, pH 8.0. Trypsin was added at a 1:50 enzyme-to-substrate ratio and the solution was incubated at 37 °C overnight. Digestates were desalted and concentrated using a Peptide MacroTrap column (Michrom Bioresources, Auburn, CA) and then lyophilized to dryness using a SpeedVac (Thermo, San Jose, CA) and stored at -20 °C. Isoelectric Focusing, Liquid Chromatography, and Mass Spectrometry. CIEF/LC/MS/MS was performed essentially as previously described.20 Briefly, peptides were reconstituted to 2 mg/mL in 10 mM Tris-HCl, pH 7.6 containing 1% Pharmalyte 3-10, loaded into a HPC-coated fused silica capillary (100 µm i.d. × 80 cm) and focused to 300 V/cm using 0.1 M acetic acid and 0.5% NH4OH as anolyte and catholyte, respectively. Focused peptides were fractionated and each fraction was analyzed by LC-MS/MS. Liquid chromatography was performed using an Ultimate HPLC (Dionex, Sunnyvale, CA) equipped with a nano-flow splitter connected to a pulled-tip fused silica capillary resolving column (50 µm i.d.) packed with 15 cm of 5 µm Zorbax Stable Bond (Agilent, Palo Alto, CA) C18 particles. Peptides were eluted at 200 nl/min using a 5-45% linear acetonitrile gradient over 100 min and electrosprayed into a QTOF micro (Waters, Milford, MA) mass spectrometer.
research articles
Mass Accuracy Based Thresholds
Mass spectra were acquired from 500 to 1900 m/z for 1 s followed by 3 data dependent MS/MS scans from 50 to 1900 m/z for 3 s each. GFP lock mass was infused at a rate of 300 nl/min and was acquired for 1 s every 2 min throughout the run. Data Analysis. Raw search data was lock mass corrected, deisotoped and converted to peak list files by ProteinLynx Global Server 2 (Waters). MASCOT 2.0 (Matrix Science, London, UK) search parameters were varied only at the peptide and fragment mass accuracy tolerances as indicated for that particular experiment. Other parameters were as follows: missed cleavages, 1; enzyme, trypsin; peptide charge, (1+,2+,3+); fixed modifications, carbamidomethyl (C); variable modifications, oxidation (M), acetylations (K and N-term). Instrument setting was ESI-QUAD-TOF. All searches were launched from MASCOT Daemon (Matrix Science). Search databases were NR (ftp:// ftp.ncbi.nlm.nih.gov/blast/db/FASTA), containing 2 768 002 entries, and Human IPI (2.28) (http://www.ebi.ac.uk/IPI/IPIhuman.html), containing 40 110 entries. IPI human was used to make comparisons between results from a smaller database versus the larger NR database and was appended with entries for BSA (gi:1351907) yeast Adh1p and Adh2p (gi:6324486, gi: 6323961), equine myoglobin (gi:70561), bovine carbonic anhydrase II (gi:41019480, gi:68288), rabbit glycogen phosphorylase (gi:6093713), equine cytochrome C (gi:117995), bovine RNAse A (gi:48429071), bovine ubiquitin (gi: 51703340), ubiquitin activating enzyme, which frequently shows up as a contaminant in preparations of ubiquitin (gi:475916) as true positives. Porcine trypsin precursor (gi:136429) and 21 human keratin entries were also added to exclude these hits in searches counting as false positives. MASCOT flat files were parsed using MP.pm (Calibrant Biosystems, Rockville, MD), a Perl module developed from DBParser21 but rewritten for processing speed and custom database schema. Each search or combined multi-fraction search was collected into a single Proteome BiNDR (Calibrant Biosystems) project database (in preparation). These are 12 table MySQL relational databases into which all run, fraction, query, peptide hit, modification and protein hit information from many samples and/or runs can be loaded and dynamically queried. Results databases were queried to generate statistics using the Proteome BiNDR interface or with custom SQL written into Perl scripts. True positive peptides were matched by searching the peptide strings against the list of standard proteins listed above. Empirical thresholds were calculated using the program ‘approx_fp’ (Calibrant Biosystems) which loads results from a forward and equivalent reversed database search into separate results databases in parallel. False positive rates are then determined by counting the number of peptide hits scoring above the indicated threshold for the reversed database search, minus equivalent peptides found in the forward search, divided by the sum of the peptide hits above threshold for the forward search. This calculation was adapted to use for searches where peptide sample components are unknown (i.e., true positive rates cannot be determined). Peptides containing fewer than 6 amino acids were not considered when calculating an empirical threshold. Peptides from the reversed database containing residues having mass differences indistinguishable by the instrument (i.e., Q f K and I f L) were allowed to match the forward peptides using either residue. Theoretical pI values for proteins and peptides for results databases were calculated using ‘iep’ or ‘pepstats’ programs developed by the EMBOSS
Figure 1. Mass errors were extracted for each true positive peptide hit from MASCOT searches against both the Human IPI database (black bars) and NR (gray bars) of 1455 tandem mass spectra derived from a mixture of 8 standard proteins. Search parameters for this search were chosen to be outside the known lock mass-corrected accuracy of the instrument: 150 ppm precursor ion mass tolerance and a 0.2 Da fragment ion mass tolerance. All other parameters are described in the (Data Analysis section). The average error for the Human IPI search was 10.7 ( 9.8 ppm and 11.7 ( 10.7 ppm against NR. The data were plotted to illustrate the actual error within the dataset from correctly identified peptide ions.
group. All software was developed on Red Hat 9 (LINUX) using Perl (v. 5.8.0) and MySQL (3.23.54). Proteome BiNDR runs on Apache 2.0 and uses PHP 4.2.2 and classes distributed by PEAR (http://pear.php.net).
Results and Discussion Determining the Actual Mass Error of a Search Set. We first examined our experimental mass accuracy by analyzing the digest of an equimolar mixture of the 8 standard proteins by LC-MS/MS. The raw data was then lock mass corrected, peak listed and searched by MASCOT using search parameters known to be outside the mass accuracy tolerance of a typical experiment. In this case, a 150 ppm peptide mass tolerance and a 0.2 Da fragment mass tolerance were chosen as a starting point. All MASCOT results data were then parsed and loaded into a Proteome BiNDR results database whether the MASCOT score was above the Identity Threshold (i.e., all hits were considered significant). Identified peptides were then read out of the database and searched against a file containing all protein sequences contained in the sample, including trypsin and common contaminants. Peptides that matched the true positive sequences were then flagged in the database and their mass differences were summed in 3 ppm bins for searches against the IPI human database and then against NR (Figure 1). From these data, mean errors for the Human database search of 10.7 ( 9.8 ppm and 11.7 ( 10.7 ppm for the NR search were determined for a total of 442 and 327 true positive peptide hits, respectively. In this case, a mass error of 33 ppm would encompass 95% of the true positive peptide hits. Identifying Correct and Incorrect Identifications. Next, the total number of true and false positive peptide hits was plotted for each set of mass tolerances for each of the two search databases (Figure 2). These values represent all scored peptide hits and are plotted to visualize the effects of varied mass tolerances on the sensitivity and specificity of MASCOT’s implementation of the MOWSE scoring algorithm. From these data it can be observed that the mass tolerance settings affect Journal of Proteome Research • Vol. 4, No. 4, 2005 1355
research articles
Rudnick et al.
Figure 2. Counts of true and false peptide hits were plotted for each set of search parameters. The solid bars represent true positive hits, black is a search against the IPI Human database and gray against NR. Speckled bars are false positive peptide hits. All data are collected without respect for a threshold score, simply to show the effects of database size and mass accuracy tolerance settings on the search algorithm.
both the quality and quantity of peptide hits; in particular, increased mass tolerances inflate the number of incorrect matches. Searches against NR at the largest mass tolerances yielded the highest false positive to true positive ratio. In only the most restrictive search, using a 30 ppm precursor ion tolerance and a 0.2 Da fragment ion tolerance against IPI, are the number of true positive identifications higher than false positives without filtering. Using these accurate parameters also yielded the highest numbers of true positive identifications, without sacrifice. Using parameters common to ion trap instruments (1.5 Da pmt, 0.3 Da fmt), the number of incorrect identifications was more than 3-fold the number of true positive identifications for the large database search, and greater than 2-fold for the smaller database. This presents a significant challenge to any discrimination test. Examining Sensitivity and Specificity. Next, to rapidly visualize the effects of mass tolerance settings and database size on ion scoring, receiver operator characteristic (ROC) curves were plotted for searches against the Human database and NR. These types of plots can be used to show the relationship between sensitivity and specificity for a given statistical test. In most instances, the X-axis would represent probabilities (1 - Specificity) and the Y-axis would report the sensitivity of the test. Here, the actual peptide hit numbers have been plotted for reference as the MASCOT score threshold was lowered from 50 to 1 (Figure 3). The Y-axes have been kept the same for comparison between the IPI Human and NR searches. These results indicate that the MASCOT score provides a good measure of true positives versus false positives. However, searches against larger databases impact both the sensitivity and the specificity of the scoring algorithm. Any event that simultaneously improves both the sensitivity and specificity of the test is an advantage. In this case, using the appropriate search parameters and database leads to the highest numbers of true positive identifications while limiting false positives. The MASCOT Identity Threshold Applied to a Standard Protein ESI-QTOF Dataset and Theoretical Considerations. Next, we examined the MASCOT Identity Threshold as a guide to discriminate true positive from false positive peptide identifications. This threshold is calculated on the fly for each query, 1356
Journal of Proteome Research • Vol. 4, No. 4, 2005
Figure 3. Receiver Operator Characteristics (ROC) plots for varied mass accuracy tolerance settings searched against the (A) Human IPI or (B) NR database. The search data is ESI-QTOF-MS/MS data from a known mixture of 8 protein standards. Plots were generated by lowering the MASCOT score threshold from 50 to 1.
and is reported by a MASCOT CGI script when a report is viewed. This threshold can also be calculated by the MASCOT flat file parser MP.pm and loaded into BiNDR results databases. The Identity Threshold is calculated according to the following
()
m (0.05) ln TS T ) 10 ln(10) where m is the number of matches within the precursor mass accuracy tolerance window, and TS is defined in the MASCOT configuration file as the significance threshold. A default value of 0.05 indicates a 5% probability of a match occurring by chance alone. We can calculate the rate of false positive identifications for a given search according to the following RFP )
HFP HTP + HFP
where the number of false positive peptide hits divided by the sum of all hits combined equals the rate of false positive identifications. From the total numbers of hits scoring above threshold combined with the data from Figure 2, we can calculate the predictive values for MASCOT. These are the probabilities that MASCOT will identify a peptide given that it was represented in the data, or MASCOT will not identify it, if
research articles
Mass Accuracy Based Thresholds Table 1. MASCOT Predictive ValuessMASCOT with the Identity Threshold 30 PPM, 0.2 Da
IPI NR
30 PPM, 0.8 Da
150 PPM, 0.2 Da
150 PPM, 0.8 Da
1.5 Da, 0.3 Da
1.5 Da, 0.8 Da
+a
-b
+
-
+
-
+
-
+
-
+
-
0.996 1.000
0.551 0.853
0.988 0.774
0.744 0.884
0.993 0.778
0.658 0.878
0.990 0.771
0.767 0.916
0.991 0.729
0.765 0.921
1.000 0.806
0.820 0.941
a Defined as the probability that a peptide was present in the sample given a score higher than threshold. b Defined as the probability that a peptide was not present in the sample given a score lower than threshold.
it was not represented in the data. This value can be calculated by applying Bayes’ Theorem according to the following P(D1|T+) )
P(T+ |D1) P(T+|D1) + P(T+|D2)
where P(D1|T+) is the probability that the peptide was in the data given a MASCOT score higher than threshold. The negative predictive value can also be defined as the following P(D2|T-) )
P(T-|D2) -
P(T |D2) + P(T-|D1)
where P(D2|T-) is the probability that that the peptide ion was not represented in the data given a score lower than threshold or no score. These values are summarized in Table 1. What can be seen in these data is that MASCOT combined with the Identity Threshold as a statistical discriminator (p < 0.05) does a good job at eliminating most, if not all, false positives. This can be calculated as the converse of the positive predictive value, which averages less than 0.007 for searches against the Human IPI database. However, MASCOT’s ability to eliminate false negatives or the converse of the negative predictive value is impaired by the high degree of specificity of the algorithm combined with the Identity Threshold, averaging 0.283 for the Human IPI searches. Stated differently, searching this dataset with a 150 ppm precursor mass tolerance and a 0.2 Da fragment ion mass tolerance against the Human IPI database, MASCOT will score a true positive precursor ion below threshold 37% of the time. This effect is exacerbated at decreased mass tolerances, indicating that the MASCOT Identity Threshold may be too conservative with lower noise, higher mass accuracy datasets. This also indicates that the additional true positive hits at increased mass tolerances, while true, are likely arrived at by chance. This provides some insight into why false positive rates are inherently higher for data searched at larger mass tolerances,3 whether due to limitations of the instrument or uncertainty by the user of the proper search settings for that data set. Calculating a Mass Accuracy Based Threshold (MATH). Next, we sought to improve the negative predictive value of a MASCOT search by approximating a new threshold to minimize false negative identifications and maintain a low false positive rate. This has been accomplished by other groups by searching against a reversed search database.4,16,17 The assumption from these types of experiments is that any peptide hits resulting from a search against the reversed database that score above threshold are false positive. We added two exceptions to this test: (1) if the peptide identification also occurs in the forward search, it is not counted as false positive and, (2) if the peptide is less than 6 residues in length, or contains substitutions that are not within the resolving power of the instrument, they are neither false positive nor true positive and are not counted.9 Reversed MASCOT databases were generated by reversing every
Figure 4. Empirical threshold score according to a calculated false positive rate can be approximated using a simple linear regression. Data from parallel forward and reversed database searches were loaded into BiNDR results databases and used to count the incidence of false positive identifications as the threshold score was lowered from 32 to 16. For this search consisting of 8383 tandem mass spectra, using a 30 PPM precursor ion mass tolerance and a 0.2 Da fragment ion mass tolerance and searched against Human IPI, a MASCOT score of 24.04 represents a predicted false positive rate of 5%.
line of the FASTA sequence file for both Human IPI and NR and writing new files. These sequences were then built into MASCOT searchable databases using the server’s automated utility. Searches against the reversed databases were run using identical search parameters as the forward searches with the exception of the search database. After both searches were complete, the data files were parsed in parallel into two Proteome BiNDR databases using approx_fp (Experimental Section). This program then counted the reversed peptide hits above a threshold score from the reversed search and checked against the peptides tables in the forward search results database as mentioned above, as the threshold was lowered between any two MASCOT search scores (in this case 32 and 16). The false positive rate for a given threshold score can then be calculated by dividing the number of hits to the reversed database by the number of peptide hits from the forward search above the same threshold. The percent false positive can then be plotted versus the MASCOT score for the data set. From these data, a threshold score can be extrapolated for a chosen false positive rate according to a simple linear regression of the exponential plot (Figure 4). This threshold score is then used to parse the forward search dataset as an alternative to the MASCOT Identity Threshold. Next, we tested the effects of an empirical threshold on the same data set, where true and false positive peptides are known, and compared the results obtained when the MASCOT Identity Threshold was used as a discriminator. For this analysis, the MASCOT score was subtracted from the Identity Journal of Proteome Research • Vol. 4, No. 4, 2005 1357
research articles
Figure 5. Numbers of true positive peptide plotted as differences from (A) MASCOT Identity Threshold and (B) Empirical threshold (MATH) score. Black bars are searches against Human IPI and gray bars are searches against NR. A score less than zero indicates that the peptide hit was above threshold. Scores greater than zero are below threshold and would be false negative.
Threshold for each true positive peptide hit found in the IPI and NR searches at 150 ppm precursor ion mass tolerance and 0.2 Da fragment ion mass tolerance.9 The results were then summed in 4 unit bins and plotted (Figure 5A). In this figure, values greater than zero indicate that the peptide hit did not score above threshold and would be regarded as false positive. To be noted in this plot is the large number of peptide hits that occur just outside the MASCOT threshold [0-4] and [4-8] bins for the IPI search. The analysis was repeated using the empirically determined threshold calculated according to the above strategy for each search (Figure 5B). Using the empirical threshold decreased the numbers of false negative identifications by 22% for this particular search. Examining the Effects of MASCOT and Empirical Thresholds on a Standard Protein Digest. To evaluate the effects of the trade off between reduced false negatives and increased false positives, peptide hits above the MASCOT Identity Threshold or the empirically determined threshold were plotted (Figure 6). These results indicate that the determination of a run-wide empirical threshold improves the numbers of true positive identifications while limiting the false positive rate to approximately what was predicted from the linear regression. These differences are more pronounced with the IPI Human searches and are most significant at decreased mass tolerances, owing to a filtering of more hits from the candidate theoretical mass spectra by accurate mass. It appears from these searches, 1358
Journal of Proteome Research • Vol. 4, No. 4, 2005
Rudnick et al.
Figure 6. Number of true positive and false positive peptide hits above (A) the MASCOT Identity Threshold and (B) the empirical threshold (MATH). Black bars are searches against Human IPI and gray represents searches against NR. Solid bars are true positive and speckles bars are false positive.
however, that accurate results from the NR search are severely limited, and neither threshold test evaluates to the expected 5% error rates for false positive identifications. Empirically determining a significance threshold for high mass accuracy data can improve the negative predictive probability of MASCOT. Positive and negative predicative values calculated using the empirical mass accuracy-based threshold (MATH) are listed in Table 2 for comparison to Table 1. For a high mass accuracy search (30 ppm, 0.2 Da) against IPI Human, negative predictive value is improved over 10%, while positive predictive value is lowered less than 1%. Applying MATH to the Analysis of a Complex Human Tissue Sample. As the last exercise, we analyzed a single run of 14 cIEF fractions derived from a sample of glioblastoma multiforme, a highly aggressive form of brain tumor. This sample was analyzed under the same pretenses as the standard protein mix with one notable exception: true positive peptides are unknown. Because this is the case, one must maximize the false positive rate to an acceptable level and then infer that the most true positive identifications will score above threshold. All 14 fractions were searched against the IPI Human database at the 30 ppm, 0.2 Da mass tolerances; this dataset had a mean mass error of ∼8 PPM (data not shown). These parameters were used to test the strength of mass accuracy when applied to a typical MASCOT search of data derived from human tissues. The corresponding reversed database search for the sample was also run and the searches were analyzed using approx_fp to
research articles
Mass Accuracy Based Thresholds Table 2. MASCOT Predictive ValuessMASCOT with a MATH (Empirical Threshold) 30 PPM, 0.2 Da
IPI NR
30 PPM, 0.8 Da
150 PPM, 0.2 Da
150 PPM, 0.8 Da
1.5 Da, 0.3 Da
1.5 Da, 0.8 Da
+a
-b
+
-
+
-
+
-
+
-
+
-
0.988 0.771
0.617 0.844
0.962 0.715
0.788 0.902
0.962 0.773
0.719 0.887
0.950 0.739
0.813 0.929
0.945 0.711
0.811 0.928
0.935 0.678
0.874 0.949
a Defined as the probability that a peptide was actually in the sample given a score higher than threshold. b Defined as the probability that a peptide was not in the sample given a score lower than threshold.
Table 3. Fourteen Fraction IEF/LC-MS/MS Run of a Glioblastoma Multiforme Microdissection Sample
identity threshold MATH (empirical threshold) a
peptide hits
distinct peptides
distinct proteinsa
2351 2909
1315 1621
966 1100
Distinct IPI Human Protein Id’s.
provide data to determine an empirical threshold score. The threshold MASCOT score was 24.04 at 5% false positive as determined for this dataset and included 8383 tandem mass spectra. The run was then parsed according to the MASCOT Identity Threshold and the empirical threshold score, and the results for the numbers of peptide and protein identifications are listed in Table 3. Using an empirical threshold added approximately 19% more peptide hits, 19% more distinct peptide identifications and 12% more protein identifications, while limiting the false positive identifications to below the 5% approximated rate.
Conclusion In summary, we have taken advantage of the high mass accuracy capability of the QTOF mass analyzer to enable us to improve the sensitivity of a MASCOT MS/MS search with controlled impact on specificity by applying a mass accuracybased threshold (MATH). Current search programs require the user to input mass tolerance values for the precursor and fragment ions, respectively. We have shown that large mass tolerance parameters applied to high mass accuracy datasets negatively impact search results by lowering both specificity and sensitivity when using MASCOT. We observed this by showing the effects of database size and search parameters on the actual numbers of true positive and false positive identifications made by MASCOT without respect for score (Figure 2). These same inaccurate parameters also negatively impact MASCOT’s predictive power when the search results are combined with MASCOT’s Identity Threshold as a discriminator. This effect is related both to the chemical and instrumental noise present in the fragment ion mass spectrum and to the number of possible precursor ion matches for a given mass tolerance. Higher levels of noise will decrease specificity, more obviously when a wider than necessary mass tolerance is applied, as MASCOT only uses mass error to either include or exclude ions from consideration when scoring. Searches using low mass accuracy tolerances are different from searches using higher mass accuracy tolerances in that the former searches will contain more possible precursor ion matches than the latter. There was a 10-fold difference in the number of possible precursor ion matches between our most and least stringent searches. Both specificity and sensitivity deteriorate as the number of possible precursor ion matches increases. This
increase can come from a larger database, less or nonspecific enzyme searches, more allowed missed cleavages, variable modifications and, as demonstrated, decreasingly stringent mass tolerances. It is not relevant where the increase in number of possible precursor ions comes from, only that there is an increase which will lead to a greater probability of a random match. We have demonstrated that data-dependent thresholding based on mass accuracy can significantly improve the negative predictive power of a MASCOT search. However, this approach is applicable to any change in search parameters and will allow for an empirical evaluation of the false positive rate for any type of search. This approach can be applied to either MASCOT or Sequest because these programs do not generate true expect values (e-values) when the search is run, but instead use a secondary calculation to place scores within a statistical framework.
Acknowledgment. We thank Zhengping Zhuang at NINDS for supplying the microdissected glioblastoma samples. Support for this work by NCI (CA103086 and CA107988) and NCRR (RR21239) is gratefully acknowledged. Note Added after ASAP Publication. This manuscript was originally published on the Web (06/22/2005) with an additional sentence at the end of the Conclusions section. The version published on the Web 07/22/2005 and in print is correct.
References (1) Beausoleil, S. A.; Jedrychowski, M.; Schwartz, D.; Elias, J. E.; Villen, J.; Li, J.; Cohn, M. A.; Cantley, L. C.; Gygi, S. P. Large-scale characterization of HeLa cell nuclear phosphoproteins. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, (33), 12130-12135. (2) Andersen, J. S.; Lam, Y. W.; Leung, A. K.; Ong, S. E.; Lyon, C. E.; Lamond, A. I.; Mann, M. Nucleolar proteome dynamics. Nature 2005, 433, (7021), 77-83. (3) Olsen, J. V.; Ong, S. E.; Mann, M. Trypsin cleaves exclusively C-terminal to arginine and lysine residues. Mol. Cell. Proteomics 2004, 3 (6), 608-614. (4) Cargile, B. J.; Bundy, J. L.; Stephenson, J. L., Jr. Potential for false positive identifications from large databases through tandem mass spectrometry. J. Proteome Res. 2004, 3 (5), 1082-1085. (5) Qian, W. J.; Liu, T.; Monroe, M. E.; Strittmatter, E. F.; Jacobs, J. M.; Kangas, L. J.; Petritis, K.; Camp Ii, D. G.; Smith, R. D. Probability-Based Evaluation of Peptide and Protein Identifications from Tandem Mass Spectrometry and SEQUEST Analysis: The Human Proteome. J. Proteome Res. 2005, 4 (1), 53-62. (6) Eng, J. K.; McCormack, A. L.; Yates, J. R., 3rd I An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976-989. (7) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551-3567. (8) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466-1467. (9) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3 (5), 958964.
Journal of Proteome Research • Vol. 4, No. 4, 2005 1359
research articles (10) Bafna, V.; Edwards, N. SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 2001, 17 (Suppl. 1), S13-21. (11) Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 1999, 6 (3-4), 327-342. (12) MacCoss, M. J.; Wu, C. C.; Yates, J. R., 3rd Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal. Chem. 2002, 74 (21), 5593-5599. (13) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R., Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74 (20), 5383-5392. (14) Kislinger, T.; Rahman, K.; Radulovic, D.; Cox, B.; Rossant, J.; Emili, A. PRISM, a Generic Large Scale Proteomic Investigation Strategy for Mammals. Mol. Cell. Proteomics 2003, 2 (2), 96-106. (15) Lopez-Ferrer, D.; Martinez-Bartolome, S.; Villar, M.; Campillos, M.; Martin-Maroto, F.; Vazquez, J. Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. Anal. Chem. 2004, 76 (23), 6853-6860. (16) Moore, R. E.; Young, M. K.; Lee, T. D. Qscore: an algorithm for evaluating SEQUEST database search results. J. Am. Soc. Mass Spectrom. 2002, 13 (4), 378-386.
1360
Journal of Proteome Research • Vol. 4, No. 4, 2005
Rudnick et al. (17) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2003, 2 (1), 43-50. (18) Clauser, K. R.; Baker, P.; Burlingame, A. L., Role of accurate mass measurement (( 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem. 1999, 71, 1 (14), 2871-2882. (19) Furuta, M.; Weil, R. J.; Vortmeyer, A. O.; Huang, S.; Lei, J.; Huang, T. N.; Lee, Y. S.; Bhowmick, D. A.; Lubensky, I. A.; Oldfield, E. H.; Zhuang, Z. Protein patterns and proteins that identify subtypes of glioblastoma multiforme. Oncogene 2004, 23 (40), 6806-6814. (20) Chen, J.; Balgley, B. M.; DeVoe, D. L.; Lee, C. S. Capillary isoelectric focusing-based multidimensional concentration/ separation platform for proteome analysis. Anal. Chem. 2003, 75 (13), 3145-3152. (21) Yang, X.; Dondeti, V.; Dezube, R.; Maynard, D. M.; Geer, L. Y.; Epstein, J.; Chen, X.; Markey, S. P.; Kowalak, J. A. DBParser: webbased software for shotgun proteomic data analyses. J. Proteome Res. 2004, 3 (5), 1002-1008.
PR0500509