Potential for False Positive Identifications from Large Databases through Tandem Mass Spectrometry Benjamin J. Cargile, Jonathan L. Bundy, and James L. Stephenson, Jr.* Mass Spectrometry Research Program, Research Triangle Institute, 3040 Cornwallis Road, Research Triangle Park, North Carolina 27709 Received February 26, 2004
The biomedical research community at large is increasingly employing shotgun proteomics for largescale identification of proteins from enzymatic digests. Typically, the approach used to identify proteins and peptides from tandem mass spectral data is based on the matching of experimentally generated tandem mass spectra to the theoretical best match from a protein database. Here, we present the potential difficulties of using such an approach without statistical consideration of the false positive rate, especially when large databases, as are encountered in eukaryotes are considered. This is illustrated by searching a dataset generated from a multidimensional separation of a eukaroytic tryptic digest against an in silico generated random protein database, which generated a significant number of positive matches, even when previously suggested score filtering criteria are used. Keywords: database searching • peptide identification • proteomics • false positives
Within the past decade, research into proteins and their role in cellular systems has increased at an exponential rate.1 This is due in no small part to revolutionary advances in the field of mass spectrometry. Developments in ionization,2,3 instrumentation,4-6 tandem mass spectrometry,7,8 software,4,9 automation,4,9 and high performance separation techniques10-12 have led to an emergence of high-throughput approaches designed to identify thousands of proteins from a variety of different sample types. Typically, tandem mass spectra derived from proteolytic digests of complex protein samples are processed by software algorithms designed to identify the various peptides by returning the “best possible” match to the experimental data from a given protein database.13,14 Thus distinguishing between true identifications, false positives15-18 and false negatives becomes extraordinarily difficult using any manual validation approach. Therefore, many biological researchers are forced to rely on determining the identity of proteins based on little familiarity with mass spectrometry and limited understanding of database search routines. Here, we present the pitfalls associated with accepting protein identifications at face value especially from large databases, such as are encountered when mammalian proteomes are examined. An ever increasing number of laboratories with little to no experience with mass spectrometry are acquiring instruments or submitting samples for analysis under the assumption that the results received are unquestionably correct. To compensate for this lack of understanding of how to interpret or filter the results from such software, criteria for the “validation” of the results of the protein identification have been proposed and are believed to verify the quality of the assignment,19 although the experimental proof that these criteria eliminate false * To whom correspondence should be addressed. Phone: (919) 844-0462. Fax: (919) 541-7208. E-mail:
[email protected].
1082
Journal of Proteome Research 2004, 3, 1082-1085
Published on Web 08/17/2004
Table 1. Results of Database Search and Filtering of the ‘Medusa’ Proteome identifications (peptides/protein) search algorithm
SEQUEST MASCOTb
filter criteria
1 or more peptides
2 or more peptides
3 or more peptides
mild stringencya high stringencya reverse Dba cutoffs P ) 0.05 P ) 0.005 P ) 0.0005
7575/7069 1408/1377 1/1 496 52 4
1008/493 67/33 0/0 2 0 0
109/42 3/1 0/0 0 0 0
a Filter Criteria for Mild Stringency X corrs were 1.5, 2.0, 2.2 for the +1, +2, and +3 charge states. High Stringency Xcorrs were 2.0, 2.5, 3.5 for the +1, +2, and +3 charge states, respectively. A ∆CN of 0.08 was used for all experiments. Reverse database cutoffs were determined as the highest Reverse database Hit for each charge state and were 2.9, 4.5, and 4.25 for the +1, +2, and +3 charge states, respectively. b All MASCOT identification totals are for proteins only.
positives has yet to be demonstrated. If these assumptions are valid, then our research group has identified thousands of proteins from Medusa, a gorgon from Greek mythology, and thus, proven that this mythical beast is in fact a real creature disguised as a laboratory rat. Table 1 shows the numbers of peptides/proteins identified and the “accepted” criteria for identification used during each search with the commonly used SEQUEST14 and MASCOT13 programs. Even using very stringent scoring cutoffs, more than 1400 proteins were identified using SEQUEST and approximately 500 were found with MASCOT. When considering multiple peptides per protein locus, over 30 identifications were observed with SEQUEST and 2 positive hits were obtained with MASCOT. It has been suggested that manual validation of spectra will eliminate these spurious hits, 10.1021/pr049946o CCC: $27.50
2004 American Chemical Society
False Positive Identifications from Large Database
letters
Figure 1. (a-c) These spectra are from the Medusa database and appear to meet most if not all of the manual validation criteria. These spectra show both different charge states and the fact that different peptides with proline amino acids can match up to intense peaks in the spectra. Although manual validation may eliminate a number of the false positive hits, these spectra would not be removed by the authors based on the quality of match. A search of the R. norvegicus database did not reveal these exact peptides although potential homologues could exist. Even if the R. norvegicus database did contain these peptides, these protein identifications would still be false positives for these database.
but Figure 1 shows several high quality hits that meet many if not all of the validation criteria. Although manual validation could not be used to eliminate all of the false positives, the use of reverse database searching to determine Xcorr cutoff thresholds only allowed one hit and, thus seems more robust than manual validation criteria when large databases are considered. It should be noted that the number of false matches here likely represent a minimum value rather than an average one, because the protein sequences were generated randomly. However, most proteins have structural and evolutionary constraints on their amino acid sequences making them much more likely to be partially homologous. The samples used in this evaluation of false-positive identifications were prepared from rat testis by solubilization in a urea buffer and subsequent digestion with the proteolytic enzyme trypsin. Samples were analyzed using a shotgun based proteomics approach employing immobilized pH gradient gel electrophoresis of the peptide mixture in the first dimension followed by liquid chromatography tandem mass spectrometry in the second dimension.10,20 A total of 43 liquid chromatography tandem mass spectrometry runs, which resulted in ∼120 000 tandem MS spectra, from an LCQ Deca XP Plus were analyzed with TurboSEQUEST 3.0. DTASelect (provided by the Yates Lab at Scripps Research Institute) was used to sort and filter the data. The “Medusa” database searched was a protein set consisting of 40 000 randomly generated proteins made to approximate the human genome in protein size and amino acid frequencies, then reversed and concatenated onto the forward database. A number of spectra were manually validated for single hit proteins to see if these criteria would remedy the false positive problem. The same data set was used with version 1.8 of the MASCOT software. All hits were filtered for redundancy, but ambiguous protein matches were left in the dataset since this still identified a protein from the “Medusa” database. The rat database was not introduced into the “Medusa” database to test whether an inappropriate or incomplete database would give significant false positive matches. For the pI filtering example, the R. norvegicus database was downloaded from NCBI and the reversed version of the proteins was added to the end of the forward protein database. For all experiments, only fully tryptic peptides were searched because tryptic peptides are considered to give more reliable identifications based on the lower Xcorr cutoff17 needed for identification. Also, a
recent publication that suggests partial tryptic peptides are not observed, at least under some experimental conditions.21 These results presented here are not intended to suggest that all mass spectrometry identifications are invalid, or that any one particular identification algorithm works better or worse than another. It is however intended to show the potential problems that a lack of understanding to correctly verify the protein identifications can cause. If no problem exists, as many users of mass spectrometry (and most instrument companies and software providers) believe, then a series of publications could be forthcoming identifying all of the creatures ever described in Greek mythology. Putting this perplexing situation aside, the more far-reaching problem will be the potentially unproductive research efforts in biomedical and biological research that such misidentifications could lead investigators to pursue. The best option that can be offered at the moment is for those who use mass spectrometry to identify proteins in shotgun experiments is to consider the false positive rates for the data sets that are generated. However, in order to understand what is meant by the false positive rate, a basic understanding of how database identification programs work is required. The way most if not all these programs work is by taking the limited amount of data present in a tandem mass spectrum and comparing it to a given subset of proteins. This subset of proteins is usually all the proteins present from a given organism’s genome. Thus, the top hit or identified peptide is simply the peptide from that protein database that matches best to the tandem mass spectrum, which presumably is the correct or real peptide that the spectrum was derived from. However, given enough random peptide sequences to search through, some of these peptides will match with a high level of significance to the tandem mass spectrum. This random match is due in part to the probability that a given set of fragment ions from the spectrum will match to a given set of theoretical peptide fragments. The potential for these fragments to match is increased if there is homology between the peptide that was fragmented in the spectrum and the theoretical peptide from the database. Thus, large databases such as those encountered in eukaryotic systems contain enough peptides that a random match with a high degree of significance is likely to occur. The problem of false positives is not limited to any single database search program or instrument type. As noted above, Journal of Proteome Research • Vol. 3, No. 5, 2004 1083
letters
Cargile et al.
Figure 2. (a) Peptide identification cross-correlation score (Xcorr) plotted with isoelectric point (pI) calculations for each peptide. The data obtained is from searching the R. norvegicus database in the forward direction. Shown here is the clustering of the peptide identifications between the 3.5-4.5 pI range corresponding to ‘true’ identifications. Emergence of false positive identifications is observed for Xcorr values below approximately 3.5 with calculated pI values greater than 4.5. (b) In addition to data obtained for the forward search, a reverse database search is performed as described by Peng et al. From this reverse database search, a cutoff can be established to limit false positives, at the expense of leaving false negatives behind. (c) Since pI can be predicted to well within ( 0.25 pI units, false positive identifications can easily be determined particularly at lower Xcorr values. Any potential peptide identifications that have pI values greater than ∼4.7 can be considered false positives. (d) By using pI as an orthogonal filtering criteria the threshold for false positive identification can be lowered substantially. Here, we are not just constrained by tandem mass spectrometry based identifications but can employ an independent physiochemical property of the peptide to mine the dataset for false negatives. By lowering the cutoff for false positives, additional peptide identifications (i.e., mining false negatives) are obtained that would otherwise be discarded.
this is a problem related to the fundamental aspects of both information and finite set theories. This information is provided so that it is not assumed that the problem is with the SEQUEST program alone. In fact, it could be argued that with the research into the false positive rate of the SEQUEST program, the correction for this problem is limited to this algorithm exclusively, and similar research will need to be repeated for all other database search programs to similarly attempt to remove false positives. New research tools are in development that help to minimize such spurious matches such as the reversed database search methodology described by Peng et al.17 This method has been expanded to use alternative information, such as isoelectric point, to remove more false positives while providing greater sensitivity during filtering.10,20 An alternative to this is to employ a purely probabilistic model13,16,22 to determine significance of the peptide identification. We would provide a word of caution though against using pure probability calculations to ‘validate’ 1084
Journal of Proteome Research • Vol. 3, No. 5, 2004
or verify data. Most such calculations do not take into account enough complex biological factors to provide a real measure of the validity of the data. Such factors include divergent evolution, horizontal gene transfer, organism specific amino acid frequencies, and trends in sequential amino acid pairs. Also there are several numerical problems with the probabalsitic approach, which include the masses from amino acid combinations tend to cluster because they are made from the same four elements, the fact that several combinations of amino acids are indistinguishable (i.e., glycine-glycine and asparate at nominal mass 114, and valine-serine and tryptophan at nominal mass 186) by mass alone. We presume these factors to be the reason that even when MASCOT, a probabilitybased algorithm is run at a probability of greater than 99%, a number proteins from this fictitious database are matched. It is possible that other probability-based methods have accounted for all these potential issues, but the needed complexity makes it doubtful. Since the magnitude of all the afore-
letters
False Positive Identifications from Large Database
mentioned problems vary from one organism’s proteome to another, the probability calculated for any one database may or may not be applicable in general and will need to be examined on a case by case basis. The other important parameter that must be considered in any protein identification approach is the presence of false negatives. Here, we define a false negative as a valid identification that falls below the limit of the threshold set to eliminate an appropriate number of false positives from a given data set. Ideally, we want to mine the dataset for those valid identifications that fall below the cutoff that limits the number of false positives without increasing the percentage of false positive identifications.10,20,23 Two general approaches can be applied to address this issue. The first uses an orthogonal property of the peptides of interest that can be used independently to verify the identification of a given set of peptides. One such property that can be employed is the isoelectric point (pI) of the peptide, which complements traditional tandem mass spectrometry based approaches.10,20 By separating peptides using immobilized pH gradient isoelectric focusing, pI can be predicted to within 0.2 pI units.10,24 Therefore, in our example using protein sample derived from the testis of R. norvegicus that was previously separated on a 3.5-4.5 pH range immobilized pH gradient strip,25 we can assign pI values to the tentative peptide identifications generated via tandem mass spectrometry and the SEQUEST search algorithm (pI generated for each top Xcorr as shown in Figure 2a). It is readily apparent from examining the data that the orthogonal pI approach has merit due to the clustering of potential peptide identifications in the 3.5-4.5 pI range. The next step in the process is to assign a cutoff value for eliminating false positives from the data set. If we apply the reverse database search approach described by Peng et al.,17 we can determine a cutoff value as demonstrated in Figure 2b that eliminates the majority of false positive identifications. However, since we can determine the peptide pI with high accuracy, we can then establish limits over which false positives can be eliminated with certainty, as well as examine the data for the presence of false negatives which can be mined from the data as shown in Figure 2c. By eliminating the known false positives that lie outside the appropriate pI range, the cutoff limit for Xcorr can be lowered substantially as shown in Figure 2d to include those peptide identifications normally not considered real hits when identification criteria is restricted to tandem mass spectrometry data alone.10,20 In closing, it should be noted that because of pressure to identify more or the most proteins, there is currently a fine line for the user of mass spectrometry based techniques to walk to minimize both the number of false positives and eliminate false negatives. The responsibility for defining these criteria for identification will ultimately reside not only with mass spectrometrists, but with practitioners of bioinformatics and biologists as well.26 Yet such a balance should be established because of potential benefits mass spectrometry can provide to under-
standing the mechanisms behind life, which we (and many others) feel will need to be done at the level of the proteome rather than that of the genome. We hope that this report has provided the ‘mirrored shield’ needed to see the Medusa of mass spectrometry and will spur advancement in the tools needed to behead this loathsome beast.
Acknowledgment. The authors wish to acknowledge the Internal Research and Development Program from the Research Triangle Institute for funding of this research. In addition, we wish to recognize James E.H. Powell from Amersham Biosciences for providing the monodisperse polymeric small bead RPC medium column packing (Source 5RPC) for the reverse phase separation work. References (1) Aebersold, R.; Mann, M. Nature 2003, 422, 198-207. (2) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse, C. M. Science 1989, 246, 64-71. (3) Karas, M.; Hillenkamp, F. Anal. Chem. 1988, 60, 2299-2301. (4) Quadroni, M.; James, P. Electrophoresis 1999, 20, 664-677. (5) Jonsson, A. P. Cell. Mol. Life Sci. 2001, 58, 868-884. (6) Ferguson, P. L.; Smith, R. D. Annu. Rev. Biophys. Biomol. Struct. 2003, 32, 399-424. (7) Yates, J. R. J. Mass Spectrom. 1998, 33, 1-19. (8) Hunt, D. F.; Yates, J. R., 3rd; Shabanowitz, J.; Winston, S.; Hauer, C. R. Proc. Natl. Acad. Sci. U.S.A. 1986, 83, 6233-6237. (9) Yarmush, M. L.; Jayaraman, A. Ann. Rev. Biomed. Eng. 2002, 4, 349-373. (10) Cargile, B. J.; Talley, D. L.; Stephenson, J., J. L. Electrophoresis 2004, 25, 936-945. (11) Shen, Y.; Smith, R. D. Electrophoresis 2002, 23, 3106-3124. (12) Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd Nat. Biotechnol. 2001, 19, 242-247. (13) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (14) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (15) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392. (16) Sadygov, R. G.; Yates, J. R. Anal. Chem. 2003, 75, 3792-3798. (17) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50. (18) Moore, R. E.; Young, M. K.; Lee, T. D. J. Am. Soc. Mass Spectrom. 2002, 13, 378-386. (19) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., 3rd Nat. Biotechnol. 1999, 17, 676-682. (20) Cargile, B. J.; Bundy, J. L.; Freeman, T. W.; Stephenson, J., J. L. J. Proteome Res. 2004, 3, 112-119. (21) Olsen, J. V.; Ong, S. E.; Mann, M. Mol. Cell Proteomics 2004, Epub ahead of print. (22) MacCoss, M. J.; Wu, C. C.; Yates, J. R., 3rd Anal. Chem. 2002, 74, 5593-5599. (23) Cargile, B. J.; Stephenson, J. L., Jr. Anal. Chem. 2004, 76, 267275. (24) Bjellqvist, B.; Hughes, G. J.; Pasquali, C.; Paquet, N.; Ravier, F.; Sanchez, J. C.; Frutiger, S.; Hochstrasser, D. Electrophoresis 1993, 14, 1023-1031. (25) Essader, A.; Cargile, B. J.; Bundy, J. L.; Stephenson, J. L., Jr. Proteomics 2004, in press. (26) Baldwin, M. A. Mol. Cell Proteomics 2004, 3, 1-9.
PR049946O
Journal of Proteome Research • Vol. 3, No. 5, 2004 1085