Artificial Decoy Spectral Libraries for False Discovery Rate Estimation in Spectral Library Searching in Proteomics Henry Lam,*,† Eric W. Deutsch,‡ and Ruedi Aebersold‡,§,| Department of Chemical and Biomolecular Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, Institute for Systems Biology, 1441 North 34th Street, Seattle, Washington 98103, Institute of Molecular Systems Biology, ETH Zurich, Switzerland, and Faculty of Science, University of Zurich, Switzerland Received July 2, 2009
Abstract: The challenge of estimating false discovery rates (FDR) in peptide identification from MS/MS spectra has received increased attention in proteomics. The simple approach of target-decoy searching has become popular with traditional sequence (database) searching methods, but has yet to be practiced in spectral (library) searching, an emerging alternative to sequence searching. We extended this target-decoy searching approach to spectral searching by developing and validating a robust method to generate realistic, but unnatural, decoy spectra. Our method involves randomly shuffling the peptide identification of each reference spectrum in the library, and repositioning each fragment ion peak along the m/z axis to match the fragment ions expected from the shuffled sequence. We show that this method produces decoy spectra that are sufficiently realistic, such that incorrect identifications are equally likely to match real and decoy spectra, a key assumption necessary for decoy counting. This approach has been implemented in the open-source library building software, SpectraST. Keywords: Spectral libraries • target-decoy searching • decoy counting • false discovery rates
Introduction An important goal of proteomics is the systematic identification and quantification of all proteins in a biological system. Liquid chromatography coupled online to mass spectrometry (LC/MS) has become the most effective analytical platform for this purpose, thanks to its high speed, exquisite sensitivity, the ability to handle complex mixtures and diverse samples. In a widely practiced process termed shotgun proteomics, peptides from digested protein mixtures are chromatographically separated, ionized and subsequently mass-resolved in the mass spectrometer. Selected peptide ions are then subject to con* Corresponding Author: Henry Lam, Department of Chemical and Biomolecular Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. E-mail:
[email protected]. Phone: +852-2358-7133. Fax: +852-2358-0054. † The Hong Kong University of Science and Technology. ‡ Institute for Systems Biology. § Institute of Molecular Systems Biology. | University of Zurich. 10.1021/pr900947u
2010 American Chemical Society
trolled fragmentation, yielding tandem mass (MS/MS) spectra that can be used to deduce the peptide sequence computationally.1-3 Unfortunately, the computational step of assigning peptide identifications to MS/MS spectra, commonly performed by sequence database searching,4-8 or more recently by spectral library searching,9-12 is a challenging and sometimes errorprone exercise, due to various reasons. First, the spectrum in question may be noisy or riddled with impurity peaks, confusing the search engines. Second, the correct interpretation of the spectrum may not be among the candidates considered by the search engine. Such is the case for peptides with unanticipated post-translational modification, peptides not contained in sequence databases or spectral libraries, or even nonpeptide species that happen to be selected for fragmentation. Because of the large scale of these experiments, it is often not practical to determine manually whether each individual identification is correct or not. Instead, stringent filters based on scores returned by the search engine are applied, and the resulting set of positive identifications is reported with an associated error rate estimate. This measure of the frequency of false identification is referred to as the false discovery rate (FDR), which is defined as the proportion of reported positive identifications that are false.13-15 Two major approaches are commonly in use to estimate the FDR of a set of identifications in proteomics.13 In what is called an empirical Bayes approach, exemplified by the software PeptideProphet,16 the observed score distribution (usually bimodal) is modeled as a mixture of two parametric distributions presumed to represent the correct and incorrect identifications, respectively. The FDR of identifications with scores above a given threshold can be computed as the fraction of the mixture density attributable to the incorrect distribution above that threshold. The biggest weakness of this approach is the need to model accurately the shapes of the correct and incorrect score distributions, which can vary substantially from data set to data set, and from search engine to search engine. Training data sets are used to determine the most suitable shapes to use, but extrapolating them to larger and unrelated data sets often proves problematic. An alternative approach, called target-decoy searching, involves introducing answers that are known a priori to be incorrect (decoys) to the search space.17-19 By making the assumption that incorrect identifications are uniformly distributed in the search space, one can Journal of Proteome Research 2010, 9, 605–610 605 Published on Web 11/16/2009
technical notes calculate the FDR from the number of positive decoy identifications. In the context of sequence database searching, this is often achieved by concatenating a decoy protein database, typically consisting of reversed or reshuffled sequences from real proteins, to the target database before searching. This approach is simple to understand, easy to implement, and can be generally applied to all data sets and search strategies. More sophisticated machine learning approaches can also take advantage of the presence of decoy identifications to improve discrimination.20 Disadvantages of this approach include the doubling of the search space and search time, and the inability to capture errors associated with partial matches (e.g., incorrect assignment of post-translational modification sites, or matches to an almost identical but incorrect sequence).21,22 While target-decoy searching is well-established in sequence searching, it is not so in spectral searching. In spectral searching, the query spectrum is matched against a library of reference empirical spectra for which the identification is known; a sufficiently good match (high spectral similarity) implies positive identification.9-12 To execute the target-decoy search strategy, therefore, one needs a mechanism to create decoy spectra, with the following design criteria. First, a decoy spectrum should not be highly similar to a real spectrum from a naturally occurring peptide from the proteome of interest, lest the presence of the decoy spectrum made it difficult to identify this real peptide confidently and cause false negatives. Second, a decoy spectrum must contain realistic spectral features that mimic those in real spectra, such that an incorrect identification should have an equal chance of hitting a real or decoy spectrum. As a counterexample, if the decoy spectrum has fewer peaks than typically seen in real spectra, then it will have a lower chance of matching a query spectrum of pure random noise than a real spectrum does. Third, the decoy generation method should preserve the distribution of real spectra, both in terms of precursor m/z, number of enzymatic termini, post-translational modifications, and so forth., such that any subdivision of search space will have the same number of real and decoy spectra as search candidates. This ensures that the decoy fraction is uniform for all queries and simplifies the calculation of FDR from decoy counts. Furthermore, these requirements cannot be met simply by using spectra from another organism unrelated to the one sampled in the experiment as decoys. This strategy is convenient in its implementation, and the decoys are guaranteed to be realistic because they are real spectra. However, we argue that this approach has some fatal shortcomings that make it unsuitable for general use. First, it is not often easy or possible to find such a spectral library to be used as decoys. Second, identical and homologous peptides are common even between phylogenetically distant species. To safeguard against false negatives, the decoy library has to be stripped of all homologous entries, and this is in itself a computational challenge. Finally, this approach cannot ensure that the same fraction of decoy candidates is considered for all search queries, simply because the precursor m/z distribution of peptides in each library is different, and unpredictably so. This effect is illustrated in Supplementary Figure 1. It complicates the calculation of FDR from decoy counts, because the appropriate multiplication factor is different for every single query. Correctly accounting for this variation, while theoretically possible, negates the advantage of the simplicity of decoy counting methods. 606
Journal of Proteome Research • Vol. 9, No. 1, 2010
Lam et al. In view of the above, we have developed a strategy of generating artificial decoy spectra, and demonstrated that it can be used to estimate FDR of spectral search results. This method is implemented in SpectraST,9,23 a spectral library building and searching tool, and made freely available to the community.
Materials and Methods Software Development. SpectraST9,23 is an open-source software tool designed for spectral library building and searching in proteomic applications. It is bundled with the Trans Proteomic Pipeline (TPP) software suite,24 whose supporting functionalities make SpectraST readily usable in a complete data analysis workflow, and deployable on both Windows and LINUX platforms. A Windows installer and the source code are available at http://tools.proteomecenter.org/software.php. In this paper, we made use of the existing library creation routines in SpectraST and added a new mode of operation to generate decoy spectra by procedures described below. SpectraST accepts a library consisting of real spectra in SpectraST’s .splib format as input. (Note that all other common library formats, including the .msp format used by National Institute of Standards and Technology, downloadable from http://peptide. nist.gov/, and X!Hunter’s10 .hlf format can be converted to .splib by SpectraST.) The decoy spectra are then generated by manipulation of the real spectra one-by-one, and subsequently added to the original spectral library to form a concatenated target-decoy library. All of the library manipulation steps, except those involved in the spectrum prediction method described below, can be accomplished without programming, simply by running SpectraST in library creation mode with suitable options turned on. For detailed users’ instructions, please refer to http://tools.proteomecenter/wiki/Software: SpectraST/. Data Sets and Libraries. For this study, we employed a publicly available data set, consisting of 157 runs and over a million MS/MS spectra from Lab 34 in the Human Proteome Organization Plasma Proteome Project (HUPOPPP).25 This large data set of a human plasma sample was acquired on a Thermo Fisher Scientific LTQ (Waltham, MA), following typical sample preparation protocols. For details, please refer to ref 25. Two spectral libraries are used to test our decoy generation method. The human spectral library contains about 220 000 consensus spectra from various human tissues, compiled by the National Institute of Standards and Technology (NIST) and available for download at http://peptide.nist.gov/ (version 2.0, dated July 11, 2008). The Escherichia coli spectral library contains about 48 000 consensus spectra from E. coli, also compiled by NIST using the same methodology (version 2.0, dated July 11, 2008). The latter library is used to more accurately estimate any bias toward or against the decoy spectra by ensuring all identifications from a human data set are necessarily false. For both libraries, we removed all library entries for which the peptide identification is shorter than 7 aminoacid long (827 and 905 removed from the human and E. coli libraries, respectively), because reshuffled short peptides tend to coincide with real peptides upon decoy generation.17 We also removed all entries in the E. coli library that have an identical counterpart in the human library. Lastly, for efficiency, each spectrum is simplified to a maximum of 50 most intense peaks, since it was previously demonstrated that this simplification does not significantly impact the performance of spectral searching.23
technical notes
Artificial Decoy Spectral Libraries for FDR Estimation Decoy Spectrum Generation. We implemented two distinct methods of decoy spectrum generation. In the m/z-shift method, all fragment ion peaks of each real spectrum are shifted by a fixed amount of +20 Th to form a corresponding decoy spectrum. This is similar to a previously employed strategy to generate spectra that can never be correctly identified: shifting the precursor m/z by some fixed value.17 However, this simple trick will also shift the overall precursor m/z distribution of decoys relative to that of real spectra, and cause uneven decoy fractions among search candidates. Therefore, we created decoy spectra by shifting all fragment ion peaks of a spectrum by some fixed distance on the m/z axis instead. This retains most of the salient features of a real spectrum, but renders any match to it a necessarily incorrect hit. In the shuffle-and-reposition method, the real spectrum’s identification is shuffled randomly to form a decoy sequence, followed by repositioning of the fragment ion peaks along the m/z axis according to the decoy sequence. More specifically, for each unique peptide sequence, which may be represented by multiple precursor ions of different charge states and/or modifications, a random shuffling of the amino acids is performed. To maintain the number of tryptic termini and missed internal cleavages, the amino acids lysine (K), arginine (R) and proline (P) are not shuffled and keep their original positions in the sequence. The shuffled sequence is checked to make sure it is sufficiently different from the original sequence (