Application of de Novo Sequencing to Large-Scale Complex

Dependent on concise, predefined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from who...
0 downloads 0 Views 1MB Size
This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

Article pubs.acs.org/jpr

Application of de Novo Sequencing to Large-Scale Complex Proteomics Data Sets Arun Devabhaktuni and Joshua E. Elias* Department of Chemical & Systems Biology, Stanford University, Stanford, California 94035, United States S Supporting Information *

ABSTRACT: Dependent on concise, predefined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from wholly uncharacterized protein products. Conversely, de novo peptide sequencing algorithms can interpret mass spectra without relying on reference databases. However, such algorithms have been difficult to apply to complex protein mixtures, in part due to a lack of methods for automatically validating de novo sequencing results. Here, we present novel metrics for benchmarking de novo sequencing algorithm performance on large-scale proteomics data sets and present a method for accurately calibrating false discovery rates on de novo results. We also present a novel algorithm (LADS) that leverages experimentally disambiguated fragmentation spectra to boost sequencing accuracy and sensitivity. LADS improves sequencing accuracy on longer peptides relative to that of other algorithms and improves discriminability of correct and incorrect sequences. Using these advancements, we demonstrate accurate de novo identification of peptide sequences not identifiable using database search-based approaches. KEYWORDS: mass spectrometry, de novo peptide sequencing, MS/MS, proteomics, large-scale computational analysis



INTRODUCTION The combination of liquid chromatography with tandem mass spectrometry (LC−MS/MS) is unparalleled in its ability to comprehensively and sensitively characterize discrete proteomes.1,2 In addition to rapidly improving instrumentation, proteome analyses owe their success to sophisticated techniques for peptide identification and to robust ways to evaluate their validity. The former identifies which peptide sequence gave rise to an observed tandem mass (MS/MS) spectrum by scoring it against limited sets of theoretical spectra deduced from protein sequence databases.3−5 The latter applies statistical techniques to discriminate high-confidence peptide identifications from the 50−80% of incorrect identifications that database search tools usually generate.6 Despite their clear dominance in proteomics, both kinds of database-dependent algorithms are fundamentally limited: peptides can be confidently identified only if their sequences are known a priori. Thus, database-dependent algorithms are less adept when employed to investigate organisms and systems with poorly defined proteomes.7 Furthermore, post-translational peptide processing events8,9 can produce peptides with no genetic basis and thus wholly unidentifiable by databasedependent approaches. Several strategies have been described to adapt traditional database search strategies toward atypical samples: unanticipated sequence variants or post-translational modifications (PTMs) can be partially addressed by searching databases constructed from matched RNA sequences10,11 or through iterative database searching practices.5,12,13 However, iterative search practices degrade in performance as sequence search © 2016 American Chemical Society

spaces increase in size. Furthermore, practical and technical obstacles currently prevent custom sequence databases from being generated for all biological specimens.14,15 De novo sequencing algorithms predate database search algorithms and are uniquely poised to address the limitations of database searches.16−19 These algorithms deduce peptide sequences directly from observed MS/MS spectra, based solely on spacing between fragment ions that is consistent with discrete amino acids. Despite several theoretical advantages, de novo sequencing is used far less than database-driven approaches20 and has thus far been restricted to fairly simple peptide mixtures. De novo methodologies often cannot determine peptide sequences as accurately or sensitively as database search,21 but perhaps the greatest barrier to adoption is the lack of widely accepted methods for estimating false discovery rates (FDRs) from spectra interpreted de novo. Just as the target−decoy search strategy helped to standardize and control false positive identifications from standard database searches,6 a similar void must be filled in order for de novo sequencing to realize its potential. Ultimately, the means to unambiguously determine a peptide’s sequence, even in the absence of a reference sequence database, would be widely applicable across many biological domains, particularly those that intersect with unknown or Special Issue: Large-Scale Computational Mass Spectrometry and Multi-Omics Received: September 14, 2015 Published: January 8, 2016 732

DOI: 10.1021/acs.jproteome.5b00861 J. Proteome Res. 2016, 15, 732−742

Article

Journal of Proteome Research

NaCl, 25 mM Tris-HC1, pH 7.6) plus EDTA-free protease inhibitor cocktail (Roche). Two milligrams of the resulting lysate was prepared as described for the nine-protein mixture. Half of the preparation was purified by SPE after proteolytic digestion without labeling to allow comparison between labeled and unlabeled samples and evaluate existing de novo sequencing algorithms on the unlabeled sample. Heat Shock Response in Yeast. An existing data set following protein expression in yeast after heat shock response was downloaded from ProteomeExchange26 using accession ID PXD000409.27 Briefly, yeast strain W303 MATα was grown at 24 °C to mid log phase in YPD and subsequently shifted to 37 °C via water bath incubation. Samples were collected at t = 0, 30, and 75 min, lysed, and mixed with a heavy [13C6/15N2]Llysine-labeled spike-in standard.28 The resulting mixtures were digested with Lys-C using the filter-aided sample preparation (FASP) protocol, purified by SPE using C18 StageTips, and analyzed on a Thermo Q Exactive mass spectrometer. Chemical Labeling. Peptides resulting from proteolytic digestion were sequentially guanadinylated and dimethlyated (i.e., 2MEGA-labeled) as described previously (Figure S2).29 Samples were incubated with 2 M O-methylisourea (Sigma) at 37 °C for 2 h to guanidinate lysines, leaving N-terminal amines unmodified. Equal amounts were then incubated with either 0.16% formaldehyde (Sigma) and 24 mM sodium cyanoborohydride (Sigma) or their deuterated counterparts (Isotec, Sigma) for 20 min at room temperature to label N-termini with light or heavy dimethyl groups. Reactions were quenched by acidifying with TFA to pH 2 and incubated for 1 h at room temperature. Samples labeled with heavy and light dimethylation were equally mixed prior to LC−MS analysis. Synthetic peptides (JPT Peptide Technologies) were guanidinylated as described above and N-terminally labeled with a light dimethyl group for comparison with de novo peptide interpretation.

poorly sequenced organisms. Raising antibody reagents to a newly discovered epitope,22 monitoring T cell responses to an experimental antigen,23 or linking a peptide toxin’s structure to its function24 all demand exact peptide sequence elucidation. Here, we define new metrics for evaluating the accuracy of de novo sequencing results based on complete sequence elucidation. We then apply these metrics to benchmark three leading algorithms. We scored each algorithm’s ability to produce full-length correct de novo interpretations on complex, large-scale proteomics data sets and to discriminate partially correct and incorrect sequences. We also describe and validate a straightforward approach, based on comparison with highconfidence database search results, for establishing scoring thresholds that return de novo sequences with a predefined FDR. Applying these techniques, we found that the many existing de novo algorithms were poorly suited for distinguishing completely correct spectrum interpretations. To achieve de novo sequencing with greater accuracy and sensitivity, we developed label-assisted de novo sequencing (LADS), a de novo sequencing algorithm that takes advantage of MS/MS spectra experimentally disambiguated through the use of chemical or isotopic labels. By applying LADS to simple and complex peptide mixtures, we demonstrate that correct, fully elucidated peptide sequences can be reliably produced and discriminated by de novo peptide sequencing alone and to a greater extent than previous de novo strategies. Finally, we demonstrate the strengths of this approach in the context of two complex protein mixtures by directly validating FDR predictions on confidently identified de novo sequences and identifying peptide species not identifiable using database search-based methodologies. Though we demonstrate these advances in the context of LADS, the benchmarking and statistical validation tools described here are compatible with any de novo sequencing algorithm and extend the applicability of proteomics to biological samples previously uncharacterizable due to the limitations of database search strategies.



Mass Spectrometry

Dried peptide mixtures for the nine-protein mixture and bovine brain samples were resuspended in 5% acetonitrile, 5% formic acid at approximately 1 μg/μL, and 1 μL was analyzed by microcapillary liquid chromatography electrospray ionization tandem mass spectrometry (LC−MS/MS). Samples were analyzed using an LTQ Orbitrap Velos mass spectrometer (Thermo Fisher Scientific, San Jose, CA) equipped with an inhouse built nanospray source, an Agilent 1200 Series binary HPLC pump, and a MicroAS autosampler (Thermo Fisher Scientific). Peptides were separated on a 125 μm i.d. × 18 cm fused silica microcapillary column with an in-house pulled emitter tip with an i.d. of approximately 5 μm. The column was packed with ProntoSIL C18 AQ reversed-phase resin (3 μm particles, 200 Å pore size; MAC-MOD, Chadds Ford, PA). The nine-protein mixture and synthetic peptide samples were separated on a compressed two-step gradient of 7−25% buffer B (0.1% formic acid, 97.5% acetonitrile) over 30 min and 25− 45% B over 7 min. Bovine brain samples were separated by applying a two-step gradient: 7−25% buffer B over 2 h and 25− 45% B over 30 min. Buffer A was 0.1% formic acid, 2.5% acetonitrile. The mass spectrometer was operated in a datadependent mode in which a full MS scan was acquired in an Orbitrap30 (AGC: 5 × 105; resolution: 6 × 104; m/z range: 360−1600; maximum ion injection time: 500 ms), followed by up to 10 HCD31 MS/MS spectra, collected from the most abundant ions from the full MS scan. MS/MS spectra were collected in the Orbitrap (AGC: 2 × 105; resolution: 7.5 × 103;

EXPERIMENTAL PROCEDURES

Sample Preparation

Nine-Protein Mixture (9P). A simple nine-protein mixture was prepared from purified commercial sources and mixed at approximately equal molar ratios: bovine carbonic anhydrase (Sigma, C2273), human alpha-acid glycoprotein (Sigma, G9885), yeast alcohol dehydrogenase (Sigma, A8656), rabbit glyceraldehyde 3-phosphate (Sigma, G5262), Escherichia coli beta-galactosidase (Sigma, G8511), bovine alpha-lactalbumin (Sigma, L6385), human catalase (Sigma, C3556), horse myoglobin (Protea BioSciences, PS-124-1), and bovine serum albumin (Sigma, A7906). All proteins were solubilized in 100 mM ammonium bicarbonate (AMBIC) at 2.8 μM, mixed, reduced (5 mM DTT, 56 °C, 30 min), alkylated with iodoacetamide (14 mM, 20 °C, 1 h), and TCA precipitated. The protein pellet was resuspended in 100 mM AMBIC and digested overnight at 37 °C with LysC at an enzyme/substrate ratio of 1:100. The digest was labeled as described below, and peptides were purified by solid-phase extraction (SPE) on a Sep-pak (Waters) column as previously described.25 Bovine Brain. A complex protein mixture was prepared from bovine brain tissue (BB, Schaub’s market, Palo Alto, CA): Roughly 10 g of snap-frozen tissue was thawed on ice, dounce homogenized, and lysed by tip sonication in RIPA lysis buffer (0.1% SDS, 1% NP-40, 1% sodium deoxycholate, 150 mM 733

DOI: 10.1021/acs.jproteome.5b00861 J. Proteome Res. 2016, 15, 732−742

Article

Journal of Proteome Research minimum m/z: 100; maximum ion injection time, 1000 ms; isolation width: 2 Da; normalized collision energy: 40; default charge state: 2; activation time: 30 ms; dynamic exclusion time: 60 s; exclude singly charged ions and ions for which no chargestate could be determined). The mass calibration of the Orbitrap analyzer was maintained to deliver mass accuracies of ±5 ppm without an external calibrant. The nine-protein mixture was analyzed once, whereas three technical replicates of both the labeled and unlabeled bovine brain samples were analyzed to assess variability and strengthen statistics.

either an exact string match allowing for I/L substitutions or a PRM-based metric where noted. FDRs for de novo search algorithms were estimated based on the rates of returning peptide identifications with 50, are colored green. Putative de novo sequence candidates are generated and then (v) rescored and reranked using an SVM model.

samples. For the 2MEGA SVMs, 31 281 high-confidence database search-derived PSMs were compiled from replicate analyses of a 2MEGA-labeled bovine brain sample on a LTQ Orbitrap Velos mass spectrometer. For the SILAC SVMs, 20 000 spectra were picked at random from the 374 493 highconfidence database search-derived PSMs from the yeast data set (Table S3). To avoid measurement skew effects due to possible overfitting, spectra used to train any SVM were not used in subsequent comparisons of the performance of LADS and PEAKS.

putative pairs were recorded as correct if the corresponding spectra were paired based on the high-confidence PSMs and incorrect otherwise. The training data set for the rescoring SVM was prepared by first running LADS, without rescoring, on the set of highconfidence PSMs derived from database search. For each spectrum, the top 10 de novo PSM candidates, as determined by the spectrum graph, were recorded. Each de novo PSM was compared to the corresponding database search PSM and assigned a class label of correct if the peptide sequences were identical and incorrect otherwise. For all de novo PSMs, the class label and corresponding SVM feature vector (Table S2) were recorded to create the training set. All SVM functionality in LADS was provided by the libSVM library.37 SVMs were trained with a radial basis function (rbf) kernel, optimizing both the cost of misclassification and standard deviation of the rbf using a grid search to maximize accuracy over cross-validated subsets of the training spectra. Pairing, clustering, and rescoring SVMs were trained on separate sets of spectra for the 2MEGA and SILAC-labeled

Software Information

System requirements for running LADS are Windows XP or higher, Mac OSX or higher, or any Linux system capable of running Python, ver. 2.6 or higher; minimum 500 MB of RAM and recommended processor speed 1.1 GHz or higher are also required. Software, sample data, and instructions for use can be found at http://sites.stanford.edu/LADS. LADS is licensed under the GPLv2 license. 736

DOI: 10.1021/acs.jproteome.5b00861 J. Proteome Res. 2016, 15, 732−742

Article

Journal of Proteome Research



RESULTS

improve the accuracy and sensitivity of de novo algorithms. Previous demonstrations of such methodologies using complementary fragmentation techniques and chemical or isotopic labels have demonstrated improved sequencing accuracy.41−48 However, the degree to which this disambiguation increases discriminability of correct results and whether this increase in confidence can compensate for the longer analysis time required for each unique identification are yet unknown.

New Metrics To Estimate and Control de Novo Sequencing Accuracy

Accurate, platform-independent methods for estimating the frequency of false positive identifications are essential for any MS/MS spectrum interpretation procedure,38,39 but they are particularly critical in the absence of a reference proteome. While the target−decoy search strategy6 and parametric modeling40 have satisfied this need for database-dependent searching, an analogous, universally applicable approach has not been described for peptides only found by de novo sequencing. Furthermore, de novo sequencing algorithms generally have not been evaluated against the rigorous standard of nonambiguous, completely correct peptide sequences. Thus, the need for benchmarking de novo algorithms on a proteome scale remains. To address the issue of de novo sequence validation, we designated two sets of confidently interpreted MS/MS spectra to serve as reference data sets: a bovine brain homogenate and a published time course of yeast responding to heat shock.27 Search results from two widely used database search algorithms, SEQUEST4 and Mascot,3 were combined to yield highconfidence sets of 20 504 and 368 789 PSMs, respectively, with estimated FDRs < 1% (Tables 1 and S3). These data sets were subsequently interpreted with three de novo sequencing algorithms, PEAKS,17 PepNovo+,18 and pNovo.16 Each de novo peptide interpretation that agreed with the corresponding high-confidence database interpretation was classified as correct and served as the basis to apply accuracy-based and discrimination-based criteria (Figure 1a,b). These measurements evaluate sequence interpretations at the level of each prefix residue mass (PRM, the mass of sequence spanning from the N-terminus to a given peptide bond)19 comprising the peptide as well as full sequence matches between the de novo peptide and its high-confidence database cognate. The algorithms examined here employ distinct approaches to de novo sequencing, resulting in measurable differences in performance (Figure 1c). PepNovo+ returns de novo sequences with gaps where it cannot confidently infer underlying amino acid subsequences and, as a result, produced fewer fully elucidated de novo sequence reconstructions than the other algorithms examined here. In contrast, PEAKS and pNovo attempt to create fully elucidated peptide sequence in all cases. However, pNovo restricts input MS/MS spectra to those with precursor ion mass less than 2000 Da due to computational constraints and thus did not return sequences for 28 and 20% of the high-confidence database-assigned PSMs in the bovine brain and yeast data sets, respectively. Nevertheless, we found that PEAKS and pNovo returned a similar number of peptide identifications that exactly matched high-confidence database-assigned PSMs on the bovine brain data set, but PEAKS’s performance was superior to both of the other algorithms on the yeast data set. Dramatic differences were observed among the three algorithms’ ability to discriminate between exact matches and mismatched de novo sequences based on their scoring functions (Figure 1d). PEAKS was, by far, the most discriminative algorithm of the three, recalling correct identifications with over 50% sensitivity in both data sets at a 10% predicted FDR. We next sought to determine the degree to which experimental disambiguation of fragment ion identity could

Label Assisted de Novo Sequencing (LADS)

To improve the depth of de novo sequencing and the ability to distinguish correct spectrum interpretations, we designed LADS, which maximizes the spectral information available for de novo sequencing. It takes as input a collection of MS/MS spectra derived from sets of peptides with chemically or isotopically distinct N- and C-termini.29 By deducing ion identity in MS/MS spectra from expected mass shifts conferred by introduced labels, LADS produces de novo peptide interpretations with increased accuracy and scores them in a manner that allows clear discrimination between correct and incorrect PSMs (Supporting Information Methods and Figure 2). First, MS/MS spectra are (i) deisotoped and (ii) clustered according to the likely peptide species of origin. Clustered spectra are combined to create consensus spectra. Consensus spectra are then (iii) paired if their precursor masses are consistent with light and heavy variants of the same peptide under the applied labeling strategy used (e.g., 2MEGA,29 SILAC28). Spectra paired in this way are jointly considered in order to (iv) generate candidate de novo PSMs with a spectrum graph data structure.19 Finally, PSMs are (v) rescored and reranked by a support vector machine (SVM)37 model to facilitate discrimination between correct and incorrect PSMs. LADS Effectively Characterizes Isolated Known Proteins

We first tested LADS using a defined mixture of nine known proteins purchased from commercial sources and labeled with the 2MEGA protocol (Figure S2). Peptides resulting from LysC digestion were labeled and analyzed by LC−MS/MS on an LTQ Orbitrap Velos mass spectrometer, collecting MS/MS spectra by HCD fragmentation. We defined a high-confidence set of PSMs attributable to these nine proteins using the database-dependent search algorithms SEQUEST and Mascot, searching against a composite sequence database composed of these proteins plus known contaminants and reversed decoy counterparts of all protein sequences (Table 1). Reconciling peptide sequences generated de novo by LADS with matches to database results to known proteins served as a proof-of-concept validation of this approach. In particular, this analysis showed the feasibility of deducing filtering constraints from highconfidence database search results and applying them to strictly de novo search results with accurate FDR estimations. Furthermore, de novo discovery of peptides harboring amino acid substitutions, peptides derived from unanticipated contaminants, and peptides that were also identified by database search but were not scored sufficiently high to be included in the high-confidence data set demonstrates the potential advantage LADS has for overcoming limitations imposed by database searches (Supporting Information Methods and Figures S3 and S4a). LADS Improves de Novo Sequencing Accuracy

The nine-protein experiment demonstrated that LADS can determine exact peptide sequences without relying on preexisting sequence databases and discriminate them from 737

DOI: 10.1021/acs.jproteome.5b00861 J. Proteome Res. 2016, 15, 732−742

Article

Journal of Proteome Research

Figure 3. (a) LADS vs PEAKS precision−recall. Precision−recall plots for LADS and PEAKS for de novo PSMs that are fully correct (identical) relative to high-confidence database search PSMs, determined using the method described in Figure 1b. (b) Comparison of high-confidence de novo PSMs on the yeast data set. Fully correct PSMs for PepNovo, pNovo, LADS, and PEAKS were retrieved at score cutoffs corresponding to 5 and 10% FDRs. (c) Comparison of unique peptides returned by LADS, PEAKS, and database search on yeast and labeled bovine brain samples. LADS and PEAKS both returned complementary sets of unique peptide identifications, demonstrating the utility of search data with multiple de novo sequencing algorithms. Of note, LADS returned more confident unique peptide identifications on the labeled bovine brain sample than PEAKS despite analyzing only a portion of the data, as only a subset of spectra in either data set was able to be paired (56% for labeled bovine brain and 67% for yeast).

incorrect spectrum interpretations. To test the extent to which this extends to complex peptide mixtures, we analyzed two whole-cell lysate data sets with LADS: the SILAC-labeled yeast data set used in the initial characterization of existing de novo sequencing algorithms (Figure 1) and a new 2MEGA-labeled bovine brain sample derived from the aforementioned unlabeled bovine brain homogenate (Table 1). These data sets were chosen to test the applicability and benefit of the LADS approach on conventional and widely generated SILAClabeled samples (yeast) and whether the LADS methodology offers any advantage over prior de novo sequencing workflows (bovine brain sample). As before, MS/MS spectra derived from labeled bovine brain peptides were subjected to the databasedependent search algorithms and selection procedures described above to define a high-confidence set of identified peptides (Table 1). We further compared LADS’s performance to that of PEAKS, the strongest de novo algorithm emerging from the analysis in Figure 1. LADS’s spectrum pairing and clustering procedures performed with sensitivity and precision greater than 90%, as estimated from the high-confidence database-assigned peptides for both data sets (Figure S4b,c). As not all acquired spectra

could be paired, LADS returned PSMs for 56 and 67% of the spectra in the labeled bovine brain and yeast data sets, respectively. Of all database-assigned spectra for which both LADS and PEAKS returned peptide identifications, LADS returned a similar number of fully correct identifications as PEAKS from the yeast data set (135 108 vs 144 152) and produced 73% more fully correct identifications than peaks from the bovine brain sample (8369 vs 4849 spectra). We attribute this observation to the fact that LADS was able to maximize its use of available information by leveraging consensus between multiple, complementary MS/MS spectra. Although PEAKS performed near optimally on the yeast data set, it could not take full advantage of the information available in the bovine brain data set (Figures 1a and S5a,c). In the case of LADS, the pairing, clustering, and fragmentation models used to generate de novo sequences were trained on 2MEGAlabeled data for the bovine brain analysis and SILAC data for the yeast analysis. However, the model used for PEAKS was a generic vendor-supplied model trained on Orbitrap data. The drastic difference in performance between of PEAKS between the two data sets highlights how sensitive de novo algorithms can be toward sample preparation methods. Furthermore, it 738

DOI: 10.1021/acs.jproteome.5b00861 J. Proteome Res. 2016, 15, 732−742

Article

Journal of Proteome Research

Figure 4. (a) Breakdown of peptides found by LADS but not database search at high confidence. Peptides returned by LADS but not database search were categorized according to whether or not they were exact matches to known yeast or bovine brain proteins, matches with SNPs or PTMs, or matches outside of the search space analyzed by database search. Peptides unable to be mapped to any sequence in NCBInr are indicated as unknown. (b) Error rates of high-confidence de novo-only peptides. Peptides returned by LADS but not database search were categorized as either correct or containing de novo sequencing errors based on their corresponding match in annotated sequence databases. These data were used to obtain the observed error rates from both the yeast and bovine brain samples. In both cases, error rates determined in this manner agreed with the FDR of 10% predicted from our scoring model.

this case, the 10% FDR threshold was the lowest we could reasonably apply in which a substantial number of PSMs was returned by both de novo algorithms. The increased discriminability of LADS on these data sets relative to PEAKS can be attributed partially to features in the SVM classifier derived from paired ion information (Figure S6 and Table S2). Remarkably, this increased discrimination power enabled LADS to return more unique peptide identifications than PEAKS (1078 vs 982) on the labeled bovine brain data set despite being able to sequence only 56% of the spectra available to PEAKS. On both the yeast and the bovine brain data sets, LADS and PEAKS highlighted distinct sets of confidently identified peptides (Figure 3c), demonstrating that paired ion de novo sequencing can complement and substantially extend proteome coverage relative to that with unpaired de novo analysis (Figure S7). Due to the increased discriminatory power afforded by paired spectrum sequencing, we expect that the development of optimized mass spectrometry workflows in which spectra corresponding to isotopic pairs are preferentially acquired would produce superior de novo analyses relative to existing workflows.

emphasizes the necessity of training scoring models on training data subject to similar conditions as the experimental conditions for optimal performance. We further note that LADS demonstrated greater accuracy than PEAKS on both the yeast and bovine brain data sets in a length-dependent fashion, with the greatest increases observed for peptides >20 amino acids in length (Figure S5b). Long peptides are particularly challenging for de novo sequencing algorithms, since they must consider exponentially more possible solutions to input spectra. Consequently, previously described algorithms have tended to perform best on relatively short (5000; Figure S8b) We found that 80% of mapped peptides matched with full sequence identity to reference proteins found in the sequence database. However, 20% of mapped peptides in the yeast and bovine data sets were found with nonisobaric, single amino acid substitutions relative to the reference sequence. Such substitutions would preclude detection by database-dependent search algorithms using the commonly applied database search criteria used here (Figure 4a). To further evaluate the accuracy of predicted sequences, several de novo peptides were further validated by comparing MS/MS spectra derived from synthetic peptides to experimental ones (Supporting Information). Considering BLAST alignments and synthetic peptide confirmations, these data demonstrate that correct, de novosequenced peptides can be readily identified and distinguished from incorrect ones with a sufficiently calibrated scoring model. Four percent of confidently identified de novo-only peptides could not be mapped to the reference databases for both data sets described above. To further ascertain the source of these peptides, we used BLAST to map them to the larger NCBInr32 database representing 71 224 800 sequences across the tree of life. Alignment with NCBInr revealed that half of these previously unmapped peptides were derived from known yeast and bovine protein sequences not present in the SGD and UniProt databases (Table 2). Of interest, 11 yeast data set peptides matched identically to the capsid and major coat proteins of S. cerevisiae virus L-A. The L-A RNA virus confers the “killer yeast” phenotype to its host by encoding a lethal secreted protein toxin as well as immunity to this toxin in its genome.50 While infection by L-A has been characterized in yeast, it was not included in the sequence space searched by the original paper’s authors, understandably, because it was not expected to be present. R-esearching the yeast-derived spectra against a database that included known L-A viral sequences revealed an additional 403 viral PSMs, supporting the infection status. Thus, through LADS-based de novo sequencing, we discovered a fundamental aspect of the yeast specimen that would otherwise have been unlikely to obtain using traditional database search methodologies. Furthermore, LADS identified three peptides in the bovine brain data set and 25 peptides in the yeast data set that could not be confidently localized to any

Table 2. Proteins Identified by LADS and Not Database Search Space in Yeast and Labeled Bovine Brain Samples bovine brain

identifier

unique mapped peptides

name

gb|ABE68619.1| ref|XP_605694.3| ref| NP_001179691.1| ref| XP_005205580.1| ref| XP_005209487.1| ref| XP_005210390.1| ref| NP_001229513.1| ref| NP_001096765.1| ref| NP_001137339.1|

immunoglobulin gamma 1 heavy chain constant region, partial [Bos taurus] PREDICTED: histone H1.5 [Bos taurus] myosin-9 [Bos taurus]

2

PREDICTED: heterogeneous nuclear ribonucleoproteins A2/B1 isoform X4 [Bos taurus] PREDICTED: matrin-3 isoform X1 [Bos taurus] PREDICTED: heterogeneous nuclear ribonucleoprotein K isoform X5 [Bos taurus] immunoglobulin-binding protein 1 [Bos taurus] lamin-B1 [Bos taurus]

1

probable ATP-dependent RNA helicase DDX6 [Bos taurus] yeast

1

1 1 1 1

unique mapped peptides

identifier

name

ref| NP_042580.1| ref| NP_620494.1| emb| CAA96747.1| ref| NP_010228.1| ref| NP_040491.1| ref| NP_009679.2| gb|AAA35182.1| emb| CAA79744.1| gb|EGA63143.1|

capsid [Saccharomyces cerevisiae virus L-BC (La)] major coat protein [Saccharomyces cerevisiae virus L-A] unnamed protein product [Saccharomyces cerevisiae] mannose-1-phosphate guanylyltransferase [Saccharomyces cerevisiae S288c] Rep 2 protein [Saccharomyces cerevisiae A364A] glycine--tRNA ligase [Saccharomyces cerevisiae S288c] TUP1 protein [Saccharomyces cerevisiae] DMRL synthase [Saccharomyces cerevisiae]

gb|EGA59287.1|

1 1

YCR087C-A-like protein [Saccharomyces cerevisiae FostersO] Mss4p [Saccharomyces cerevisiae FostersB]

7 4 1 1 1 1 1 1 1 1

protein in NCBInr. These peptides could originate from completely novel proteins not present in existing sequence databases. Of note, the score cutoffs used to achieve a 10% FDR for both PEAKS and LADS were virtually unchanged over all complex data sets analyzed (Table S3), suggesting that, once calibrated to confident database search results on a wellcharacterized sample, cutoffs on de novo scoring functions can be generally applied to other proteomics samples of similar complexity even in the absence of companion high-confidence database search results.



DISCUSSION De novo sequencing has largely remained outside mainstream proteome analysis workflows in part because of a prevailing perception that it cannot deliver reliable full-length peptide sequence identifications or criteria with which such correct results can be distinguished. Here, we formalized criteria by which any de novo MS/MS spectrum interpreting algorithm 740

DOI: 10.1021/acs.jproteome.5b00861 J. Proteome Res. 2016, 15, 732−742

Article

Journal of Proteome Research can be evaluated, as an essential prerequisite for comparing and improving computational approaches. Through this process, we demonstrate a process to accurately identify sets of fully correct peptides from any de novo sequencing algorithm at a userspecified FDR. Holding de novo search engines to this more stringent standard demanded improvements in both the quality of individual PSMs and the tools that score them as being clearly distinct from a background of partial and incorrect sequence matches. Toward both ends, we developed LADS, a computational workflow that leverages experimentally introduced terminus-specific mass shifts in MS/MS spectrum pairs to produce full-length PSMs that can be readily discriminated from incorrect PSMs using an SVM-based scoring model. We demonstrate that LADS both improves discriminability and sequencing accuracy, particularly for longer peptides, relative to that of other de novo sequencers. However, a careful comparison of LADS (executed on an isotopically labeled peptide digest) with PEAKS (executed on a corresponding unlabeled peptide digest) demonstrated that this increase in de novo sequencing confidence comes at a cost: Due to the throughput losses associated with the isotopic labeling experimental procedure and mass spectrometer duty cycles, standard label-free analysis maximizes the number of total peptides confidently identified de novo. However, the aforementioned advantages of LADS warrant the further development and optimization of mass spectrometry workflows that maximize the acquisition of data from chemical or isotopic peptide pairs.51 We demonstrated that measurably accurate de novo sequencing is a viable hypothesis-generating strategy, even for well-studied systems such as the yeast heat shock response. With minimal effort, LADS-based de novo searching showed that the yeast strain used in that experiment carried a virus, the influence of which on yeast heat shock response and other stresses is not characterized. De novo peptide analyses are indispensable for characterizing other poorly defined or highly complex peptide mixtures such as soil52 and marine53 ecosystems and health-influencing proteins that are physically separated from their genomic sources, such as air-borne immunogens and environmental toxins.54,55 The techniques described here precisely identify correct full-length de novo peptide interpretations, providing a powerful approach for characterizing amino acid polymers not assayable with conventional high-throughput protein or nucleic acid sequencing technologies.





number of spectra in training and testing sets (Figure S8); LADS pairing and clustering SVM features (Table S1); LADS rescoring and reranking SVM features (Table S2); and summary of database and de novo results (Table S3) (PDF) Synthetic vs experimental spectra comparison for LADS de novo peptide predictions (PDF) All peptide−spectrum matches for all database search and de novo sequencing algorithms (Table S4) (XLSX)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Author Contributions

A.D. designed and implemented all algorithms, designed and carried out all LADS labeling experiments, and wrote the manuscript. J.E.E. designed algorithms and wrote the manuscript. Both authors discussed and made comments toward writing the manuscript. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS J.E.E. is a Damon Runyon-Rachleff Innovation Awardee supported in part by the Damon Runyon Cancer Research Foundation (DRR-13-11). This work was also supported by the W.M. Keck Foundation Medical Research Program (J.E.E.) and The Stanford Graduate Fund (A.D.). We wish to acknowledge members of the Elias lab and Karlene Cimprich, Daniel Jarosz, David Dill, Parag Mallick, and Tobias Meyer for helpful discussions during the preparation of this manuscript.



REFERENCES

(1) Zhang, Y.; Fonslow, B. R.; Shan, B.; Baek, M.-C.; Yates, J. R. Protein analysis by shotgun/bottom-up proteomics. Chem. Rev. 2013, 113, 2343−94. (2) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of Multidimensional Chromatography Coupled with Tandem Mass Spectrometry (LC/LC−MS/MS) for Large-Scale Protein Analysis: The Yeast Proteome. J. Proteome Res. 2003, 2, 43− 50. (3) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551−67. (4) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976−989. (5) Zhang, J.; et al. PEAKS DB: De Novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics 2012, 11, M111.010587. (6) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207−14. (7) Calvete, J. J. Challenges and prospects of proteomics of nonmodel organisms. J. Proteomics 2014, 105, 1−4. (8) Starck, S. R.; Shastri, N. Non-conventional sources of peptides presented by MHC class I. Cell. Mol. Life Sci. 2011, 68, 1471−9. (9) Saska, I.; Craik, D. J. Protease-catalysed protein splicing: a new post-translational modification? Trends Biochem. Sci. 2008, 33, 363−8. (10) Edwards, N. J. Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Mol. Syst. Biol. 2007, 3, 102.

ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.5b00861. Additional methods; LADS probabilistic scoring network (Figure S1); 2MEGA-labeling protocol (Figure S2); LADS identifies peptides correctly from a known protein mixture (Figure S3); pairing and clustering SVMs sensitivity and precision (Figure S4); LADS sequences spectra more accurately than PEAKS (Figure S5); relative feature importance in rescoring and reranking SVM (Figure S6); LADS and PEAKS identify complementary sets of high-confidence results (Figure S7); SVM training and cross-validation success increase with 741

DOI: 10.1021/acs.jproteome.5b00861 J. Proteome Res. 2016, 15, 732−742

Article

Journal of Proteome Research

(37) Chang, C.-C.; Lin, C.-J. LIBSVM. ACM Trans. Intell. Syst. Technol. 2011, 2, 1−27. (38) Eng, J. K.; Searle, B. C.; Clauser, K. R.; Tabb, D. L. A face in the crowd: recognizing peptides through database search. Mol. Cell. Proteomics 2011, 10, R111.009522. (39) Gevaert, K.; Vandekerckhove, J. Protein identification methods in proteomics. Electrophoresis 2000, 21, 1145−54. (40) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383− 92. (41) Chi, H.; et al. pNovo+: de novo peptide sequencing using complementary HCD and ETD tandem mass spectra. J. Proteome Res. 2013, 12, 615−25. (42) Cagney, G.; Emili, A. De novo peptide sequencing and quantitative profiling of complex protein mixtures using mass-coded abundance tagging. Nat. Biotechnol. 2002, 20, 163−70. (43) Taouatas, N.; Drugan, M. M.; Heck, A. J. R.; Mohammed, S. Straightforward ladder sequencing of peptides using a Lys-N metalloendopeptidase. Nat. Methods 2008, 5, 405−7. (44) Richards, A. L.; et al. Neutron-encoded signatures enable product ion annotation from tandem mass spectra. Mol. Cell. Proteomics 2013, 12, 3812−23. (45) Savitski, M. M.; Nielsen, M. L.; Kjeldsen, F.; Zubarev, R. A. Proteomics-grade de novo sequencing approach. J. Proteome Res. 2005, 4, 2348−54. (46) Bertsch, A.; et al. De novo peptide sequencing by tandem MS using complementary CID and electron transfer dissociation. Electrophoresis 2009, 30, 3736−47. (47) He, L.; Ma, B. ADEPTS: advanced peptide de novo sequencing with a pair of tandem mass spectra. J. Bioinf. Comput. Biol. 2010, 8, 981−94. (48) Datta, R.; Bern, M. Spectrum fusion: using multiple mass spectra for de novo Peptide sequencing. J. Comput. Biol. 2009, 16, 1169−82. (49) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403−10. (50) Breinig, F.; Sendzik, T.; Eisfeld, K.; Schmitt, M. J. Dissecting toxin immunity in virus-infected killer yeast uncovers an intrinsic strategy of self-protection. Proc. Natl. Acad. Sci. U. S. A. 2006, 103, 3810−5. (51) Frese, C. K.; et al. Toward full peptide sequence coverage by dual fragmentation combining electron-transfer and higher-energy collision dissociation tandem mass spectrometry. Anal. Chem. 2012, 84, 9668−73. (52) Bastida, F.; Hernández, T.; García, C. Metaproteomics of soils from semiarid environment: functional and phylogenetic information obtained with different protein extraction methods. J. Proteomics 2014, 101, 31−42. (53) Poulson-Ellestad, K. L.; et al. Metabolomics and proteomics reveal impacts of chemically mediated competition on marine plankton. Proc. Natl. Acad. Sci. U. S. A. 2014, 111, 9009−14. (54) Liu, J.; et al. Evaluation of inflammatory effects of airborne endotoxin emitted from composting sources. Environ. Toxicol. Chem. 2011, 30, 602−6. (55) Gangamma, S.; Patil, R. S.; Mukherji, S. Characterization and proinflammatory response of airborne biological particles from wastewater treatment plants. Environ. Sci. Technol. 2011, 45, 3282−7.

(11) Woo, S.; et al. Proteogenomic Database Construction Driven from Large Scale RNA-seq Data. J. Proteome Res. 2014, 13, 21−8. (12) Shilov, I. V.; et al. The Paragon Algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Mol. Cell. Proteomics 2007, 6, 1638−55. (13) Bern, M.; Kil, Y. J.; Becker, C. Byonic: advanced peptide and protein identification software. Curr. Protoc. Bioinformatics 2012, DOI: 10.1002/0471250953.bi1320s40. (14) Cho, S.; et al. Current challenges in bacterial transcriptomics. Genomics Inform. 2013, 11, 76−82. (15) Feldmesser, E.; Rosenwasser, S.; Vardi, A.; Ben-Dor, S. Improving transcriptome construction in non-model organisms: integrating manual and automated gene definition in Emiliania huxleyi. BMC Genomics 2014, 15, 148. (16) Chi, H.; et al. pNovo: de novo peptide sequencing and identification using HCD spectra. J. Proteome Res. 2010, 9, 2713−24. (17) Ma, B.; et al. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2003, 17, 2337−42. (18) Frank, A.; Pevzner, P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 2005, 77, 964−73. (19) Dancík, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 1999, 6, 327−42. (20) Allmer, J. Algorithms for the de novo sequencing of peptides from tandem mass spectra. Expert Rev. Proteomics 2011, 8, 645−57. (21) Seidler, J.; Zinn, N.; Boehm, M. E.; Lehmann, W. D. De novo sequencing of peptides by MS/MS. Proteomics 2010, 10, 634−49. (22) Trier, N. H.; Hansen, P. R.; Houen, G. Production and characterization of peptide antibodies. Methods 2012, 56, 136−44. (23) Yadav, M.; et al. Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing. Nature 2014, 515, 572−576. (24) Ueberheide, B. M.; Fenyö, D.; Alewood, P. F.; Chait, B. T. Rapid sensitive analysis of cysteine rich peptide venom components. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 6910−5. (25) Villén, J.; Gygi, S. P. The SCX/IMAC enrichment approach for global phosphorylation analysis by mass spectrometry. Nat. Protoc. 2008, 3, 1630−8. (26) Vizcaíno, J. A.; et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 2014, 32, 223−6. (27) Nagaraj, N.; et al. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap. Mol. Cell. Proteomics 2012, 11, M111.013722. (28) Ong, S.-E.; et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 2002, 1, 376−86. (29) Ji, C.; Guo, N.; Li, L. Differential dimethyl labeling of N-termini of peptides after guanidination for proteome analysis. J. Proteome Res. 2005, 4, 2099−108. (30) Hu, Q.; et al. The Orbitrap: a new mass spectrometer. J. Mass Spectrom. 2005, 40, 430−43. (31) Olsen, J. V.; et al. Higher-energy C-trap dissociation for peptide modification analysis. Nat. Methods 2007, 4, 709−12. (32) Sayers, E. W.; et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2009, 37, D5−15. (33) The UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014, 42, D191−8. (34) Cherry, J. M.; et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 2012, 40, D700−5. (35) Vizcaíno, J. A.; et al. A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 2009, 9, 4276−83. (36) Yuan, Z.; Shi, J.; Lin, W.; Chen, B.; Wu, F.-X. Features-based deisotoping method for tandem mass spectra. Adv. Bioinf. 2011, 2011, 210805. 742

DOI: 10.1021/acs.jproteome.5b00861 J. Proteome Res. 2016, 15, 732−742