Combining de novo peptide sequencing algorithms, a synergistic

FDR of 5% more than threefold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11,120 ...... Email: Sickma...
3 downloads 13 Views 1MB Size
Article pubs.acs.org/jpr

Combining De Novo Peptide Sequencing Algorithms, A Synergistic Approach to Boost Both Identifications and Confidence in Bottom-up Proteomics Bernhard Blank-Landeshammer,† Laxmikanth Kollipara,† Karsten Biß,† Markus Pfenninger,‡,§ Sebastian Malchow,† Konstantin Shuvaev,† René P. Zahedi,† and Albert Sickmann*,†,∥,⊥ †

Leibniz-Institut für Analytische Wissenschaften − ISAS − e.V., 44139 Dortmund, Germany Biodiversity and Climate Research Centre, Senckenberg Gesellschaft für Naturforschung, 60325 Frankfurt am Main, Germany § Faculty of Biological Science, Institute for Ecology, Evolution and Diversity, Department of Molecular Ecology, Goethe University, Max-von-Laue-Straße 9, 60438 Frankfurt am Main, Germany ∥ Medizinische Fakultät, Medizinische Proteom-Center (MPC), Ruhr-Universität Bochum, 44801 Bochum, Germany ⊥ Department of Chemistry, College of Physical Sciences, University of Aberdeen, Aberdeen AB24 3FX, Scotland, United Kingdom ‡

S Supporting Information *

ABSTRACT: Complex mass spectrometry based proteomics data sets are mostly analyzed by protein database searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e., de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than 3-fold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11 120 PSMs (combined) instead of 3476 PSMs (alone) in triplicate 2 h LC−MS runs of tryptic HeLa digestion. KEYWORDS: de novo peptide sequencing, bottom-up proteomics, LC−MS/MS, false discovery rate



become accessible via database search-based methods.8 However, the paramount majority of species has not been sequenced yet, or if so, the oftentimes short, raw sequencing reads generated by next generation sequencing (NGS) have not been reconstructed to full, chromosome-level genomes. RNASequencing can provide better suited pseudoprotein databases with less effort than whole genome sequencing. These potentially include alternative splicing events and single nucleotide polymorphisms (SNPs), but ideally would require a parallel RNA-Seq analysis of the corresponding samples, associated with additional effort, time and costs.9 On the other hand, alternative splicing events and SNPs often are not accounted for in standard protein databases. In these cases, de novo peptide sequencing, i.e., to directly infer the peptide sequence from an acquired tandem mass spectrum without prior restrictions by databases, oftentimes is a valuable

INTRODUCTION Peptide and protein identification by liquid chromatography coupled tandem mass spectrometry (LC−MS/MS) has become a widely accepted tool in biology and medicine alike.1 In commonly conducted “bottom-up” or shotgun proteomics experiments, the interpretation of spectra is usually done either by searching against (i) a protein database or (ii) a spectral library.2 Both approaches exploit an effective reduction of the search space to (i) peptide sequences which are biologically plausible or (ii) to spectra which have been experimentally observed before. Search engines such as MASCOT, 3 SEQUEST4 or X!Tandem5 employ database-dependent approaches, while, e.g., SpectraST6 or X!Hunter7 are commonly used for spectral library searching. In both cases, certain a priori knowledge is needed and restricts the applicability of these methods to well-described and thoroughly studied/sequenced organisms. Notably, due to steadily decreasing costs and vast efforts of the scientific community, the genomes of more and more non-model organisms have been sequenced and hence © 2017 American Chemical Society

Received: April 6, 2017 Published: July 25, 2017 3209

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218

Article

Journal of Proteome Research

Subsequently an N-terminal FPR (nFPR) for this reference set was calculated based on the matching of experimental spectra to the reference spectra of the N-terminally shuffled synthetic peptides. This comparison indicated a comparable FPR to the database-validated sample sets.

alternative. Furthermore, in cases where no reliable protein sequence can be inferred, e.g., the sequencing of antibodies, de novo sequencing can add reliable information.10 A vast variety of de novo peptide sequencing algorithms is currently available. All-in-one solutions such as PEAKS,11 Byonic12 or ProteinLynx13 provide de novo sequencing algorithms as part of complete proteomic pipeline solutions, together with database search algorithms and further features. However, the majority of algorithms, such as pepNovo,14 pNovo+,15 Novor16 or UniNovo17 are freely available to the community and are operated via command-line interface. Efforts have been made for more user-friendly environments, e.g., denovoGUI allows controlling three de novo algorithms via a java-based user interface.18 Despite this vast selection of algorithms and efforts of many researchers and developers, de novo peptide sequencing is still largely neglected, especially when it comes to high-throughput experiments and complex data sets. This might be attributed to several inherent limitations of de novo peptide sequencing: The possible permutations of amino acids for a given peptide mass is intrinsically high, and increases with decreasing mass accuracy. Therefore, often several candidate sequences with equal score can match a given MS/MS spectrum. This imposes considerable issues for downstream data analysis, as perpossiblytrue candidate sequence, several false positives are equally likely and cannot be readily distinguished. This also hampers the implementation of a realistic FDR-estimation for de novo sequencing, comparable to the target-decoy approach established for database searches. Some intrinsic ambiguities can be resolved by acquisition of high resolution MS and particularly MS/MS data, which facilitates the differentiation of amino acids and amino acid combinations, e.g., K and Q or DP and LV differ by 36 mDa, each, F and methionine sulfoxide differ by 22 mDa.19 Besides sufficient mass accuracy, the presence of fragment ions ideally covering the entire peptide sequence is crucial for accurate de novo sequence annotation. Combining complementary fragmentation techniques11,15,17,20,21 or differential labeling of peptides to facilitate b- and y-ion ladder identification based on specifically induced mass shifts22,23 have been introduced to improve de novo sequencing, however, on the expense of considerably reduced acquisition rates. Since every peptide has to be analyzed multiple times, either with different fragmentation modes or modification states, the identification rates particularly for complex samples are low. Devabhaktuni et al. recently showed that it is possible to estimate FDRs of de novo annotations in complex samples by comparing these to database search results.22 In contrast, we describe here thatbesides optimized data acquisition settingsthe combination of multiple de novo sequencing algorithms increases confidence without the need for labeling or multiple fragmentation steps. We evaluated this approach by analyzing three different data sets of well-described organisms, i.e., human (HeLa cell line), mouse (C2C12 cell line) and yeast (strain W303), and comparing the combined de novo results to state-of-the-art database search engines. Furthermore, we chose an organism with still unsequenced genome, i.e., Radix auricularia. Here, we selected peptide sequences at different levels of confidence, obtained from our combined de novo annotation workflow and synthesized these in a forward and a “shuffled” version, i.e., exchanging the first two amino acid positions. Finally, we compared high resolution MS/MS spectra of those reference peptides to the endogenous de novo sequenced MS/MS from the Radix auricularia sample.



MATERIALS AND METHODS

Materials and Reagents

Ammonium hydrogen carbonate (NH4HCO3), anhydrous magnesium chloride (MgCl 2 ), guanidine hydrochloride (GuHCl), iodoacetamide (IAA), and urea were obtained from Sigma-Aldrich, Steinheim, Germany. Tris base was acquired from Applichem Biochemica, Darmstadt, Germany. Sodium dodecyl sulfate (SDS) was purchased from Carl Roth, Karlsruhe, Germany. Dithiothreitol (DTT) and EDTA-free protease inhibitor (Complete Mini) tablets were bought from Roche Diagnostics, Mannheim, Germany. Sodium chloride (NaCl) and calcium chloride (CaCl2) were purchased from Merck, Darmstadt. Sequencing grade Modified trypsin (sequencing grade) was bought from Promega, Madison, WI USA. Benzonase endonuclease was purchased from Novagen. Bicinchoninic acid assay (BCA) kit was acquired from Thermo Fisher Scientific, Dreieich, Germany. Formic acid (FA), trifluoroacetic acid (TFA) and acetonitrile (ACN) were obtained from Biosolve, Valkenswaard, Netherlands. Sample Lysis

To validate the performance of the different de novo peptide sequencing algorithms, a set complex test samples was prepared. To rule out possible influences from distinct samples or databases, different organisms, namely human (HeLa), a mouse muscle cell line (C2C12) and S. cerevisiae (strain W303) were chosen. Furthermore, the pond snail R. auricularia, of which no comprehensive genome data has been published yet, was included to act as a proof of concept. Unless otherwise stated the following procedures i.e., cell lysis, tissue dissection and ultrasonication were carried out on ice. Lysis and dissection procedures were performed under the laminar flow hood. HeLa (Homo sapiens), C2C12 (Mus musculus), and Yeast-W303 (Saccharomyces cerevisiae). Approximately 1 mg of cells of each organism were suspended in 300 μL of lysis buffer (LB) comprised of 50 mM Tris-HCl (pH 7.8), 150 mM NaCl, 1% SDS, and complete Mini. Subsequently, 6 μL of benzonase (25 U/μL) and 2 mM MgCl2 were added to the lysates and incubated at 37 °C for 30 min. Samples were clarified by centrifugation at 4 °C and 18 000g for 30 min. Radix auricularia. A snail was freshly harvested from an in house maintained aquarium and was immersed in ice cold ethanol (100%) for 10 min. Next, the foot was dissected using a sterile scalpel, thoroughly cleaned and cut into small pieces. To these, 300 μL of LB were added and subjected to mechanical grinding. Next, lysates were further homogenized by ultrasonication for 30 s (amplitude: 30; pulse 1 s/1 s). Benzonase treatment and centrifugation were performed as described for HeLa. Estimation of Protein Concentration and Carbamidomethylation

A calorimetric bicinchoninic acid assay (Pierce BCA Protein Assay Kit) was performed to estimate protein concentration. Next, reduction of disulfide bonds was carried out by addition of 10 mM DTT at 56 °C for 30 min and subsequent alkylation 3210

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218

Article

Journal of Proteome Research LC−MS/MS Analysis

was done using 30 mM IAA for 30 min at room temperature (RT) in the dark.

One μg of each sample was analyzed using an Ultimate 3000 nano RSLC system coupled to a Q Exactive HF mass spectrometer (both Thermo Fischer Scientific). To minimize systematic errors, the samples were measured in a random order. Preconcentration of peptides on a 100 μm × 2 cm C18 precolumn for 10 min using 0.1% TFA with a flow rate of 20 μL/min was followed by separation on a 75 μm × 50 cm C18 analytical column (both PepMap RSLC, Thermo Scientific). A 120 min LC gradient ranging from 3 to 42% of buffer B: 84% ACN 0.1% FA at a flow rate of 250 nL/min was used. Synthetic peptides were separated similarly, with a 35 min LC gradient ranging from 3 to 50% of buffer B. The Q Exactive HF was operated in data-dependent acquisition mode using following parameters. The full MS scans were acquired within 300−1500 m/z at 60 000 resolution using the polysiloxane ion at m/z 371.101236 as lock mass.30 The top 15 most intense ions were isolated with a window of 0.4 m/z and were fragmented using higher-energy collisional dissociation (HCD) with a normalized collision energy of 27%, and the dynamic exclusion duration for precursor masses was limited to 12 s. The MS/MS spectra were acquired at 15 000 resolution, and automatic gain control target values were set to 3 × 106 for MS and 5 × 104 for MS/MS. Maximum injection times were 120 and 250 ms for MS and MS/MS, respectively, whereas precursor ions with charge states of +1, >+5 or unassigned were excluded from MS/MS analysis. The underfill ratio, which specifies the minimum percentage of the target value likely to be reached at maximum fill time, was defined as 5%, which corresponds to a minimum precursor intensity of 2.5 × 103 to trigger an MS/MS scan.

Sample Preparation for LC−MS Analysis

The filter-aided sample preparation (FASP)24,25 protocol was employed for sample cleanup and proteolytic digestion, albeit with slight changes. Each sample lysate corresponding to 50 μg of protein was diluted 10-fold with newly prepared 8.0 M urea in 100 mM Tris-HCl (pH 8.5) buffer26 and subsequently transferred onto a centrifugal device (PALL Nanosep, 30 kDa molecular weight cutoff). The device was centrifuged at 13 500g at RT for 30 min and these conditions were identical for all the following centrifugation steps. For removal of residual SDS, the device was treated three times with 100 μL of 8.0 M urea buffer. Next, to eliminate urea, the device was washed thrice with 100 μL of 50 mM NH4HCO3 (pH 7.8). For proteolysis, 100 μL of buffer comprising of trypsin (Promega) (1:20 w/w, protease to substrate), 0.2 M GuHCl and 2 mM CaCl2 in 50 mM NH4HCO3 (pH 7.8), were added to each device containing concentrated proteins and incubated at 37 °C for 14 h. Thus, generated tryptic peptides were collected by centrifugation followed by consecutive wash steps with 50 μL of 50 mM NH4HCO3 and 50 μL of ultrapure water, respectively. Finally, peptides were acidified using 10% TFA and quality control of the digests was carried out using a monolithic-HPLC as described previously.27 Synthetic Peptides

74 peptide sequences were selected based on the annotations made by three de novo peptide sequencing algorithms on a data set comprising three 2 h LC−MS/MS runs of a Radix auricularia tryptic digest. First, from the annotated spectra where all de novo sequencing algorithms agree, 11 sequences were selected from a subset of annotations with high scores (i.e., PEAKS ALC > 95), 13 sequences from a subset with medium scores (PEAKS ALC > 80 and 50 was compared to the corresponding Mascot annotation. Only fully annotated, completely correct peptide sequences were considered as true hits, whereas L and I were treated as equal. By narrowing down the isolation window, the total number of false positives could be reduced significantly, while the number of true positives remained constant throughout the runs (data not shown). Therefore, the most narrow isolation window (i.e., 0.4 m/z) was chosen for subsequent analyses. Evaluation of De Novo Algorithms

Combined Approach

As pointed out by Devabhaktuni and Elias,22 performing highthroughput de novo sequencing of complex samples, requires that one is able to determine completely correct de novo sequence annotations at defined FDR-rates. For this to validate, data sets of three different organisms were generated, applying the aforementioned optimized MS/MS settings. Each de novo

As can be derived from the ROC-curves depicted above (Figure 1), the recovery of true peptide sequences drastically drops when reducing the FDR to a considerable level, e.g., 10% or 5% FDR. Therefore, as additional discriminatory measure, we here propose to use the agreement of different de novo sequencing 3212

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218

Article

Journal of Proteome Research

the respective sequences results in a subset of so-called “truncated” peptide hits again meeting a 5% FDR level. This way, for every experiment two de novo sequence sets are generated: The main set comprises PSMs where the algorithms agree on the entire sequence, while in the truncated set all “rescued” PSMs are included, i.e., here the sequence agreement comprises the entire sequence except for the first two amino acids which are reversed. When combining two algorithms (Figure 3A), cutoff scores were determined empirically. For the main set of peptide sequences, the cutoff had to be set at either a PEAKS ALC of 95 or a Novor Score of 82. For the truncated sequence set, minimum scores were 70 for the PEAKS ALC and 65 for the Novor Score. These values suffice to maintain an FDR of ≤5% in all three analyzed data sets (i.e., human, mouse and yeast). Figure 4 depicts the improved recovery of PSMs at a fixed FDR of 5% for the combined workflow in contrast to individual

Figure 2. Error position determination of de novo annotations by PEAKS compared to reference sequences from database searches (Mascot, Sequest, MS Amanda). X-axis shows the total number of sequencing errors, while the first error position starting from the Nterminus of the peptide is represented by the color scheme, e.g., two shuffled amino acids at the N-terminus are displayed as “error at pos. 1”. Accordingly, errors where the first amino acid is assigned correctly, but positions 2 and 3 are shuffled are displayed as “error at pos. 2”. For de novo annotations with differing sequence length to the reference, exact error positions were not determined.

algorithms on distinct MS/MS spectra. When combining the results of two algorithms (workflow in Figure 3(A)), additional score-cutoffs have to be applied in order for the results to reach a confidence level below 5% FDR. Combination of three algorithms (Figure 3(B)) leaves the use of score-cutoffs obsolete, as the “agreeing” subset as a whole shows an FDR of ≤5% in all our measured data sets. As mentioned above, a predominant share of de novo sequencing errors in general occur due to permutation of the two most N-terminal amino acids (i.e., approximately 70% of errors in the validated HeLa data set). A subset of these sequences can be “rescued” by removing those ambiguous amino acids. Agreement of the algorithms on the remainder of

Figure 4. Number of spectra annotated by individual de novo algorithms and combination of two and three, respectively (5% FDR); three HeLa replicates. Blue bars represent the main de novo data set, where the full peptide sequence is correctly annotated, while red bars represent the truncated data set, where the first two amino acid positions were removed to meet the 5% FDR criteria. Database searches with Mascot, Sequest and MS Amanda at −1% FDR on PSMlevel as reference.

de novo search algorithms with score-cutoff to reach the same FDR-level. The corresponding figures for C2C12 and yeast are given in the Supporting Information. In the HeLa data set, the number of identified spectra at 5% FDR increased from an

Figure 3. Flowchart of the combined de novo workflows with (A) two algorithms and (B) three algorithms. 3213

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218

Article

Journal of Proteome Research Table 1. Overview of All Utilized Datasets and the Respective Database Search and De Novo Annotations

HeLa

C2C12

Yeast

Radix

Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep. Rep.

1 2 3 1 2 3 1 2 3 1 2 3

MS/MS scans

DB algorithm PSMs (1% FDR)

PEAKS (5% FDR)

Novor (5% FDR)

PEAKS | Novor (5% FDR)

PEAKS | Novor | pNovo (5% FDR)

36584 37227 38321 36410 37282 36238 35372 37000 36557 35786 34769 36634

24943 24946 25757 25150 23506 24276 17169 17795 17358 − − −

3589 3308 3533 4016 3771 3569 3336 3174 3034 3235 3174 3322

2770 2595 2601 2806 2488 2411 2280 2141 2049 1800 1779 2002

9862 9673 9301 9631 8772 8531 7743 7517 6994 9931 9630 10843

11332 11184 10845 10885 10034 9857 8771 8658 8153 11381 10958 12314

Figure 5. Evaluation of de novo annotations by synthesis of reference peptides and comparison.

average of 3476 (PEAKS alone) to 11 120 spectra for the combined approach, thus corresponding to a more than a 3-fold increase in confident identifications. Likewise, for C2C12 and yeast, identifications could be increased almost 3-fold when compared to PEAKS, the best-performing individual de novo algorithm. Additionally, the truncated sequence set (allowing deletion of the first two N-terminal amino acids) accounts for another 3890 identifications. The total recoveries across all three data sets in comparison to conventional database searches ranges from 37% to 47% on the PSM level and from 43% to 56% on the peptide level, in both cases just considering complete de novo annotations, i.e., ignoring truncated peptides. Table 1 shows a complete overview of all data sets. For the main sets, in addition to the validated hits, approximately 10% (HeLa data set), 11% (C2C12 data set) and 15% (Yeast data set) of de novo annotations meeting all criteria were assigned to MS/MS for which no database search algorithm matched a hit passing the 1% FDR cutoff. For the HeLa data set, these “de novo only” (DNO) peptide sequences were analyzed in detail: (i) Searching against a common contaminant database (247 entries) revealed that approximately 7% of the DNO-hits were assigned to trypsin, keratin and the like. This is in good accordance with the findings of Griss et al.,37 who analyzed unidentified spectra in hundreds of PRIDE data sets and found that a considerable share originated from trypsin, keratin, albumin and hemoglobin, thereby illustrating the importance of including contaminants in databases used in proteomics experiments. (ii a), 50% of the DNO sequences were actually present in the initially used UniProt human database. Upon relaxing the FDR-cutoff of the database search

also to 5%, approximately 60% of those hits were also identified by the search algorithms (i.e., Mascot, Sequest and MS Amanda), indicating that the corresponding matches indeed have failed to pass the more stringent FDR-cutoffs. Recalculation of the de novo FDR for these hits revealed an error-rate close to 8%, most likely due to error propagation at the relaxed criteria. (ii b) One third of these DNO-hits present in the database represent nontryptic cleavages at peptide Ntermini (semitryptic peptides) and hence were not matched by the database search algorithms. These peptides most likely stem from enzyme miscleavages,27 in vivo proteolysis, or degradation during sample preparation and are inaccessible for conventional fully tryptic database searches. A change in the settings of the database search algorithms to semitrypsin specificity lead to identification of 30% of those peptides, but this comes to the cost of a minor loss of sensitivity and a massive increase of search time (data not shown). (iii) One quarter of the DNOhits differ by a single amino acid from sequences in the database. 60% of those show a putative amino acid substitution from N to D (mass shift of +0.9840 Da), corresponding to a deamidation. Notably, the second most prominent substitution with 11% occurrence is S to E (mass shift of +42.0106 Da), all but a few occurring at the N-terminus. These are most likely the product of in vivo acetylation events. Remaining sequences might indicate single nucleotide polymorphisms not accounted for by routine proteomics approaches. Notably, those sequences could potentially be identified as well by open modification or homology-driven search engines.38,39 (iv) About 7% of the peptide sequences only found by the combined de novo approach differ in three or more amino acids 3214

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218

Article

Journal of Proteome Research from the reference and contaminant database sequences. Of those 103 cases in the three replicates, 40 have a direct match in the more extensive GENCODE protein-coding transcript sequences (release 25)40 indicating an incompleteness of the employed UniProt protein database. Peculiarly, 18 peptide sequences have been found in all three replicates but remain unexplained and do not match to any previously published proteins (blastp search against nonredundant protein sequences). Validity of the annotation was manually confirmed.Overall, 90% of the DNO-hits showed complete sequence similarity or a single amino acid difference to the database, while only 3% of the sequences showed a putative de novo sequencing error, owing to differing isobaric sequences such as ST instead of TS or AG instead of Q. Comparison of Synthetic Peptides Reference Spectra

In order to validate the annotations of the R. auricularia data set, synthetic peptides were created in “forward” and “shuffle” conformation (i.e., inversed versions of the first two amino acids) for a set of PSMs from the main and truncated de novo sequence sets, as depicted in Figure 5. Synthetic peptides were analyzed with the same settings as the complex data set, and the best scoring PSM (Mascot ion score) were selected as reference spectra and compared against the original de novo sequenced MS/MS spectra (see Figure 5). Dot products were calculated by comparing the reference spectra of the “forward” and “shuffle” peptide to the original MS/MS derived from the complex sample. If the forward version’s score surpassed the shuffled peptide, the de novo PSM was considered a true positive hit (see Figure 6). On the basis of this presumption, among the 37 selected spectra 3 scored higher for the shuffled version, leading to a FP-rate of 8.1%. As this FP-estimation ultimately only accounts for the first two Nterminal amino acids, we propose the term N-terminal FPR (nFPR). Therefore, direct assumptions about the correctness of the complete sequence cannot be made, however the commonly high dot products indicate a high confidence in the correct identification of the whole peptide sequence. Accordingly, the nFPR of the truncated set of de novo annotations could be calculated as 22.5%, in case these two amino acids are taken into account. Figure 7 shows the graphical representation of the score differences. Notably, the selection of peptides from different score regimes does not represent the distribution of peptides in the final data sets. Higher scoring peptides (e.g., ALC > 90) are much more prominent and amount to approximately 75% of all PSMs. Therefore, the set of validated peptides should be perceived as an overview and not be mistaken as an errordetermination of the whole data set, as FP-rates cannot be directly inferred. However, all high scoring de novo PSMs that have been validated in principle identified the correct sequence, whereas a certain share mis-annotated the first two N-terminal amino acids, considering that these sequences are correct in the database.

Figure 6. Direct comparison of (i) a spectrum taken from the R. auricularia data set (blue) which was annotated as TDDAFDVLGFTAEEK by the de novo algorithms and (ii) two synthetic peptidederived spectra representing the sequences TDDAFDVLGFTAEEK (top; dot product: 0.9899) and DTDAFDVLGFTAEEK (bottom; dot product: 0.9282), thus indicating the correctness of the “forward” sequence version of MS/MS spectra.

sample preparation and analysis times or consumption of large sample amounts, which can be cumbersome and detrimental, especially when dealing with minute amounts of starting material. Here, we demonstrate that even without specific labeling or combinatorial fragmentation techniques the performance of de novo peptide sequencing can be considerably improved. Applying minor optimizations in MS/ MS acquisition settings and combining results from different de novo sequencing algorithms, the confidence and recovery of de novo annotations in complex proteomic samples can be significantly increased. In a reference data set, PEAKS as the best-performing single de novo sequencing algorithm on average annotated 3476 spectra at 5% FDR, corresponding to only 14% of the database search algorithm PSMs at 1% FDR. Notably, using our combination approach, these numbers were multiplied to 11 120 and 45% at a 5% FDR-level. As discussed recently by Gorshkov et al.,41 coisolation and consequently produced chimeric spectra are major limitations in shotgun proteomics and hamper the performance of both database search and de novo peptide sequencing algorithms. Deconvoluting such chimeric spectra a posteriori seems to improve confident spectrum annotation. However, modern MS instrumentation provides sufficient sensitivity to reduce



DISCUSSION Reliable de novo peptide sequencing is still a challenge, especially when it comes to complex biological samples. A multitude of approaches have been developed over the past decades in order to improve recovery and sensitivity of de novo annotations. This includes the usage of multiple enzymes, special labeling techniques or mixed fragmentation modes. Most of these techniques come along with severely prolonged 3215

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218

Article

Journal of Proteome Research

data, as the necessary homology search step introduces another layer of complexity and ambiguity. Therefore, conventional database searches against closely related species are often performed additionally. Most of the above-mentioned de novo sequencing approaches generate partially redundant but complementary spectral information to improve de novo sequencing on the expense of sensitivity and depth. Our combined-algorithm approach can be readily used for combined search procedures.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.7b00198. Figures S1−S5 (PDF) Additional data (PDF) Additional data (PDF)

Figure 7. Dot product deltas of spectra selected from complex samples and two different shuffled versions of synthetic peptides, acquired with same settings. Positive differences indicate a higher dot product for the forward version, corresponding to a fully correct annotation. In the main data set, a paramount number of spectra score higher to their “forward” sequence reference (i.e., the one corresponding to the de novo annotation). This does not hold true for the truncated data set, indicating de novo sequencing ambiguities that could be resolved by removal of the first two amino acids.



AUTHOR INFORMATION

Corresponding Author

*Tel: +49-231-1392-100. Fax: +49-231-1392-200. E-mail: [email protected]. ORCID

Albert Sickmann: 0000-0002-2388-5265 Author Contributions

coisolation already during acquisition by using narrow isolation windows, having a greater impact, as it renders deconvolution of chimeric spectra rather unnecessary and avoids ambiguities introduced by deconvolution algorithms. As shown in this work, a considerable share of de novo sequencing errors occurs at the two most N-terminal amino acids, due to the absence of b1-fragment ions and low intensity of N-terminal y-ions. Notably, the usage of ion trap-based collision induced dissociation (CID) fragmentation in conjunction with the Orbitrap MS/MS acquisition could not compensate for this (data not shown). This is an inherent shortcoming of CID fragmentation which cannot be mended a posteriori by improved bioinformatics pipelines, but rather has to be addressed by different sample preparation approaches or alternative and combinatorial fragmentation techniques such as ETD and EThCD.42,43 NTerminal labeling of peptides to prevent cyclization of the b2 ions in order to improve the detection of b1 ions, as employed by Devabhaktuni and Elias22 improves the traceability of those crucial fragment ions, but requires an additional sample preparation step. Likewise, the usage of proteases cleaving Nterminal of specific residues should facilitate de novo sequence annotation, as b1 ions are not required whereas the b2 ion suffices to identify the most N-terminal sequence tag. Van Breukelen et al. have shown this for LysN;44 however, the peptide fragments generated by LysN are considerably longer than, e.g., tryptic peptides, which tends to hamper confident de novo sequence annotation as more fragment ions have to be detected for a complete sequence coverage. Moreover, the specificity of the protease, which for instance is rather high for well-defined trypsin and GluC digestions, may be an issue that can further complicate peptide identification. The recently described enzyme LysargiNase45 may overcome this issue by generating peptides mirroring tryptic digests. Notably, protein inferenceone of the major goals in any proteomics experimentis more challenging from de novo

The manuscript was written by B.B.L. All authors contributed to writing and discussion. B.B.L., R.P.Z., and A.S. conceived of the study. All authors have given approval to the final version of the manuscript Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was funded by the Leibniz-Competition Fund (SAW-2014-ISAS-2-D).The financial support by the Ministerium für Innovation, Wissenschaft und Forschung des Landes Nordrhein-Westfalen, the Regierende Bürgermeister von Berlin - inkl. Wissenschaft und Forschung, and the Bundesministerium für Bildung und Forschung is gratefully acknowledged.



ABBREVIATIONS ACN, acetonitrile; BCA, bicinchoninic acid; C2C12, mouse myoblast cell line; FDR, false discovery rate; FPR, false positive rate; HeLa, human cervical cancer cell line; LC−MS/MS, liquid chromatography−tandem mass spectrometry; PSM, peptide spectrum match; TFA, trifluoroacetic acid



REFERENCES

(1) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422, 198−207. (2) MacCoss, M. J. Computational analysis of shotgun proteomics data. Curr. Opin. Chem. Biol. 2005, 9, 88−94. (3) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551− 3567. (4) Yates, J. R.; Eng, J. K.; Clauser, K. R.; Burlingame, A. L. Search of Sequence Databases with Uninterpreted High-Energy CollisionInduced Dissociation Spectra of Peptides. J. Am. Soc. Mass Spectrom. 1996, 7, 1089−1098.

3216

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218

Article

Journal of Proteome Research (5) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466−1467. (6) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7, 655−667. (7) Craig, R.; Cortens, J. C.; Fenyo, D.; Beavis, R. C. Using Annotated Peptide Mass Spectrum Libraries for Protein Identification. J. Proteome Res. 2006, 5, 1843−1849. (8) Armengaud, J.; Trapp, J.; Pible, O.; Geffard, O.; Chaumot, A.; Hartmann, E. M. Non-model organisms, a species endangered by proteogenomics. J. Proteomics 2014, 105, 5−18. (9) Abreu, R. d. S.; Penalva, L. O.; Marcotte, E. M.; Vogel, C. Global signatures of protein and mRNA expression levels. Mol. BioSyst. 2009, 5, 1512−1526. (10) Bandeira, N.; Pham, V.; Pevzner, P.; Arnott, D.; Lill, J. R. Beyond Edman Degradation: Automated De novo Protein Sequencing of Monoclonal Antibodies. Nat. Biotechnol. 2008, 26, 1336−1338. (11) Zhang, J.; Xin, L.; Shan, B.; Chen, W.; Xie, M.; Yuen, D.; Zhang, W.; Zhang, Z.; Lajoie, G. A.; Ma, B. PEAKS DB: De Novo Sequencing Assisted Database Search for Sensitive and Accurate Peptide Identification. Mol. Cell. Proteomics 2012, 11, M111.010587. (12) Bern, M.; Kil, Y. J.; Becker, C. Byonic: Advanced Peptide and Protein Identification Software. Curr. Protoc Bioinf. 2012, DOI: 10.1002/0471250953.bi1320s40. (13) O’Malley, R. Life’s (more than) a BLAST. Biochemist 2002, 24, 21−23. (14) Frank, A.; Pevzner, P. PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling. Anal. Chem. 2005, 77, 964−973. (15) Chi, H.; Chen, H.; He, K.; Wu, L.; Yang, B.; Sun, R.-X.; Liu, J.; Zeng, W.-F.; Song, C.-Q.; He, S.-M.; Dong, M.-Q. pNovo+: De Novo Peptide Sequencing Using Complementary HCD and ETD Tandem Mass Spectra. J. Proteome Res. 2013, 12, 615−625. (16) Ma, B. Novor: real-time peptide de novo sequencing software. J. Am. Soc. Mass Spectrom. 2015, 26, 1885−1894. (17) Jeong, K.; Kim, S.; Pevzner, P. A. UniNovo: a universal tool for de novo peptide sequencing. Bioinformatics 2013, 29, 1953−1962. (18) Muth, T.; Weilnböck, L.; Rapp, E.; Huber, C. G.; Martens, L.; Vaudel, M.; Barsnes, H. DeNovoGUI: An Open Source Graphical User Interface for de Novo Sequencing of Tandem Mass Spectra. J. Proteome Res. 2014, 13, 1143−1146. (19) Frank, A. M.; Savitski, M. M.; Nielsen, M. N.; Zubarev, R. A.; Pevzner, P. A. De Novo Peptide Sequencing and Identification with Precision Mass Spectrometry. J. Proteome Res. 2007, 6, 114−123. (20) Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. New Data Baseindependent, Sequence Tag-based Scoring of Peptide MS/MS Data Validates Mowse Scores, Recovers Below Threshold Data, Singles Out Modified Peptides, and Assesses the Quality of MS/MS Techniques. Mol. Cell. Proteomics 2005, 4, 1180−1188. (21) Guthals, A.; Clauser, K. R.; Frank, A. M.; Bandeira, N. Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ ETD) from overlapping peptides. J. Proteome Res. 2013, 12, 2846− 2857. (22) Devabhaktuni, A.; Elias, J. E. Application of de Novo Sequencing to Large-Scale Complex Proteomics Data Sets. J. Proteome Res. 2016, 15, 732−742. (23) Bandeira, N.; Tsur, D.; Frank, A.; PA, P. A new approach to protein identification. Lecture Notes in Computer Science 2006, 3909, 363−378. (24) Wisniewski, J. R.; Zougman, A.; Nagaraj, N.; Mann, M. Universal sample preparation method for proteome analysis. Nat. Methods 2009, 6, 359−362. (25) Manza, L. L.; Stamer, S. L.; Ham, A. J.; Codreanu, S. G.; Liebler, D. C. Sample preparation and digestion for proteomic analyses using spin filters. Proteomics 2005, 5, 1742−1745. (26) Laxmikanth, K.; Zahedi, R. P. Protein carbamylation: In vivo modification or in vitro artefact? Proteomics 2013, 13, 941−944. (27) Burkhart, J. M.; Schumbrutzki, C.; Wortelkamp, S.; Sickmann, A.; Zahedi, R. P. Systematic and quantitative comparison of digest

efficiency and specificity reveals the impact of trypsin quality on MSbased proteomics. J. Proteomics 2012, 75, 1454−1462. (28) Dickhut, C.; Feldmann, I.; Lambert, J.; Zahedi, R. P. Impact of Digestion Conditions on Phosphoproteomics. J. Proteome Res. 2014, 13, 2761−2770. (29) Burkhart, J. M.; Premsler, T.; Sickmann, A. Quality control of nano-LC−MS systems using stable isotope-coded peptides. Proteomics 2011, 11, 1049−1057. (30) Olsen, J. V.; Godoy, L. M. F. d.; Li, G.; Macek, B.; Mortensen, P.; Pesch, R.; Makarov, A.; Lange, O.; Horning, S.; Mann, M. Parts per Million Mass Accuracy on an Orbitrap Mass Spectrometer via Lock Mass Injection into a C-trap. Mol. Cell. Proteomics 2005, 4, 2010− 2021. (31) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24, 2534−2536. (32) R Development Core Team. R: A Language and Environment for Statistical Computing, 3.3.1 ed.; R Foundation for Statistical Computing: Vienna, Austria, 2016. (33) Laurent Gatto; Lilley, K. S. MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 2012, 288, 288−289. (34) Houel, S.; Abernathy, R.; Renganathan, K.; Meyer-Arendt, K.; Ahn, N. G.; Old, W. M. Quantifying the Impact of Chimera MS/MS Spectra on Peptide Identification in Large-Scale Proteomics Studies. J. Proteome Res. 2010, 9, 4152−4160. (35) Paizs, B.; Suhai, S. Fragmentation pathways of protonated peptides. Mass Spectrom. Rev. 2005, 24, 508−548. (36) Michalski, A.; Neuhauser, N.; Cox, J.; Mann, M. A Systematic Investigation into the Nature of Tryptic HCD Spectra. J. Proteome Res. 2012, 11, 5479−5491. (37) Griss, J.; Perez-Riverol, Y.; Lewis, S.; Tabb, D. L.; Dianes, J. A.; del-Toro, N.; Rurik, M.; Walzer, M.; Kohlbacher, O.; Hermjakob, H.; Wang, R.; Vizcaino, J. A. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 2016, 13, 651−656. (38) Na, S.; Bandeira, N.; Paek, E. Fast Multi-blind Modification Search through Tandem Mass Spectrometry. Mol. Cell. Proteomics 2012, 11, M111.010199. (39) Kong, A. T.; Leprevost, F. V.; Avtonomov, D. M.; Mellacheruvu, D.; Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 2017, 14, 513−520. (40) Harrow, J.; Frankish, A.; Gonzalez, J. M.; Tapanari, E.; Diekhans, M.; Kokocinski, F.; Aken, B. L.; Barrell, D.; Zadissa, A.; Searle, S.; Barnes, I.; Bignell, A.; Boychenko, V.; Hunt, T.; Kay, M.; Mukherjee, G.; Rajan, J.; Despacio-Reyes, G.; Saunders, G.; Steward, C.; Harte, R.; Lin, M.; Howald, C.; Tanzer, A.; Derrien, T.; Chrast, J.; Walters, N.; Balasubramanian, S.; Pei, B.; Tress, M.; Rodriguez, J. M.; Ezkurdia, I.; van Baren, J.; Brent, M.; Haussler, D.; Kellis, M.; Valencia, A.; Reymond, A.; Gerstein, M.; Guigó, R.; Hubbard, T. J. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 2012, 22, 1760−1774. (41) Gorshkov, V.; Hotta, S. Y. K.; Verano-Braga, T.; Kjeldsen, F. Peptide de novo sequencing of mixture tandem mass spectra. Proteomics 2016, 16, 2470−2479. (42) Syka, J. E.; Coon, J. J.; Schroeder, M. J.; Shabanowitz, J.; Hunt, D. F. Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc. Natl. Acad. Sci. U. S. A. 2004, 101, 9528−9533. (43) Frese, C. K.; Altelaar, A. F.; van den Toorn, H.; Nolting, D.; Griep-Raming, J.; Heck, A. J.; Mohammed, S. Toward full peptide sequence coverage by dual fragmentation combining electron-transfer and higher-energy collision dissociation tandem mass spectrometry. Anal. Chem. 2012, 84, 9668−9673. (44) van Breukelen, B.; Georgiou, A.; Drugan, M. M.; Taouatas, N.; Mohammed, S.; Heck, A. J. R. LysNDeNovo: An algorithm enabling de novo sequencing of Lys-N generated peptides fragmented by electron transfer dissociation. Proteomics 2010, 10, 1196−1201. 3217

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218

Article

Journal of Proteome Research (45) Huesgen, P. F.; Lange, P. F.; Rogers, L. D.; Solis, N.; Eckhard, U.; Kleifeld, O.; Goulas, T.; Gomis-Ruth, F. X.; Overall, C. M. LysargiNase mirrors trypsin for protein C-terminal and methylationsite identification. Nat. Methods 2015, 12, 55−58.

3218

DOI: 10.1021/acs.jproteome.7b00198 J. Proteome Res. 2017, 16, 3209−3218