Combining De Novo Peptide Sequencing Algorithms, A Synergistic

ACS eBooks; C&EN Global Enterprise .... 5% more than 3-fold compared to the single best de novo sequencing algorithm alone, accounting for an average ...
0 downloads 0 Views 2MB Size
Subscriber access provided by PEPPERDINE UNIV

Article

Combining de novo peptide sequencing algorithms, a synergistic approach to boost both identifications and confidence in bottom-up proteomics Bernhard Blank-Landeshammer, Laxmikanth Kollipara, Karsten Biß, Markus Pfenninger, Sebastian Malchow, Konstantin Shuvaev, René P. Zahedi, and Albert Sickmann J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 25 Jul 2017 Downloaded from http://pubs.acs.org on July 25, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Combining de novo peptide sequencing algorithms, a synergistic approach to boost both identifications and confidence in bottom-up proteomics Bernhard Blank-Landeshammer1, Laxmikanth Kollipara1, Karsten Biß 1,Markus Pfenninger2,3, Sebastian Malchow1, Konstantin Shuvaev1,René P. Zahedi 1, Albert Sickmann1,4,5 1

Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany

2

Biodiversity and Climate Research Centre, Senckenberg Gesellschaft für Naturforschung,

Frankfurt, Germany 3

Faculty of Biological Science, Institute for Ecology, Evolution and Diversity, Department of

Molecular Ecology, Goethe University, Max-von-Laue-Straße 9, Frankfurt am Main, 60438, Germany. 4

Medizinische Fakultät, Medizinische Proteom-Center (MPC), Ruhr-Universität

Bochum,Bochum, Germany 5

Department of Chemistry, College of Physical Sciences, University of Aberdeen, Aberdeen,

Scotland, United Kingdom

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 35

Abstract

Complex mass spectrometry based proteomics datasets are mostly analyzed by protein databasesearches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e. de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than threefold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11,120 PSMs (combined) instead of 3,476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.

KEYWORDS De novo peptide sequencing, bottom-up proteomics, LC-MS/MS, false discovery rate

ACS Paragon Plus Environment

2

Page 3 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction Peptide and protein identification by liquid chromatography coupled tandem mass spectrometry (LC-MS/MS) has become a widely accepted tool in biology and medicine alike (1). In commonly conducted “bottom-up” or shotgun proteomics experiments, the interpretation of spectra is usually done either by searching against (i) a protein database or (ii) a spectral library (2). Both approaches exploit an effective reduction of the search space to (i) peptide sequences which are biologically plausible or (ii) to spectra which have been experimentally observed before. Search engines such as MASCOT (3), SEQUEST (4) or X!Tandem (5) employ databasedependent approaches, while e.g. SpectraST (6) or X!Hunter (7) are commonly used for spectral library searching. In both cases, certain a priori knowledge is needed and restricts the applicability of these methods to well-described and thoroughly studied/sequenced organisms. Notably, due to steadily decreasing costs and vast efforts of the scientific community, the genomes of more and more non-model organisms have been sequenced and hence become accessible

via

database

search-based

methods

(8).

However, the paramount majority of species has not been sequenced yet, or if so, the oftentimes short, raw sequencing reads generated by next generation sequencing (NGS) have not been reconstructed to full, chromosome-level genomes. RNA-Sequencing can provide better suited pseudo-protein databases with less effort than whole genome sequencing. These potentially include alternative splicing events and single nucleotide polymorphisms (SNPs), but ideally would require a parallel RNA-Seq analysis of the corresponding samples, associated with additional effort, time and costs (9). On the other hand, alternative splicing events and SNPs often are not accounted for in standard protein databases. In these cases, de novo peptide sequencing, i.e. to directly infer the peptide sequence from an acquired tandem mass spectrum

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 35

without prior restrictions by databases, oftentimes is a valuable alternative. Furthermore, in cases where no reliable protein sequence can be inferred, e.g. the sequencing of antibodies, de novo sequencing can add reliable information (10). A vast variety of de novo peptide sequencing algorithms is currently available. All-in-one solutions such as PEAKS (11), Byonic (12) or ProteinLynx (13) provide de novo sequencing algorithms as part of complete proteomic pipeline solutions – together with database search algorithms and further features. However, the majority of algorithms, such as pepNovo (14), pNovo+ (15), Novor (16) or UniNovo (17) are freely available to the community and are operated via command-line interface. Efforts have been made for more user-friendly environments, e.g. denovoGUI allows controlling three de novo algorithms via a java-based user interface (18). Despite this vast selection of algorithms and efforts of many researchers and developers, de novo peptide sequencing is still largely neglected, especially when it comes to high-throughput experiments and complex datasets. This might be attributed to several inherent limitations of de novo peptide sequencing: The possible permutations of amino acids for a given peptide mass is intrinsically high, and increases with decreasing mass accuracy. Therefore, often several candidate sequences with equal score can match a given MS/MS spectrum. This imposes considerable issues for downstream data analysis, as per – possibly – true candidate sequence, several false positives are equally likely and cannot be readily distinguished. This also hampers the implementation of a realistic FDR-estimation for de novo sequencing, comparable to the target-decoy approach established for database searches. Some intrinsic ambiguities can be resolved by acquisition of high resolution MS and particularly MS/MS data, which facilitates the differentiation of amino acids and amino acid combinations e.g. K and Q or DP and LV differ by 36 mDa, each, F and methionine sulfoxide differ by 22 mDa (19). Besides sufficient mass accuracy, the presence of fragment ions ideally covering the

ACS Paragon Plus Environment

4

Page 5 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

entire peptide sequence is crucial for accurate de novo sequence annotation. Combining complementary fragmentation techniques (11, 15, 17, 20, 21) or differential labeling of peptides to facilitate b- and y-ion ladder identification based on specifically induced mass shifts (22, 23) have been introduced to improve de novo sequencing, however, on the expense of considerably reduced acquisition rates. Since every peptide has to be analyzed multiple times, either with different fragmentation modes or modification states, the identification rates particularly for complex samples are low. Devabhaktuni et al. recently showed that it is possible to estimate FDRs of de novo annotations in complex samples by comparing these to database search results (22). In contrast, we describe here that – besides optimized data acquisition settings – the combination of multiple de novo sequencing algorithms increases confidence without the need for labeling or multiple fragmentation steps. We evaluated this approach by analyzing three different datasets of welldescribed organisms, i.e. human (HeLa cell line), mouse (C2C12 cell line) and yeast (strain W303), and comparing the combined de novo results to state-of-the-art database search engines. Furthermore, we chose an organism with still unsequenced genome, i.e. Radix auricularia. Here, we selected peptide sequences at different levels of confidence, obtained from our combined de novo annotation workflow and synthesized these in a forward and a “shuffled” version, i.e. exchanging the first two amino acid positions. Finally, we compared high resolution MS/MS spectra of those reference peptides to the endogenous de novo sequenced MS/MS from the Radix auricularia sample. Subsequently an N-terminal FPR (nFPR) for this reference set was calculated based on the matching of experimental spectra to the reference spectra of the Nterminally shuffled synthetic peptides. This comparison indicated a comparable FPR to the database-validated sample sets.

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 35

Materials & Methods Materials and reagents Ammonium hydrogen carbonate (NH4HCO3), anhydrous magnesium chloride (MgCl2), guanidine hydrochloride (GuHCl), iodoacetamide (IAA), and urea were obtained from SigmaAldrich, Steinheim, Germany. Tris base was acquired from Applichem Biochemica, Darmstadt, Germany. Sodium dodecyl sulfate (SDS) was purchased from Carl Roth, Karlsruhe, Germany. Dithiothreitol (DTT) and EDTA-free protease inhibitor (Complete Mini) tablets were bought from Roche Diagnostics, Mannheim, Germany. Sodium chloride (NaCl) and calcium chloride (CaCl2) were purchased from Merck, Darmstadt. Sequencing grade Modified trypsin (sequencing grade) was bought from Promega, Madison, WI USA. Benzonase® endonuclease was purchased from Novagen. Bicinchoninic acid assay (BCA) kit was acquired from Thermo Fisher Scientific, Dreieich, Germany. Formic acid (FA), trifluoroacetic acid (TFA) and acetonitrile (ACN) were obtained from Biosolve, Valkenswaard, Netherlands.

Methods Sample lysis To validate the performance of the different de novo peptide sequencing algorithms, a set complex test samples was prepared. To rule out possible influences from distinct samples or databases, different organisms, namely human (HeLa), a mouse muscle cell line (C2C12) and S. cerevisiae (strain W303) were chosen. Furthermore, the pond snail R. auricularia, of which no comprehensive genome data has been published yet, was included to act as a proof of concept.

ACS Paragon Plus Environment

6

Page 7 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Unless otherwise stated the following procedures i.e. cell lysis, tissue dissection and ultrasonication were carried out on ice. Lysis and dissection procedures were performed under the laminar flow hood. 1. HeLa (Homo sapiens), C2C12 (Mus musculus), and Yeast-W303 (Saccharomyces cerevisiae) Approximately 1 mg of cells of each organism were suspended in 300 µL of lysis buffer (LB) comprised of 50 mM Tris-HCl (pH 7.8), 150 mM NaCl, 1% SDS, and complete Mini. Subsequently, 6 µL of benzonase (25 U/µL) and 2 mM MgCl2 were added to the lysates and incubated at 37°C for 30 min. Samples were clarified by centrifugation at 4°C and 18,000 g for 30 min. 2. Radix auricularia A snail was freshly harvested from an in house maintained aquarium and was immersed in ice cold ethanol (100%) for 10 min. Next, the foot was dissected using a sterile scalpel, thoroughly cleaned and cut into small pieces. To these, 300 µL of LB were added and subjected to mechanical grinding. Next, lysates were further homogenized by ultrasonication for 30 seconds (amplitude: 30; pulse 1s/1s).

Benzonase treatment and centrifugation were performed as

described for HeLa.

Estimation of protein concentration and carbamidomethylation A calorimetric bicinchoninic acid assay (Pierce BCA Protein Assay Kit) was performed to estiamte protein concentration. Next, reduction of disulfide bonds was carried out by addition of 10 mM DTT at 56°C for 30 min and subsequent alkylation was done using 30 mM IAA for 30 min at room temperature (RT) in the dark. Sample preparation for LC-MS analysis

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 35

The filter-aided sample preparation (FASP) (24, 25) protocol was employed for sample cleanup and proteolytic digestion, albeit with slight changes. Each sample lysate corresponding to 50 µg of protein was diluted 10-fold with newly prepared 8.0 M urea in 100 mM Tris-HCl (pH 8.5) buffer (26) and subsequently transferred onto a centrifugal device (PALL Nanosep, 30 kDa molecular weight cutoff). The device was centrifuged at 13,500 g at RT for 30 min and these conditions were identical for all the following centrifugation steps. For removal of residual SDS, the device was treated three times with 100 µL of 8.0 M urea buffer. Next, to eliminate urea, the device was washed thrice with 100 µL of 50 mM NH4HCO3 (pH 7.8). For proteolysis, 100 µL of buffer comprising of trypsin (Promega) (1:20 w/w, protease to substrate), 0.2 M GuHCl and 2 mM CaCl2 in 50 mM NH4HCO3 (pH 7.8), were added to each device containing concentrated proteins and incubated at 37°C for 14 h. Thus generated tryptic peptides were collected by centrifugation followed by consecutive wash steps with 50 µL of 50 mM NH4HCO3 and 50 µL of ultra-pure water, respectively. Finally, peptides were acidified using 10% TFA and quality control of the digests was carried out using a monolithic-HPLC as described previously (27).

Synthetic peptides 74 peptide sequences were selected based on the annotations made by three de novo peptide sequencing algorithms on a dataset comprising three 2 h LC-MS/MS runs of a Radix-auricularia tryptic digest. Firstly, from the annotated spectra where all de novo sequencing algorithms agree, 11 sequences were selected from a subset of annotations with high scores (i.e. PEAKS ALC >95), 13 sequences from a subset with medium scores (PEAKS ALC > 80 and < 90) and 13 sequences from a subset with low scores (PEAKS ALC < 75). Secondly, from the annotated spectra where the de novo algorithms disagree on the sequence or identity of the first two amino

ACS Paragon Plus Environment

8

Page 9 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

acids, 14, 13 and 12 peptide sequences where chosen from the respective score-ranges mentioned above. Finally, for every selected sequence, a second version – denoted ‘shuffled” – was created, where the sequence of the first two amino acids was inversed, as these often represent a major challenge for de novo sequencing owing to the frequent absence of the required y-ions and the absence of the b1 ion. Additionally, due to the isobaric properties of the amino acid combinations VT and SL one sequence allowed for the creation of three shuffled versions. In total, 154 different peptide sequences where synthesized in house using a semi-automated solid phase peptide synthesizer (Syro I, Multisyntech, Witten, Germany) with Fmoc chemistry, and using commercially available TentaGel resins preloaded with Lysine/Arginine amino acids. The peptides were purified as described previously (28) and cysteine-containing peptides were carbamidomethylated as described above. Peptides were aliquoted in 4 sets (50 fmol per peptide) to ensure that shuffled versions were not analyzed in the same LC-MS/MS run set as their corresponding forward version and could be unambiguously distinguished. Wash blanks between samples were used to exclude memory effects that could interfere with the analysis.(29)

LC-MS/MS analysis 1 µg of each sample was analyzed using an Ultimate 3000 nano RSLC system coupled to a Q Exactive HF mass spectrometer (both Thermo Fischer Scientific). To minimize systematic errors, the samples were measured in a random order. Preconcentration of peptides on a 100 µm x 2 cm C18 pre-column for 10 min using 0.1% TFA with a flow rate of 20 µL/min was followed by separation on a 75 µm x 50 cm C18 analytical column (both PepMap RSLC, Thermo Scientific). A 120 min LC gradient ranging from 3-42% of buffer B: 84% ACN 0.1% FA at a

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 35

flow rate of 250 nL/min was used. Synthetic peptides were separated similarly, with a 35 min LC gradient ranging from 3 to 50% of buffer B. The Q Exactive HF was operated in data-dependent acquisition mode using following parameters. The full MS scans were acquired within 300 1,500 m/z at 60,000 resolution using the polysiloxane ion at m/z 371.101236 as lock mass (30). The top 15 most intense ions were isolated with a window of 0.4 m/z and were fragmented using higher-energy collisional dissociation (HCD) with a normalized collision energy of 27%, and the dynamic exclusion duration for precursor masses was limited to 12 s. The MS/MS spectra were acquired at 15,000 resolution, and automatic gain control target values were set to 3 x 106 for MS and 5 x 104 for MS/MS. Maximum injection times were 120 ms and 250 ms for MS and MS/MS, respectively, whereas precursor ions with charge states of +1, > +5 or unassigned were excluded from MS/MS analysis. The underfill ratio, which specifies the minimum percentage of the target value likely to be reached at maximum fill time, was defined as 5%, which corresponds to a minimum precursor intensity of 2.5 x 103 to trigger an MS/MS scan.

Data analysis 1.

Database search

In order to generate reference annotations to be used as template for FDR-calculation of the de novo sequencing algorithms, database searches were performed for all samples using Proteome Discoverer 1.4 (Thermo Scientific). Three different search algorithms were used (Mascot 2.4 (Matrix Science), SEQUEST and MS Amanda) and the searches were conducted in a target/decoy manner against the respective protein sequence database in FASTA-format (Homo sapiens, 20,207 target sequences, downloaded from UniProt in July 2015; Mus musculus, 16,746 target sequences, downloaded from UniProt in July 2015; S. cerevisiae, 6,622 target sequences,

ACS Paragon Plus Environment

10

Page 11 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

downloaded from Saccharomyces Genome Database SGD in July 2015). Enzyme specificity was set to trypsin, allowing for two missed cleavages. Cysteine carbamidomethylation and methionine oxidation were set as fixed and variable modifications, respectively. Precursor and fragment ion mass tolerances were 10 ppm and 0.02 Da, respectively. Percolator was used to adjust the false discovery rate (FDR) to 1% on peptide- spectrum matches (PSM) level, and only PSMs with search engine rank 1 were considered further for evaluation.

2.

De novo peptide sequencing

In order to generate de novo annotations with Novor and pNovo+, raw files had to be converted to mgf format using the ProteoWizard package v 3.0.7398 (31). a.

PEAKS

PEAKS Studio 7.5 (http://www.bioinfor.com) was used to generate de novo peptide annotations. Thermo raw files were directly loaded and as only sample preprocessing option, precursor correction on the mass was selected. Enzyme specificity was set to trypsin and error tolerances were set to 10 ppm for the precursor mass and 0.02 Da for the fragment ions. Carbamidomethylation of cysteines was set as a fixed modification and oxidation of methionine as variable, with a maximum of 3 variable modifications allowed per peptide. Five candidates per spectrum were reported and the highest ranking candidate sequence was further used. b.

Novor and PepNovo

Novor and PepNovo were both operated via the DeNovoGUI-interface in version 1.9.6.(18) Precursor mass tolerance was set to 10 ppm, fragment mass tolerance to 0.02 Da, trypsin was selected for enzyme specificity, carbamidomethylation of cysteines was set as a fixed modification and oxidation of methionine as variable modification. Up to 10 matches per

ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 35

spectrum were allowed, of which the top ranking was exported to a csv file via the DeNovoGUI export option. c.

pNovo+

The command line interface was used to control pNovo+. In the parameter file, the same conditions as mentioned above for the other de novo sequencing algorithms were defined, and the top ranking peptide sequence for every annotated spectrum was used.

3. Direct spectra comparison Spectra derived from the R. auricularia dataset as well as spectra of synthetic peptides were converted to mgf file format as described above and loaded into R (32) using the MSnbase package (33) of the Bioconductor environment. Spectra comparison plots and calculation of dot products to compare spectra were carried out using the MSnbase package.

Results MS/MS method optimization For all below mentioned method optimization steps, aliquots of the tryptic HeLa digest were used, while PEAKS Studio 7.5 and Mascot (version 2.4) were chosen as representatives for the de novo sequencing and database search algorithms, respectively. De novo peptide sequencing benefits considerably from acquisition of the fragment ion spectra with high resolution (19). Therefore, a Q Exactive HF was chosen in this study. The resolution was set to 15,000 for MS/MS acquisition, since this allows for sufficient resolving power to distinguish commonly mis- assigned amino acid pairs, without unnecessarily increasing the transient time in complex samples.

ACS Paragon Plus Environment

12

Page 13 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Co-Isolation of precursors and the subsequent acquisition of chimeric spectra imposes considerable difficulties in proteomics experiments in general and particularly leads to impaired de novo peptide sequencing (34). To investigate on this, HeLa digests were measured in triplicate with the quadrupole isolation window set to 2.0, 1.0 and 0.4 m/z, respectively. For every MS/MS scan, the top-ranking PEAKS hit with an ALC > 50 was compared to the corresponding Mascot annotation. Only fully annotated, completely correct peptide sequences were considered as true hits, whereas L and I were treated as equal. By narrowing down the isolation window, the total number of false positives could be reduced significantly, while the number of true positives remained constant throughout the runs (data not shown). Therefore the most narrow isolation window (i.e. 0.4 m/z) was chosen for subsequent analyses.

Evaluation of de novo algorithms As pointed out by Devabhaktuni and Elias (22), performing high-throughput de novo sequencing of complex samples, requires that one is able to determine completely correct de novo sequence annotations at defined FDR-rates.. For this to validate, datasets of three different organisms were generated, applying the aforementioned optimized MS/MS settings. Each de novo sequencing algorithm, i.e PEAKS, Novor and pNovo+ was compared against the combined annotations of the reference database search algorithms Mascot, SEQUEST and MS Amanda, stringently filtered to a PSM-level FDR of 1%, thus creating a pseudo-ground-truth dataset. Notably, MS/MS scans only assigned by the de novo algorithms but not by the database search engines were not considered in this comparison. Figure 1 depicts the performance of the algorithms on triplicate measurements of the HeLa dataset, depicted as ROC (receiver-operator characteristics) curves. As mentioned above, only PSMs that agreed with the database search engines at 1% FDR (I and L considered equal) were treated as true positives.

ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 35

For pNovo+, the scoring scheme does not seem to discriminate efficiently between true and false positive hits, and therefore it was not possible to create a meaningful ROC-curve for this algorithm. PepNovo on the other hand primarily annotates ambiguous spectra with mass tags rather than providing full peptide sequences. While this approach certainly has its benefits, it is not compatible with the general concept presented in this work. Therefore, results from pepNovo were not considered in the further analyses. Taking a closer look at the putatively false positives hits – i.e where de novo sequencing and database search algorithms disagree –it becomes obvious that a majority of these disparities stem from permutations of a pair of amino acids or the wrong interpretation of isobaric amino acid combinations, e.g. Q instead of GA or AG. As depicted in figure 2, errors occur predominantly at the N-terminus of the peptide sequence and are most likely the product of incomplete fragment ion ladders. Specifically, the absence of the crucial b1-ion, which is usually not detected due to oxazolone ring formation of the b2-ion (35) (36), contributes largely to these observed sequencing ambiguities, particularly for longer sequences where y-ion series in HCD tend to decrease in intensity at higher m/z.

Combined approach As can be derived from the ROC-curves depicted above (Fig. 1), the recovery of true peptide sequences drastically drops when reducing the FDR to a considerable level, e.g. 10% or 5% FDR. Therefore, as additional discriminatory measure, we here propose to use the agreement of different de novo sequencing algorithms on distinct MS/MS spectra. When combining the results of two algorithms (workflow in figure 3 (A)), additional score-cutoffs have to be applied in order for the results to reach a confidence level below 5% FDR. Combination of three algorithms

ACS Paragon Plus Environment

14

Page 15 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(figure 3 (B)) leaves the use of score-cutoffs obsolete, as the ‘agreeing’ subset as a whole shows an FDR of