Improving the Identification Rate of Endogenous Peptides Using

Sep 16, 2013 - User Resources. About Us · ACS Members · Librarians · ACS Publishing Center · Website Demos · Privacy Policy · Mobile Site ...
0 downloads 0 Views 2MB Size
Subscriber access provided by RENSSELAER POLYTECH INST

Article

Improving the identification rate of endogenous peptide using electron transfer dissociation and collision-induced dissociation Eisuke Hayakawa, Gerben Menschaert, Pieter-Jan De Bock, Walter Luyten, Kris Gevaert, Geert Baggerman, and Liliane Schoofs J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/pr400446z • Publication Date (Web): 16 Sep 2013 Downloaded from http://pubs.acs.org on September 22, 2013

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Improving the identification rate of endogenous peptide using electron transfer dissociation and collision-induced dissociation Eisuke Hayakawaa*, Gerben Menschaertb, Pieter-Jan De Bockc,d, Walter Luytene, Kris Gevaertc, Geert Baggermanf,g, Liliane Schoofsa a

Research Group of Functional Genomics and Proteomics, KU Leuven, Leuven, Belgium b

Faculty of Bioscience Engineering, Laboratory for Bioinformatics and Computational Genomics, Ghent University, Ghent, Belgium

e

c

Department of Medical Protein Research, VIB, Ghent, Belgium

d

Department of Biochemistry, Ghent University, Ghent, Belgium

Biomedical Sciences Group, Department of Pharmaceutical and Pharmacological Sciences, KU Leuven, Leuven, Belgium f

g

VITO, Mol, Belgium

CFP/Ceproma, University Antwerpen, Antwerpen, Belgium

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 42

KEYWORDS Endogenous peptide, Neuropeptide, Peptidomics, Neuropeptidomics, Peptidogenomics, Peptide identification, Bioinformatics, Tandem mass spectrometry, Electron Transfer Dissociation

Abstract

Tandem mass spectrometry (MS/MS) combined with bioinformatics tools have enabled fast and systematic protein identification based on peptide-to-spectrum matches. However, it remains challenging to obtain accurate identification of endogenous peptides, such as neuropeptides, peptide hormones, peptide pheromones, venom peptides and antimicrobial peptides. Since these peptides are processed at sites that are difficult to predict reliably, the search of their MS/MS spectra in sequence databases needs to be done without any protease setting. In addition, many endogenous peptides carry various post-translational modifications, making it essential to take these into account in the database search. These characteristics of endogenous peptides result in a huge search space, frequently leading to poor confidence of the peptide characterizations. We have developed a new MS/MS spectrum search tool for highly accurate and confident identification of endogenous peptides by combining two different fragmentation methods. Our approach takes advantage of the combination of two independent fragmentation methods (collision-induced dissociation and electron transfer dissociation). Their peptide spectral matching is carried out separately in both methods, and the final score is built as a combination

ACS Paragon Plus Environment

2

Page 3 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

of the two separate scores. We demonstrate that this approach is very effective in discriminating correct peptide identifications from false hits. We applied this approach to a spectral data set of neuropeptides extracted from mouse pituitary tumor cells. Compared to conventional MS-based identification, i.e., using a single fragmentation method, our approach significantly increased the peptide identification rate. It proved also highly effective for scanning spectra against a very large search space, enabling more accurate genome-wide searches and searches including multiple potential post-translational modifications.

Introduction In proteomics and peptidomics, mass spectrometry is the main tool for the identification of proteins and peptides. Since tandem mass spectrometry (MS/MS) was introduced, MS/MS-based peptide database searching has become one of the most commonly applied techniques in proteomics. The basic idea behind assignment of an MS/MS spectrum to a peptide sequence is matching an experimental fragmentation spectrum to a theoretical spectrum generated from a peptide contained in a protein sequence database. Nowadays, numerous MS/MS database search programs, based on this principle, are available: SEQUEST, Mascot, X!Tandem and OMSSA1–4 amongst others. These search engines have enabled high-throughput protein identification thereby largely contributing to our understanding of the proteome maps of various organisms. Although these search engines were developed for protein identification by assigning MS/MS spectra to peptides resulting from enzymatic digestion (e.g., tryptic peptides), they have also been widely used for the identification of

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 42

naturally occurring, endogenous peptides, such as neuropeptides, peptide hormones, venom peptides and antimicrobial peptides.5–8 However, obtaining reliable identifications for endogenous peptides using these search engines is often difficult. Such peptides are derived from precursor proteins that are cleaved by various types of proteases.9–11 Unlike in vitro generated enzymatic digests, cleavage sites in a peptide precursor protein cannot always be readily and accurately predicted. Therefore, MS/MS-based peptide identification needs to be performed without protease specification, which leads to a huge search space. Another complexity arises from post-translational modifications (PTMs),12,13 which frequently occur in endogenous peptides. These PTMs are often essential for the peptide’s biological activity and molecular stability. The most common PTMs are C-terminal amidation14 and pyroglutamic acid at the N-terminus. These PTMs prevent enzymatic degradation of peptides by exopeptidases. In addition, endogenous peptides may display various other PTMs, such as acetylation, oxidation, formylation, hydroxylation, phosphorylation, sulfation and glycosylation. Therefore, it is essential to take into account multiple variable PTMs in a database search. As a consequence, the search space for MS/MS spectra grows exponentially, thereby decreasing the sensitivity of the search. Moreover, unlike protein identification based on enzymatic digests, endogenous peptide identification has to be inferred based on a single MS/MS. Therefore, high accuracy and reliability of the peptide-to-spectrum matching is important for endogenous peptide identification. Relatively recently, electron capture dissociation (ECD) and electron transfer dissociation (ETD) have been introduced in proteomics research.15,16 In general, ETD and ECD induce fragmentation of the peptide backbone at the N-C (α) bond. In contrast to collision-induced dissociation (CID), which gives rise to b- and y- fragment ions, ECD and ETD generate mainly

ACS Paragon Plus Environment

4

Page 5 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

c-, y- and z- fragment ions together with other minor species17 , and thereby provide structural information that is complementary to CID. ECD and ETD data can be analysed with existing database search engines. Although most of these engines were developed and adapted for CID data, many search engines are capable of assigning c- and z- ion types. Lately, a number of approaches were proposed for utilizing a combination of CID and ETD fragmentation to improve the sensitivity of MS/MS-based peptide identification in proteomics.18– 24

Nielsen et al. proposed a pre-search processing technique using a combination of CID and

ECD spectra.18 In their approach, complementary pairs of CID and ETD fragment ions are identified based on the mass differences of b-/c- and y-/z- pairs after performing charge and isotope reduction. Such complementary fragments are exported and used for MS/MS searches. This approach increased the average Mascot protein score by 64 %. Another approach uses MS/MS search results of CID and ETD in a post-search process.25 Wright et al. used CID and ETD spectra in their post-search statistical rescoring tool Percolator, and obtained an improvement of peptide identifications.25

Kim et al. implemented a function to use the

complementarity of CID and ETD in their MS-GF algorithm,19 which converts a fragmentation spectrum into a scored version of that spectrum, possessing a score at every mass. The MSGFDB engine merges the scored spectra of CID and ETD fragmentation and generates a summed spectra. Using their MS-GFDB framework, they improved the number of tryptic peptide identifications by 13% and 27%, respectively, over separate CID or ETD searches. Recently, ETD has also been applied to the identification of endogenous peptides, and several studies have shown a significant improvement of the identification of endogenous peptides by combining the result of CID and ETD peptide searches.26–28 Sasaki et al. performed large scale CID and ETD

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 42

analysis for endogenous secretory peptides, and obtained 795 and 569 peptides, respectively, with an overlap of 397.28 We here present an alternative method that is based on the combination of CID and ETD fragmentation spectra and that is specifically developed for the identification of endogenous peptides. The here presented method is very effective for searching endogenous peptides against large search spaces.

Materials and Methods

Overview of the new identification approach. Figure 1 illustrates the overview of our approach for identifying endogenous peptides using a combination of CID and ETD spectra. In brief, MS/MS-based endogenous peptide identification is carried out through the following steps: (1) the target protein database is scanned for possible peptide sequences by changing the location of the first residue and the length of a potential peptide in a protein. All peptides that match the observed mass with a given mass tolerance are used for the following step. (2) for each peptide candidate found in the database, theoretical fragmentation spectra for CID (b- and y-ions) and ETD (c-, z- ions) are generated in silico. (3) experimental CID and ETD spectra are compared to the corresponding theoretical fragmentation patterns and a score is given to every candidate peptide, based on the number of matched fragment ions and their intensities. (4) the score of each candidate peptide is generated by combining the CID and ETD matching scores.

ACS Paragon Plus Environment

6

Page 7 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(5) a stochastic score distribution is modeled to estimate an expectation value (e-value).

Spectra preprocessing. The algorithm takes as input MS/MS peak list files (currently Mascot generic format). First, from these MS/MS peak lists, the peaks corresponding to the precursor ion with a given mass tolerance, are removed. For ETD spectra, charge-reduced precursor ion peaks are also removed. Second, obvious non-monoisotopic peaks are removed by examining peaks within 1.5 m/z and eliminating less intense peaks in this region. As a third step, the most intense peaks are selected for a peptide-to-spectrum matching. The number of the most intense peaks is determined based on the number of amino acids (l) assumed from the molecular weight of the peptide ion (mw), which is calculated from the observed m/z value and the charge state. Since the average molecular weight of an amino acid is 110 Da, l can be calculated as

 

 110

In the present study, the 5l most intense peaks are selected and used for the following spectrum matching processes. Finally, the intensities of the filtered spectrum are normalized to have a maximum intensity of 50. These values for the spectrum preprocessing are user-adjustable. More details of the implementation can be found in supplemental text.

Peptide search and spectrum matching. First, the algorithm searches for all possible peptide sequences in a database falling within a given mass tolerance, considering every amino acid as a potential cleavage site. Subsequently, theoretical MS/MS spectra of the retained candidate peptides are generated, both with CID and ETD fragmentation schemes. Matching between theoretical and observed fragment ions is carried out for both fragmentation methods separately. For CID, b-type (b-fragment ions and b-fragment ions that lost H2O) and y-type ion species (y-

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 42

fragment ions and y-fragment ions that lost NH3 or H2O) are matched. For ETD spectra, c-type (c’- and c˙-fragment ions) and z-type (z’- and z˙-fragment ions) fragment ions are matched. For CID, if the charge state of the precursor is 2+, then 1+ product ions are used for spectrum matching. Additionally, 2+ product ions are considered for 3+ or higher-charged precursor ions. For ETD, 1+ product ions are considered for 2+ precursor ions, and 1+ and 2+ product ions for 3+ or higher-charged precursor ions. If variable modifications are selected, the database search is performed based on the peptide mass, taking into account the mass shifts caused by the possible modifications. For spectrum-topeptide matching, multiple theoretical spectra are generated considering the mass and the potential location of specified modifications, and they are compared to the experimental spectra. To suppress the combinatorial expansion of the number of the possible forms of modified peptides when multiple variable modifications are selected, the upper limit of the number of combinations of modifications is set in the same way as in OMSSA.4 In brief, the algorithm generates all possible combinations of modifications, increasing the number of modifications on a peptide, until the number of combination reaches the upper limit. More details of the implementation can be found in the supporting text.

Scoring function. We designed score for peptide-to-spectrum matching based on a spectral pair of CID and ETD as COMBI score (SCOMBI). 

"





  log      !  ! log  !   # ! $ ! The COMBI scoring function comprises two subscoring functions for CID and ETD spectra (SCID, SETD), which are based on the hyperscore method proposed by Fenyö and Beavis.3,29,30

ACS Paragon Plus Environment

8

Page 9 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research



 %  log      !  !  "

&'%  log  !   # ! $ ! 

The m point vector I and the n point vector J represent the observed fragment ions of CID and ETD, respectively. The values of Ii and Ji correspond to the normalized intensities of the observed fragment ions. Matched peaks are given Pi =1, whereas unmatched peak are given a Pi=0. Nb!, Ny!, Nc! and Nz! represent the factorials of the number of matched b- , y-, c- or z- type ions, respectively. The confidence of a peptide-to-spectrum match (PSM) is assessed by estimating the e-value based on the score distribution. In this study, the generalized extreme value distribution (EVD) theory31 was applied to model the score distribution and estimate the e- value. The distribution function of EVD is

()*+  exp /exp 0/

1 / 2 3

4

To model the distribution of PSM scores, the empirical cumulative distribution is generated from the acquired scores. To improve the estimation in the tail of the distribution, a linear regression line is obtained by a smoothed linear least-squares fit on the higher scoring portion (30%-90% upper values of the stochastic score distribution) of the log(-log)-transformed empirical cumulative distribution function32,33. The estimators of location (2) and scale (3) of the distribution function are obtained from the regression line. The e-value is the product of the probability estimated by the modeled EVD and the number of candidate peptides in the database.

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 42

Implementation. The algorithm is written in C++ (Visual Studio 2008 Express, Microsoft) using the Standard Template Library and the Boost Library (boost.org). R (Rproject.com) is used to calculate the e-value based on the score distribution. The program can be installed on a Microsoft Windows operating system (XP, Vista, 7, 8). The software package MPCombi is available

online

(https://bitbucket.org/ehayakawa/mpcombi/wiki/Home

and

http://sourceforge.net/projects/mpcombi/) under GPL license, and the source code can also be requested from the authors.

Sample preparation. Mouse pituitary tumor (AtT-20, ATCC No.CCL-89) cells were grown in Dulbecco’s minimum essential medium containing 10% fetal bovine serum at 37°C in a humidified atmosphere containing 5% CO2. Immediately prior to harvesting, the cells were washed with phosphate-buffered saline (PBS) and harvested using a trypsin/EDTA solution. The cells (approximately 2 x 106 cells) were centrifuged at 150 x g for 4 min at 4 °C, and the pellet were washed with PBS, snap-frozen in liquid nitrogen and stored at -80°C until peptide extraction. The pellets were dissolved in 200 µL of a solution containing 90 % (v/v) methanol, 9% Milli-Q water and 1 % formic acid. After sonication in a Bransonic 2510 apparatus (Branson Ultrasonics, CT, USA) for 5 min, samples were centrifuged (10,000 x g for 20 min at 4 °C) to remove the protein debris and the supernatant was transferred to a new tube, freeze-dried and stored at -80°C until analysis.

Mass spectrometry. Dried peptide extract was re-dissolved in 30 µL of 2% acetonitrile and 0.1% TFA. The peptides were analyzed on an Ultimate 3000 HPLC system (Dionex,

ACS Paragon Plus Environment

10

Page 11 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Amsterdam, The Netherlands) in-line connected to an LTQ Orbitrap XL mass spectrometer (Thermo Electron, Bremen, Germany). A 60 min gradient from 2% to 50% acetonitrile, both containing 0.1% formic acid, followed by a washing and re-equilibration step on an in-house packed, 15 cm long column (75 µm inner diameter (Reprosil-Pur Basic C18-HD 3 µm, Dr. Maisch, Germany)) was used to elute the peptides from the trapping column (100 µm inner diameter, Reprosil-Pur Basic C18-HD 5 µm, Dr. Maisch, Germany). Per LC-MS/MS analysis, 2.5 µl of the peptide mixture was consumed. The LTQ-Orbitrap mass spectrometer was operated in data-dependent mode, automatically switching between MS and MS/MS acquisition for the three most abundant peaks in a given MS spectrum. Full-scan MS spectra were acquired in the Orbitrap at a target value of 1E6 with a resolution of 60,000. The three most intense ions were then isolated for fragmentation in the linear ion trap. A chosen precursor ion was first fragmented by CID after filling the ion trap with a maximum ion time of 300 ms and a maximum of 1E4 ion counts. Then, the same precursor ion was fragmented by ETD. The precursor ion was activated for 110 ms with 3E5 reagent ions in the ion trap, with supplemental activation set on. This method allowed us to have both a CID and an ETD spectrum from the same precursor. The mass spectrometry raw data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD000186.34 The raw data files were processed with Proteome Discoverer 1.3. The following spectrum filters were applied: minimum precursor mass 350 Da, maximum precursor mass 5000 Da, minimum peak count 1, minimum S/N threshold 1.5. The CID and ETD spectra were then written to Mascot generic files.

Test application. All experimental fragmentation spectra were searched against the UniProtKB/SwissProt mouse protein sequence database (release 2011_09), containing 16,570

ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 42

reviewed mouse sequences. For the genome-wide search, the mouse genomic sequence (NCBI build 37.61) was directly translated in its 6 reading frames, and used for spectral searching. No enzymatic cleavage was assumed during the database searches. Search parameters for mass tolerance were as follows: the parent ion mass accuracy was set to +/- 5 ppm, fragment ion mass accuracy was +/- 0.4 Da. For comparison, OMSSA4 (version 2.1.9) was used with the same mass tolerance setting. For fragment ion species, b- and y- ions were selected for CID, and c-, y- and z- ions for ETD. Precursor charge state detection was set to read from input file data. For ETD spectra, elimination of charge-reduced precursors in spectra was allowed. In the search for modified endogenous peptides, a series of PTMs, including amidation (Ctermini of peptides), pyroglutamylation from glutamine and glutamic acid (N-terminal Q and E of peptides), oxidation (M), sulfation (Y), hydroxylation (P), bromination (W), dioxidation (W), acetylation (N-termini of peptides), Cation:Na (C-terminal D and E of peptides), deamidation (N,Q), formylation (N-termini of peptides) and phosphorylation (S,T and Y) were taken into account as variable modifications. To limit the combinatorial expansion of modified peptide sequences, the maximum number of the combinations of variable modifications per candidate peptide sequence was set to 32 for both OMSSA and our algorithm. A PSM is statistically evaluated as positive or negative based on the false discovery rate (FDR) estimated from the complete search. Estimation of the FDR was conducted by searching against a concatenated database comprised of an original and decoy (reversed amino acid sequences) dataset. In the present study, the FDR is defined as 2 x the number of PSMs found in the decoy dataset (FP) divided by the sum of the number of PSMs found in target (TP) and decoy dataset, above the e-value threshold, as follows.

ACS Paragon Plus Environment

12

Page 13 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

5(6 

2 5 5 8 9

The results of the performed MS/MS searches are provided as supplemental material.

Results

Workflow. Figure 1 illustrates our approach for the identification of fragmentation spectra of endogenous peptides. The basic idea is similar to other MS/MS peptide search tools. However, the COMBI scoring method, which is based on a combination of two fragmentation methods, improves the distinction between correct versus false peptide hits. This approach is particularly effective for endogenous peptide searches in a large search space. In general, CID and ETD spectral datasets are utilized to search the same database separately (double search), deriving statistical values, such as e-values, for each type of search (Fig. 2A). The resulting statistical values can be further used to determine if the PSMs are positive or negative by imposing a FDR. In a double search approach, a union-based method can be applied by imposing the same FDR threshold to the CID and ETD results in order to combine the results of both searches. However, this leads to an increase of the actual FDR, since true discoveries (TP) of a CID and ETD spectral pair indicate the same true peptide, whereas false discoveries (FP) mostly indicate a different peptide in the decoy dataset, increasing the number of FP compared to the acquired TP in a union search result.35 To solve this problem, one can impose a 1/k % FDR threshold for k different fragmentation methods (for CID and ETD, k = 2) on the separate search results allowing to set the union FDR to at least < 1%. On the other hand, unlike a double search method, our approach performs a single search for CID and ETD spectral pairs

ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 42

and the COMBI scores are given to each candidate peptide once (Fig. 2B). Next to improving the sensitivity of the MS/MS search, this approach also allows a straightforward FDR calculation, thus avoiding the difficulties with merging the union results from a double search strategy.

Effect of combining CID and ETD on the identification of endogenous peptides. We applied the algorithm to a series of CID and ETD MS/MS spectra of endogenous peptides, extracted from mouse pituitary tumor cell cultures. The detected peptides were found to be derived from classical endogenous peptide precursors such as the pro-opiomelanocortin, chromogranin and secretogranin family, and they are known to be expressed in the pituitary. Those endogenous peptides have previously been observed in a peptidomics study of the mouse pituitary.36 To illustrate our approach, we discuss one PSM example (one of the analogs of βendorphin “GGFMTSEKSQTP”). Without any cleavage site specification, a total of 5,816 peptide sequences were found to fit the molecular mass with a precursor mass tolerance of ± 5 ppm in the Mus musculus Swiss-Prot protein dataset. Matching and scoring of the theoretical peptide fragmentation spectra to the observed fragmentation spectral pair of CID and ETD, was performed for all these 5,816 peptide sequences. Of these candidate peptides, the theoretical fragmentation pattern of the correct peptide sequence “GGFMTSEKSQTP” showed 15 CID and 11 ETD fragment ion matches to the experimental spectrum (Fig. 3). To display the effect of the COMBI scoring function on the improvement of the e-value, we compare the score distributions of the COMBI score together with the sub-scoring functions based on only CID or ETD (SCID and SETD). Figure 4 shows the score distribution of the separate searches with CID or ETD MS/MS spectra and the distribution of the COMBI score. The SCID of the correct peptide is the highest of all peptides extracted from the search database. However, the score is not significantly

ACS Paragon Plus Environment

14

Page 15 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

higher than the high-scoring portion of the stochastic population (the resulting e-value is 0.002, Fig. 4A). The SETD distribution also does not yield a significant e-value for the correct peptide (evalue = 0.007, Fig. 4B). On the other hand, the distribution of the COMBI score (Figure 4C) reflects a significantly higher score for the correct peptide sequence. The COMBI score of the correct peptide sequence is significantly higher and clearly separated from the stochastic distribution, resulting in an improved e-value (1.9E-13). To further assess the effect of using the COMBI score, we simulated score distributions by changing the number of matched fragment ions based on the same spectral pair acquired from the peptide (figure 5). Note that the values of the COMBI score are higher than those of the CID and ETD subscores, but the pattern of the score distribution of the stochastic population is largely unaffected (Figures 5 A, B and C). When only 2 or 6 fragment ions are matched for both CID and ETD spectra, neither CID or ETD subscores, nor the COMBI score show significant separation from the top score and the stochastic population. With 8 matched fragment ions for both CID and ETD spectra, the COMBI score already shows a clear separation between the top score and the stochastic population (Figure 5I), whereas the score distributions of either CID and ETD subscores fail to show significant separation (Figs. 5G and H). This example clearly demonstrates the positive impact of using the combination of two fragmentation methods. Figure 6 shows the distribution of e-values of all PSMs from the LC-MS/MS data that contains 1,706 CID/ETD spectral pairs. The MS/MS search was performed against the mouse Swiss-Prot dataset without taking into account variable PTMs. The e-values for the SCID or SETD subscoring functions are distributed above 1.0E-12. On the other hand, the e-values of the COMBI scoring are distributed in a much lower region, down to 1.0E-44. This shows the effect of the combination of CID and ETD on the score and e-value of the PSMs.

ACS Paragon Plus Environment

15

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 42

Comparison with a conventional MS/MS search method. To compare the performance of our method with conventional MS/MS search software, the results were further assessed by means of FDR calculations using a target-decoy approach (Fig. 7). It should be noted that our algorithm does not employ a multi-stage search approach, in which the result of a first search stage affects the proportion of the search database for the next stage to improve the overall search sensitivity. Our described method only performs one single search for the CID and ETD spectral pair on a single database. As explained in the materials and methods section, our scoring function is solely based on the peptide sequence and the experimental fragmentation spectra, meaning that our approach is target-decoy approach compliant as defined by Gupta et al.37 We compared our COMBI scoring approach with the OMSSA MS/MS search software, using the same CID and ETD spectral datasets. We specifically chose OMSSA since it meets the required conditions for comparison. Firstly, it can perform a search that considers every amino acid residue as a potential cleavage site, and it is able to process both CID and ETD spectral data. Secondly, it can search for modified peptides taking into account multiple PTMs by limiting the number of combinations. Thirdly, it can perform a search against a large sequence database containing the six reading frame translation of the mouse genome as shown later. At < 1% FDR level, the number of OMSSA CID and ETD positive hits is 183 and 100 respectively. On the other hand, the COMBI scoring method yields 395 positive PSMs, of which 205 PSMs are only attributable to the COMBI scoring result (Fig. 7). In addition, we performed an OMSSA search using merged CID and ETD spectra. The number of positive PSMs is then 163 and does not show any improvement over the search using CID spectra only.

ACS Paragon Plus Environment

16

Page 17 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Application to MS/MS searches in a large search space. In a routine MS/MS spectral database search for identification of endogenous peptides, only the most common PTMs (e.g., C-terminal amidation and N-terminal pyroglutamic acid) are typically considered. However, many endogenous peptides are potentially modified in various ways. Here, we performed our MS/MS peptide search while allowing a large number of variable PTMs, including C-terminal amidation, pyroglutamylation from N-terminal glutamine and glutamic acid, oxidation (M), sulfation (Y), hydroxylation (P), bromination (W), dioxidation (W), acetylation (N-termini of peptides), cation:Na (C-terminal D and E of peptides), deamidation (N and Q), phosphorylation (S,T and Y) and formylation (N-termini of peptides). Considering multiple PTMs leads to a combinatorial expansion of peptides with PTMs, thereby significantly increasing the number of candidates of modified peptides (Fig. 8). Figure 9 shows the performance of the COMBI scoring method and OMSSA CID and ETD results for this search. The number of positive PSMs with OMSSA CID or ETD scoring is severely affected when taking into account multiple variable PTMs, resulting in only 143 and 45 positive PSMs, respectively, at < 1 % FDR level. An OMSSA search using the merged CID and ETD spectra yielded 141 positive PSMs. On the other hand, the COMBI score generates 346 positive PSMs. We detected modified versions of peptides derived from the same classical endogenous peptide precursors (pro-opiomelanocortin, secretogranin). This includes classical endogenous peptide modifications, such as C-terminal amidation, pyroglutamylation from glutamine and N-terminal acetylation previously reported in other studies.36,38,39 Also, mass shifts corresponding to relatively uncommon modifications were detected, such as cation:Na, which may occur during the sample processing in the presence of sodium36.

ACS Paragon Plus Environment

17

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 42

Similarly, fragmentation spectra of peptides were scanned against the six-frame translation of the mouse genome sequence, without taking into account PTMs, in order to identify neuropeptides directly from the genome sequence. Such a large genomic sequence database leads to a very large search space, especially when no protease is identified (as is the case in peptidomics for the identification of endogenous peptides). As a result, the sensitivity of the MS/MS peptide search decreases: in this large search space, the OMSSA CID, ETD and merged spectra searches resulted in respectively 129, 22 and 81 positive PSMs at < 1% FDR level. However, the COMBI scoring is more sensitive and yielded 275 positive PSMs (Fig. 10).

COMBI scoring and OMSSA union approach. We also compared our COMBI score with the union of OMSSA CID and ETD double search results (Fig. 11). For the OMSSA search, CID and ETD spectral pairs were searched separately, and PSMs were positively called when either a CID or ETD from a precursor ion was found to be positive at a given FDR. Note that as explained earlier (Fig. 2), 1/k of FDR threshold (for OMSSA CID and ETD, k = 2) was imposed on each separate search (OMSSA CID and ETD) to make the union of separate search results. We compared the performance of OMSSA and the COMBI scoring for searching in three databases, including the protein database, the protein database with possible PTMs and the sixframe translated genome database. In all three conditions, the COMBI scoring method performed much better than the union of OMSSA double searches (Fig. 11).

ACS Paragon Plus Environment

18

Page 19 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Discussion

We have developed a new MS/MS database search method for the mass spectrometric identification of naturally occurring peptides in biological samples. Identification of such endogenous peptides has always suffered from a high failure rate. When an MS/MS spectrum of an endogenous peptide has to be matched with its corresponding peptide sequence, two important difficulties are encountered. First, unlike enzymatically generated peptides (e.g. tryptic peptides) that are common in many proteome studies, endogenous peptides are generated in vivo from proteolytic processing by various dedicated, but largely uncharacterized proteases. Therefore, MS/MS peptide searches need to be conducted without any protease specification. This leads to a huge search space, often resulting in poor confidence levels for peptide identification. Second, the majority of endogenous peptides are post-translationally modified. By taking into account multiple different PTMs, the search space further increases exponentially, which severely affects the sensitivity of the MS/MS search. Due to these difficulties, many MS/MS spectra of endogenous peptides cannot be identified even when they are of high quality. Recent high-mass-accuracy mass spectrometers, such as Fourier Transform Orbitrap and ion cyclotron resonance instruments, have enabled the measurement of highly accurate molecular masses of peptide molecules. This contributes to narrowing the search space, thereby improving the sensitivity of MS/MS database searches. On the other hand, informatics solutions are required for the further improvement of endogenous peptide identifications. Several approaches have been proposed9 to address the large search space for identifying endogenous peptides. One solution is downsizing the database to a search space that only contains known protein precursors of endogenous peptides as the target dataset, thus narrowing

ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 42

the search space. Obviously, such an approach only allows identifying peptides that are derived from known precursors, and cannot be used for the identification of unknown endogenous peptides. Another solution is to preprocess the protein dataset in silico based on known bioactive peptide processing patterns, thus generating potential peptide sequences as a target for the MS/MS search.40–42 However, the knowledge of the processing patterns still is very incomplete and many studies have reported on the generation of endogenous peptides by a variety of unconventional proteases.43–45 The here presented approach improves the sensitivity of a peptide search by taking advantage of two types of fragmentation methods, without narrowing target sequences. It improves the sensitivity of the peptide search without losing the opportunity to detect novel forms of endogenous peptides, which makes it very suitable for discovery-oriented comprehensive peptidomics. Another potential approach to deal with a large search space is a combination of de novo sequencing and searching based on short sequence tags. Several tools are available to search for peptides based on short peptide subsequence tags acquired de novo.46,47 However, to acquire such de novo peptide sequences, high quality and noise-free fragmentation spectra are needed, even for short subsequences, and this is often difficult from a complex biological sample. Unlike de novo approaches, our approach is applicable to noisy spectra, such as those resulting from complex biological samples.

Using two complementary fragmentation methods. Recently, a number of proteomics studies have been reported that combine the separate search results of CID and ETD fragmentations.23,24,48,49 These approaches appear beneficial for peptide identification for bottomup proteomics. Rather than just combining the separate fragmentation results, CID and ETD

ACS Paragon Plus Environment

20

Page 21 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

fragment peak lists may be merged into one spectrum. However, this results in low identification sensitivity. Once the CID and ETD spectra are merged, all ions have to be matched against the b, y- and c-, z- ion series. This increases false peak assignment. Consequently, the overall score of false hits increases, whereas that of the correct identification is not necessarily improved.50 Therefore, simply merging two types of fragmentation spectra does not help to discriminate the correct peptide sequence from the stochastic population. Currently, most MS/MS search engines have been adapted to allow searching with ETD spectra. Most of them simply match fragment ions to c- and z- ions, but because of the different nature of the fragmentation process (e.g., charge-reduced precursor ions are generated with high intensity using ETD), search engines developed for CID do not always give adequate results. MS-GFDB19 employs a clever approach to combine CID and ETD information. The MS-GF algorithm converts a fragmentation spectrum into a prefix-residue mass spectrum (PRM), which has a score at every mass. It then sums CID and ETD PRM spectra by combining two scores at every mass. We could not compare our results with MS-GFDB for large database searches (e.g., genome-wide search), because we ran into memory problems with the current version, even allowing up to 10 Gbyte of memory during the process. Our approach is specifically designed for large search spaces, dynamically allocating all information (e.g., sequences, scores) on local disc space, without consuming large amounts of RAM memory so that the program can run on a normal workstation. New challenges for peptidomics. Peptidomics shares many experimental techniques and bioinformatic tools with proteomics, including MS/MS-based peptide identification. So far various peptidomics studies have used conventional MS/MS data search engines to identify endogenous peptides.5,44,45,51–60 However, differences between conventional identification of

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 42

enzymatic digests of proteins as in bottom-up proteomics, and identification of endogenous peptide urged the need for appropriate search protocols for endogenous peptide identification. Most peptidomic studies take into account several of the most commonly encountered PTMs during database searching. However, a number of novel biologically active peptides carrying uncommon PTMs have been identified.61,62 Moreover, it has been shown that previously identified endogenous peptides are posttranslationally modified by uncommon PTMs and can be present in multiple forms.36 Those PTMs are ignored in most peptidomics analyses as including a high number of variable PTMs increases the search space and lowers the sensitivity of peptide identification. This can lead to missed identifications of such forms of bioactive peptides. Several studies have used unrestrictive PTM searching in a blind mode: searching peptides without specifying their potential PTMs in a sample.63,64 Another approach is to cluster spectra by grouping ones that display similar peak patterns. This allows to assess modifications present in a peptidome as mass shifts.36 Performing such a PTM survey prior to a database search may increase the overall sensitivity and enables the identifications of more peptides.65 Another important challenge of MS/MS-based endogenous peptide identification is the large database that needs to be scanned, for example when performing a genome-wide search. In the present study, we have validated our approach by conducting a search against the mouse genome translated in its six reading frames. Currently, genomic sequences of a number of species, including non-model organisms, are increasingly becoming available. Typically genomic sequences are scanned for protein-coding regions and these can be used for MS/MS peptide searching. However, coding sequences of the precursors of endogenous peptides are often significantly shorter than other proteins, and the conserved structures or domains are often restricted to a single mature peptide and do not extend over the whole precursor protein. As a

ACS Paragon Plus Environment

22

Page 23 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

consequence, many endogenous peptide sequences are potentially missed during genome annotation.66 Recent studies have shown the presence of a number of endogenous peptides encoded in unconventional coding regions such as short open reading frames.67–69 To identify such peptides, MS/MS peptide searches have to be performed beyond the conventional protein dataset, and larger search spaces need to be scanned. The approach presented in this study uses a combination of CID and ETD and is highly efficient to scan an enormous search space making it highly suitable for search applications in peptidomics.

ACS Paragon Plus Environment

23

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 42

Figure 1. The MS/MS peptide search strategy employed in this study. Matching of experimental and theoretical fragmentation spectra is carried out for CID and ETD independently. The PSM score is generated by combining the CID and ETD subscores into a COMBI score.

ACS Paragon Plus Environment

24

Page 25 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2 Comparison of union strategy using double search (A) and COMBI scoring method (B). TP and FP indicate true positives and false positives.

ACS Paragon Plus Environment

25

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 42

Figure 3. CID (A) and ETD (B) fragmentation spectra of the peptide “GGFMTSEKSQTP”. The arrows indicate precursor ions. Ion labels and their meanings are as follows: *, loss of NH3; °, loss of H2O.

ACS Paragon Plus Environment

26

Page 27 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4. Distribution of the observed scores of hits for the different scoring methods. Black dots indicate the log(-log) transformed empirical cumulative distribution function (ecdf) of the observed stochastic scores. Panels A and B show the score distribution based on the subscoring functions using only CID and ETD spectra, respectively. C shows the score distribution of the COMBI score. The arrows indicate the score of the correct peptide assignment (peptide “GGFMTSEKSQTP”).

ACS Paragon Plus Environment

27

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 42

Figure 5. Simulation of score distribution for CID and ETD subscores and COMBI score. Black dots indicate the log(-log) transformed empirical cumulative distribution function (ecdf) of the observed stochastic scores. The number of matched fragment ions in CID and ETD spectra is shown on the right side of the panels. The e-values of the top scoring hits are indicated in the panels.

ACS Paragon Plus Environment

28

Page 29 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 6. Overall distribution of the observed e-values of PSMs. The search was performed against the mouse Swiss-Prot protein dataset (concatenated with the reversed sequence as decoy sequence), without taking into account any PTMs. Panel A, B and C show the result of CID and ETD subscoring functions and the COMBI scoring, respectively. Bars indicate the count of evalues of highest-scoring peptides found in target sequences (black bars) or decoy sequences (gray bars). The frequencies are displayed up to 200. The arrows indicate the lowest e-values of searches against the target database.

ACS Paragon Plus Environment

29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 42

Figure 7. (A) The number of positive PSMs when no PTMs are considered. The number of positive PSMs in function of the FDR is plotted as the significance e-value threshold is varied. (B) Venn diagram of the positive PSMs among the three scoring methods when < 1 % FDR level is imposed to each result.

ACS Paragon Plus Environment

30

Page 31 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 8. Size of the search spaces in the three search conditions. The searches were performed against the protein database (white), protein database with PTMs (gray) and genome 6-frame translation (black). This histogram shows the distributions of the observed peptide ions in the LC-MS/MS data (1,706 in total) over the search space size (the number of the candidate peptides found in each search condition).

ACS Paragon Plus Environment

31

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 42

Figure 9. (A) The number of positive PSMs when multiple variable PTMs are taken into account. The number of positive PSMs in function of the FDR is plotted as the significance evalue threshold is varied. (B) Venn diagram of the positive PSMs for the three scoring methods when a < 1 % FDR level is imposed to each result.

ACS Paragon Plus Environment

32

Page 33 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 10. (A) The number of positive PSMs when searches are performed against the six frame translation of the mouse genome database. The number of positive PSMs in function of the FDR is plotted as the significance e-value threshold is varied. (B) Venn diagram of the positive PSMs for the three scoring methods when a < 1 % FDR level is imposed to each result.

ACS Paragon Plus Environment

33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 42

Figure 11. Comparisons of the performance of COMBI scoring with the union of the results of an OMSSA CID and ETD double search. Searches were performed against the protein database (A), the protein database allowing PTMs (B) and the six-frame translated mouse genome database (C).

ACS Paragon Plus Environment

34

Page 35 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Supporting information Supplementary texts and the list of positive PSMs. This material is available free of charge via the Internet at http://pubs.acs.org

*Corresponding author: Research group of Functional Genomics and Proteomics, Naamsestraat 59, B-3000 Leuven, Belgium; Tel: +32 16 32 42 60; Fax: +32 16 32 39 02; E-mail address: [email protected]

Acknowledgments We acknowledge the excellent technical assistance of Zjef Nackaerts. This work was supported by the agency for Innovation by Science and Technology (IWT Flanders), the Industrial Research Fund of the KU Leuven and the national Interuniversity Attraction Pole grant (IPROS). G.M. is a Postdoctoral Fellow of the Research Foundation - Flanders (FWOVlaanderen).

ABBREVIATIONS CID, collision-induced dissociation; e-value, expectation value; ECD, electron capture dissociation; ETD, electron transfer dissociation; EVD, extreme value distribution; FDR, false discovery rate; MS/MS, tandem mass spectrometry; PSM, peptide-to-spectrum match; PTM post-translational modification.

ACS Paragon Plus Environment

35

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 42

REFERENCES

(1)

Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976–89.

(2)

Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–67.

(3)

Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466–7.

(4)

Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3, 958–64.

(5)

Baggerman, G.; Cerstiaens, A.; De Loof, A.; Schoofs, L. Peptidomics of the larval Drosophila melanogaster central nervous system. J. Biol. Chem. 2002, 277, 40368–74.

(6)

Sasaki, K.; Satomi, Y.; Takao, T.; Minamino, N. Snapshot peptidomics of the regulated secretory pathway. Mol. Cell. Proteomics 2009, 8, 1638–47.

(7)

Munawar, A.; Trusch, M.; Georgieva, D.; Spencer, P.; Frochaux, V.; Harder, S.; Arni, R. K.; Duhalov, D.; Genov, N.; Schlüter, H.; Betzel, C. Venom peptide analysis of Vipera ammodytes meridionalis (Viperinae) and Bothrops jararacussu (Crotalinae) demonstrates subfamily-specificity of the peptidome in the family Viperidae. Mol. Biosyst. 2011, 7, 3298–307.

(8)

Li, G.-H.; Mine, Y.; Hincke, M. T.; Nys, Y. Isolation and characterization of antimicrobial proteins and peptide from chicken liver. J. Pept. Sci. 2007, 13, 368–78.

(9)

Fälth, M.; Sköld, K.; Svensson, M.; Nilsson, A.; Fenyö, D.; Andren, P. E. Neuropeptidomics strategies for specific and sensitive identification of endogenous peptides. Mol. Cell. Proteomics 2007, 6, 1188–97.

(10)

Steiner, D. F. The proprotein convertases. Curr. Opin. Chem. Biol. 1998, 2, 31–9.

(11)

Fricker, L. D. Neuropeptide-processing enzymes: applications for drug discovery. AAPS J. 2005, 7, E449–55.

(12)

Fricker, L. D.; Lim, J.; Pan, H.; Che, F. Y. Peptidomics: identification and quantification of endogenous peptides in neuroendocrine tissues. Mass Spectrom. Rev. 2006, 25, 327–44.

ACS Paragon Plus Environment

36

Page 37 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(13)

Hökfelt, T.; Broberger, C.; Xu, Z. Q.; Sergeyev, V.; Ubink, R.; Diez, M. Neuropeptides-an overview. Neuropharmacology 2000, 39, 1337–56.

(14)

Eipper, B. a; Milgram, S. L.; Husten, E. J.; Yun, H. Y.; Mains, R. E. Peptidylglycine alpha-amidating monooxygenase: a multifunctional protein with catalytic, processing, and routing domains. Protein Sci. 1993, 2, 489–97.

(15)

Syka, J. E. P.; Coon, J. J.; Schroeder, M. J.; Shabanowitz, J.; Hunt, D. F. Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 9528–33.

(16)

McLafferty, F. W.; Horn, D. M.; Breuker, K.; Ge, Y.; Lewis, M. A.; Cerda, B.; Zubarev, R. A.; Carpenter, B. K. Electron capture dissociation of gaseous multiply charged ions by Fourier-transform ion cyclotron resonance. J. Am. Soc. Mass Spectrom. 2001, 12, 245–9.

(17)

Li, W.; Song, C.; Bailey, D. J.; Tseng, G. C.; Coon, J. J.; Wysocki, V. H. Statistical analysis of electron transfer dissociation pairwise fragmentation patterns. Anal. Chem. 2011, 83, 9540–5.

(18)

Nielsen, M. L.; Savitski, M. M.; Zubarev, R. A. Improving protein identification using complementary fragmentation techniques in fourier transform mass spectrometry. Mol. Cell. Proteomics 2005, 4, 835–45.

(19)

Kim, S.; Mischerikow, N.; Bandeira, N.; Navarro, J. D.; Wich, L.; Mohammed, S.; Heck, A. J. R.; Pevzner, P. A. The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search. Mol. Cell. Proteomics 2010, 9, 2840–52.

(20)

Wright, J.; Collins, M.; Yu, L.; Kall, L.; Brosch, M.; Choudhary, J. S. Enhanced peptide identification by electron transfer dissociation using an improved mascot percolator. Mol. Cell. Proteomics 2012, 11, 478-91.

(21)

Molina, H.; Matthiesen, R.; Kandasamy, K.; Pandey, A. Comprehensive comparison of collision induced dissociation and electron transfer dissociation. Anal. Chem. 2008, 80, 4825–35.

(22)

Guthals, A.; Bandeira, N. Peptide identification by tandem mass spectrometry with alternate fragmentation modes. Mol. Cell. Proteomics 2012, 11, 550–7.

(23)

Swaney, D. L.; McAlister, G. C.; Coon, J. J. Decision tree-driven tandem mass spectrometry for shotgun proteomics. Nat. Methods 2008, 5, 959–64.

(24)

Frese, C. K.; Altelaar, A. F. M.; Hennrich, M. L.; Nolting, D.; Zeller, M.; Griep-Raming, J.; Heck, A. J. R.; Mohammed, S. Improved peptide identification by targeted fragmentation using CID, HCD and ETD on an LTQ-Orbitrap Velos. J. Proteome Res. 2011, 10, 2377–88.

ACS Paragon Plus Environment

37

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 42

(25)

Wright, J. C.; Collins, M. O.; Yu, L.; Käll, L.; Brosch, M.; Choudhary, J. S. Enhanced peptide identification by electron transfer dissociation using an improved Mascot Percolator. Mol. Cell. Proteomics 2012, 11, 478–91.

(26)

Shen, Y.; Tolić, N.; Xie, F.; Zhao, R.; Purvine, S. O.; Schepmoes, A. a; Moore, R. J.; Anderson, G. a; Smith, R. D. Effectiveness of CID, HCD, and ETD with FT MS/MS for degradomic-peptidomic analysis: comparison of peptide identification methods. J. Proteome Res. 2011, 10, 3929–43.

(27)

Shen, Y.; Tolić, N.; Purvine, S. O.; Smith, R. D. Improving collision induced dissociation (CID), high energy collision dissociation (HCD), and electron transfer dissociation (ETD) fourier transform MS/MS degradome-peptidome identifications using high accuracy mass information. J. Proteome Res. 2012, 11, 668–77.

(28)

Sasaki, K.; Osaki, T.; Minamino, N. Large-scale identification of endogenous secretory peptides using electron transfer dissociation mass spectrometry. Mol. Cell. Proteomics 2013, 12, 700-9.

(29)

Fenyö, D.; Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 2003, 75, 768–74.

(30)

Alves, G.; Ogurtsov, A. Y.; Yu, Y. K. RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics. PLoS One 2010, 5, e15438.

(31)

Gumbel, E. J. Statistics of extremes.; Columbia University Press: New York, 1958.

(32)

Waterman, M. S.; Vingron, M. Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl. Acad. Sci. U.S.A. 1994, 91, 4625–8.

(33)

Rehmsmeier, M.; Steffen, P.; Hochsmann, M.; Giegerich, R. Fast and effective prediction of microRNA/target duplexes. RNA 2004, 10, 1507–17.

(34)

Vizcaíno, J. A.; Côté, R. G.; Csordas, A.; Dianes, J. A.; Fabregat, A.; Foster, J. M.; Griss, J.; Alpi, E.; Birim, M.; Contell, J.; O’Kelly, G.; Schoenegger, A.; Ovelleiro, D.; PérezRiverol, Y.; Reisinger, F.; Ríos, D.; Wang, R.; Hermjakob, H. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 2013, 41, D1063–9.

(35)

Guthals, A.; Bandeira, N. Peptide identification by tandem mass spectrometry with alternate fragmentation modes. Mol. Cell. Proteomics 2012, 11, 550–7.

(36)

Menschaert, G.; Vandekerckhove, T. T. M.; Landuyt, B.; Hayakawa, E.; Schoofs, L.; Luyten, W.; Van Criekinge, W. Spectral clustering in peptidomics studies helps to unravel modification profile of biologically active peptides and enhances peptide identification rate. Proteomics 2009, 9, 4381–8.

ACS Paragon Plus Environment

38

Page 39 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(37)

Gupta, N.; Bandeira, N.; Keich, U.; Pevzner, P. A. Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 2011, 22, 1111– 20.

(38)

Eipper, B. a; Mains, R. E.; Glembotski, C. C. Identification in pituitary tissue of a peptide alpha-amidation activity that acts on glycine-extended peptides and requires molecular oxygen, copper, and ascorbic acid. Proc. Natl. Acad. Sci. U.S.A. 1983, 80, 5144–8.

(39)

An, Z.; Chen, Y.; Koomen, J. M.; Merkler, D. J. A mass spectrometry-based method to screen for α-amidated peptides. Proteomics 2012, 12, 173–82.

(40)

Fälth, M.; Sköld, K.; Norrman, M.; Svensson, M.; Fenyö, D.; Andren, P. E. SwePep, a database designed for endogenous peptides and mass spectrometry. Mol. Cell. Proteomics 2006, 5, 998–1005.

(41)

Southey, B. R.; Amare, A.; Zimmerman, T. a; Rodriguez-Zas, S. L.; Sweedler, J. V NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides. Nucleic Acids Res. 2006, 34, W267–72.

(42)

Southey, B. R.; Sweedler, J. V; Rodriguez-Zas, S. L. A python analytical pipeline to identify prohormone precursors and predict prohormone cleavage sites. Front. Neuroinformatics 2008, 2, 7.

(43)

Zheng, X.; Baker, H.; Hancock, W. S. Analysis of the low molecular weight serum peptidome using ultrafiltration and a hybrid ion trap-Fourier transform mass spectrometer. J. Chromatogr. A 2006, 1120, 173–84.

(44)

Fricker, L. D. Analysis of mouse brain peptides using mass spectrometry-based peptidomics: implications for novel functions ranging from non-classical neuropeptides to microproteins. Mol. BioSyst. 2010, 6, 1355-65.

(45)

Bora, A.; Annangudi, S. P.; Millet, L. J.; Rubakhin, S. S.; Forbes, A. J.; Kelleher, N. L.; Gillette, M. U.; Sweedler, J. V Neuropeptidomics of the supraoptic rat nucleus. J. Proteome Res. 2008, 7, 4992–5003.

(46)

Menschaert, G.; Vandekerckhove, T. T. M.; Baggerman, G.; Landuyt, B.; Sweedler, J. V; Schoofs, L.; Luyten, W.; Van Criekinge, W. A hybrid, de novo based, genome-wide database search approach applied to the sea urchin neuropeptidome. J. Proteome Res. 2010, 9, 990–6.

(47)

Jeong, K.; Kim, S.; Bandeira, N.; Pevzner, P. A. Gapped spectral dictionaries and their applications for database searches of tandem mass spectra. Mol. Cell. Proteomics 2011, 10, M110.002220.

ACS Paragon Plus Environment

39

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 42

(48)

Hogan, J. M.; Pitteri, S. J.; Chrisman, P. A.; McLuckey, S. A. Complementary structural information from a tryptic N-linked glycopeptide via electron transfer ion/ion reactions and collision-induced dissociation. J. Proteome Res. 2005, 4, 628–32.

(49)

Zubarev, R. A.; Zubarev, A. R.; Savitski, M. M. Electron capture/transfer versus collisionally activated/induced dissociations: solo or duet? J. Am. Soc. Mass Spectrom. 2008, 19, 753–61.

(50)

Kim, M.-S.; Zhong, J.; Kandasamy, K.; Delanghe, B.; Pandey, A. Systematic evaluation of alternating CID and ETD fragmentation for phosphorylated peptides. Proteomics 2011, 11, 2568–72.

(51)

Clynen, E.; Baggerman, G.; Veelaert, D.; Cerstiaens, A.; Van der Horst, D.; Harthoorn, L.; Derua, R.; Waelkens, E.; De Loof, A.; Schoofs, L. Peptidomics of the pars intercerebralis-corpus cardiacum complex of the migratory locust, Locusta migratoria. Eur. J. Biochem. 2001, 268, 1929–39.

(52)

Bora, A.; Annangudi, S. P.; Millet, L. J.; Rubakhin, S. S.; Forbes, A. J.; Kelleher, N. L.; Gillette, M. U.; Sweedler, J. V Neuropeptidomics of the Supraoptic Rat Nucleus research articles. J. Proteome Res. 2008, 4992–5003.

(53)

Dircksen, H.; Neupert, S.; Predel, R.; Verleyen, P.; Huybrechts, J.; Strauss, J.; Hauser, F.; Stafflinger, E.; Schneider, M.; Pauwels, K.; Schoofs, L.; Grimmelikhuijzen, C. J. P. Genomics, transcriptomics, and peptidomics of Daphnia pulex neuropeptides and protein hormones. J. Proteome Res. 2011, 10, 4478–504.

(54)

Yin, P.; Knolhoff, A. M.; Rosenberg, H. J.; Millet, L. J.; Gillette, M. U.; Sweedler, J. V Peptidomic analyses of mouse astrocytic cell lines and rat primary cultured astrocytes. J. Proteome Res. 2012, 11, 3965–73.

(55)

Hummon, A. B.; Richmond, T. A.; Verleyen, P.; Baggerman, G.; Huybrechts, J.; Ewing, M. A.; Vierstraete, E.; Rodriguez-Zas, S. L.; Schoofs, L.; Robinson, G. E.; Sweedler, J. V From the genome to the proteome: uncovering peptides in the Apis brain. Science 2006, 314, 647–9.

(56)

Zhang, X.; Pan, H.; Peng, B.; Steiner, D. F.; Pintar, J. E.; Fricker, L. D. Neuropeptidomic analysis establishes a major role for prohormone convertase-2 in neuropeptide biosynthesis. J. Neurochem. 2010, 112, 1168–79.

(57)

Petruzziello, F.; Fouillen, L.; Wadensten, H.; Kretz, R.; Andren, P. E.; Rainer, G.; Zhang, X. Extensive characterization of Tupaia belangeri neuropeptidome using an integrated mass spectrometric approach. J. Proteome Res. 2012, 11, 886–96.

(58)

Zhang, X.; Petruzziello, F.; Zani, F.; Fouillen, L.; Andren, P. E.; Solinas, G.; Rainer, G. High identification rates of endogenous neuropeptides from mouse brain. J. Proteome Res. 2012, 11, 2819–27.

ACS Paragon Plus Environment

40

Page 41 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(59)

Osaki, T.; Sasaki, K.; Minamino, N. Peptidomics-based discovery of an antimicrobial peptide derived from insulin-like growth factor-binding protein 5. J. Proteome Res. 2011, 10, 1870–80.

(60)

Sasaki, K.; Takahashi, N.; Satoh, M.; Yamasaki, M.; Minamino, N. A peptidomics strategy for discovering endogenous bioactive peptides. J. Proteome Res. 2010, 9, 5047– 52.

(61)

Kojima, M.; Hosoda, H.; Date, Y.; Nakazato, M.; Matsuo, H.; Kangawa, K. Ghrelin is a growth-hormone-releasing acylated peptide from stomach. Nature 1999, 402, 656–60.

(62)

Fujii, R.; Yoshida, H.; Fukusumi, S.; Habata, Y.; Hosoya, M.; Kawamata, Y.; Yano, T.; Hinuma, S.; Kitada, C.; Asami, T.; Mori, M.; Fujisawa, Y.; Fujino, M. Identification of a neuropeptide modified with bromine as an endogenous ligand for GPR7. J. Biol. Chem. 2002, 277, 34010–6.

(63)

Na, S.; Bandeira, N.; Paek, E. Fast multi-blind modification search through tandem mass spectrometry. Mol. Cell. Proteomics 2012, 11, M111.010199.

(64)

Tsur, D.; Tanner, S.; Zandi, E.; Bafna, V.; Pevzner, P. A. Identification of posttranslational modifications by blind search of mass spectra. Nat. Biotechnol. 2005, 23, 1562–7.

(65)

Menschaert, G.; Vandekerckhove, T. T. M.; Baggerman, G.; Schoofs, L.; Luyten, W.; Van Criekinge, W. Peptidomics coming of age: a review of contributions from a bioinformatics angle. J. Proteome Res. 2010, 9, 2051–61.

(66)

Basrai, M. A.; Hieter, P.; Boeke, J. D. Small open reading frames: beautiful needles in the haystack. Genome Res. 1997, 7, 768–71.

(67)

Kondo, T.; Hashimoto, Y.; Kato, K.; Inagaki, S.; Hayashi, S.; Kageyama, Y. Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nat. Cell Biol. 2007, 9, 660–5.

(68)

Ingolia, N. T.; Lareau, L. F.; Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 2011, 147, 789–802.

(69)

Slavoff, S. A.; Mitchell, A. J.; Schwaid, A. G.; Cabili, M. N.; Ma, J.; Levin, J. Z.; Karger, A. D.; Budnik, B. A.; Rinn, J. L.; Saghatelian, A. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 2013, 9, 59–64.

ACS Paragon Plus Environment

41

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Graphic 52x33mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 42 of 42