Spectral Clustering in Peptidomics Studies Allows Homology

Mar 12, 2012 - Laurens Minerva , An Ceulemans , Geert Baggerman , Lutgarde Arckens. PROTEOMICS - Clinical Applications 2012 6 (10.1002/prca.v6.11-12) ...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

Spectral Clustering in Peptidomics Studies Allows Homology Searching and Modification Profiling: HomClus, a Versatile Tool Gerben Menschaert,*,†,‡ Eisuke Hayakawa,‡,§ Liliane Schoofs,§ Wim Van Criekinge,† and Geert Baggerman∥,⊥ †

Faculty of Bioscience Engineering, Laboratory for Bioinformatics and Computational Genomics, Ghent University, Ghent, Belgium Prometa, Interfaculty Center for Proteomics and Metabolomics, K.U. Leuven, Leuven, Belgium § Research Group of Functional Genomics and Proteomics, K.U. Leuven, 3000 Leuven, Belgium ∥ VITO Nv, 2400 Mol, Belgium ⊥ CFP, Center for Proteomics, 2020 Antwerpen, Belgium ‡

S Supporting Information *

ABSTRACT: Many genomes of nonmodel organisms are yet to be annotated. Peptidomics research on those organisms therefore cannot adopt the commonly used database-driven identification strategy, leaving the more difficult de novo sequencing approach as the only alternative. The reported tool uses the growing resources of publicly or in-house available fragmentation spectra and sequences of (model) organisms to elucidate the identity of peptides of experimental spectra of nonannotated species. Clustering algorithms are implemented to infer the identity of unknown peak lists based on their publicly or in-house available counterparts. The reported tool, which we call the HomClus-tool, can cope with posttranslational modifications and amino acid substitutions. We applied this tool on two locusts (Schistocerca gregaria and Locusta migratoria) LC-MALDI-TOF/TOF datasets. Compared to a Mascot database search (using the available UniProt-KB proteins of these species), we were able to double the amount of peptide identifications for both spectral sets. Known bioactive peptides from Drosophila melanogaster (i.e., fragmentations spectra generated in silico thereof) were used as a starting point for clustering, trying to reveal their experimental homologues’ counterparts. KEYWORDS: peptidomics, bioactive peptides, Locusta migratoria, Schistocerca gregaria, spectral clustering, PTMs, post-translational modification, homology searching



INTRODUCTION Mass spectrometry-based proteomics and peptidomics are wellestablished techniques in protein and peptide research. Inherently, data repositories holding those results are exponentially expanding, providing the research community with annotated fragmentation spectra and peptide information:1 Proteomics Identification database (PRIDE, +260 million spectra2), PeptideAtlas (several millions of highly confident peptide spectrum matches3,4), The Global Proteome Machine database (GPMDB5), Tranche,6 and more specific endogenous peptide databases (Swepep,7 Neuropedia8). Moreover, many laboratories can utilize in-house data sets with annotated fragmentation spectra.9−13 Query interfaces on aforementioned public data repositories are also continuously improving, enabling extraction of annotated peak lists from mass spectrometry types of interest (selection can be based on MS fragmentation, ionization, or analyzer characteristics). PRIDE-BioMart and PRIDE-Inspector are interfaces built on the PRIDE database; PeptideAtlas results can be downloaded based on specific research queries. Next to the ability to query for specific data sets, public resources also strive toward publication of high quality data. The quality © 2012 American Chemical Society

assessment in terms of probability of correct identification and false discovery rates at the project and/or database level are commonly implemented. PeptideAtlas implements the statistical analysis from the Trans-Proteome-Pipeline,14 and Pride-Inspector has its own quality control features; furthermore, PRIDE-Q (forthcoming release of the PRIDE database) will only hold fragmentation spectra passing stringent quality control, thus assuring confident tandem mass spectrometry data. Apart from the described spectral resources, protein databases are also expanding,15−17 providing the community with an evermore-detailed protein map, especially for widely studied model organisms. Spectral libraries can be created from these proteins sequences in silico, subsequently serving as a kind of spectral resource. Most database search engines rely on this technique (Mascot,18 X!Tandem19). In this present report, we describe a software solution, which we call the HomClus-tool, enabling identification of endogenous (neuro)peptides from poorly annotated (nonmodel) species. Received: November 8, 2011 Published: March 12, 2012 2774

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

Journal of Proteome Research

Article

supernatant was filtered through spin down filter (0.22 mm, Ultrafree-MC, Millipore). Finally, the peptide extracts were dried in a vacuum centrifuge and stored at −20 °C before the chromatographic analysis.

Until now, the protein and peptide profile for many species remains unstudied: many vertebrate, insect, or bacteria proteomes are lacking, but also more specific peptide profiles (for example of toxins from spiders, scorpions, or cone snails) remain unexplored. Due to the lack of available amino acid sequences, the peptides of such samples can only be identified applying more labor-intensive de novo techniques. De novo strategies inevitably introduce their known difficulties and shortcomings,20 as for example the need for very good quality spectra (without missing parts of fragment ion series) and the need for accurate data. A handful of hybrid techniques are devised to first try and circumvent these difficulties and second allow introduction of amino acid mutations: MS-Blast21 or SPIDER.22 The latter methodologies use inferred sequence tags (short stretches of identified amino acids) as a starting point for further analysis based on the BLAST algorithm23 in the case of MS-Blast and a gapped alignment algorithm in the case of SPIDER. The Mascot database search engine similarly enables an error-tolerant database search allowing mutations.18 In this work, we present a methodology based on spectral clustering to deduce the identity of experimental fragmentation spectra of bioactive peptides. The novelty of the presented solution lies in the fact that it can use clustering algorithms25,26 that cope with post-translational modifications. Regular spectral alignment tools as SpectraST40 can only match to identical sequences, not to alternatively post-translationally modified forms. Furthermore the described solution is able to identify peak lists based on annotated peak lists of homologous peptides. The amount of similarity is measured at the fragmentation spectrum level, not at the amino acid level (e.g., MS-Blast that is solely sequence based21), thus also taking into account valuable spectral information as the intensities of the ion fragments. This technique is promising, especially since bioactive peptides mostly show a high degree of conservation and regularly carry post-translational modifications. Annotated peak lists of homologous peptides, obtained or generated from public or in-house resources (either spectral or sequence databases), can be administered. As a proof of concept, we chose to apply our method on fragmentation spectra obtained from LC−MALDITOF/TOF experiments on corpora cardiaca samples of two different locusts: Schistocerca gregaria and Locusta migratoria. Here, spectra generated in silico of known bioactive fruit fly peptides were used as a starting point in the identification process.



HPLC

The samples were dissolved in Milli-Q water containing 5% acetonitrile (ACN) and 0.1% FA and separated on an HPLC system, equipped with a C18 precolumn (PepMap 100, 5 μm − 100 A, 0.3 × 5 mm, Dionex) to concentrate and desalt the sample. After loading the sample, the following gradient was applied for the mobile phase: from 5% ACN to 10% ACN in 5 min, to 25% ACN in 37 min, to 45% ACN in 13 min, to finally 95% ACN in 4 min, at a flow rate of 200 nL/min over the analytic column (PepMap 100, 3 μm − 100 A, 75 μm × 15 cm, Dionex). Every 15 s, a fraction was automatically spotted on a Maldi ground plate using the Proteineer FC (LC−Maldi Fraction Collector, Bruker Daltonics) after mixing with 1.5 μL of a saturated solution of Alpha-Cyano-4-hydroxycinnamic acid (CHCA) in 60% ACN/0.1% FA. Mass Spectrometry

After evaporation of the solvent, the MALDI target was introduced into the mass spectrometer ion source. MS and MS/MS analysis were performed on an Ultraflex II instrument (Bruker Daltonics) in positive ion, reflectron mode. The instrument was calibrated externally with a commercial peptide mixture (peptide calibration standard, Bruker Daltonics). All spectra were obtained using Flex Control software (3.4 Bruker Daltonics). The plate was initially examined in MS mode and spectra were recorded within a mass range from m/z 500 to 4000. Subsequently, the most intense peaks with S/N higher than 10 were selected and used for the optimized LIFT MS/MS analysis from the same target. Peaks were selected for MS/MS from lowest to highest abundancy. All MS and MS/MS spectra were processed by means of the FlexAnalysis software (Bruker Daltonics), and m/z values and intensities of each peak were recorded in peak list files. Overview of the Presented Prediction Tool

Figure 1 gives a schematic overview of the presented solution identifying fragmentation spectra of peptides for poorly annotated organisms of interest. Known peptide peak lists (Figure 1, green box) retrieved from public repositories, comprehensive peptidomics studies on model organisms, synthetized catalogs, and/or in silico generated from protein databases (see Supplemental Data 1 for details on the supported input formats, Supporting Information), are used in aiding the annotation of unknown experimental data obtained from the examined species (Figure 1, red box). The experimental data and publicly available peak list files need to be entered in the HomClus-tool as MGF-formatted files (the L. migratoria and S. gregaria files used for the described analysis are available as Supplemental Data 4a and 4b, Supporting Information). Furthermore, values for both the peptide and fragment ion matching tolerance can be specified as input parameters during the HomClus analysis, to be able to take the accuracy of input data into account. The annotation process (Figure 1, dark blue boxes) contains of three subunits: a clustering module, an identification module, and a scoring module. The clustering module is currently based on the Bonanza clustering algorithm,25,26 but as the process is setup modularly, other algorithms27 can be plugged in at a later stage. A notable advantage of the Bonanza algorithm is that it takes into account mass shifts corresponding either to amino acid mutations and/or post-translational modifications (PTMs),

MATERIALS AND METHODS

Sample Preparation

Schistocerca gregaria and Locusta migratoria were reared under laboratory conditions,24 under a 13 h light, 11 h dark photoperiod at a room temperature of 32 °C and relative humidity between 40 and 60%. Locusts were kept under gregarizing conditions with at least 100 animals in cages of 38 cm × 38 cm × 38 cm and were daily fed ad libitum with cabbage (S. gregaria) or fresh grass and oatmeal (L. migratoria). One corpora cardiaca of an adult locust was carefully dissected. It was rinsed in a Ringer solution (8.77 g/L NaCl, 0.19 g/L CaCl2, 0.75 g/L KCl, 0.41 g/L MgCl2, 0.34 g/L NaHCO3, 30.81 g/L sucrose, 1.89 g/L trehalose, pH 7.2) and transferred to a 0.5 mL Eppendorf tube on ice, containing 50 μL of chilled extraction solvent (90% v/v methanol, 9% v/v Milli-Q water, 1% v/v formic acid (FA)). The sample was sonicated three times for 1 min and the remaining solid fraction was centrifuged down (20 min at 20000× g). Next, the 2775

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

Journal of Proteome Research

Article

Figure 1. Simplified schematic overview of the identification pipeline: gathering its publicly available, well-annotated and unknown peak lists to examine as input (IN: green and red boxes, left), how its identification strategy is build of three subunits: clustering, identification, and scoring module (PROCESS: dark blue boxes, center), and how it interacts with the information contained in the output (OUT: light blue box, right).

Plotting the score density versus the score values follows an extreme value distribution (type 1). To improve the estimation in the tail of the distribution, a fitting based on the 20−99% upper values of the log(−log) transformed empirical cumulative density function is commonly applied.30 A detailed example of the workflow, scoring schemes and statistical significance assessment is provided as Supplemental Data 2 (Supporting Information). The pipeline is built in Perl v10.5, E-value calculation is implemented in R (version 2.12.0), and a back-end portable SQLite (version 3.6.13) database is used for storage of all possible amino acid mutations up to 3 amino acids. The source-code and documentation can be requested from the authors.

the latter being abundantly present in bioactive peptides. In the clustering module spectral information is used to try and group (cluster) spectra. The aim is to cluster experimental, unknown spectra together with known spectra that have a sequence annotation. Afterward, the identification module generates a list of putative identifications by introducing amino acid shifts and/or post-translational modifications to the known sequence of the annotated fragmentation spectrum that clustered together with the unknown experimental fragmentation spectrum. The introduced amino acid shifts and/or modifications should match the difference between the parent ion masses of the annotated and unannotated, experimental spectra. Thus, both modified or homologous peptide forms can be identified. The generated candidates are subsequently ranked in the scoring module using information on fragmentation rules of the technique applied (e.g., CID, PSD, ETD/ECD, IRMPD). At present several scoring schemes are made available: correlation score, hyperscore1, hyperscore2, and the SALSA score28 (see Supplemental Data 2, Supporting Information). Finally a list is generated presenting the identified experimental peptides and their relationship to known peptides (sequence alignment) ranked on their calculated identity score. An expectation value (E-value) is also generated based on the scoring distribution assessing the significance of the obtained score. This E-value calculation in this report is based on the HyperScore2 distribution, but is replaceable by another scoring scheme. Extreme value distribution statistics29 are applied to generate the statistical significance of the peptide identifications.

Protein and Peptide Data

Two protein databases were compiled for database searching (Mascot search engine18) holding all Schistocerca gregaria and Locusta migratoria UniProt-KB protein entries, using the following queries: organism: “Locusta migratoria [7004]” (342 sequences) and organism: “Schistocerca gregaria [7010]” (104 sequences). These databases were used in the first identification round, prior to using the HomClus-tool. Furthermore all fruit fly peptide precursors were also downloaded from the UniProtKB resource (query: taxonomy: “Drosophila melanogaster (Fruit f ly) [7227]” AND annotation: (type:peptide)), to deduce all bioactive peptide sequences and modifications. The protein data downloaded from UniProt-KB (SwissProt formatted) of the two locust species and fruit fly were uploaded 2776

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

Journal of Proteome Research

Article

Figure 2. Schematic overview of the identification strategy, consisting of three identification rounds. The first is based on a regular database search against available UniProt-KB protein of the according locust. A second round is based on the reported HomClus-tool using fragmentation spectra generated in silico based on available fruit fly bioactive peptides in the UniProt-KB database, emphasizing homology searching (amino acid substitutions are allowed). The third round also adopts the HomClus-tool using the already annotated experimental fragmentation spectra resulting from the previous identification rounds for clustering. Here the emphasis is on identification of the full modification profile (no amino acid substitutions, yet all post-translational modifications are allowed).

melanogaster peak list counterparts (generated in silico from the sequences), thresholds for matching peaks (MP) count, total ion current (TIC), and expectation value (e-value) were set to 10, 0.15, and 10−1 respectively in the clustering module of the program. To allow calculation of an e-value, 1000 randomly generated peptides with the same experimental mass are also generated and scored next to the list of candidates resulting from either amino acid substitutions or modifications. As their e-value is generally higher than 1, setting an e-value cutoff of 10−1 for the true candidates in the clustering module assures that the false discovery score threshold is never reached. The candidate and identification modules make use of the following cut-offs: mass tolerance is set to 0.2 Da for peptides and 0.4 Da for fragments. These thresholds can be set when launching the HomClus tool. In this round, the most common bioactive peptide modifications (C-terminal amidation, N-terminal Gln→Pyro-Glu, and methionine oxidation) are taken into account. The PSI-MOD ontology (included in the Ontology Lookup Service,31 http://www.ebi.ac.uk/ontology-lookup/) is used describing post-translational modifications throughout the program. Furthermore, the top 10 candidates (ranked on Hyperscore2) are listed in the output; see Supplemental data 3 for generated output (Supporting Information). In the final identification step (see Figure 2), the emphasis is to identify extra modifications. All MS/MS peak lists, corresponding to the identified peptides from the first (Mascot DB-engine) and second identification round are gathered into a “known in-house peptidomic data set”. This “known” pool of

in an in-house BioSQL database (http://www.biosql.org), upon which the bioactive peptide sequences and annotated posttranslational modifications were queried and pasted in an CSV formatted file (see Supplemental Data 1 for CSV input data fomat, Supporting Information). Subsequently, fragmentation spectra were generated in silico (by the presented HomClus-tool) for comparison with the S. gregaria and L. migratoria corpora cardiaca peak lists. Identification Strategy of Peptides and PTMs Fingerprint

In a first identification round (see Figure 2), a database engine strategy (Mascot) was applied, performing searches against their according sequence databases (S. gregaria and L. migratoria), using the most abundant modifications present (extracted from the PTMs fingerprinting,25 see Figure 3: Amidated (C-term), Cation:Na (C-term), Cation:Na (DE), Gln→pyro-Glu (N-term Q), Methyl (C-term), Oxidation (M)). Throughout all searches, the following input parameters were chosen: mass tolerance was set to 0.2 Da for peptides and 0.4 Da for fragments and no cleavage enzyme for protein digestion was chosen. These mass thresholds are justified since a Bruker Ultraflex II Maldi-Tof-Tof instrument was used for the analysis. Of course, when working with a more accurate machine, these thresholds should be adjusted accordingly. A second identification round (see Figure 2) is based on the presented HomClus-tool, whereby the emphasis is to identify homologous peptide sequences, allowing amino acids substitutions. To cluster the unknown experimental fragmentation spectra (of both locust species) with their annotated Drosophila 2777

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

Journal of Proteome Research

Article

Figure 3. Bar plots show the absolute values of the mass shifts present between MS/MS peaklists of both the Schistocerca gregaria and Locusta migratoria samples after clustering using 15 for matching peaks (MP) cutoff and 10−3 for expectation value cutoff (using the Web tool (http:// peppus.ugent.be/Peptidomics-Bonanza). Only clusters that score above these thresholds are considered. The bar plot is based on the mass differences between the two peak lists within the withheld clusters. The 0 Da shift is omitted and the results are truncated to maximum 40 Da. The minimum bar size is set to 5 for both samples.

in-house peptidome database” so to speak). Not only is the functionality of the presented tool demonstrated thus, but it also shows that such a neuropeptide spectral library further improves identification efficiency, sensitivity, and reliability (in comparison to peak lists generated in silico by means of peptide sequences). It considers all spectral features, including actual fragment intensities, neutral losses from fragments, and various uncommon or even unknown fragments to determine the best matches. Preliminary knowledge of the modification profile of specific peptidome samples greatly improves identification rates.20 Consequently, prior to running the first and last identification round, a PTMs fingerprint is constructed based on the experimental data (http://peppus.ugent.be/PeptidomicsBonanza25). Bar plots depicting the present mass shifts in the two locust samples are presented in Figure 3. For subsequent analysis (Mascot database searching and the presented HomClus-tool) the most abundant mass shifts corresponding to known post-translational modifications were taken into account: Amidation (−1 Da), Gln→ pyro-Glu (−17 Da), Oxidation (+16 Da), Methylation (+14 Da), and Cation:Na adduct (+22 Da). An extra modification resulting in a 7 Da shift was also introduced. In a future release of the presented program this prior PTMs fingerprinting will be included automatically and resulting modifications taken into account in the matching algorithm, when chosen. At present, modifications have to be manually provided running the program.

fragmentation spectra is clustered to the initial experimental MS/MS peak lists. In contrast to the former run, no amino acid substitutions are allowed during identification, and the complete set of post-translational modifications obtained from a prior clustering analysis25 of the experimental data was taken into account. The same thresholds for matching peaks (MP) count, total ion current (TIC), and expectation value (e-value) were set as in the second identification step. Only the top 5 candidates were chosen as output (ranked on HyperScore2, see Supplemental data 3, Supporting Information).



RESULTS For the S. gregaria sample, 15 peptides (cleaved from 7 different precursors) were identified in the first identification round. Six new peptides (cleaved from 5 different precursors) were identified with the reported HomClus-tool in the second identification round, with the emphasis on identifying homologous peptides using amino acid substitutions. In the third identification round 13 extra peptides were identified (emphasize on revealing new modifications). For the L. migratoria sample the number of identifications is higher; 30 peptides (15 precursors), 9 peptides (7 precursors), and 17 peptides in respectively first, second, and third identification round. Overall the number of peptide identifications is doubled for both biological samples. All identifications are listed in Table 1. The full output files of the HomClus-tool are provided as Supplementary data 3 (3a and 3b represent the output for identification round 2 for S. gregaria and L. migratoria respectively; 3c and 3d are identical representing output for the third identification round, Supporting Information). Note that in the last identification round the fragmentation spectra corresponding to the annotated peptides were used as a starting point (“an



DISCUSSION Several extra bioactive peptides were identified by means of the HomClus-tool using fruit fly endogenous peptides as starting point in comparison to the first identification round (e.g., corazonin, drostatin, two allostatin peptides, several short 2778

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

2779

Schistostatin

tr|Q94742|Q94742_SCHGR

Drostatin-3 (=Schiststatin)

sp|Q9VC44|ALLS_DROME

Locustamyotropin-2

Schistostatin

tr|Q94742|Q94742_SCHGR

Ion transport peptide

Schistostatin

tr|Q94742|Q94742_SCHGR

sp|P22396|LMT2_LOCMI

Schistostatin

tr|Q94742|Q94742_SCHGR

sp|Q26491|ITP_SCHGR

Insulin-related peptide transcript variant T1

tr|B1GV78|B1GV78_SCHGR

Ion transport peptide

Insulin-related peptide transcript variant T1

tr|B1GV78|B1GV78_SCHGR

sp|Q26491|ITP_SCHGR

Insulin-related peptide transcript variant T1

tr|B1GV78|B1GV78_SCHGR

Corazonin

Ion transport peptide

sp|Q26491|ITP_SCHGR

sp|Q26377|CORZ_DROME

Short neuropeptide F

sp|P86445|SNPF_SCHGR

Locustamyotropin-4

SchistoFLRFamide

sp|P84307|FARP_SCHGR

sp|P41490|LMT4_LOCMI

SchistoFLRFamide

sp|P84307|FARP_SCHGR

Locustamyotropin-4

Adipokinetic prohormone type 2

sp|P35808|AKH2_SCHGR

sp|P41490|LMT4_LOCMI

Adipokinetic prohormone type 2

sp|P35808|AKH2_SCHGR

Locustamyotropin-2

Adipokinetic prohormone type 1

sp|P18829|AKH1_SCHGR

Locustapyrokinin-2

Adipokinetic prohormone type 1

sp|P18829|AKH1_SCHGR

sp|P41488|LPK2_LOCMI

Adipokinetic prohormone type 1

sp|P18829|AKH1_SCHGR

sp|P22396|LMT2_LOCMI

description

Precursor ID

GPRTYSFG

EGDFTPR

LDPHHLA

SPLDPHHL

QTFQYSHGWTN

RLQQYGMPFSPRL

QYGMPFSPRL

QSMPTFTPRL

EGDFTPRL

ARPYSFGL

GRLYSFGL

GPRTYSFGL

AGPAPSRLYSFGL

QSDLFLLSPK

QAQSDLFLLSPK

DLFLLSPK

SPLDPHHLA

SPSLRLRF

PDVDHVFLRF

PDVDHVFLR

QLNFSTGWGRR

QLNFSTGWGR

QLNFTPNWGTGK

QLNFTPNWGT

PNWGTGK

sequence

Mr delta mass

homology score threshold

910.406

995.438

1333.586

1129.496

1328.537

931.446

985.379

973.494

1242.54

1096.453

1303.51

1147.432

1344.505

1158.41

35.2 50.36

24.68

−0.12 −0.092

−0.161

28.01

28.27

−0.088

29.18

50.51

−0.111

−0.097

28.65

−0.114

−0.081

31.48

−0.132

38.83

51.31

−0.109

−0.128

48.09

−0.141

43.56

28.65

−0.136

−0.106

45.43

−0.087

1348.427

1589.675

1176.465

1157.448

931.372

907.399

103.95 71.91 78.31 102.39

163.23

−0.179 −0.114 −0.175

−0.157

Methylation(6)

897.334

820.257

801.303

914.34

HomClus score

−0.104

1.50 × 10−1

1.20 × 10−1

1.60 × 10−2

5.00 × 10−3

4.20 × 10−1

8.90 × 10−4

3.00 × 10−2

1.50 × 10−1

1.00 × 10−3

1.50 × 10−1

8.80 × 10−2

8.20 × 10−4

2.00 × 10−3

1.50 × 10−1

2.60 × 10−3

Mascot e-val

68.06

32

32

33

33

33

32

32

32

33

33

33

33

34

33

32

identity score threshold

−0.093

22

24

26

21

26

21

22

21

17

24

26

23

49.43 59.26 49.21 27.38

−0.126 −0.115 −0.119 −0.105

Peptide identifictions using HomClus tool (using ptms fingerprint)

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

Amidated (C-term)

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

Amidated (C-term)

Amidated (C-term)

Peptide identifictions using HomClus tool (≤3 AA subst)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Gln→pyro-Glu (N-term Q)

Gln→pyro-Glu (N-term Q)

Amidated (C-term)

Amidated (C-term)

Gln→pyro-Glu (N-term Q)

Gln→pyro-Glu (N-term Q)

Gln→pyro-Glu (N-term Q)

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

758.285

Schistocerca gregaria Peptide identifictions using Mascot DB search engine (first round)

Ptm

Mascot peptide score

Table 1. Identification Results for Schistocerca gregaria and Locusta migratoria Bio-active Peptidesa

5.62 × 10

1.06 × 10−1

9.27 × 10−3

3.52 × 10−2

7.62 × 10−2

1.58 × 10−5

4.57 × 10−7

2.10 × 10

−4

4.70 × 10−6

−4

5.77 × 10−3

HomClus e-val

R_H_QTFQYSRGWTN

HN_YQ_RLHQNGMPFSPRL

N_Y_QNGMPFSPRL

V_M_QSVPTFTPRL

EQUAL

S_A_SRPYSFGL

rule

Journal of Proteome Research Article

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

2780

LIRP

LIRP

LIRP

LIRP

LIRP

LIRP

Locustamyotropin-1

Locustamyotropin-1

Locustamyotropin-1

Locustamyotropin-2

Diuretic hormone

Locustatachykinin-4

Locustapyrokinin-2

sp|P15131|LIRP_LOCMI

sp|P15131|LIRP_LOCMI

sp|P15131|LIRP_LOCMI

sp|P15131|LIRP_LOCMI

sp|P15131|LIRP_LOCMI

sp|P15131|LIRP_LOCMI

sp|P22395|LMT1_LOCMI

sp|P22395|LMT1_LOCMI

sp|P22395|LMT1_LOCMI

sp|P22396|LMT2_LOCMI

sp|P23465|DIUH_LOCMI

sp|P30250|TKL4_LOCMI

sp|P41488|LPK2_LOCMI

LIRP

sp|P15131|LIRP_LOCMI

Insulin-related peptide transcript variant T1

tr|B1GV78|B1GV78_SCHGR

Apolipophorin-3b

SchistoFLRFamide

sp|P84307|FARP_SCHGR

sp|P10762|APL3_LOCMI

SchistoFLRFamide

sp|P84307|FARP_SCHGR

Apolipophorin-3b

Adipokinetic prohormone type 2

sp|P35808|AKH2_SCHGR

sp|P10762|APL3_LOCMI

Locustamyotropin-4

sp|P41490|LMT4_LOCMI

Adipokinetic prohormone type 2

Schistostatin

tr|Q94742|Q94742_SCHGR

sp|P08379|AKH2_LOCMI

Corazonin

sp|Q26377|CORZ_DROME

Adipokinetic prohormone type 2

Corazonin

sp|Q26377|CORZ_DROME

sp|P08379|AKH2_LOCMI

Corazonin

description

sp|Q26377|CORZ_DROME

Precursor ID

Table 1. continued

QSVPTFTPRL

APSLGFHGVR

DAEEQIKANKDFLQQI

EGDFTPRL

VPAAQFSPRL

GAVPAAQFSPRL

GAVPAAQFSPR

TQAQSDLFLLSPK

TATQAQSDLFLLSPK

QSDLFLLSPK

QAQSDLFLLSPK

DLFLLSPK

ARPSAGGLLTGAVF

AQSDLFLLSPK

RPDAAGHVNIAEA

QNSIQSAVQKPAN

QLNFSAGWGRR

QLNFSAGWGR

QSDLFLLSPK

PDVDHVFLR

PDVDHVFLRF

QLNFSTGW

LQQYGMPFSPRL

GRLYSFG

QTFQYSHGWTN

QTFQYSHGWTN

QTFQYSHGWTN

sequence

Mr delta mass

homology score threshold

1136.399

1110.481

1264.512

955.342

1434.593

812.33

1371.452

1364.401

1356.422

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Gln→pyro-Glu (N-term Q)

Gln→pyro-Glu (N-term Q)

Gln→pyro-Glu (N-term Q)

Gln→pyro-Glu (N-term Q)

1126.516

1038.491

1887.766

932.395

1083.545

1211.576

1099.475

1446.631

1618.698

1129.518

1328.589

931.467

1315.631

1217.571

1319.524

1383.539

1273.538

1117.445

31.58 185.87 30.2

58.99 55.15 75.39

−0.093 −0.156 −0.079

−0.126 −0.106 −0.208

47.28 30.65 50.72 28.17 48.86 38.05 27.97 44.9 31.52 41.88

−0.084

−0.141 −0.103 −0.102 −0.074 −0.076 −0.204 −0.082 −0.098

64.37 −0.159

47.89

38.96

−0.095

−0.109

37.14

−0.135

−0.071

45.98

−0.172

38.71

30.14

−0.094

−0.094

36.63

−0.086

17

21

24

26

27

22

25

16

30

35

23

21

28

19

17

18

37

36

38

36

36

37

36

37

37

36

37

36

37

36

37

37

1.60 × 10−2

1.50 × 10−1

1.10 × 10−2

3.50 × 10−1

3.40 × 10−2

3.40 × 10−3

3.60 × 10−1

2.50 × 10−3

2.60 × 10−1

4.20 × 10−3

9.90 × 10−5

3.40 × 10−3

3.50 × 10−2

3.00 × 10−2

5.00 × 10−2

6.50 × 10−3

2.60 × 10−1

67.4

−0.114

37

92.04

HomClus score

−0.183

5.00 × 10−2

Mascot e-val 81.68

36

identity score threshold

−0.162

Locusta migratoria Peptide identifictions using Mascot DB search engine (first round)

Gln→pyro-Glu (N-term Q); 7 DaShift(6)

Methylation(9)

Amidated (C-term); 22 DaShift(4)

Amidated (C-term); Gln→ pyro-Glu (N-term Q);22 DaShift(5)

Amidated (C-term)

Methylation(5)

Amidated (C-term); Gln→ pyro-Glu (N-term Q); 22 DaShift(4)

7 DaShift(C-term); Gln→pyroGlu (N-term Q); 7daShift(11)

Amidated (C-term); Gln→ pyro-Glu (N-term Q); 7daShift(11)

Peptide identifictions using HomClus tool (using ptms fingerprint)

Ptm

Mascot peptide score

8.21 × 10−2

2.17 × 10−3

1.86 × 10−1

5.56 × 10−2

1.95 × 10−6

4.30 × 10−1

3.02 × 10−2

1.28 × 10−3

1.58 × 10−3

HomClus e-val rule

Journal of Proteome Research Article

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

2781

Locustamyotropin-1

Locustamyotropin-3

sp|P41489|LMT3_LOCMI

LIRP

sp|P15131|LIRP_LOCMI

Diuretic hormone

Drostatins

sp|Q9VVF7|MIP_DROME

sp|P23465|DIUH_LOCMI

Neuropeptide Like

sp|Q9VV28|NPLP3_DROME

sp|P22395|LMT1_LOCMI

Short neuropeptide F

sp|Q9VIQ0|SNPF_DROME

Allostatins

sp|Q9VC44|ALLS_DROME

Short neuropeptide F

Cardio acceleratory peptide 2b

sp|Q9NIP6|CP2B_DROME

sp|Q9VIQ0|SNPF_DROME

Corazonin

sp|Q26377|CORZ_DROME

Short neuropeptide F

Locustatachykinin-1

sp|P16223|TKL1_LOCMI

sp|Q9VIQ0|SNPF_DROME

FMRFamide

sp|P10552|FMRF_DROME

SchistoFLRFamide

sp|P84306|FARP_LOCMI

Prepro ion transport peptide

Hypertrehalosaemic hormone

sp|P81626|HTF_LOCMI

tr|Q9XYF9|Q9XYF9_LOCMI

Sulfakinin

sp|P47733|LOSK_LOCMI

Short neuropeptide F

Locustamyotropin-4

sp|P41490|LMT4_LOCMI

sp|P86444|SNPF_LOCMI

Locustamyotropin-4

sp|P41490|LMT4_LOCMI

SchistoFLRFamide

Locustamyotropin-4

sp|P41490|LMT4_LOCMI

Short neuropeptide F

Locustamyotropin-3

sp|P41489|LMT3_LOCMI

sp|P86444|SNPF_LOCMI

Locustamyotropin-3

sp|P41489|LMT3_LOCMI

sp|P84306|FARP_LOCMI

description

Precursor ID

Table 1. continued

RQQPFVPR

DAEEQIKANKDFLQQI

VPAAQFSPRL

ARPSAGGLLTGA

AWQELQSAW

VVSVVED

SPSLRLFF

SLRLRF

PSRSPSLRLRF

GRLYSFGL

GANMGVVVFPR

QTFQYSHGWTN

GPSGFYGVR

HVFLxRF

SPLDAHHLA

SNRSPSLRLRF

SPSLRLRF

VDHVFLRF

PDVDHVFLRF

QVTFSRDWSP

QLASDDYGHMRF

RLHQNGMPFSPRL

LHQNGMPFSPRL

LHQNGMPFSPRL

RQQPFVPRL

QQPFVPRL

sequence

Mr delta mass

homology score threshold

959.383

1330.629

973.498

1030.484

1242.554

1203.475

1420.495

1550.672

1410.585

1394.613

1138.579

965.461 39.6 58.38 29.42 50.29 27.65 45.94 62.65 24.38 25.59 24.64 36.2

−0.094 −0.111 −0.134 −0.153 −0.125 −0.092 −0.097 −0.087 −0.085 −0.129 −0.099

27.74

−0.084

1116.454

744.275

963.39

788.434

1312.595

909.421

1144.486

1348.45

936.39

815.397

Amidated(C-term)

Methylation(8)

Amidated (C-term); 22 DaShift(13)

1040.5

1909.753

1083.532

1069.505

5.10 × 10−2

8.60 × 10−1

5.80 × 10−1

7.40 × 10−1

1.40 × 10−4

6.60 × 10−3

4.50 × 10−1

3.00 × 10−3

3.50 × 10−1

4.10 × 10−4

2.80 × 10−2

3.90 × 10−1

Mascot e-val

HomClus score

54.65 81.85 77.66 50.78

−0.164 −0.131 −0.086

−0.086

−0.068

104.24

−0.137

100.76

30.33

−0.134

−0.178

35.96 150.83

−0.092

68.41

36

36

36

36

37

37

37

38

37

37

37

36

identity score threshold

−0.083

28

17

18

22

21

22

16

25

25

23

29

20

26.95 80.41 67.06

−0.092 −0.202 −0.093

52.3

−0.088

Peptide identifictions using HomClus tool (using ptms fingerprint)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

Amidated (C-term)

Amidated (C-term)

Peptide identifictions using HomClus tool (≤3 AA subst)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

Amidated (C-term)

Amidated (C-term); Oxidation (M)

Amidated (C-term)

Amidated (C-term)

Amidated (C-term); Gln→ pyro-Glu (N-term Q)

Peptide identifictions using Mascot DB search engine (first round)

Ptm

Mascot peptide score

2.42 × 10−2

3.51 × 10−1

5.92 × 10−3

9.10 × 10−3

7.69 × 10−2

1.39 × 10−1

4.28 × 10

−2

1.86 × 10−3

3.80 × 10−4

1.97 × 10−3

1.85 × 10−1

1.20 × 10−5

1.02 × 10−4

1.87 × 10−3

HomClus e-val

SS_AE_AWQSLQSSW

PG_ED_VVSVVPG

R_F_SPSLRLRF

TRUNCATED

AQ_SP_AQRSPSLRLRF

SP_GL_SRPYSFGL

LYA_VVV_GANMGLYAFPR

R_H_QTFQYSRGWTN

EQUAL

SN_HV_SNFIRF

rule

Journal of Proteome Research Article

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

Locustamyotropin-3

Locustamyotropin-3

Locustamyotropin-4

Locustamyotropin-4

SchistoFLRFamide

SchistoFLRFamide

Corazonin

Corazonin

Short neuropeptide F

Short neuropeptide F

Short neuropeptide F

Prepro ion transport peptide

Prepro ion transport peptide

sp|P41489|LMT3_LOCMI

sp|P41489|LMT3_LOCMI

sp|P41490|LMT4_LOCMI

sp|P41490|LMT4_LOCMI

sp|P84306|FARP_LOCMI

sp|P84306|FARP_LOCMI

sp|Q26377|CORZ_DROME

sp|Q26377|CORZ_DROME

sp|Q9VIQ0|SNPF_DROME

sp|Q9VIQ0|SNPF_DROME

sp|Q9VIQ0|SNPF_DROME

tr|Q9XYF9|Q9XYF9_LOCMI

tr|Q9XYF9|Q9XYF9_LOCMI

SPLDAHHL

SPLDAHHLA

SPSLRLR

SNRSPSLRLR

PSLRLRF

QTFQYSHGWTN

QTFQYSHGWTN

PDVDHVFLRF

PDVDHVFLR

RLHQNGMPFSPRL

RLHQNGMPFSPR

QQPFVPR

RQQPFVPR

sequence

Mr delta mass

homology score threshold

22 DaShift(6)

Methylation(7)

Amidated(C-term)

Amidated (C-term); Gln→pyroGlu (N-term Q); 7 DaShift(11)

Amidated (C-term); Gln→ pyro-Glu (N-term Q); 22 DaShift(4)

Methylation(4); Amidated(Cterm)

Methylation(9)

Oxidation(7); Amidated(C-term)

Gln→pyro-Glu (N-term Q)

888.351

981.37

841.438

1184.548

886.463

1356.472

1371.443

1256.604

1110.482

1566.659

1438.594

853.369

1026.47

74.96 71.45 36.89

88.29 56.86 82.44 35 83.09 58.57

−0.123

−0.112 −0.092 −0.13 −0.08 −0.099 −0.099

49.37

−0.166

−0.068

40.43

−0.136

−0.105

32.3

HomClus score 88.09

Mascot e-val

−0.081

identity score threshold

−0.107

Peptide identifictions using HomClus tool (using ptms fingerprint)

Ptm

Mascot peptide score

5.89 × 10−2

7.57 × 10−4

1.06 × 10−1

1.06 × 10−2

4.04 × 10−4

1.07 × 10−1

4.26 × 10−2

1.31 × 10−2

2.73 × 10−3

8.94 × 10−1

5.40 × 10−1

8.74 × 10−1

3.32 × 10−2

HomClus e-val rule

All peptide sequences obtained throughout the three identification rounds, listing the predicted peptide sequence, the PTMs present, and the precursor the peptide is cleaved from or based on, for respectively the DB-search (1st round) and the HomClus-tool (2nd and 3rd round). Furthermore, the score and e-value are also provided for both tools. Identifications with an expectation value higher than 10−1 are marked in italic. The last column shows the rule applied to substitute the “known” peptide sequence into the experimental one.

a

description

Precursor ID

Table 1. continued

Journal of Proteome Research Article

2782

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

Journal of Proteome Research

Article

Figure 4. Corazonin peptide alignment block. After aligning all corazonin precursors from the Swiss-Prot reviewed protein database, using Clustal Omega,35 the block overlapping the corazonin peptides is sliced and presented. Based on the alignment it is clear that the locust corazonin also has the His7 substitution similar to the Apis mellifera corazonin, as compared to the other species. Nonetheless, the locust corazonin also has a Glu at position 4, similar to the other species (in contrast to the Thr4 substition in the A. mellifera peptide).

Next to the ability to annotate new homologous peptides, the tool also shows to be more sensitive than a commonly applied database search engine. Some peptides scoring below-threshold in the DB-search are significantly identified applying the HomClustool (e.g., identifications in round 2: schistostatin peptide based on drostatin counterpart and FLRFamide peptides based on the FMRFamide Drosophila counterpart; identifications in round 3: ion transport peptides, LIRP, myotropin and short neuropeptide F peptides). Lower quality spectral data (i.e., lacking partial fragmentation information) can still provide enough information for clustering purpose. Subsequently the -in the identification modulegenerated peptide candidates are rescored in the scoring module, eventually leading to significant results. Another advantage is the ability to identify new forms of bioactive peptides carrying post-translational modifications (see Table 1). A great deal of failed or missed identifications are attributable to the wealth of modifications on peptides, some of which may originate from in vivo post-translational processes to activate the molecule, whereas others could be introduced during the tissue preparation procedures. Next to known modifications such as methylation (+14 Da), oxidation (+16 Da), Cation:Na adduct (+22 Da), another as of yet unannotated modification was also discovered (after consulting the UniMod database, www.unimod.org36). The post-translational modification causes a 7 Da shift on the C-terminal amino acid of the corazonin peptide, next to the already present N-terminal pyroglutamination and C-terminal amidation (pQTFQYSHGWTNa). This modification has been witnessed both on Schistocerca and Locusta corazonin fragmentation spectra. Moreover, another form was encountered (only in the Schistocerca data set) carrying two of the mentioned modifications (7 Da shift), both located at the C terminal part of the peptide (one C-terminally, the other placed on the C-terminal amino acid). Even after thoroughly scrutinizing literature for this type of modification and mining the available databases as UniMod (www.unimod.org) and RESID (www.ebi. ac.uk/RESID/), the nature of this 7 Da shift modification could not be revealed. This 7 Da shift has been witnessed in both the Schistocerca and Locusta sample (see Figure 3). Also, a 6 Da shift is clearly present in the Locusta sample. This can be explained by

neuropeptide F peptide isoforms, an additional FMRFamide peptide, and less significantly also a cardio acceleratory peptide 2b, see Table 1). Furthermore, extra Schistocerca peptides could also be identified based on annotated Locusta peptide sequences (extra pyrokinins and myotropins peptides, see Table 1, bottom). These aforementioned extra identifications prove that the HomClus-tool is clearly capable to identify unknown, experimental fragmentation spectra using its built-in strategy. First the unknown spectrum is clustered together with a known, annotated fragmentation spectrum (e.g., CORZ_DROME, pQTFQYSRGWTNam) based on spectral similarity. Afterward a list of putative identifications for the experimental spectrum is generated by introducing amino acid substitutions and/or post-translational modifications to the known sequence (CORZ_DROME, pQTFQYSRGWTNam) of the annotated fragmentation spectrum clustering together with the experimental one. Hereby the mass difference between the parent ion masses of the annotated and unannotated, experimental spectra should match to these introduced mass shifts. In the case of the corazonin peptide, a substitution of an arginine (R) to a histidine (H) at position 7 corresponds to the parent ion mass difference between the known, annotated and unannotated, experimental spectrum. After ranking all solutions the aforementioned solution (R to H substitution: pQTFQYSHGWTNam) comes out as the best scoring one. To further validate this specific HomClus-tool prediction we constructed a multiple alignment of all corazonin peptide precursors found in UniProt-KB. This alignment clearly shows that an amino acid substitution from R to H at position 7 is presumably possible. This substitution was also noticed in the honeybee corazonin peptide, next to a second substitution from glutamine (Q) to tyrosine (T) at position 4 (see corazonin alignment in Figure 4). Most of the peptides identified in this study have been detected before,32−34 using de novo sequencing (implying very good spectral quality) and/or Edman degradation studies (implying fair amounts of peptide material). The presented tool clearly demonstrates its use in enabling the identification based on Maldi-TOF-TOF data using homologous sequence information as only input. 2783

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

Journal of Proteome Research

Article

Next to endogenous neuropeptides, identification of other types of peptides, with an emphasis toward revealing PTMs, as for example antimicrobial peptides, toxins, or venom peptides could also be attempted applying the reported tool. Finally, the tool can also be useful in regular proteomic analyses of organisms from which little or no sequence information is available.

combining this unannotated 7 Da modification and an amidation (−1 Da shift, also abundantly present in the Locusta sample), resulting in a net shift of 6 Da. Next to running this HomClus-tool, a commonly applied DB-search is still advisible. Not all peptide identifications obtained with the Mascot search (first identification round) could be confirmed with the second identification round applying the presented tool using the set of annotated bioactive peptides retrieved from UniProt-KB as point of comparison in the clustering module. For a database search it is possible to scan the complete precursor, without applying an enzymatic cleavage (mostly trypsin). The HomClus-tool on the other hand only tries to match to the provided bioactive peptide sequences or their corresponding MS/MS peak lists. As such, the spectrum identification rate can be higher with an initial Mascot search, since it will also pick up longer peptide variants (typically a set of variants, rather than just a single peptide is purified from a tissue or organ9,37,38). Nonetheless, this problem can be (partially) overcome using a “known” peptide pool (either sequence or spectral based) as comprehensive as possible. Several options are readily available to increase this search space if it is sequence based: (1) prior preprocessing of the peptide precursor sequences with the Neuropred tool,39 instead of only using the annotated endogenous peptides, (2) usage of all bioactive peptides in all insects instead of limiting the search to one specific species and (3) inclusion of peptides lacking a peptide annotation. For example, several peptide precursors have not been reviewed yet and remain in the trEMBL section of UniProt-KB (e.g., Drosophila IFamide precursor32). Of course, in the case of using a spectral resource as a starting point (e.g., public resources PRIDE1 or NeuroPedia8 or in-house peptidome spectral libraries, this same rule applies, that is, not limiting the clustering analysis to one specific species. Also, the HomClus-tool allows for querying a mixed set of both fragmentation spectra and spectral data generated in silico, making the search space even more comprehensive. Further functionality is planned for the future version of the identification tool, deletions and extensions are not yet implemented in the identification module. Current version only substitutions are allowed. Such implementation could create an even more complete list of candidates aiding the peptide identification. On the other hand the method described in this report, using spectral data as input, could outperform existing tools because it takes into account fragmentation intensities, unlike MS-Blast that is solely sequence based.21 Moreover it can cope with mass shifts in the fragmentation spectrum, in contrast to typical spectral alignment tools as SpectraST40 that can only find previously identified peptides based on very good spectral similarity. Finally the ability to cope with a wide number of post-translational modifications is also favorable with regard to revealing the sequence of bioactive peptides. As already mentioned the presented tool is modularly built, making it feasible to plug-in or replace modules. For spectral clustering other algorithms are published27 and could be applied. Within this study the Bonanza clustering was opted and implemented. Previous research also proves that the clustering technique can be applied on fragmentation spectra of multiply charged parent ions, for example, generated by an electrospray tandem mass spectrometer.25 The scoring module can be customized accordingly. Similarly, other scoring functions can also easily be implemented by changing the scoring module.



CONCLUSION In conclusion, this report describes the use of spectral clustering to group fragmentation spectra of annotated homologous peptides to a set of experimental unannotated ones. To reveal the peptide sequence of the latter set, an identification and candidate-scoring pipeline was constructed, next to the clustering section. It shows unambiguously that the approach yields promising results. Not only does it successfully reveal the identity of many examined peptides, the spectral clustering also enables one to detect the complete modification profile of the peptides. Bearing in mind the lack of protein annotation for nonmodel organisms and the importance of PTMs as activators of biological peptides, this reported tool can be seen as a valuable addition to the already available identification software.



ASSOCIATED CONTENT

* Supporting Information S

Supplemental tables and figures. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Gerben Menschaert, Laboratory for Bioinformatics and Computational Genomics, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium. E-mail: [email protected]. Notes

The authors declare no competing financial interest.



REFERENCES

(1) Vizcaino, J. A.; Foster, J. M.; Martens, L. Proteomics data repositories: providing a safe haven for your data and acting as a springboard for further research. J. Proteomics 2010, 73 (11), 2136−46. (2) Vizcaino, J. A.; Cote, R.; Reisinger, F.; Foster, J. M.; Mueller, M.; Rameseder, J.; Hermjakob, H.; Martens, L. A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 2009, 9 (18), 4276−83. (3) Desiere, F.; Deutsch, E. W.; Nesvizhskii, A. I.; Mallick, P.; King, N. L.; Eng, J. K.; Aderem, A.; Boyle, R.; Brunner, E.; Donohoe, S.; Fausto, N.; Hafen, E.; Hood, L.; Katze, M. G.; Kennedy, K. A.; Kregenow, F.; Lee, H.; Lin, B.; Martin, D.; Ranish, J. A.; Rawlings, D. J.; Samelson, L. E.; Shiio, Y.; Watts, J. D.; Wollscheid, B.; Wright, M. E.; Yan, W.; Yang, L.; Yi, E. C.; Zhang, H.; Aebersold, R. Integration with the human genome of peptide sequences obtained by highthroughput mass spectrometry. Genome Biol. 2005, 6 (1), R9. (4) Deutsch, E. W.; Lam, H.; Aebersold, R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 2008, 9 (5), 429−34. (5) Craig, R.; Cortens, J. P.; Beavis, R. C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 2004, 3 (6), 1234−42. (6) Falkner, J. A.; Hill, J. A.; Andrews, P. C. Proteomics FASTA archive and reference resource. Proteomics 2008, 8 (9), 1756−7. (7) Falth, M.; Skold, K.; Norrman, M.; Svensson, M.; Fenyo, D.; Andren, P. E. SwePep, a database designed for endogenous peptides and mass spectrometry. Mol. Cell. Proteomics 2006, 5 (6), 998−1005.

2784

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785

Journal of Proteome Research

Article

(8) Kim, Y.; Bark, S.; Hook, V.; Bandeira, N. NeuroPedia: neuropeptide database and spectral library. Bioinformatics 2011, 27 (19), 2772−3. (9) Baggerman, G.; Boonen, K.; Verleyen, P.; De Loof, A.; Schoofs, L. Peptidomic analysis of the larval Drosophila melanogaster central nervous system by two-dimensional capillary liquid chromatography quadrupole time-of-flight mass spectrometry. J. Mass Spectrom. 2005, 40 (2), 250−60. (10) Hummon, A. B.; Richmond, T. A.; Verleyen, P.; Baggerman, G.; Huybrechts, J.; Ewing, M. A.; Vierstraete, E.; Rodriguez-Zas, S. L.; Schoofs, L.; Robinson, G. E.; Sweedler, J. V. From the genome to the proteome: uncovering peptides in the Apis brain. Science 2006, 314 (5799), 647−9. (11) Husson, S. J.; Clynen, E.; Baggerman, G.; De Loof, A.; Schoofs, L. Peptidomics of Caenorhabditis elegans: in search of neuropeptides. Commun. Agric. Appl. Biol. Sci. 2005, 70 (2), 153−6. (12) Boonen, K.; Baggerman, G.; D’Hertog, W.; Husson, S. J.; Overbergh, L.; Mathieu, C.; Schoofs, L. Neuropeptides of the islets of Langerhans: a peptidomics study. Gen. Comp. Endrocrinol. 2007, 152 (2−3), 231−41. (13) Li, B.; Predel, R.; Neupert, S.; Hauser, F.; Tanaka, Y.; Cazzamali, G.; Williamson, M.; Arakane, Y.; Verleyen, P.; Schoofs, L.; Schachtner, J.; Grimmelikhuijzen, C. J.; Park, Y. Genomics, transcriptomics, and peptidomics of neuropeptides and protein hormones in the red flour beetle Tribolium castaneum. Genome Res. 2008, 18 (1), 113−22. (14) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75 (17), 4646−58. (15) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4 (7), 1985−8. (16) Magrane, M.; Consortium, U. UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) 2011, 2011, bar009. (17) Pruitt, K. D.; Tatusova, T.; Klimke, W.; Maglott, D. R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2009, 37 (Database issue), D32−6. (18) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551−67. (19) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466−7. (20) Menschaert, G.; Vandekerckhove, T. T.; Baggerman, G.; Schoofs, L.; Luyten, W.; Van Criekinge, W. Peptidomics coming of age: a review of contributions from a bioinformatics angle. J. Proteome Res. 2010, 9 (5), 2051−61. (21) Shevchenko, A.; Sunyaev, S.; Loboda, A.; Bork, P.; Ens, W.; Standing, K. G. Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. Anal. Chem. 2001, 73 (9), 1917−26. (22) Han, Y.; Ma, B.; Zhang, K. SPIDER: software for protein identification from sequence tags with de novo sequencing error. J. Bioinform. Comput. Biol. 2005, 3 (3), 697−716. (23) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215 (3), 403−10. (24) Ashby, G. J. Locusts. In UFAW Handbook on the care and management of laboratory animals; Hume, C. W., Ed.; Churchill Livingstone: Edinburgh, 1972; pp 582−587. (25) Menschaert, G.; Vandekerckhove, T. T.; Landuyt, B.; Hayakawa, E.; Schoofs, L.; Luyten, W.; Van Criekinge, W. Spectral clustering in peptidomics studies helps to unravel modification profile of biologically active peptides and enhances peptide identification rate. Proteomics 2009, 9 (18), 4381−8. (26) Falkner, J. A.; Falkner, J. W.; Yocum, A. K.; Andrews, P. C. A spectral clustering approach to MS/MS identification of posttranslational modifications. J. Proteome Res. 2008, 7 (11), 4614−22. (27) Flikka, K.; Meukens, J.; Helsens, K.; Vandekerckhove, J.; Eidhammer, I.; Gevaert, K.; Martens, L. Implementation and

application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 2007, 7 (18), 3245−58. (28) Fenyo, D.; Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 2003, 75 (4), 768−74. (29) Gumbel, E. J. Statistics of extremes; Columbia University Press: New York, 1958. (30) Rehmsmeier, M.; Steffen, P.; Hochsmann, M.; Giegerich, R. Fast and effective prediction of microRNA/target duplexes. RNA 2004, 10 (10), 1507−17. (31) Cote, R.; Reisinger, F.; Martens, L.; Barsnes, H.; Vizcaino, J. A.; Hermjakob, H. The Ontology Lookup Service: bigger and better. Nucleic Acids Res. 2010, 38 (Web Serverissue), W155−60. (32) Clynen, E.; Schoofs, L. Peptidomic survey of the locust neuroendocrine system. Insect Biochem. Mol. Biol. 2009, 39 (8), 491− 507. (33) Tawfik, A. I.; Tanaka, S.; De Loof, A.; Schoofs, L.; Baggerman, G.; Waelkens, E.; Derua, R.; Milner, Y.; Yerushalmi, Y.; Pener, M. P. Identification of the gregarization-associated dark-pigmentotropin in locusts through an albino mutant. Proc. Natl. Acad. Sci. U.S.A. 1999, 96 (12), 7083−7. (34) Predel, R.; Gade, G. Identification of the abundant neuropeptide from abdominal perisympathetic organs of locusts. Peptides 2002, 23 (4), 621−7. (35) Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T. J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Soding, J.; Thompson, J. D.; Higgins, D. G. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011, 7, 539. (36) Creasy, D. M.; Cottrell, J. S. Unimod: Protein modifications for mass spectrometry. Proteomics 2004, 4 (6), 1534−6. (37) Fricker, L. D.; Lim, J.; Pan, H.; Che, F. Y. Peptidomics: identification and quantification of endogenous peptides in neuroendocrine tissues. Mass Spectrom. Rev. 2006, 25 (2), 327−44. (38) Falth, M.; Skold, K.; Svensson, M.; Nilsson, A.; Fenyo, D.; Andren, P. E. Neuropeptidomics strategies for specific and sensitive identification of endogenous peptides. Mol. Cell. Proteomics 2007, 6 (7), 1188−97. (39) Southey, B. R.; Amare, A.; Zimmerman, T. A.; Rodriguez-Zas, S. L.; Sweedler, J. V. NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides. Nucleic Acids Res. 2006, 34 (Web Server issue), W267−72. (40) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7 (5), 655−67.

2785

dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785