Article pubs.acs.org/jpr
Spectral Clustering in Peptidomics Studies Allows Homology Searching and Modification Profiling: HomClus, a Versatile Tool Gerben Menschaert,*,†,‡ Eisuke Hayakawa,‡,§ Liliane Schoofs,§ Wim Van Criekinge,† and Geert Baggerman∥,⊥ †
Faculty of Bioscience Engineering, Laboratory for Bioinformatics and Computational Genomics, Ghent University, Ghent, Belgium Prometa, Interfaculty Center for Proteomics and Metabolomics, K.U. Leuven, Leuven, Belgium § Research Group of Functional Genomics and Proteomics, K.U. Leuven, 3000 Leuven, Belgium ∥ VITO Nv, 2400 Mol, Belgium ⊥ CFP, Center for Proteomics, 2020 Antwerpen, Belgium ‡
S Supporting Information *
ABSTRACT: Many genomes of nonmodel organisms are yet to be annotated. Peptidomics research on those organisms therefore cannot adopt the commonly used database-driven identification strategy, leaving the more difficult de novo sequencing approach as the only alternative. The reported tool uses the growing resources of publicly or in-house available fragmentation spectra and sequences of (model) organisms to elucidate the identity of peptides of experimental spectra of nonannotated species. Clustering algorithms are implemented to infer the identity of unknown peak lists based on their publicly or in-house available counterparts. The reported tool, which we call the HomClus-tool, can cope with posttranslational modifications and amino acid substitutions. We applied this tool on two locusts (Schistocerca gregaria and Locusta migratoria) LC-MALDI-TOF/TOF datasets. Compared to a Mascot database search (using the available UniProt-KB proteins of these species), we were able to double the amount of peptide identifications for both spectral sets. Known bioactive peptides from Drosophila melanogaster (i.e., fragmentations spectra generated in silico thereof) were used as a starting point for clustering, trying to reveal their experimental homologues’ counterparts. KEYWORDS: peptidomics, bioactive peptides, Locusta migratoria, Schistocerca gregaria, spectral clustering, PTMs, post-translational modification, homology searching
■
INTRODUCTION Mass spectrometry-based proteomics and peptidomics are wellestablished techniques in protein and peptide research. Inherently, data repositories holding those results are exponentially expanding, providing the research community with annotated fragmentation spectra and peptide information:1 Proteomics Identification database (PRIDE, +260 million spectra2), PeptideAtlas (several millions of highly confident peptide spectrum matches3,4), The Global Proteome Machine database (GPMDB5), Tranche,6 and more specific endogenous peptide databases (Swepep,7 Neuropedia8). Moreover, many laboratories can utilize in-house data sets with annotated fragmentation spectra.9−13 Query interfaces on aforementioned public data repositories are also continuously improving, enabling extraction of annotated peak lists from mass spectrometry types of interest (selection can be based on MS fragmentation, ionization, or analyzer characteristics). PRIDE-BioMart and PRIDE-Inspector are interfaces built on the PRIDE database; PeptideAtlas results can be downloaded based on specific research queries. Next to the ability to query for specific data sets, public resources also strive toward publication of high quality data. The quality © 2012 American Chemical Society
assessment in terms of probability of correct identification and false discovery rates at the project and/or database level are commonly implemented. PeptideAtlas implements the statistical analysis from the Trans-Proteome-Pipeline,14 and Pride-Inspector has its own quality control features; furthermore, PRIDE-Q (forthcoming release of the PRIDE database) will only hold fragmentation spectra passing stringent quality control, thus assuring confident tandem mass spectrometry data. Apart from the described spectral resources, protein databases are also expanding,15−17 providing the community with an evermore-detailed protein map, especially for widely studied model organisms. Spectral libraries can be created from these proteins sequences in silico, subsequently serving as a kind of spectral resource. Most database search engines rely on this technique (Mascot,18 X!Tandem19). In this present report, we describe a software solution, which we call the HomClus-tool, enabling identification of endogenous (neuro)peptides from poorly annotated (nonmodel) species. Received: November 8, 2011 Published: March 12, 2012 2774
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
Journal of Proteome Research
Article
supernatant was filtered through spin down filter (0.22 mm, Ultrafree-MC, Millipore). Finally, the peptide extracts were dried in a vacuum centrifuge and stored at −20 °C before the chromatographic analysis.
Until now, the protein and peptide profile for many species remains unstudied: many vertebrate, insect, or bacteria proteomes are lacking, but also more specific peptide profiles (for example of toxins from spiders, scorpions, or cone snails) remain unexplored. Due to the lack of available amino acid sequences, the peptides of such samples can only be identified applying more labor-intensive de novo techniques. De novo strategies inevitably introduce their known difficulties and shortcomings,20 as for example the need for very good quality spectra (without missing parts of fragment ion series) and the need for accurate data. A handful of hybrid techniques are devised to first try and circumvent these difficulties and second allow introduction of amino acid mutations: MS-Blast21 or SPIDER.22 The latter methodologies use inferred sequence tags (short stretches of identified amino acids) as a starting point for further analysis based on the BLAST algorithm23 in the case of MS-Blast and a gapped alignment algorithm in the case of SPIDER. The Mascot database search engine similarly enables an error-tolerant database search allowing mutations.18 In this work, we present a methodology based on spectral clustering to deduce the identity of experimental fragmentation spectra of bioactive peptides. The novelty of the presented solution lies in the fact that it can use clustering algorithms25,26 that cope with post-translational modifications. Regular spectral alignment tools as SpectraST40 can only match to identical sequences, not to alternatively post-translationally modified forms. Furthermore the described solution is able to identify peak lists based on annotated peak lists of homologous peptides. The amount of similarity is measured at the fragmentation spectrum level, not at the amino acid level (e.g., MS-Blast that is solely sequence based21), thus also taking into account valuable spectral information as the intensities of the ion fragments. This technique is promising, especially since bioactive peptides mostly show a high degree of conservation and regularly carry post-translational modifications. Annotated peak lists of homologous peptides, obtained or generated from public or in-house resources (either spectral or sequence databases), can be administered. As a proof of concept, we chose to apply our method on fragmentation spectra obtained from LC−MALDITOF/TOF experiments on corpora cardiaca samples of two different locusts: Schistocerca gregaria and Locusta migratoria. Here, spectra generated in silico of known bioactive fruit fly peptides were used as a starting point in the identification process.
■
HPLC
The samples were dissolved in Milli-Q water containing 5% acetonitrile (ACN) and 0.1% FA and separated on an HPLC system, equipped with a C18 precolumn (PepMap 100, 5 μm − 100 A, 0.3 × 5 mm, Dionex) to concentrate and desalt the sample. After loading the sample, the following gradient was applied for the mobile phase: from 5% ACN to 10% ACN in 5 min, to 25% ACN in 37 min, to 45% ACN in 13 min, to finally 95% ACN in 4 min, at a flow rate of 200 nL/min over the analytic column (PepMap 100, 3 μm − 100 A, 75 μm × 15 cm, Dionex). Every 15 s, a fraction was automatically spotted on a Maldi ground plate using the Proteineer FC (LC−Maldi Fraction Collector, Bruker Daltonics) after mixing with 1.5 μL of a saturated solution of Alpha-Cyano-4-hydroxycinnamic acid (CHCA) in 60% ACN/0.1% FA. Mass Spectrometry
After evaporation of the solvent, the MALDI target was introduced into the mass spectrometer ion source. MS and MS/MS analysis were performed on an Ultraflex II instrument (Bruker Daltonics) in positive ion, reflectron mode. The instrument was calibrated externally with a commercial peptide mixture (peptide calibration standard, Bruker Daltonics). All spectra were obtained using Flex Control software (3.4 Bruker Daltonics). The plate was initially examined in MS mode and spectra were recorded within a mass range from m/z 500 to 4000. Subsequently, the most intense peaks with S/N higher than 10 were selected and used for the optimized LIFT MS/MS analysis from the same target. Peaks were selected for MS/MS from lowest to highest abundancy. All MS and MS/MS spectra were processed by means of the FlexAnalysis software (Bruker Daltonics), and m/z values and intensities of each peak were recorded in peak list files. Overview of the Presented Prediction Tool
Figure 1 gives a schematic overview of the presented solution identifying fragmentation spectra of peptides for poorly annotated organisms of interest. Known peptide peak lists (Figure 1, green box) retrieved from public repositories, comprehensive peptidomics studies on model organisms, synthetized catalogs, and/or in silico generated from protein databases (see Supplemental Data 1 for details on the supported input formats, Supporting Information), are used in aiding the annotation of unknown experimental data obtained from the examined species (Figure 1, red box). The experimental data and publicly available peak list files need to be entered in the HomClus-tool as MGF-formatted files (the L. migratoria and S. gregaria files used for the described analysis are available as Supplemental Data 4a and 4b, Supporting Information). Furthermore, values for both the peptide and fragment ion matching tolerance can be specified as input parameters during the HomClus analysis, to be able to take the accuracy of input data into account. The annotation process (Figure 1, dark blue boxes) contains of three subunits: a clustering module, an identification module, and a scoring module. The clustering module is currently based on the Bonanza clustering algorithm,25,26 but as the process is setup modularly, other algorithms27 can be plugged in at a later stage. A notable advantage of the Bonanza algorithm is that it takes into account mass shifts corresponding either to amino acid mutations and/or post-translational modifications (PTMs),
MATERIALS AND METHODS
Sample Preparation
Schistocerca gregaria and Locusta migratoria were reared under laboratory conditions,24 under a 13 h light, 11 h dark photoperiod at a room temperature of 32 °C and relative humidity between 40 and 60%. Locusts were kept under gregarizing conditions with at least 100 animals in cages of 38 cm × 38 cm × 38 cm and were daily fed ad libitum with cabbage (S. gregaria) or fresh grass and oatmeal (L. migratoria). One corpora cardiaca of an adult locust was carefully dissected. It was rinsed in a Ringer solution (8.77 g/L NaCl, 0.19 g/L CaCl2, 0.75 g/L KCl, 0.41 g/L MgCl2, 0.34 g/L NaHCO3, 30.81 g/L sucrose, 1.89 g/L trehalose, pH 7.2) and transferred to a 0.5 mL Eppendorf tube on ice, containing 50 μL of chilled extraction solvent (90% v/v methanol, 9% v/v Milli-Q water, 1% v/v formic acid (FA)). The sample was sonicated three times for 1 min and the remaining solid fraction was centrifuged down (20 min at 20000× g). Next, the 2775
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
Journal of Proteome Research
Article
Figure 1. Simplified schematic overview of the identification pipeline: gathering its publicly available, well-annotated and unknown peak lists to examine as input (IN: green and red boxes, left), how its identification strategy is build of three subunits: clustering, identification, and scoring module (PROCESS: dark blue boxes, center), and how it interacts with the information contained in the output (OUT: light blue box, right).
Plotting the score density versus the score values follows an extreme value distribution (type 1). To improve the estimation in the tail of the distribution, a fitting based on the 20−99% upper values of the log(−log) transformed empirical cumulative density function is commonly applied.30 A detailed example of the workflow, scoring schemes and statistical significance assessment is provided as Supplemental Data 2 (Supporting Information). The pipeline is built in Perl v10.5, E-value calculation is implemented in R (version 2.12.0), and a back-end portable SQLite (version 3.6.13) database is used for storage of all possible amino acid mutations up to 3 amino acids. The source-code and documentation can be requested from the authors.
the latter being abundantly present in bioactive peptides. In the clustering module spectral information is used to try and group (cluster) spectra. The aim is to cluster experimental, unknown spectra together with known spectra that have a sequence annotation. Afterward, the identification module generates a list of putative identifications by introducing amino acid shifts and/or post-translational modifications to the known sequence of the annotated fragmentation spectrum that clustered together with the unknown experimental fragmentation spectrum. The introduced amino acid shifts and/or modifications should match the difference between the parent ion masses of the annotated and unannotated, experimental spectra. Thus, both modified or homologous peptide forms can be identified. The generated candidates are subsequently ranked in the scoring module using information on fragmentation rules of the technique applied (e.g., CID, PSD, ETD/ECD, IRMPD). At present several scoring schemes are made available: correlation score, hyperscore1, hyperscore2, and the SALSA score28 (see Supplemental Data 2, Supporting Information). Finally a list is generated presenting the identified experimental peptides and their relationship to known peptides (sequence alignment) ranked on their calculated identity score. An expectation value (E-value) is also generated based on the scoring distribution assessing the significance of the obtained score. This E-value calculation in this report is based on the HyperScore2 distribution, but is replaceable by another scoring scheme. Extreme value distribution statistics29 are applied to generate the statistical significance of the peptide identifications.
Protein and Peptide Data
Two protein databases were compiled for database searching (Mascot search engine18) holding all Schistocerca gregaria and Locusta migratoria UniProt-KB protein entries, using the following queries: organism: “Locusta migratoria [7004]” (342 sequences) and organism: “Schistocerca gregaria [7010]” (104 sequences). These databases were used in the first identification round, prior to using the HomClus-tool. Furthermore all fruit fly peptide precursors were also downloaded from the UniProtKB resource (query: taxonomy: “Drosophila melanogaster (Fruit f ly) [7227]” AND annotation: (type:peptide)), to deduce all bioactive peptide sequences and modifications. The protein data downloaded from UniProt-KB (SwissProt formatted) of the two locust species and fruit fly were uploaded 2776
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
Journal of Proteome Research
Article
Figure 2. Schematic overview of the identification strategy, consisting of three identification rounds. The first is based on a regular database search against available UniProt-KB protein of the according locust. A second round is based on the reported HomClus-tool using fragmentation spectra generated in silico based on available fruit fly bioactive peptides in the UniProt-KB database, emphasizing homology searching (amino acid substitutions are allowed). The third round also adopts the HomClus-tool using the already annotated experimental fragmentation spectra resulting from the previous identification rounds for clustering. Here the emphasis is on identification of the full modification profile (no amino acid substitutions, yet all post-translational modifications are allowed).
melanogaster peak list counterparts (generated in silico from the sequences), thresholds for matching peaks (MP) count, total ion current (TIC), and expectation value (e-value) were set to 10, 0.15, and 10−1 respectively in the clustering module of the program. To allow calculation of an e-value, 1000 randomly generated peptides with the same experimental mass are also generated and scored next to the list of candidates resulting from either amino acid substitutions or modifications. As their e-value is generally higher than 1, setting an e-value cutoff of 10−1 for the true candidates in the clustering module assures that the false discovery score threshold is never reached. The candidate and identification modules make use of the following cut-offs: mass tolerance is set to 0.2 Da for peptides and 0.4 Da for fragments. These thresholds can be set when launching the HomClus tool. In this round, the most common bioactive peptide modifications (C-terminal amidation, N-terminal Gln→Pyro-Glu, and methionine oxidation) are taken into account. The PSI-MOD ontology (included in the Ontology Lookup Service,31 http://www.ebi.ac.uk/ontology-lookup/) is used describing post-translational modifications throughout the program. Furthermore, the top 10 candidates (ranked on Hyperscore2) are listed in the output; see Supplemental data 3 for generated output (Supporting Information). In the final identification step (see Figure 2), the emphasis is to identify extra modifications. All MS/MS peak lists, corresponding to the identified peptides from the first (Mascot DB-engine) and second identification round are gathered into a “known in-house peptidomic data set”. This “known” pool of
in an in-house BioSQL database (http://www.biosql.org), upon which the bioactive peptide sequences and annotated posttranslational modifications were queried and pasted in an CSV formatted file (see Supplemental Data 1 for CSV input data fomat, Supporting Information). Subsequently, fragmentation spectra were generated in silico (by the presented HomClus-tool) for comparison with the S. gregaria and L. migratoria corpora cardiaca peak lists. Identification Strategy of Peptides and PTMs Fingerprint
In a first identification round (see Figure 2), a database engine strategy (Mascot) was applied, performing searches against their according sequence databases (S. gregaria and L. migratoria), using the most abundant modifications present (extracted from the PTMs fingerprinting,25 see Figure 3: Amidated (C-term), Cation:Na (C-term), Cation:Na (DE), Gln→pyro-Glu (N-term Q), Methyl (C-term), Oxidation (M)). Throughout all searches, the following input parameters were chosen: mass tolerance was set to 0.2 Da for peptides and 0.4 Da for fragments and no cleavage enzyme for protein digestion was chosen. These mass thresholds are justified since a Bruker Ultraflex II Maldi-Tof-Tof instrument was used for the analysis. Of course, when working with a more accurate machine, these thresholds should be adjusted accordingly. A second identification round (see Figure 2) is based on the presented HomClus-tool, whereby the emphasis is to identify homologous peptide sequences, allowing amino acids substitutions. To cluster the unknown experimental fragmentation spectra (of both locust species) with their annotated Drosophila 2777
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
Journal of Proteome Research
Article
Figure 3. Bar plots show the absolute values of the mass shifts present between MS/MS peaklists of both the Schistocerca gregaria and Locusta migratoria samples after clustering using 15 for matching peaks (MP) cutoff and 10−3 for expectation value cutoff (using the Web tool (http:// peppus.ugent.be/Peptidomics-Bonanza). Only clusters that score above these thresholds are considered. The bar plot is based on the mass differences between the two peak lists within the withheld clusters. The 0 Da shift is omitted and the results are truncated to maximum 40 Da. The minimum bar size is set to 5 for both samples.
in-house peptidome database” so to speak). Not only is the functionality of the presented tool demonstrated thus, but it also shows that such a neuropeptide spectral library further improves identification efficiency, sensitivity, and reliability (in comparison to peak lists generated in silico by means of peptide sequences). It considers all spectral features, including actual fragment intensities, neutral losses from fragments, and various uncommon or even unknown fragments to determine the best matches. Preliminary knowledge of the modification profile of specific peptidome samples greatly improves identification rates.20 Consequently, prior to running the first and last identification round, a PTMs fingerprint is constructed based on the experimental data (http://peppus.ugent.be/PeptidomicsBonanza25). Bar plots depicting the present mass shifts in the two locust samples are presented in Figure 3. For subsequent analysis (Mascot database searching and the presented HomClus-tool) the most abundant mass shifts corresponding to known post-translational modifications were taken into account: Amidation (−1 Da), Gln→ pyro-Glu (−17 Da), Oxidation (+16 Da), Methylation (+14 Da), and Cation:Na adduct (+22 Da). An extra modification resulting in a 7 Da shift was also introduced. In a future release of the presented program this prior PTMs fingerprinting will be included automatically and resulting modifications taken into account in the matching algorithm, when chosen. At present, modifications have to be manually provided running the program.
fragmentation spectra is clustered to the initial experimental MS/MS peak lists. In contrast to the former run, no amino acid substitutions are allowed during identification, and the complete set of post-translational modifications obtained from a prior clustering analysis25 of the experimental data was taken into account. The same thresholds for matching peaks (MP) count, total ion current (TIC), and expectation value (e-value) were set as in the second identification step. Only the top 5 candidates were chosen as output (ranked on HyperScore2, see Supplemental data 3, Supporting Information).
■
RESULTS For the S. gregaria sample, 15 peptides (cleaved from 7 different precursors) were identified in the first identification round. Six new peptides (cleaved from 5 different precursors) were identified with the reported HomClus-tool in the second identification round, with the emphasis on identifying homologous peptides using amino acid substitutions. In the third identification round 13 extra peptides were identified (emphasize on revealing new modifications). For the L. migratoria sample the number of identifications is higher; 30 peptides (15 precursors), 9 peptides (7 precursors), and 17 peptides in respectively first, second, and third identification round. Overall the number of peptide identifications is doubled for both biological samples. All identifications are listed in Table 1. The full output files of the HomClus-tool are provided as Supplementary data 3 (3a and 3b represent the output for identification round 2 for S. gregaria and L. migratoria respectively; 3c and 3d are identical representing output for the third identification round, Supporting Information). Note that in the last identification round the fragmentation spectra corresponding to the annotated peptides were used as a starting point (“an
■
DISCUSSION Several extra bioactive peptides were identified by means of the HomClus-tool using fruit fly endogenous peptides as starting point in comparison to the first identification round (e.g., corazonin, drostatin, two allostatin peptides, several short 2778
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
2779
Schistostatin
tr|Q94742|Q94742_SCHGR
Drostatin-3 (=Schiststatin)
sp|Q9VC44|ALLS_DROME
Locustamyotropin-2
Schistostatin
tr|Q94742|Q94742_SCHGR
Ion transport peptide
Schistostatin
tr|Q94742|Q94742_SCHGR
sp|P22396|LMT2_LOCMI
Schistostatin
tr|Q94742|Q94742_SCHGR
sp|Q26491|ITP_SCHGR
Insulin-related peptide transcript variant T1
tr|B1GV78|B1GV78_SCHGR
Ion transport peptide
Insulin-related peptide transcript variant T1
tr|B1GV78|B1GV78_SCHGR
sp|Q26491|ITP_SCHGR
Insulin-related peptide transcript variant T1
tr|B1GV78|B1GV78_SCHGR
Corazonin
Ion transport peptide
sp|Q26491|ITP_SCHGR
sp|Q26377|CORZ_DROME
Short neuropeptide F
sp|P86445|SNPF_SCHGR
Locustamyotropin-4
SchistoFLRFamide
sp|P84307|FARP_SCHGR
sp|P41490|LMT4_LOCMI
SchistoFLRFamide
sp|P84307|FARP_SCHGR
Locustamyotropin-4
Adipokinetic prohormone type 2
sp|P35808|AKH2_SCHGR
sp|P41490|LMT4_LOCMI
Adipokinetic prohormone type 2
sp|P35808|AKH2_SCHGR
Locustamyotropin-2
Adipokinetic prohormone type 1
sp|P18829|AKH1_SCHGR
Locustapyrokinin-2
Adipokinetic prohormone type 1
sp|P18829|AKH1_SCHGR
sp|P41488|LPK2_LOCMI
Adipokinetic prohormone type 1
sp|P18829|AKH1_SCHGR
sp|P22396|LMT2_LOCMI
description
Precursor ID
GPRTYSFG
EGDFTPR
LDPHHLA
SPLDPHHL
QTFQYSHGWTN
RLQQYGMPFSPRL
QYGMPFSPRL
QSMPTFTPRL
EGDFTPRL
ARPYSFGL
GRLYSFGL
GPRTYSFGL
AGPAPSRLYSFGL
QSDLFLLSPK
QAQSDLFLLSPK
DLFLLSPK
SPLDPHHLA
SPSLRLRF
PDVDHVFLRF
PDVDHVFLR
QLNFSTGWGRR
QLNFSTGWGR
QLNFTPNWGTGK
QLNFTPNWGT
PNWGTGK
sequence
Mr delta mass
homology score threshold
910.406
995.438
1333.586
1129.496
1328.537
931.446
985.379
973.494
1242.54
1096.453
1303.51
1147.432
1344.505
1158.41
35.2 50.36
24.68
−0.12 −0.092
−0.161
28.01
28.27
−0.088
29.18
50.51
−0.111
−0.097
28.65
−0.114
−0.081
31.48
−0.132
38.83
51.31
−0.109
−0.128
48.09
−0.141
43.56
28.65
−0.136
−0.106
45.43
−0.087
1348.427
1589.675
1176.465
1157.448
931.372
907.399
103.95 71.91 78.31 102.39
163.23
−0.179 −0.114 −0.175
−0.157
Methylation(6)
897.334
820.257
801.303
914.34
HomClus score
−0.104
1.50 × 10−1
1.20 × 10−1
1.60 × 10−2
5.00 × 10−3
4.20 × 10−1
8.90 × 10−4
3.00 × 10−2
1.50 × 10−1
1.00 × 10−3
1.50 × 10−1
8.80 × 10−2
8.20 × 10−4
2.00 × 10−3
1.50 × 10−1
2.60 × 10−3
Mascot e-val
68.06
32
32
33
33
33
32
32
32
33
33
33
33
34
33
32
identity score threshold
−0.093
22
24
26
21
26
21
22
21
17
24
26
23
49.43 59.26 49.21 27.38
−0.126 −0.115 −0.119 −0.105
Peptide identifictions using HomClus tool (using ptms fingerprint)
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
Amidated (C-term)
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
Amidated (C-term)
Amidated (C-term)
Peptide identifictions using HomClus tool (≤3 AA subst)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Gln→pyro-Glu (N-term Q)
Gln→pyro-Glu (N-term Q)
Amidated (C-term)
Amidated (C-term)
Gln→pyro-Glu (N-term Q)
Gln→pyro-Glu (N-term Q)
Gln→pyro-Glu (N-term Q)
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
758.285
Schistocerca gregaria Peptide identifictions using Mascot DB search engine (first round)
Ptm
Mascot peptide score
Table 1. Identification Results for Schistocerca gregaria and Locusta migratoria Bio-active Peptidesa
5.62 × 10
1.06 × 10−1
9.27 × 10−3
3.52 × 10−2
7.62 × 10−2
1.58 × 10−5
4.57 × 10−7
2.10 × 10
−4
4.70 × 10−6
−4
5.77 × 10−3
HomClus e-val
R_H_QTFQYSRGWTN
HN_YQ_RLHQNGMPFSPRL
N_Y_QNGMPFSPRL
V_M_QSVPTFTPRL
EQUAL
S_A_SRPYSFGL
rule
Journal of Proteome Research Article
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
2780
LIRP
LIRP
LIRP
LIRP
LIRP
LIRP
Locustamyotropin-1
Locustamyotropin-1
Locustamyotropin-1
Locustamyotropin-2
Diuretic hormone
Locustatachykinin-4
Locustapyrokinin-2
sp|P15131|LIRP_LOCMI
sp|P15131|LIRP_LOCMI
sp|P15131|LIRP_LOCMI
sp|P15131|LIRP_LOCMI
sp|P15131|LIRP_LOCMI
sp|P15131|LIRP_LOCMI
sp|P22395|LMT1_LOCMI
sp|P22395|LMT1_LOCMI
sp|P22395|LMT1_LOCMI
sp|P22396|LMT2_LOCMI
sp|P23465|DIUH_LOCMI
sp|P30250|TKL4_LOCMI
sp|P41488|LPK2_LOCMI
LIRP
sp|P15131|LIRP_LOCMI
Insulin-related peptide transcript variant T1
tr|B1GV78|B1GV78_SCHGR
Apolipophorin-3b
SchistoFLRFamide
sp|P84307|FARP_SCHGR
sp|P10762|APL3_LOCMI
SchistoFLRFamide
sp|P84307|FARP_SCHGR
Apolipophorin-3b
Adipokinetic prohormone type 2
sp|P35808|AKH2_SCHGR
sp|P10762|APL3_LOCMI
Locustamyotropin-4
sp|P41490|LMT4_LOCMI
Adipokinetic prohormone type 2
Schistostatin
tr|Q94742|Q94742_SCHGR
sp|P08379|AKH2_LOCMI
Corazonin
sp|Q26377|CORZ_DROME
Adipokinetic prohormone type 2
Corazonin
sp|Q26377|CORZ_DROME
sp|P08379|AKH2_LOCMI
Corazonin
description
sp|Q26377|CORZ_DROME
Precursor ID
Table 1. continued
QSVPTFTPRL
APSLGFHGVR
DAEEQIKANKDFLQQI
EGDFTPRL
VPAAQFSPRL
GAVPAAQFSPRL
GAVPAAQFSPR
TQAQSDLFLLSPK
TATQAQSDLFLLSPK
QSDLFLLSPK
QAQSDLFLLSPK
DLFLLSPK
ARPSAGGLLTGAVF
AQSDLFLLSPK
RPDAAGHVNIAEA
QNSIQSAVQKPAN
QLNFSAGWGRR
QLNFSAGWGR
QSDLFLLSPK
PDVDHVFLR
PDVDHVFLRF
QLNFSTGW
LQQYGMPFSPRL
GRLYSFG
QTFQYSHGWTN
QTFQYSHGWTN
QTFQYSHGWTN
sequence
Mr delta mass
homology score threshold
1136.399
1110.481
1264.512
955.342
1434.593
812.33
1371.452
1364.401
1356.422
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Gln→pyro-Glu (N-term Q)
Gln→pyro-Glu (N-term Q)
Gln→pyro-Glu (N-term Q)
Gln→pyro-Glu (N-term Q)
1126.516
1038.491
1887.766
932.395
1083.545
1211.576
1099.475
1446.631
1618.698
1129.518
1328.589
931.467
1315.631
1217.571
1319.524
1383.539
1273.538
1117.445
31.58 185.87 30.2
58.99 55.15 75.39
−0.093 −0.156 −0.079
−0.126 −0.106 −0.208
47.28 30.65 50.72 28.17 48.86 38.05 27.97 44.9 31.52 41.88
−0.084
−0.141 −0.103 −0.102 −0.074 −0.076 −0.204 −0.082 −0.098
64.37 −0.159
47.89
38.96
−0.095
−0.109
37.14
−0.135
−0.071
45.98
−0.172
38.71
30.14
−0.094
−0.094
36.63
−0.086
17
21
24
26
27
22
25
16
30
35
23
21
28
19
17
18
37
36
38
36
36
37
36
37
37
36
37
36
37
36
37
37
1.60 × 10−2
1.50 × 10−1
1.10 × 10−2
3.50 × 10−1
3.40 × 10−2
3.40 × 10−3
3.60 × 10−1
2.50 × 10−3
2.60 × 10−1
4.20 × 10−3
9.90 × 10−5
3.40 × 10−3
3.50 × 10−2
3.00 × 10−2
5.00 × 10−2
6.50 × 10−3
2.60 × 10−1
67.4
−0.114
37
92.04
HomClus score
−0.183
5.00 × 10−2
Mascot e-val 81.68
36
identity score threshold
−0.162
Locusta migratoria Peptide identifictions using Mascot DB search engine (first round)
Gln→pyro-Glu (N-term Q); 7 DaShift(6)
Methylation(9)
Amidated (C-term); 22 DaShift(4)
Amidated (C-term); Gln→ pyro-Glu (N-term Q);22 DaShift(5)
Amidated (C-term)
Methylation(5)
Amidated (C-term); Gln→ pyro-Glu (N-term Q); 22 DaShift(4)
7 DaShift(C-term); Gln→pyroGlu (N-term Q); 7daShift(11)
Amidated (C-term); Gln→ pyro-Glu (N-term Q); 7daShift(11)
Peptide identifictions using HomClus tool (using ptms fingerprint)
Ptm
Mascot peptide score
8.21 × 10−2
2.17 × 10−3
1.86 × 10−1
5.56 × 10−2
1.95 × 10−6
4.30 × 10−1
3.02 × 10−2
1.28 × 10−3
1.58 × 10−3
HomClus e-val rule
Journal of Proteome Research Article
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
2781
Locustamyotropin-1
Locustamyotropin-3
sp|P41489|LMT3_LOCMI
LIRP
sp|P15131|LIRP_LOCMI
Diuretic hormone
Drostatins
sp|Q9VVF7|MIP_DROME
sp|P23465|DIUH_LOCMI
Neuropeptide Like
sp|Q9VV28|NPLP3_DROME
sp|P22395|LMT1_LOCMI
Short neuropeptide F
sp|Q9VIQ0|SNPF_DROME
Allostatins
sp|Q9VC44|ALLS_DROME
Short neuropeptide F
Cardio acceleratory peptide 2b
sp|Q9NIP6|CP2B_DROME
sp|Q9VIQ0|SNPF_DROME
Corazonin
sp|Q26377|CORZ_DROME
Short neuropeptide F
Locustatachykinin-1
sp|P16223|TKL1_LOCMI
sp|Q9VIQ0|SNPF_DROME
FMRFamide
sp|P10552|FMRF_DROME
SchistoFLRFamide
sp|P84306|FARP_LOCMI
Prepro ion transport peptide
Hypertrehalosaemic hormone
sp|P81626|HTF_LOCMI
tr|Q9XYF9|Q9XYF9_LOCMI
Sulfakinin
sp|P47733|LOSK_LOCMI
Short neuropeptide F
Locustamyotropin-4
sp|P41490|LMT4_LOCMI
sp|P86444|SNPF_LOCMI
Locustamyotropin-4
sp|P41490|LMT4_LOCMI
SchistoFLRFamide
Locustamyotropin-4
sp|P41490|LMT4_LOCMI
Short neuropeptide F
Locustamyotropin-3
sp|P41489|LMT3_LOCMI
sp|P86444|SNPF_LOCMI
Locustamyotropin-3
sp|P41489|LMT3_LOCMI
sp|P84306|FARP_LOCMI
description
Precursor ID
Table 1. continued
RQQPFVPR
DAEEQIKANKDFLQQI
VPAAQFSPRL
ARPSAGGLLTGA
AWQELQSAW
VVSVVED
SPSLRLFF
SLRLRF
PSRSPSLRLRF
GRLYSFGL
GANMGVVVFPR
QTFQYSHGWTN
GPSGFYGVR
HVFLxRF
SPLDAHHLA
SNRSPSLRLRF
SPSLRLRF
VDHVFLRF
PDVDHVFLRF
QVTFSRDWSP
QLASDDYGHMRF
RLHQNGMPFSPRL
LHQNGMPFSPRL
LHQNGMPFSPRL
RQQPFVPRL
QQPFVPRL
sequence
Mr delta mass
homology score threshold
959.383
1330.629
973.498
1030.484
1242.554
1203.475
1420.495
1550.672
1410.585
1394.613
1138.579
965.461 39.6 58.38 29.42 50.29 27.65 45.94 62.65 24.38 25.59 24.64 36.2
−0.094 −0.111 −0.134 −0.153 −0.125 −0.092 −0.097 −0.087 −0.085 −0.129 −0.099
27.74
−0.084
1116.454
744.275
963.39
788.434
1312.595
909.421
1144.486
1348.45
936.39
815.397
Amidated(C-term)
Methylation(8)
Amidated (C-term); 22 DaShift(13)
1040.5
1909.753
1083.532
1069.505
5.10 × 10−2
8.60 × 10−1
5.80 × 10−1
7.40 × 10−1
1.40 × 10−4
6.60 × 10−3
4.50 × 10−1
3.00 × 10−3
3.50 × 10−1
4.10 × 10−4
2.80 × 10−2
3.90 × 10−1
Mascot e-val
HomClus score
54.65 81.85 77.66 50.78
−0.164 −0.131 −0.086
−0.086
−0.068
104.24
−0.137
100.76
30.33
−0.134
−0.178
35.96 150.83
−0.092
68.41
36
36
36
36
37
37
37
38
37
37
37
36
identity score threshold
−0.083
28
17
18
22
21
22
16
25
25
23
29
20
26.95 80.41 67.06
−0.092 −0.202 −0.093
52.3
−0.088
Peptide identifictions using HomClus tool (using ptms fingerprint)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
Amidated (C-term)
Amidated (C-term)
Peptide identifictions using HomClus tool (≤3 AA subst)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
Amidated (C-term)
Amidated (C-term); Oxidation (M)
Amidated (C-term)
Amidated (C-term)
Amidated (C-term); Gln→ pyro-Glu (N-term Q)
Peptide identifictions using Mascot DB search engine (first round)
Ptm
Mascot peptide score
2.42 × 10−2
3.51 × 10−1
5.92 × 10−3
9.10 × 10−3
7.69 × 10−2
1.39 × 10−1
4.28 × 10
−2
1.86 × 10−3
3.80 × 10−4
1.97 × 10−3
1.85 × 10−1
1.20 × 10−5
1.02 × 10−4
1.87 × 10−3
HomClus e-val
SS_AE_AWQSLQSSW
PG_ED_VVSVVPG
R_F_SPSLRLRF
TRUNCATED
AQ_SP_AQRSPSLRLRF
SP_GL_SRPYSFGL
LYA_VVV_GANMGLYAFPR
R_H_QTFQYSRGWTN
EQUAL
SN_HV_SNFIRF
rule
Journal of Proteome Research Article
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
Locustamyotropin-3
Locustamyotropin-3
Locustamyotropin-4
Locustamyotropin-4
SchistoFLRFamide
SchistoFLRFamide
Corazonin
Corazonin
Short neuropeptide F
Short neuropeptide F
Short neuropeptide F
Prepro ion transport peptide
Prepro ion transport peptide
sp|P41489|LMT3_LOCMI
sp|P41489|LMT3_LOCMI
sp|P41490|LMT4_LOCMI
sp|P41490|LMT4_LOCMI
sp|P84306|FARP_LOCMI
sp|P84306|FARP_LOCMI
sp|Q26377|CORZ_DROME
sp|Q26377|CORZ_DROME
sp|Q9VIQ0|SNPF_DROME
sp|Q9VIQ0|SNPF_DROME
sp|Q9VIQ0|SNPF_DROME
tr|Q9XYF9|Q9XYF9_LOCMI
tr|Q9XYF9|Q9XYF9_LOCMI
SPLDAHHL
SPLDAHHLA
SPSLRLR
SNRSPSLRLR
PSLRLRF
QTFQYSHGWTN
QTFQYSHGWTN
PDVDHVFLRF
PDVDHVFLR
RLHQNGMPFSPRL
RLHQNGMPFSPR
QQPFVPR
RQQPFVPR
sequence
Mr delta mass
homology score threshold
22 DaShift(6)
Methylation(7)
Amidated(C-term)
Amidated (C-term); Gln→pyroGlu (N-term Q); 7 DaShift(11)
Amidated (C-term); Gln→ pyro-Glu (N-term Q); 22 DaShift(4)
Methylation(4); Amidated(Cterm)
Methylation(9)
Oxidation(7); Amidated(C-term)
Gln→pyro-Glu (N-term Q)
888.351
981.37
841.438
1184.548
886.463
1356.472
1371.443
1256.604
1110.482
1566.659
1438.594
853.369
1026.47
74.96 71.45 36.89
88.29 56.86 82.44 35 83.09 58.57
−0.123
−0.112 −0.092 −0.13 −0.08 −0.099 −0.099
49.37
−0.166
−0.068
40.43
−0.136
−0.105
32.3
HomClus score 88.09
Mascot e-val
−0.081
identity score threshold
−0.107
Peptide identifictions using HomClus tool (using ptms fingerprint)
Ptm
Mascot peptide score
5.89 × 10−2
7.57 × 10−4
1.06 × 10−1
1.06 × 10−2
4.04 × 10−4
1.07 × 10−1
4.26 × 10−2
1.31 × 10−2
2.73 × 10−3
8.94 × 10−1
5.40 × 10−1
8.74 × 10−1
3.32 × 10−2
HomClus e-val rule
All peptide sequences obtained throughout the three identification rounds, listing the predicted peptide sequence, the PTMs present, and the precursor the peptide is cleaved from or based on, for respectively the DB-search (1st round) and the HomClus-tool (2nd and 3rd round). Furthermore, the score and e-value are also provided for both tools. Identifications with an expectation value higher than 10−1 are marked in italic. The last column shows the rule applied to substitute the “known” peptide sequence into the experimental one.
a
description
Precursor ID
Table 1. continued
Journal of Proteome Research Article
2782
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
Journal of Proteome Research
Article
Figure 4. Corazonin peptide alignment block. After aligning all corazonin precursors from the Swiss-Prot reviewed protein database, using Clustal Omega,35 the block overlapping the corazonin peptides is sliced and presented. Based on the alignment it is clear that the locust corazonin also has the His7 substitution similar to the Apis mellifera corazonin, as compared to the other species. Nonetheless, the locust corazonin also has a Glu at position 4, similar to the other species (in contrast to the Thr4 substition in the A. mellifera peptide).
Next to the ability to annotate new homologous peptides, the tool also shows to be more sensitive than a commonly applied database search engine. Some peptides scoring below-threshold in the DB-search are significantly identified applying the HomClustool (e.g., identifications in round 2: schistostatin peptide based on drostatin counterpart and FLRFamide peptides based on the FMRFamide Drosophila counterpart; identifications in round 3: ion transport peptides, LIRP, myotropin and short neuropeptide F peptides). Lower quality spectral data (i.e., lacking partial fragmentation information) can still provide enough information for clustering purpose. Subsequently the -in the identification modulegenerated peptide candidates are rescored in the scoring module, eventually leading to significant results. Another advantage is the ability to identify new forms of bioactive peptides carrying post-translational modifications (see Table 1). A great deal of failed or missed identifications are attributable to the wealth of modifications on peptides, some of which may originate from in vivo post-translational processes to activate the molecule, whereas others could be introduced during the tissue preparation procedures. Next to known modifications such as methylation (+14 Da), oxidation (+16 Da), Cation:Na adduct (+22 Da), another as of yet unannotated modification was also discovered (after consulting the UniMod database, www.unimod.org36). The post-translational modification causes a 7 Da shift on the C-terminal amino acid of the corazonin peptide, next to the already present N-terminal pyroglutamination and C-terminal amidation (pQTFQYSHGWTNa). This modification has been witnessed both on Schistocerca and Locusta corazonin fragmentation spectra. Moreover, another form was encountered (only in the Schistocerca data set) carrying two of the mentioned modifications (7 Da shift), both located at the C terminal part of the peptide (one C-terminally, the other placed on the C-terminal amino acid). Even after thoroughly scrutinizing literature for this type of modification and mining the available databases as UniMod (www.unimod.org) and RESID (www.ebi. ac.uk/RESID/), the nature of this 7 Da shift modification could not be revealed. This 7 Da shift has been witnessed in both the Schistocerca and Locusta sample (see Figure 3). Also, a 6 Da shift is clearly present in the Locusta sample. This can be explained by
neuropeptide F peptide isoforms, an additional FMRFamide peptide, and less significantly also a cardio acceleratory peptide 2b, see Table 1). Furthermore, extra Schistocerca peptides could also be identified based on annotated Locusta peptide sequences (extra pyrokinins and myotropins peptides, see Table 1, bottom). These aforementioned extra identifications prove that the HomClus-tool is clearly capable to identify unknown, experimental fragmentation spectra using its built-in strategy. First the unknown spectrum is clustered together with a known, annotated fragmentation spectrum (e.g., CORZ_DROME, pQTFQYSRGWTNam) based on spectral similarity. Afterward a list of putative identifications for the experimental spectrum is generated by introducing amino acid substitutions and/or post-translational modifications to the known sequence (CORZ_DROME, pQTFQYSRGWTNam) of the annotated fragmentation spectrum clustering together with the experimental one. Hereby the mass difference between the parent ion masses of the annotated and unannotated, experimental spectra should match to these introduced mass shifts. In the case of the corazonin peptide, a substitution of an arginine (R) to a histidine (H) at position 7 corresponds to the parent ion mass difference between the known, annotated and unannotated, experimental spectrum. After ranking all solutions the aforementioned solution (R to H substitution: pQTFQYSHGWTNam) comes out as the best scoring one. To further validate this specific HomClus-tool prediction we constructed a multiple alignment of all corazonin peptide precursors found in UniProt-KB. This alignment clearly shows that an amino acid substitution from R to H at position 7 is presumably possible. This substitution was also noticed in the honeybee corazonin peptide, next to a second substitution from glutamine (Q) to tyrosine (T) at position 4 (see corazonin alignment in Figure 4). Most of the peptides identified in this study have been detected before,32−34 using de novo sequencing (implying very good spectral quality) and/or Edman degradation studies (implying fair amounts of peptide material). The presented tool clearly demonstrates its use in enabling the identification based on Maldi-TOF-TOF data using homologous sequence information as only input. 2783
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
Journal of Proteome Research
Article
Next to endogenous neuropeptides, identification of other types of peptides, with an emphasis toward revealing PTMs, as for example antimicrobial peptides, toxins, or venom peptides could also be attempted applying the reported tool. Finally, the tool can also be useful in regular proteomic analyses of organisms from which little or no sequence information is available.
combining this unannotated 7 Da modification and an amidation (−1 Da shift, also abundantly present in the Locusta sample), resulting in a net shift of 6 Da. Next to running this HomClus-tool, a commonly applied DB-search is still advisible. Not all peptide identifications obtained with the Mascot search (first identification round) could be confirmed with the second identification round applying the presented tool using the set of annotated bioactive peptides retrieved from UniProt-KB as point of comparison in the clustering module. For a database search it is possible to scan the complete precursor, without applying an enzymatic cleavage (mostly trypsin). The HomClus-tool on the other hand only tries to match to the provided bioactive peptide sequences or their corresponding MS/MS peak lists. As such, the spectrum identification rate can be higher with an initial Mascot search, since it will also pick up longer peptide variants (typically a set of variants, rather than just a single peptide is purified from a tissue or organ9,37,38). Nonetheless, this problem can be (partially) overcome using a “known” peptide pool (either sequence or spectral based) as comprehensive as possible. Several options are readily available to increase this search space if it is sequence based: (1) prior preprocessing of the peptide precursor sequences with the Neuropred tool,39 instead of only using the annotated endogenous peptides, (2) usage of all bioactive peptides in all insects instead of limiting the search to one specific species and (3) inclusion of peptides lacking a peptide annotation. For example, several peptide precursors have not been reviewed yet and remain in the trEMBL section of UniProt-KB (e.g., Drosophila IFamide precursor32). Of course, in the case of using a spectral resource as a starting point (e.g., public resources PRIDE1 or NeuroPedia8 or in-house peptidome spectral libraries, this same rule applies, that is, not limiting the clustering analysis to one specific species. Also, the HomClus-tool allows for querying a mixed set of both fragmentation spectra and spectral data generated in silico, making the search space even more comprehensive. Further functionality is planned for the future version of the identification tool, deletions and extensions are not yet implemented in the identification module. Current version only substitutions are allowed. Such implementation could create an even more complete list of candidates aiding the peptide identification. On the other hand the method described in this report, using spectral data as input, could outperform existing tools because it takes into account fragmentation intensities, unlike MS-Blast that is solely sequence based.21 Moreover it can cope with mass shifts in the fragmentation spectrum, in contrast to typical spectral alignment tools as SpectraST40 that can only find previously identified peptides based on very good spectral similarity. Finally the ability to cope with a wide number of post-translational modifications is also favorable with regard to revealing the sequence of bioactive peptides. As already mentioned the presented tool is modularly built, making it feasible to plug-in or replace modules. For spectral clustering other algorithms are published27 and could be applied. Within this study the Bonanza clustering was opted and implemented. Previous research also proves that the clustering technique can be applied on fragmentation spectra of multiply charged parent ions, for example, generated by an electrospray tandem mass spectrometer.25 The scoring module can be customized accordingly. Similarly, other scoring functions can also easily be implemented by changing the scoring module.
■
CONCLUSION In conclusion, this report describes the use of spectral clustering to group fragmentation spectra of annotated homologous peptides to a set of experimental unannotated ones. To reveal the peptide sequence of the latter set, an identification and candidate-scoring pipeline was constructed, next to the clustering section. It shows unambiguously that the approach yields promising results. Not only does it successfully reveal the identity of many examined peptides, the spectral clustering also enables one to detect the complete modification profile of the peptides. Bearing in mind the lack of protein annotation for nonmodel organisms and the importance of PTMs as activators of biological peptides, this reported tool can be seen as a valuable addition to the already available identification software.
■
ASSOCIATED CONTENT
* Supporting Information S
Supplemental tables and figures. This material is available free of charge via the Internet at http://pubs.acs.org.
■
AUTHOR INFORMATION
Corresponding Author
*Gerben Menschaert, Laboratory for Bioinformatics and Computational Genomics, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium. E-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
REFERENCES
(1) Vizcaino, J. A.; Foster, J. M.; Martens, L. Proteomics data repositories: providing a safe haven for your data and acting as a springboard for further research. J. Proteomics 2010, 73 (11), 2136−46. (2) Vizcaino, J. A.; Cote, R.; Reisinger, F.; Foster, J. M.; Mueller, M.; Rameseder, J.; Hermjakob, H.; Martens, L. A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 2009, 9 (18), 4276−83. (3) Desiere, F.; Deutsch, E. W.; Nesvizhskii, A. I.; Mallick, P.; King, N. L.; Eng, J. K.; Aderem, A.; Boyle, R.; Brunner, E.; Donohoe, S.; Fausto, N.; Hafen, E.; Hood, L.; Katze, M. G.; Kennedy, K. A.; Kregenow, F.; Lee, H.; Lin, B.; Martin, D.; Ranish, J. A.; Rawlings, D. J.; Samelson, L. E.; Shiio, Y.; Watts, J. D.; Wollscheid, B.; Wright, M. E.; Yan, W.; Yang, L.; Yi, E. C.; Zhang, H.; Aebersold, R. Integration with the human genome of peptide sequences obtained by highthroughput mass spectrometry. Genome Biol. 2005, 6 (1), R9. (4) Deutsch, E. W.; Lam, H.; Aebersold, R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 2008, 9 (5), 429−34. (5) Craig, R.; Cortens, J. P.; Beavis, R. C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 2004, 3 (6), 1234−42. (6) Falkner, J. A.; Hill, J. A.; Andrews, P. C. Proteomics FASTA archive and reference resource. Proteomics 2008, 8 (9), 1756−7. (7) Falth, M.; Skold, K.; Norrman, M.; Svensson, M.; Fenyo, D.; Andren, P. E. SwePep, a database designed for endogenous peptides and mass spectrometry. Mol. Cell. Proteomics 2006, 5 (6), 998−1005.
2784
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785
Journal of Proteome Research
Article
(8) Kim, Y.; Bark, S.; Hook, V.; Bandeira, N. NeuroPedia: neuropeptide database and spectral library. Bioinformatics 2011, 27 (19), 2772−3. (9) Baggerman, G.; Boonen, K.; Verleyen, P.; De Loof, A.; Schoofs, L. Peptidomic analysis of the larval Drosophila melanogaster central nervous system by two-dimensional capillary liquid chromatography quadrupole time-of-flight mass spectrometry. J. Mass Spectrom. 2005, 40 (2), 250−60. (10) Hummon, A. B.; Richmond, T. A.; Verleyen, P.; Baggerman, G.; Huybrechts, J.; Ewing, M. A.; Vierstraete, E.; Rodriguez-Zas, S. L.; Schoofs, L.; Robinson, G. E.; Sweedler, J. V. From the genome to the proteome: uncovering peptides in the Apis brain. Science 2006, 314 (5799), 647−9. (11) Husson, S. J.; Clynen, E.; Baggerman, G.; De Loof, A.; Schoofs, L. Peptidomics of Caenorhabditis elegans: in search of neuropeptides. Commun. Agric. Appl. Biol. Sci. 2005, 70 (2), 153−6. (12) Boonen, K.; Baggerman, G.; D’Hertog, W.; Husson, S. J.; Overbergh, L.; Mathieu, C.; Schoofs, L. Neuropeptides of the islets of Langerhans: a peptidomics study. Gen. Comp. Endrocrinol. 2007, 152 (2−3), 231−41. (13) Li, B.; Predel, R.; Neupert, S.; Hauser, F.; Tanaka, Y.; Cazzamali, G.; Williamson, M.; Arakane, Y.; Verleyen, P.; Schoofs, L.; Schachtner, J.; Grimmelikhuijzen, C. J.; Park, Y. Genomics, transcriptomics, and peptidomics of neuropeptides and protein hormones in the red flour beetle Tribolium castaneum. Genome Res. 2008, 18 (1), 113−22. (14) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75 (17), 4646−58. (15) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4 (7), 1985−8. (16) Magrane, M.; Consortium, U. UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) 2011, 2011, bar009. (17) Pruitt, K. D.; Tatusova, T.; Klimke, W.; Maglott, D. R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2009, 37 (Database issue), D32−6. (18) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551−67. (19) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466−7. (20) Menschaert, G.; Vandekerckhove, T. T.; Baggerman, G.; Schoofs, L.; Luyten, W.; Van Criekinge, W. Peptidomics coming of age: a review of contributions from a bioinformatics angle. J. Proteome Res. 2010, 9 (5), 2051−61. (21) Shevchenko, A.; Sunyaev, S.; Loboda, A.; Bork, P.; Ens, W.; Standing, K. G. Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. Anal. Chem. 2001, 73 (9), 1917−26. (22) Han, Y.; Ma, B.; Zhang, K. SPIDER: software for protein identification from sequence tags with de novo sequencing error. J. Bioinform. Comput. Biol. 2005, 3 (3), 697−716. (23) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215 (3), 403−10. (24) Ashby, G. J. Locusts. In UFAW Handbook on the care and management of laboratory animals; Hume, C. W., Ed.; Churchill Livingstone: Edinburgh, 1972; pp 582−587. (25) Menschaert, G.; Vandekerckhove, T. T.; Landuyt, B.; Hayakawa, E.; Schoofs, L.; Luyten, W.; Van Criekinge, W. Spectral clustering in peptidomics studies helps to unravel modification profile of biologically active peptides and enhances peptide identification rate. Proteomics 2009, 9 (18), 4381−8. (26) Falkner, J. A.; Falkner, J. W.; Yocum, A. K.; Andrews, P. C. A spectral clustering approach to MS/MS identification of posttranslational modifications. J. Proteome Res. 2008, 7 (11), 4614−22. (27) Flikka, K.; Meukens, J.; Helsens, K.; Vandekerckhove, J.; Eidhammer, I.; Gevaert, K.; Martens, L. Implementation and
application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 2007, 7 (18), 3245−58. (28) Fenyo, D.; Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 2003, 75 (4), 768−74. (29) Gumbel, E. J. Statistics of extremes; Columbia University Press: New York, 1958. (30) Rehmsmeier, M.; Steffen, P.; Hochsmann, M.; Giegerich, R. Fast and effective prediction of microRNA/target duplexes. RNA 2004, 10 (10), 1507−17. (31) Cote, R.; Reisinger, F.; Martens, L.; Barsnes, H.; Vizcaino, J. A.; Hermjakob, H. The Ontology Lookup Service: bigger and better. Nucleic Acids Res. 2010, 38 (Web Serverissue), W155−60. (32) Clynen, E.; Schoofs, L. Peptidomic survey of the locust neuroendocrine system. Insect Biochem. Mol. Biol. 2009, 39 (8), 491− 507. (33) Tawfik, A. I.; Tanaka, S.; De Loof, A.; Schoofs, L.; Baggerman, G.; Waelkens, E.; Derua, R.; Milner, Y.; Yerushalmi, Y.; Pener, M. P. Identification of the gregarization-associated dark-pigmentotropin in locusts through an albino mutant. Proc. Natl. Acad. Sci. U.S.A. 1999, 96 (12), 7083−7. (34) Predel, R.; Gade, G. Identification of the abundant neuropeptide from abdominal perisympathetic organs of locusts. Peptides 2002, 23 (4), 621−7. (35) Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T. J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Soding, J.; Thompson, J. D.; Higgins, D. G. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011, 7, 539. (36) Creasy, D. M.; Cottrell, J. S. Unimod: Protein modifications for mass spectrometry. Proteomics 2004, 4 (6), 1534−6. (37) Fricker, L. D.; Lim, J.; Pan, H.; Che, F. Y. Peptidomics: identification and quantification of endogenous peptides in neuroendocrine tissues. Mass Spectrom. Rev. 2006, 25 (2), 327−44. (38) Falth, M.; Skold, K.; Svensson, M.; Nilsson, A.; Fenyo, D.; Andren, P. E. Neuropeptidomics strategies for specific and sensitive identification of endogenous peptides. Mol. Cell. Proteomics 2007, 6 (7), 1188−97. (39) Southey, B. R.; Amare, A.; Zimmerman, T. A.; Rodriguez-Zas, S. L.; Sweedler, J. V. NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides. Nucleic Acids Res. 2006, 34 (Web Server issue), W267−72. (40) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7 (5), 655−67.
2785
dx.doi.org/10.1021/pr201114m | J. Proteome Res. 2012, 11, 2774−2785