PPIP: Automated Software for Identification of Bioactive Endogenous

Dec 12, 2018 - Copyright © 2018 American Chemical Society .... Research of the year C&EN's most popular stories of the year Molecules of the year Sci...
1 downloads 0 Views 1MB Size
Subscriber access provided by AUSTRALIAN NATIONAL UNIV

Technical Note

PPIP: An automated software for identification of bioactive endogenous peptides Mingqiang Rong, Baojin Zhou, Ruo Zhou, Qiong Liao, Yong Zeng, Shaohang Xu, and Zhonghua Liu J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00718 • Publication Date (Web): 12 Dec 2018 Downloaded from http://pubs.acs.org on December 13, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PPIP: An automated software for identification of bioactive endogenous peptides

Mingqiang Rong1#, Baojin Zhou2#, Ruo Zhou2, Qiong Liao1, Yong Zeng1*, Shaohang Xu2*, Zhonghua Liu1* 1 The National & Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China. 2 Deepxomics Co., Ltd., Shenzhen 518000, China # These authors contributed equally to this work. *To whom correspondence should be addressed: Yong Zeng, the National & Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China. E-Mail: [email protected] Zhonghua Liu, the National & Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China. E-Mail: [email protected] Shaohang Xu, Deepxomics Co., Ltd., Shenzhen 518000, China. E-Mail: [email protected]

ACS Paragon Plus Environment

1

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 20

Abbreviations MM, Mesobuthus martensii; ORF, open reading frame; SP, Scorpion, Mesobuthus martensii; FG, Carausius morosus frontal ganglion; BR, Carausius morosus Brain; GNG, Carausius morosus gnathal ganglion; NC, Carausius morosus ventral nerve cord

2

ACS Paragon Plus Environment

Page 3 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Abstract Endogenous peptides play an important role in multiple biological processes in many species. Liquid chromatography coupled tandem mass spectrometry (LC-MS/MS) is an important technique for detecting these peptides at a large scale. Herein, we present PPIP, which is a dedicated peptidogenomics software for identifying endogenous peptides based on peptidomics and RNA-Seq data. This software automates the de novo transcript assembly based on RNA-Seq data, construction of a protein reference database based on the de novo assembled transcripts, peptide identification, function analysis and HTML-based report generation. Different function components are integrated using Docker technology. The Docker image of PPIP is available at https://hub.docker.com/r/shawndp/ppip, and the source code under GPL-3 license is available at https://github.com/ShawnXu/PPIP. A user manual of PPIP is available at https://shawn-xu.github.io/PPIP.

Keywords Bioinformatics, Proteogenomics, Proteomics, Peptidomics, Peptidogenomics

3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 20

Introduction Endogenous peptides from tissues or bodily fluids have multiple functions and applications in previous studies. For instance, neuropeptides, such as peptide hormones from brain tissues, affect diverse behavioral functions including weight homeostasis, pain and psychiatric disorders1. The bioactive peptides from different species, such as Scorpion 2,3, Centipede 4,5, Conus 6, Snake 7, Frog 8 and Ant9, possess specific biological properties that classify these components as potential ingredients for drug discovery. The emergence of liquid chromatography coupled tandem mass spectrometry (LC-MS/MS) makes it possible for large-scale identification of endogenous peptides extracted from tissue or body fluid samples. This strategy is referred to as peptidomics1. However, analysis of peptidomics has thus far had challenges due to both experimental and computational issues including endogenous peptide enrichment, unspecific protease digestion, lack of complete protein reference databases and difficulties in biological interpretation. Therefore, it is essential to develop specific data analysis strategies for peptidomics. Peptidogenomics integrated with high throughput MS/MS based on peptidomics and RNA-Seq has recently emerged as a promising strategy for deep analysis of the bioactive endogenous peptides in complex samples. However, highly efficient data analysis using this strategy remains challenging. Some software and pipelines have been developed to analyze peptideomics data10,11. Nevertheless, these tools are only focused on specific aspects of the data analysis. No end-to-end software has been implemented to efficiently integrate peptidomics and RNA-Seq data for bioactive peptide discovery, which is from peptide identification to biological function interpretation12. Despite the development of bioinformatics tools to combine both proteomics and RNA-Seq data for peptide identification12–16, none of them could be easily used for the analysis of peptidogenomics data directly due to the differences between peptidomics and proteomics data. Furthermore, these public proteogenomics tools primarily focused on peptide identification rather than biological function interpretation. Herein, we developed a Docker-based peptidogenomics software called PPIP, which is capable of end-to-end

4

ACS Paragon Plus Environment

Page 5 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

peptidomics data analysis from customized database construction based on RNA-Seq data to peptide identification and biological function interpretation. To the best of our knowledge, this is the first comprehensive pipeline for peptidomics data analysis. The application of Docker technology results in the easy installation and use of PPIP without extensive knowledge in programming and bioinformatics. An HTMLbased report that includes all results can be generated by PPIP.

Materials and Methods Workflow and implementation

As illustrated in Figure 1, the workflow of PPIP based on MS/MS-based peptidomics and RNA-Seq data is broadly divided into the following steps and implemented as a Docker image. (I) Transcript assembly based on RNA-Seq data: First, AfterQC17 was used for reads filtering, trimming, error removing and quality control for single end or pair end Illumina RNA-Seq data in FASTQ format. Then, the remaining reads with high quality were used to assemble transcripts by Trinity18 using the default parameters. (II) Protein reference database construction: The assembled transcripts were translated to protein sequences by three-frame or six-frame translation or based on the longest open reading frame (ORF) in all reading frames using PGA 16. These translated protein sequences were then taken as the protein reference database for MS/MS database searching. Moreover, PPIP also accepts a user-provided protein sequence file in FASTA format as the reference database in the event that RNA-Seq data are lacking. (III) Peptide identification based on MS/MS data: The MS/MS data in mzML or MGF format were searched against the reference protein database from step (II) using MS-GF+19 or Comet20,21 with a non-enzyme search. The parameters for MS-GF+ or Comet were configurable by users, and in default, the identification result was filtered with 1% false discovery rate (FDR) at the peptide level using the target-decoy strategy22. Only confident

5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 20

peptides were retained for downstream analysis. The identification result can be visualized using PDV23. (IV) Function annotation: These confident identified peptides and their precursor protein sequences were used for function annotation. More specifically, these identified protein sequences were first searched against the NCBI NR (non-redundant) database using Blast24, and the result was further filtered with E-value ≤ 10-5 in default. Second, a multiple sequence alignment analysis was performed with a unified R interface25 to popular MSA (multiple sequence alignment) algorithms, such as ClustalW, ClustalOmega or MUSCLE. The results from the multiple sequence alignment were visualized using an interactive JavaScript component26. Third, identified peptides were annotated to the VenomKB (v2.0) database27. Fourth, signal peptide prediction was performed for all identified precursor protein sequences using SignalP28. Fifth, GO and KEGG analyses were performed with KOBAS29,30 or Sma3s31. (V) HTML-based report generation: An HTML-based interactive report was generated using the rmarkdown package32. This report contained all analysis results from PPIP (Figure S1). An example report can be viewed at: https://shawn-xu.github.io/PPIP/report/Scorpion.html. Datasets

Two public datasets with paired RNA-Seq and MS/MS data were used. The first dataset D1 (also named SP) is from Scorpion Mesobuthus martensii (MM) samples2 in which the RNA-Seq data were generated by Illumina HiSeq 2000 (71,472,944 reads) and the MS/MS (129,335 MS/MS spectra) data were generated by Q-Exactive. The second dataset D2 is from Carausius morosus samples33 in which the RNA-Seq data (49,784,152 reads) were generated by Illumina HiSeq 2000 and the MS/MS data (111,759 MS/MS spectra) were generated by QExactive Plus. The dataset D2 includes four different tissue samples: BR (Brain of Carausius morosus), FG (Frontal ganglion of Carausius morosus), GNG (Gnathal ganglion of Carausius morosus) and NC (Ventral nerve cord of Carausius morosus). RNA sequencing data obtained from D1 and D2 were processed using methods previously described in Steps (I) and (II) (Materials and Methods). Two customized protein databases generated 6

ACS Paragon Plus Environment

Page 7 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

with six-frame translation can be accessed at https://github.com/Shawn-Xu/PPIP_meterials. Raw MS/MS files were converted to MGF or mzML files using the msconvert module in the ProteoWizard34. The MS/MS peak list files were searched against the corresponding customized protein databases using four engines (MaxQuant35, MS-GF+, X!Tandem36 and Comet) with the following parameters: no enzyme digestion; variable modifications consisting of oxidation (M) and Gln->pyro-Glu (N-term Q); fixed modifications of carbamidomethyl (C); 10 ppm of peptide mass tolerance; and 0.05 Da of fragment mass tolerance. Additionally, a target-decoy strategy22 was adopted for false discovery rate (FDR) calculation with a peptide-level FDR threshold of 1%.

Results and Discussion To evaluate the utility of PPIP, two public datasets with paired RNA-Seq and MS/MS data were used (D1 and D2 described in the Datasets section). As reported in a previous study37, the genome of Scorpion MM was sequenced and 32,016 protein-coding genes were predicted for the MM genome. However, the genome of Carausius morosus has not yet been sequenced. Therefore, as for the MS/MS data from Scorpion MM, either the protein reference database derived from genome annotation or RNA-Seq data could be used as the reference database for peptide identification, whereas for the MS/MS data from Carausius morosus, only a protein database constructed from RNA-Seq is available.

First, we compared the peptide identification efficiency of the dataset D1 using the protein database derived from genome annotation or RNA-Seq data. We used Augustus (version 3.3)38 for gene prediction based on the Scorpion MM genome sequences; there were 69,392 protein-coding genes predicted and defined as database DB1. The RNA-Seq data were analyzed using Trinity through PPIP, and 92,018 transcripts were assembled. Then, a customized protein database was generated based on the assembled transcripts as DB2. The MS/MS data were searched against the DB1 and DB2 databases using MS-GF+, separately. As shown in Figure 2A, 7

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 20

504 and 607 peptides were identified using DB1 and DB2 with 1% FDR at the peptide level, respectively (Table S1, Supporting Information). Using DB2 derived from RNA-Seq data, 20.44% (103) more peptides than DB1 were able to be identified. Furthermore, 71.63% of identified peptides from DB1 were also identified in DB2. Moreover, a similar conclusion could also be obtained using Comet (Figure 2B) or X!Tandem (Figure 2C). This result indicates that the database derived from RNA-Seq data was close to or better than that derived from de novo gene prediction based on genome sequences for peptide identification in dataset D1.

We then evaluated the performance of different search engines for endogenous peptide (non-tryptic peptide) identification. Several studies have reported the evaluations of different search engines for tryptic peptide identification22,39–41. However, the identification of endogenous non-tryptic digested peptides was more challenging than the identification of tryptic digested peptides due to the greatly expanded search space. In this study, MS-GF+, Comet, X!Tandem36,42 and MaxQuant35 were evaluated based on datasets D1 and D2. The protein database used in this evaluation was derived from RNA-Seq data analyzed by PPIP. All search results were filtered with 1% FDR at the peptide level. As shown in Table 1 and Figure 3, Comet performed the best among the four search engines in terms of the number of identified peptides and running time. More specifically, Comet identified 38%, 50% and 319% more peptides than MS-GF+, X!Tandem and MaxQuant, respectively. MS-GF+ identified 24% and 262% more peptides than X!Tandem and MaxQuant, respectively. For sample GNC, X!Tandem identified more peptides than MS-GF+ and MaxQuant. Detailed peptide identification information for each dataset using different search engines can be found in Tables S2-S5. Furthermore, we also compared the four search engines’ results based on the identified neuropeptides/neuropeptide-like peptides from the Liessem et al.33 study. We first combined all identified peptides from the D2 dataset for each search engine. We then compared these peptides from different engines with the neuropeptides/neuropeptide-like peptides (243 peptides) from the Liessem et al. study. The peptides with exact matches or those overlapping with the peptides reported previously were considered as neuropeptides/neuropeptide-like peptides. A final overview of 8

ACS Paragon Plus Environment

Page 9 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the peptides identified by the four engines related to the Liessem et al. neuropeptides is illustrated in Figure 4, and corresponding peptide lists are provided in Table S6. This result indicates that Comet and MS-GF+ have more overlapped peptides with the original paper than the other two search engines and identified more neuropeptides/neuropeptide-like peptides than X!Tandem and MaxQuant. In terms of speed, using the same computer with one physical CPU (8 cores) and 32 GB of memory, Comet was much faster than MS-GF+ (up to 10 times) and MaxQuant (up to 150 times), while the speeds of Comet and X!Tandem were comparable. It was interesting that MaxQuant was significantly slower than other search engines. To confirm the search results by Comet and MS-GF+, we generated all annotated spectra for D1 and D2 using PDV; these annotated spectra are

available

at

https://github.com/Shawn-

Xu/PPIP_meterials/blob/master/Annotated_Spectra_of_Comet_and_MSGFPlus.zip. Considering the high performance of Comet and MS-GF+, they were integrated into PPIP as two optional search engines.

One important feature of PPIP is its function annotation module. Five total components were implemented: (1) sequence similarity search that can be used to find the similar sequences from public databases that may have similar functions with the target input peptide sequences, (2) multiple sequence alignment that can be used in the study of families of peptide/protein sequences, (3) peptide annotation based on the public bioactive peptide database, (4) signal peptide prediction, and (5) GO/KEGG analysis, which provides biological function annotation of the identified proteins. An example of such analysis is shown in Figure 5. These results can be viewed with highly interactive interfaces based on JavaScript. A comprehensive sample report of PPIP is available at https://shawn-xu.github.io/PPIP/report/Scorpion.html.

Conclusions In conclusion, we developed a Docker-based software called PPIP, which is an integrated peptidogenomics tool 9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 20

for identifying and annotating endogenous peptides. Furthermore, PPIP can be used to analyze peptidogenomics data by users lacking extensive knowledge in programming and bioinformatics. As a whole, the above results demonstrated the utility of PPIP in identifying endogenous peptides based on both MS/MS-based peptidomics and RNA-Seq data. We anticipate that researchers from the peptidomics community would benefit from PPIP.

Supplementary Information The article contains Supplementary Figure S1 and Supplementary Tables S1, S2, S3, S4, and S5. Figure S1. An overview of the PPIP html-based report. Table S1. Scorpion peptide identification list with databases DB1 and DB2 by MS-GF+. Table S2. MS-GF+ identification list with datasets D1 and D2. Table S3. Comet identification list with datasets D1 and D2. Table S4. MaxQuant identification list with datasets D1 and D2. Table S5. X!Tandem identification list with datasets D1 and D2. Table S6. Carausius neuropeptide-related peptides identified by four search engines.

Acknowledgment This work was supported by the NSFC (81573320, 31670783) and Yunnan Province (P0120150033).

Conflict of Interest Statement The authors have declared no conflicts of interest.

10

ACS Paragon Plus Environment

Page 11 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

References

(1) Secher, A.; Kelstrup, C. D.; Conde-Frieboes, K. W.; Pyke, C.; Raun, K.; Wulff, B. S.; Olsen, J. V. Analytic framework for peptidomics applied to large-scale neuropeptide identification. Nature communications 2016, 7, 11436. (2) Luan, N.; Shen, W.; Liu, J.; Wen, B.; Lin, Z.; Yang, S.; Lai, R.; Liu, S.; Rong, M. A Combinational Strategy upon RNA Sequencing and Peptidomics Unravels a Set of Novel Toxin Peptides in Scorpion Mesobuthus martensii. Toxins 2016, 8, 286. (3) Yang, S.; Yang, F.; Zhang, B.; Lee, B. H.; Li, B.; Luo, L.; Zheng, J.; Lai, R. A bimodal activation mechanism underlies scorpion toxin-induced pain. Science advances 2017, 3, e1700810. (4) Rong, M.; Yang, S.; Wen, B.; Mo, G.; Di Kang; Liu, J.; Lin, Z.; Jiang, W.; Li, B.; Du, C. et al. Peptidomics combined with cDNA library unravel the diversity of centipede venom. Journal of proteomics 2015, 114, 28–37. (5) Yang, S.; Yang, F.; Wei, N.; Hong, J.; Li, B.; Luo, L.; Rong, M.; Yarov-Yarovoy, V.; Zheng, J.; Wang, K. et al. A paininducing centipede toxin targets the heat activation machinery of nociceptor TRPV1. Nature communications 2015, 6, 8297. (6) Olivera, B. M. Conus peptides: biodiversity-based discovery and exogenomics. The Journal of biological chemistry 2006, 281, 31173–31177. (7) Tashima, A. K.; Zelanis, A.; Kitano, E. S.; Ianzer, D.; Melo, R. L.; Rioli, V.; Sant'anna, S. S.; Schenberg, A. C. G.; Camargo, A. C. M.; Serrano, S. M. T. Peptidomics of three Bothrops snake venoms: insights into the molecular diversification of proteomes and peptidomes. Molecular & cellular proteomics : MCP 2012, 11, 1245–1262. (8) Ma, Y.; Liu, C.; Liu, X.; Wu, J.; Yang, H.; Wang, Y.; Li, J.; Yu, H.; Lai, R. Peptidomics and genomics analysis of novel antimicrobial peptides from the frog, Rana nigrovittata. Genomics 2010, 95, 66–71. (9) Touchard, A.; Téné, N.; Song, P. C. T.; Lefranc, B.; Leprince, J.; Treilhou, M.; Bonnafé, E. Deciphering the Molecular Diversity of an Ant Venom Peptidome through a Venomics Approach. Journal of proteome research 2018, 17, 3503–3516. (10) Wu, C.; Monroe, M. E.; Xu, Z.; Slysz, G. W.; Payne, S. H.; Rodland, K. D.; Liu, T.; Smith, R. D. An Optimized Informatics Pipeline for Mass Spectrometry-Based Peptidomics. Journal of the American Society for Mass Spectrometry 2015, 26, 2002– 2008. (11) Manguy, J.; Jehl, P.; Dillon, E. T.; Davey, N. E.; Shields, D. C.; Holton, T. A. Peptigram: A Web-Based Application for Peptidomics Data Visualization. Journal of proteome research 2017, 16, 712–719. (12) Sheynkman, G. M.; Johnson, J. E.; Jagtap, P. D.; Shortreed, M. R.; Onsongo, G.; Frey, B. L.; Griffin, T. J.; Smith, L. M. Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC genomics 2014, 15, 703. (13) Wang, X.; Zhang, B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics (Oxford, England) 2013, 29, 3235–3237. (14) Wen, B.; Xu, S.; Sheynkman, G. M.; Feng, Q.; Lin, L.; Wang, Q.; Xu, X.; Wang, J.; Liu, S. sapFinder: an R/Bioconductor package for detection of variant peptides in shotgun proteomics experiments. Bioinformatics (Oxford, England) 2014, 30, 3136– 3138. (15) Ruggles, K. V.; Krug, K.; Wang, X.; Clauser, K. R.; Wang, J.; Payne, S. H.; Fenyö, D.; Zhang, B.; Mani, D. R. Methods, Tools and Current Perspectives in Proteogenomics. Molecular & cellular proteomics : MCP 2017, 16, 959–981. (16) Wen, B.; Xu, S.; Zhou, R.; Zhang, B.; Wang, X.; Liu, X.; Xu, X.; Liu, S. PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq. BMC bioinformatics 2016, 17, 244. (17) Chen, S.; Huang, T.; Zhou, Y.; Han, Y.; Xu, M.; Gu, J. AfterQC: automatic filtering, trimming, error removing and quality control for fastq data. BMC bioinformatics 2017, 18, 80. (18) Grabherr, M. G.; Haas, B. J.; Yassour, M.; Levin, J. Z.; Thompson, D. A.; Amit, I.; Adiconis, X.; Fan, L.; Raychowdhury, R.; Zeng, Q. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology 2011, 29, 644–652. (19) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications 2014, 5, 5277. (20) Eng, J. K.; Hoopmann, M. R.; Jahan, T. A.; Egertson, J. D.; Noble, W. S.; MacCoss, M. J. A deeper look into Comet-implementation and features. Journal of the American Society for Mass Spectrometry 2015, 26, 1865–1874. (21) Eng, J. K.; Jahan, T. A.; Hoopmann, M. R. Comet: an open-source MS/MS sequence database search tool. Proteomics 2013, 13, 22–24. (22) Elias, J. E.; Haas, W.; Faherty, B. K.; Gygi, S. P. Comparative evaluation of mass spectrometry platforms used in largescale proteomics investigations. Nature methods 2005, 2, 667–675. 11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 20

(23) Li, K.; Vaudel, M.; Zhang, B.; Ren, Y.; Wen, B. PDV: an integrative proteomics data viewer. Bioinformatics (Oxford, England) 2018, DOI: 10.1093/bioinformatics/bty770. (24) Camacho, C.; Coulouris, G.; Avagyan, V.; Ma, N.; Papadopoulos, J.; Bealer, K.; Madden, T. L. BLAST+: architecture and applications. BMC bioinformatics 2009, 10, 421. (25) Bodenhofer, U.; Bonatesta, E.; Horejš-Kainrath, C.; Hochreiter, S. msa: an R package for multiple sequence alignment. Bioinformatics (Oxford, England) 2015, 31, 3997–3999. (26) Yachdav, G.; Wilzbach, S.; Rauscher, B.; Sheridan, R.; Sillitoe, I.; Procter, J.; Lewis, S. E.; Rost, B.; Goldberg, T. MSAViewer: interactive JavaScript visualization of multiple sequence alignments. Bioinformatics (Oxford, England) 2016, 32, 3501–3503. (27) Romano, J.; Nwankwo, V.; Tatonetti, N. VenomKB v2.0: A knowledge repository for computational toxinology 2018, DOI: 10.1101/295204. (28) Petersen, T. N.; Brunak, S.; Heijne, G. von; Nielsen, H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods 2011, 8, 785–786. (29) Wu, J.; Mao, X.; Cai, T.; Luo, J.; Wei, L. KOBAS server: a web-based platform for automated annotation and pathway identification. Nucl Acids Res 2006, 34, W720-4. (30) Xie, C.; Mao, X.; Huang, J.; Ding, Y.; Wu, J.; Dong, S.; Kong, L.; Gao, G.; Li, C.-Y.; Wei, L. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucl Acids Res 2011, 39, W316-22. (31) Casimiro-Soriguer, C. S.; Muñoz-Mérida, A.; Pérez-Pulido, A. J. Sma3s: A universal tool for easy functional annotation of proteomes and transcriptomes. Proteomics 2017, 17, DOI: 10.1002/pmic.201700071. (32) JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang. rmarkdown: Dynamic Documents for R: https://CRAN.Rproject.org/package=rmarkdown. (33) Liessem, S.; Ragionieri, L.; Neupert, S.; Büschges, A.; Predel, R. Transcriptomic and Neuropeptidomic Analysis of the Stick Insect, Carausius morosus. Journal of proteome research 2018, 17, 2192–2204. (34) Chambers, M. C.; Maclean, B.; Burke, R.; Amodei, D.; Ruderman, D. L.; Neumann, S.; Gatto, L.; Fischer, B.; Pratt, B.; Egertson, J. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nature biotechnology 2012, 30, 918–920. (35) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature biotechnology 2008, 26, 1367–1372. (36) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics (Oxford, England) 2004, 20, 1466–1467. (37) Cao, Z.; Yu, Y.; Wu, Y.; Hao, P.; Di, Z.; He, Y.; Chen, Z.; Yang, W.; Shen, Z.; He, X. et al. The genome of Mesobuthus martensii reveals a unique adaptation model of arthropods. Nature communications 2013, 4, 2602. (38) Stanke, M.; Keller, O.; Gunduz, I.; Hayes, A.; Waack, S.; Morgenstern, B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucl Acids Res 2006, 34, W435-9. (39) Wen, B.; Du, C.; Li, G.; Ghali, F.; Jones, A. R.; Käll, L.; Xu, S.; Zhou, R.; Ren, Z.; Feng, Q. et al. IPeak: An open source tool to combine results from multiple MS/MS search engines. Proteomics 2015, 15, 2916–2920. (40) Yuan, Z.-F.; Lin, S.; Molden, R. C.; Garcia, B. A. Evaluation of proteomic search engines for the analysis of histone modifications. Journal of proteome research 2014, 13, 4470–4478. (41) Chamrad, D. C.; Körting, G.; Stühler, K.; Meyer, H. E.; Klose, J.; Blüggel, M. Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data. Proteomics 2004, 4, 619–628. (42) Craig, R.; Beavis, R. C. A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid communications in mass spectrometry : RCM 2003, 17, 2310–2316.

12

ACS Paragon Plus Environment

Page 13 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1. Summary of the peptide identification results and running time for all datasets using different search engines. MS-GF+ (v2017.01.13) Dataset

D1

D2

Sample

Total spectra

Identified peptides

Elapsed time

Comet (2018.01 rev. 1)

X!Tandem (2017.2.1.4)

MaxQuant (v1.6.0.16)

Identified peptides

Elapsed time

Identified peptides

Elapsed time

Identified peptides

Elapsed time

SP

129335

607

9h16min

610

1h12min

539

1h15min

223

1day12h30min

FG

12322

192

7h37min

238

44min

181

58min

53

4day19h48min

BR

69875

1775

20h48min

2447

3h35min

1633

3h11min

766

5day11h35min

GNG

13533

247

7h17min

302

55min

255

1h10min

87

4day12h56min

NC

16029

2808

7h44min

3430

1h10min

2268

1h25min

1909

5day3h41min

a) SP: Scorpion, Mesobuthus martensii b) FG: Carausius morosus frontal ganglion c) BR: Carausius morosus Brain d) GNG: Carausius morosus gnathal ganglion e) NC: Carausius morosus ventral nerve cord

13

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 20

Figure Legends Figure 1. Flowchart describing the core procedures of PPIP. Note that the workflow begins with a project workspace preparation and can then be run completely with a one-step script.

Figure 2. Comparison of the performance of genome guided (DB1) or RNA-Seq guided (DB2) database construction methods with [A] MS-GF+, [B] Comet and [C] X!Tandem.

Figure 3. Overlap of identified peptides using five datasets from: [A] scorpion MM, [B] frontal ganglion of C. morosus (FG), [C] brain of C. morosus (BR), [D] gnathal ganglion of C. morosus (GNG) and [E] ventral nerve cord of C. morosus (NC).

Figure 4. Overlap of identified neuropeptides/neuropeptide-like peptides by different search engines in the D2 dataset. Figure 5. Overview of endogenous peptide annotation using PPIP. (A) PPIP used a wrapper function to access the SignalP 4.1 Server for signal peptide/non-signal peptide prediction. The “isSignal” column indicates whether it is a signal peptide. (B) PPIP utilized an MSA package to analyze endogenous peptides for similarities and produced a motif for each pattern. Users could manipulate the graphic elements by scrolling, selecting, or highlighting to capture some significant conserved sequences. (C) PPIP annotated the endogenous peptides with an open-source and publicly accessible venom resource. Significant hits must meet the following two conditions: 1) percentage of identical match ≥80%; and 2) Blast e-value≤10-5. (D) PPIP selected two optional tools for gene ontology or pathway annotation. Users could use the KOBAS wrapper script to annotate endogenous peptides when the internet was available, or they could choose the Sma3s to perform this analysis in a local/offline network.

14

ACS Paragon Plus Environment

Page 15 of 20

Journal of Proteome Research

Figure 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Project initialization

Raw reads

Mass spectra

Step I: Transcript as sembly

Candidated transcripts

Step II: Database construction

Customized database

Step III: Peptide identification

Peptide lis t

Step IV: Function annotation

Functional res ults

Step V: Report engine

Html report

One-step script ACS Paragon Plus Environment

Workspace

Figure 2 A 1 2 3 4 5 6 7 8 9 10

Journal of Proteome Research

B

ACS Paragon Plus Environment

Page 16 of 20

C

Figure Page 173of 20 1 2 3 4 5 6 7 8 9A 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Journal of Proteome Research

B

D

C

ACS Paragon Plus E Environment

Journal of Proteome Research

Figure 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

ACS Paragon Plus Environment

Page 18 of 20

Page 19 of 20

Journal of Proteome Research

Figure 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

A

B

EAGAGRFNPTSEAGAGRFDPASAAGAGRFDPTL

C Identified endogenous precursor protein ID

Function annotation of NTX41956

D

ACS Paragon Plus Environment

For TOC Only

1 2 3 4 5 6 7

Journal of Proteome Research Page 20 of 20

ACS Paragon Plus Environment