Subscriber access provided by ORTA DOGU TEKNIK UNIVERSITESI KUTUPHANESI
Article
A pipeline for differential proteomics in unsequenced species #ule Y#lmaz, Bjorn Victor, Niels Hulstaert, Elien Vandermarliere, Harald Barsnes, Sven Degroeve, Surya Gupta, Adriaan Sticker, Sarah Gabriël, Pierre Dorny, Magnus Palmblad, and Lennart Martens J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00140 • Publication Date (Web): 18 Apr 2016 Downloaded from http://pubs.acs.org on April 19, 2016
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
Research Article
A pipeline for differential proteomics in unsequenced species Şule Yılmaz1,2,3, Bjorn Victor4, Niels Hulstaert 1,2,3, Elien Vandermarliere1,2,3 , Harald Barsnes5, Sven Degroeve1,2,3, Surya Gupta1,2,3, Adriaan Sticker1,2,3,6, Sarah Gabriël4, Pierre Dorny4, Magnus Palmblad7, Lennart Martens1,2,3* 1
Medical Biotechnology Center, VIB, 9000 Ghent, Belgium Department of Biochemistry, Ghent University, 9000 Ghent, Belgium 3 Bioinformatics Institute Ghent, Ghent University, 9000 Ghent, Belgium 4 Veterinary Helminthology Unit, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium 5 Proteomics Unit (PROBE), Department of Biomedicine, University of Bergen, Jonas Liesvei 91, Bergen 5009, Norway 6 Department of Applied Mathematics, Computer Science, and Statistics, Ghent University, 9000 Ghent, Belgium 7 Center for Proteomics and Metabolomics, Leiden University Medical Center, Leiden, The Netherlands 2
* Corresponding author: Prof. Dr. Lennart Martens, A. Baertsoenkaai 3, B-9000 Ghent, Belgium. Email:
[email protected] Tel: +3292649358
Abbreviations: PSM: peptide to spectrum match; 6RF tr: six reading frame translated; DB: database; PTM: post translational modification; HQ: high quality; LQ: low quality; FDR: false discovery rate; MS/MS: tandem mass spectrometry; ESP: excretion/secretion protein Keywords: mass spectrometry, differential proteomics, unique spectrum, unsequenced species
1 ACS Paragon Plus Environment
Journal of Proteome Research
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 30
ABSTRACT Shotgun proteomics experiments often take the form of a differential analysis, where two or more samples are compared against each other. The objective is to identify proteins that are either unique to a specific sample or a set of samples (qualitative differential proteomics), or that are significantly differentially expressed in one or more samples (quantitative differential proteomics). However, the success depends on the availability of a reliable protein sequence database for each sample. In order to perform such an analysis in the absence of a database, we here propose a novel, generic pipeline comprising an adapted spectral similarity score derived from database search algorithms that compares samples at the spectrum level to detect unique spectra. We applied our pipeline to compare two parasitic tapeworms: Taenia solium and Taenia hydatigena, of which only the former poses a threat to humans. Furthermore, because the genome of Taenia solium recently became available, we were able to prove the effectiveness and reliability of our pipeline a posteriori.
2 ACS Paragon Plus Environment
Page 3 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
MAIN TEXT INTRODUCTION Shotgun mass spectrometry (MS) is the method of choice for proteomics analyses1,2. A typical proteomics experiment starts with the proteolytic digestion of a protein sample, usually performed by a protease such as trypsin. The obtained peptide mixture is then separated by chromatographic means and subsequently introduced into the mass spectrometer. Here, selected peptides are fragmented to produce fragmentation (MS/MS) spectra3. Differential proteomics enables the detection of differences between two or more proteomes by performing differential analyses4,5. These analyses can help to understand differences between cell lines6, pathogenicity of virulent strains7,8, and diseases9, among others. Differential proteomics analyses can be either quantitative or qualitative. The aim of quantitative differential proteomics is to find differences in the amounts of expressed proteins by performing one of two strategies: label-free10,11, chemical or stable isotope labeling12. Qualitative differential proteomics on the other hand, focuses on the identification of species specific peptides or proteins among samples. Regardless of the type of proteomics experiment performed, a key step in the data analysis is the assignment of peptide sequences to experimental MS/MS spectra13. This task can be carried out via two distinct computational approaches. Database searching is preferred if a reliable and reasonably complete protein sequence database is available, while de novo sequencing is employed in the absence of such a database14. Database search algorithms assign peptide sequences by the generation of theoretical MS/MS spectra from peptides obtained after an in silico digest of the proteins in the given database, and then match these theoretical spectra to the experimental spectra13. Examples of such database search algorithms include OMSSA15, X!Tandem16, Andromeda17 and MS Amanda18. If no useable protein sequence database exists, a genome sequence can take the role of the search database19. In this case, MS/MS spectra are matched to peptides that are generated from the translated genome20. Finally, if no sequence is available at all, de novo sequencing can be used instead. In this approach, a putative sequence 3 ACS Paragon Plus Environment
Journal of Proteome Research
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 30
is read directly from the MS/MS spectrum14. De novo sequencing however, requires highquality MS/MS spectra21, but even high-quality spectra can be assigned ambiguously or to incorrect amino acid sequences. As a result, de novo sequencing is currently not widely used in high-throughput studies22. The availability of a sequence database of the species of interest is therefore currently of great importance to successfully complete differential studies. Yet, despite the immense sequencing efforts of the last decade, the number of complete genome projects in the Genome Online Database (GOLD; http://www.genomesonline.org) in the year 2014 was only 2,334, complemented by 21,471 permanent drafts. This number is quite low when compared to the estimated 2-8 million extant species23. The vast majority of species thus remains unsequenced, which makes it very challenging to carry out a differential proteomics analysis for these organisms. To circumvent this lack of sequence information, we here propose a novel and generic experimental spectrum comparison pipeline that allows the detection of species specific peptide derived spectra by comparing two sets of spectra without the need of a sequence database. Our pipeline consists of three consecutive steps: (i) spectrum preprocessing, (ii) removal of low-quality spectra, and (iii) pairwise spectrum comparison using an adapted spectrum similarity scoring function from database search algorithms such as Andromeda
17
and MS Amanda18. We tested our pipeline by performing a differential proteomics analysis between two tapeworm species: Taenia solium and Taenia hydatigena. Pigs are the intermediate host of T. solium while human normally acts as final host, but can also become accidental intermediate host. T. solium affects millions of people worldwide, especially in developing countries24,25 and this has led the World Health Organization (WHO) to list T. solium parasitic infection among the 17 neglected tropical diseases24. T. hydatigena on the other hand, which also has pigs as intermediate host, does not infect humans. Current serological circulating antigen detecting tests in pigs fail to distinguish between these infections. A species specific detection is crucial
4 ACS Paragon Plus Environment
Page 5 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
for the advancement of T. solium research and implementation as well as evaluation of control programs. While T. hydatigena remains unsequenced, the high disease burden of T. solium prompted an ongoing project in Mexico that has provided the first T. solium draft genome sequence26. As a result, we were able to use this draft T. solium genome to test the power of our pipeline to identify unique peptide spectra and could demonstrate that our pipeline combines good specificity with sensitivity.
5 ACS Paragon Plus Environment
Journal of Proteome Research
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 30
MATERIALS & METHODS
Data sets Our bioinformatics pipeline for differential proteomics made use of three data sets: two Taenia samples to compare each data set, and a training data set from Escherichia coli to train the spectrum quality tool (Table 1). T. solium metacestode collection from a naturally infected pig and 24h in vitro excretion/secretion protein (ESP) production were done in Zambia as described by Victor and coworkers27. T. hydatigena metacestodes of various sizes (n = 14) were collected from the
omentum and liver of naturally infected Zambian goats and carefully separated from the host tissue. We collected T. hydatigena metacestodes from goats because most goats are infected with this parasite and this facilitates sample collections. ESPs were prepared as described for T. solium27, except that the number of cysts per culture dish was modified according to the size of the cysts. Protein concentration of the ESPs was determined with a Pierce BCA Protein Assay Kit (Thermo Fisher Scientific Inc., Rockford, IL). Species confirmation of the Taenia metacestodes was done with a polymerase chain reaction assay followed by enzymatic digestion (PCR-RFLP) as previously described28. T. solium and T. hydatigena ESP aliquots (40 μg) were precipitated in four volumes of ice-cold acetone and pellets were re-suspended in NuPAGE Sample Reducing Agent (10x) and NuPAGE LDS Sample Buffer (4x). Protein separation was done on a 4-12% NuPAGE Novex Bis-Tris gel. Gels were stained overnight with the Colloidal Blue Staining Kit (all reagents from Invitrogen). In-gel tryptic digestion, liquid chromatography and tandem mass spectrometry were performed as described previously in detail27. Briefly, each gel lane was cut in 48 identical slices. All slices were transferred to a 96-well plate (one slice/well) where the cysteines were reduced with DTT. The resulting cysteines were alkylated with the addition of iodoacetamide. Proteins were digested overnight at 37°C with sequencing grade porcine trypsin. Peptides were collected from each well and analyzed on an amaZon ETD ion trap mass spectrometer (Bruker Daltonics, Bremen, Germany) equipped with an Apollo II ESI source and coupled on-line to a NanoLC-Ultra 2D plus LC (Eksigent, Dublin, CA, USA). Singly charged species 6 ACS Paragon Plus Environment
Page 7 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
were excluded for MS/MS. Raw LC-MS/MS data were converted to mzXML using the Bruker compassXport tool (version 3.0.4) and then to MGF using msconvert (ProteoWizard29, version 3.0.4388) with the assumption that all spectra are either doubly or triply charged. The E. coli data set as described by Mostovenko and coworkers30 was used to train the spectrum quality tool (SpecQual31). This data set was generated with the same instrument and using the same approach as the Taenia data sets. This E. coli data set represents high-quality (HQ) and low-quality (LQ) MS/MS spectra. The HQ training data set consists of 2,273 previously identified E. coli MS/MS spectra with p>=0.9995 whereas the LQ training data set consists of 6,663 non-identified E. coli MS/MS spectra with p