X!TandemPipeline: A Tool to Manage Sequence Redundancy for

Dec 8, 2016 - X!TandemPipeline is a software designed to perform protein inference and to manage redundancy in the results of phosphosite identificati...
0 downloads 3 Views 1014KB Size
Subscriber access provided by UNIVERSITY OF CALGARY

Article

X!TandemPipeline: a tool to manage sequence redundancy for protein inference and phosphosite identification Olivier Langella, Benoît Valot, Thierry Balliau, Mélisande Blein-Nicolas, Ludovic Bonhomme, and Michel Zivy J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 08 Dec 2016 Downloaded from http://pubs.acs.org on December 8, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

X!TandemPipeline: a tool to manage sequence redundancy for protein inference and phosphosite identification

Olivier Langella1, Benoît Valot2, Thierry Balliau1, Mélisande Blein-Nicolas1 , Ludovic Bonhomme3 and Michel Zivy1*

1

PAPPSO, GQE - Le Moulon, INRA, Univ. Paris-Sud, CNRS, AgroParisTech, Université Paris-

Saclay, 91190, Gif-sur-Yvette, France 2

UMR 6249 Chrono-Environnement, CNRS, Université de Bourgogne Franche-Comté, 25030

Besançon 3

INRA/UBP, UMR 1095, Genetics, Diversity and Ecophysiology of Cereals, F63100 Clermont-

Ferrand, France.

* Corresponding author : Michel Zivy e-mail : [email protected] Tel : 33 1 69 33 23 65

Olivier Langella and Benoît Valot contributed equally to this work

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT X!TandemPipeline is a software designed to perform protein inference and to manage redundancy in the results of phosphosite identification by database search. It provides the minimal list of proteins or phosphosites that are present in a set of samples using grouping algorithms based on the principle of parsimony. Regarding proteins, a two-level classification is performed, where groups gather proteins sharing at least one peptide and subgroups gather proteins that are not distinguishable according to the identified peptides. Regarding phosphosites, an innovative approach based on the concept of phosphoisland is used to gather overlapping phosphopeptides. The graphical interface of X!TandemPipeline allows the users to launch X!tandem identification, to inspect spectra and to manually validate their assignment to peptides, to launch the grouping program and to visualize elementary data as well as grouping and redundancy information. Identification results obtained from other search engines can also be processed. X!TandemPipeline results can be exported as ready-to-use tabulated files or as XML files that can be directly used by the PROTICdb database or by the MassChroQ quantification software. X!TandemPipeline runs fast, is easy to use and can process hundreds of samples simultaneously. It is freely available under the GNU General Public Licence v3.0 at http://pappso.inra.fr/bioinfo/xtandempipeline/.

Key words : protein inference, phosphopeptide, database search, bioinformatics, software, mass spectrometry

2

ACS Paragon Plus Environment

Page 2 of 30

Page 3 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction The identification of proteins based on peptide fragmentation is the central process of most proteomics studies. It is generally performed in two steps by a search engine. First, the experimental fragmentation spectra are compared with those theoretically expected from the sequences stored in databases. Second, identified peptides are assigned to the proteins that contain their sequences. Peptides are therefore identified by mass spectrometry (MS), while proteins are identified by assembling the MS-identified peptides. Several search engines have been developed to perform these two operations (e.g. Mascot1⁠ , Sequest2⁠ , X!Tandem3⁠ , OMSSA4⁠ , Andromeda5⁠ , COMET6⁠ , Morpheus7⁠ , MS-GF +8⁠ ). They generally produce an exhaustive list of all the proteins that can be obtained from the MS-identified peptides. However, due to sequence redundancy (i.e. the fact that a same peptide can be included in several proteins: shared peptides, also termed degenerated peptides), this raw list may contain irrelevant proteins with respect to the actual composition of the biological sample. Sequence redundancy may be either of biological origin (for instance, sequences shared by the members of protein families, splicing variants, allelic variants) or artifactual (same sequence with different names, presence of truncated forms e.g. in EST databases). In order to consider only the proteins that are the most likely to be present in a sample, it is therefore necessary to refine the raw list of identified proteins by taking redundancy into account. The process of establishing a biologically relevant list of proteins from MS-identified peptides is known for long as protein inference9⁠ . It is not straightforward and several types of algorithms have been proposed (reviewed in

10,11

⁠ ). Nevizhskii et al.

12

⁠ proposed to apply the

principle of parsimony, invoking the“ Occam's razor ”: “plurality should not be posited without necessity”. The principle of parsimony is to build the minimal list of proteins that can explain the presence of all the MS-identified peptides. Based on this, several tools for protein inference have 3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

been developed (e.g. DBParser13⁠ , IDPicker14⁠ , MassSieve15⁠ , IsoformResolver16⁠ ), which differ on the results classification, on the information provided on shared peptides and on the management of “subsumable” proteins (i.e. proteins that share peptides with more than one other protein). Other methods of protein inference have been developed to address the specific case where different proteins are identified by a same set of peptides. This case is not well handled by the parsimonious approach, which fails to make a choice between the different possible proteins. These methods are based on the apportionment of peptide assignment probabilities between several proteins

12,17,18

⁠ . To this end, they use probabilistic models in which the solutions are iteratively

approached. Some of these methods also take into account peptides that are present in the database but that were not identified by the search engine19,20⁠ . The advantage of these methods is that they allow the identification of proteins that would not be identified by using a strictly parsimonious approach. However, they may be difficult to implement, particularly when they depend on parameters which are difficult to estimate, such as peptide detectability. In addition, the results differ from one method to another10⁠ and can depend on small variations of the initial assignment probabilities11⁠ . In quantitative proteomics, the objective is not to discover new proteins but to compare the protein abundances between different conditions and/or genotypes. For these comparisons to be meaningful and reliable, proteins have to be identified with high confidence, which makes protein inference critical. A similar issue exists in quantitative phosphoproteomics, where the objective is to study the variations of phosphopeptide abundances. In this case, not only phosphopeptides can be shared by different proteins, but several overlapping phosphopeptides can also share the same phosphosite(s). Here, we present X!TandemPipeline, a software for protein inference and phosphopeptide grouping based on a strict parsimonious approach which provides minimal lists of proteins or phosphopeptides classified according to simple criteria. X!TandemPipeline is an integrated tool, with a graphical user interface (GUI) that allows (i) the parametrization and launching of the X!Tandem search engine, (ii) the parametrization and 4

ACS Paragon Plus Environment

Page 4 of 30

Page 5 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

launching of protein inference and phosphopeptide grouping, (iii) the visualization of spectra and manual validation of peptide-spectrum matches (PSMs), (iv) the visualization of the final identification and grouping results. These results can finally be exported in a tabular format for human interpretation or in a XML format for subsequent feeding of other proteomics tools.

Materials and et methods Grouping algorithms The algorithms implemented in X!TandemPipeline for inferring proteins and grouping phosphopeptides are based on the principle of the “ Occam's razor ” described above. They include two steps of grouping that successively define subgroups and groups. Proteins or phosphopeptides that are not necessary according to the parsimony criterion are successively discarded at these two steps. With respect to proteins, the grouping algorithm proceeds as follows. In a first step, the proteins identified by exactly the same set of peptides are gathered in subgroups. They are considered as having same probabilities of being actually present and all of them are selected (Figure 1 A1 to A3). On the contrary, proteins identified by a subset of the peptides defining a subgroup are discarded since they are unnecessary to explain the presence of these peptides. In a second step, the subgroups sharing at least one peptide are gathered to create groups and the subgroups identified only by peptides shared with other subgroups are discarded. This allows removing redundancies that were not visible at the first step, because different peptides of a subgroup can be shared with different subgroups (Figure 1 A4). Occasionally, a tie can occur when several subgroups of a given group have no specific peptide but only different combinations of unspecific peptides. In this case, the subgroup that includes the smallest number of identified peptides is discarded, and the grouping process is reiterated until disappearance of the tie. In the case of phosphoproteomics, the analysis of phosphosites is not only hampered by 5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 30

shared peptides but also by the possible co-occurrence of different phosphosites on the same peptide (or on overlapping peptides in case of miscleavage). To solve this problem, we developed the concept of phosphoisland. A phosphoisland is a protein region that includes at least one phosphosite covered by one or several overlapping phosphopeptides. The position and number of phosphosites are not necessarily the same in all the overlapping phosphopeptides. This can reflect a biological reality (the different sites can be alternatively phosphorylated) or the uncertainty on the exact position of the phosphosite. The first step of the algorithm is the same as for protein inference, except that phosphopeptides replace peptides and phosphoislands replace proteins: the phosphoislands identified in different proteins with exactly the same set of phosphopeptides are gathered to create subgroups and the phosphoislands defined by a subset of these phosphopeptides are discarded, as there is no evidence for their existence in the samples (Figure 1 B1 to B3). In the second step, the subgroups containing phosphoislands beared by the same protein are gathered to create groups. In this way, the group gathers all the phosphosites detected for a given protein, as well as all the phosphosites of the proteins that share a phosphosite with this protein (Figure 1 B4).

Informatics X!TandemPipeline is encoded in Java for interoperability and uses SWT as graphic framework. It is a standalone software application with a GUI allowing the launching of the X!Tandem search engine and the post-processing of identification results without the need of a web server. The program is licensed under the GNU General Public License v3.0. The source code, binaries and documentation are available at http://pappso.inra.fr/bioinfo/xtandempipeline. A zip archive containing executables for Windows and Linux can be downloaded. The program can also be deployed or updated automatically by Java Web Start for Windows, Mac and Linux. Alternatively

a

Debian

package

is

available

in

a

Debian

repository

(http://pappso.inra.fr/bioinfo/install_debian_jessie.php). Using this repository allows the automatic installation or update of all the required dependencies including the Java Virtual Machine and the 6

ACS Paragon Plus Environment

Page 7 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

X!Tandem software. X!TandemPipeline is naturally linked to the X!Tandem search engine. The GUI provides access to all X!tandem parameters with the possibility to save the settings in preset files, which facilitates the searches in successive batches. However the post-processing is not specific to X!Tandem and can as well use the results of other search engines such as Sequest, Mascot, OMSSA, COMET or Morpheus. This provides the possibility to compare or to combine the results obtained with different search engines. Support for the mzIdentML format21⁠ is in progress.

Benchmarking Three publicly available data sets were used for benchmarking: (i) a set of four LC-MS/MS injections

of

yeast

proteins

published

by

(http://www.marcottelab.org/MSdata/Data_02/DATA/070119-zl-mudpit07-1.raw.zip.gz)

and

22



previously

used by 23⁠ and 24⁠ to compare inference algorithms, (ii) the 12 lung cancer samples of a human data set from the PRIDE repository (PXD000603; LC1-LC12) also used by data set from the PROTICdb database

25

24

⁠ and (iii) a maize

⁠ ⁠ composed of 19 LC-MS/MS injections of leave

proteins (http://moulon.inra.fr/protic/extraction). The latest version of X!Tandem (X!Tandem Vengeance 2015.12.15.2) was used for database searching. For the yeast and lung cancer data sets (LTQ-Orbitrap, same databases searched as in 24⁠ ), the X!Tandem parameters were the following: 25 ppm for precursor tolerance, 0.5 Da for fragment tolerance, trypsine digestion, no semitryptic peptide, one miscleavage allowed, cysteine carboxymethylation as permanent modification and oxidation of methionine as a potential modification. For the maize data set (Q-Exactive mass spectrometer), the maize genome database (version 5a; http://www.maizesequence.org/) was searched with the same parameters, except that precursor and fragment tolerance were set to 10 ppm and 0.02 Da, respectively. The false discovery rate (FDR) was estimated by using reverse databases for the yeast and maize 7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

experiments and the human decoy database provided by 24⁠ (ProteomXchange ID : PXD003957) for the lung cancer experiment.

Results X!TandemPipeline features and displays X!TandemPipeline includes two independent modules, one for peptide identification and the other for the post-processing of identification results, i.e. for protein inference or phosphopeptide grouping (a short user guideline is given in Supporting Information Text S-1). Peptide identification is performed by the X!Tandem search engine through a GUI that provides a friendly access to all the search parameters (Supporting Information Figure S-1). Parameter settings can be registered and applied in batch to several samples. Entry files should be compatible with the formats accepted by X!Tandem (e.g. mzXML, mzML, mzData, MGF). Post-processing can be performed on the identification results obtained from several search engines, including X!tandem, Mascot (.dat files) and the tools producing results in the pepXML format developed by the TPP group26⁠ . As in the first module, the files submitted simultaneously are processed in batch with the same settings. In the three available options (Combined, Individual and Phosphopeptide options, see Supporting Information Text S-1 for details), X!TandemPipeline enables to tune the thresholds for peptide and protein E-values and for the minimum number of identified peptides per protein. Data imported in the pepXML format can also be filtered according to the PeptideProphet and InterProphet probabilities. The user can choose whether the protein filters should be applied to all the samples together or individually and whether contaminant proteins should be removed. If desired, all the filters can be adjusted once the processing is achieved. Results are displayed in a tabular form as a list of identified proteins (or phosphoislands) ranked by groups and subgroups where each line introduces an identified protein (or phosphoisland)

8

ACS Paragon Plus Environment

Page 8 of 30

Page 9 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(Figure 2A, Supporting Information Figure S-2). Statistics relative to fragment mass precision and to FDR are easily accessible in the main window (Figure 2B). PSMs are listed in a second window with details relative to peptide identification (Figure 2C). Protein details can be inspected in a dedicated window (Figure 2D), where the complete sequence of the selected protein is displayed, with identified peptides highlighted, and the protein coverage, the E-value and the theoretical mass are specified. For each PSM, the annotated spectrum can be produced, zoomed and tuned according to m/z precision or minimum peak intensity (Figure 2E). Any PSM that do not meet sought quality criteria can be easily removed. As the elimination of PSMs can affect the identification and grouping results, the automatic post-processing can be re-launched at any time. In the same way, undesired protein or phosphoisland (e.g. contaminants) can be easily removed and groups and subgroups are automatically re-computed. X!TandemPipeline results can be exported in tabulated text files (.txt) or as a spreadsheet file (.ods) (see details in Supporting Information Text S-1). The sequence of identified proteins can be exported in FASTA format. E-values of peptides and proteins computed on the real database and on a decoy database can also be obtained to compute and chose adequate FDR thresholds. At last, results can also be exported in adequate formats for quantification by MassChroQ27⁠ (XIC integration), for making data publicly available in a PROTICdb database25⁠ , or for de novo analysis of spectra that were not identified by X!Tandem (http://pappso.inra.fr/bioinfo/denovopipeline). Like X!TandemPipeline, both MassChroQ and PROTICdb are adapted to the management of large numbers of samples.

Computing performances X!tandemPipeline was developed to process large experiments and special attention has been paid to speed of execution and memory management. With a common desktop PC (Intel core i7, 3.4GHz, 8 GB of RAM), the 19 shotgun analyses of the maize leaf dataset was processed with 9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

less than 1 GB of memory usage in 2 minutes and 12 seconds, of which 2 minutes and 6 seconds were used for the loading of X!Tandem identification results and 6 seconds for the grouping process itself. For experiments with more than 100 complex samples, the amount of memory allocated to the Java Virtual Machine must be specified. This can be done by using a command line to execute X!tandemPipeline or by using the “bigmem” launcher available in the zip archive. With 6 GB of allocated memory, processing 229 MS shotgun analyses (10.5 GB of XML X!Tandem result files) took a total of 15 minutes and 45 seconds, of which 3 minutes and 7 seconds were used for the grouping process itself.

Example of use As an example of the grouping system on proteins, we present the results obtained when processing one of the 19 samples of the maize leaf dataset (Figure 3). As a whole, a raw list of 4618 unique peptide sequences were identified by X!Tandem with an E-value AT4G25500.2 (138­208) PERRRDRSPERRRRSPSPYKRERGSPDYGRGASPVAAYRKERTSPDYGRRRSPSP.YKKSRRGSPEYGRDRR >AT4G25500.3 (146­216) PERRRDRSPERRRRSPSPYKRERGSPDYGRGASPVAAYRKERTSPDYGRRRSPSP.YKKSRRGSPEYGRDRR >AT5G52040.1 (178­248) PERRRDRSPDRRRRSPSPYRRERGSPDYGRGASPVAHKR.ERTSPDYGRGRRSPSPYKRARLSPDYKRDDRR >AT5G52040.2 (178­248) PERRRDRSPDRRRRSPSPYRRERGSPDYGRGASPVAHKR.ERTSPDYGRGRRSPSPYKRARLSPDYKRDDRR Subgroup a7a2 containing phosphoislands 7.2.1 to 7.2.5

Subgroup a7a1 Containing phosphoislands 7.1.1 to 7.1.5

B Group 3

Pepa3a1a1 (2 scans )

EGTTTGGRGTVR >AT3G47070.1 (12­88)ATVRVYATSTKGGSGGPKEEKNPIDFVLGFMTKQDQFYETNPLLKKVDEKEGTTTGGRGTVRGGKNSAPTPVPKKSE VDEKEGTTTGGR Pepa3a1a2 (2 scans)

phosphoisland a3.a1.a1

Figure 4 : Phosphoisland creation and grouping A: Phosphopeptide Pepa7a2a1, that was identified in 2 scans, allowed the identification of 5 identical phosphoislands on 5 different Arabidopsis accessions. They were automatically grouped in subgroup a7.a2. Similarly, Pepa7a1a2 allowed the definition of subgroup a7a1 on the same accessions. As the subgroups were identified in the same proteins, they were grouped in a single group: group 7. Note that the group includes the products of two different genes. B: Phosphoisland a3.a1.a1 was detected on protein AT3G47070.1 thanks to two overlapping phosphopeptides. Each of the peptides were identified thanks to 2 scans. The sequence was found only in protein AT3G47070.1, thus there is only one protein in this group. Note that this phosphoisland contains 2 phosphosites (red). ACS Paragon Plus Environment