hEIDI: An Intuitive Application Tool To Organize and Treat Large-Scale

Aug 25, 2016 - Managing and extracting reliable information from such large series of data sets require the use of dedicated software organized in a c...
0 downloads 0 Views 3MB Size
Technical Note pubs.acs.org/jpr

hEIDI: An Intuitive Application Tool To Organize and Treat LargeScale Proteomics Data Anne-Marie Hesse,†,‡,§,⊥ Véronique Dupierris,†,‡,§,⊥ Claire Adam,†,‡,§,# Magali Court,†,‡,§ Damien Barthe,†,‡,§ Anouk Emadali,†,‡,§,∥ Christophe Masselon,†,‡,§ Myriam Ferro,†,‡,§ and Christophe Bruley*,†,‡,§ †

Univ. Grenoble Alpes, BIG-BGE, F-38000 Grenoble, France CEA, BIG-BGE, F-38000 Grenoble, France § Inserm U1038, BGE, F-38000 Grenoble, France ‡

S Supporting Information *

ABSTRACT: Advances in high-throughput proteomics have led to a rapid increase in the number, size, and complexity of the associated data sets. Managing and extracting reliable information from such large series of data sets require the use of dedicated software organized in a consistent pipeline to reduce, validate, exploit, and ultimately export data. The compilation of multiple mass-spectrometry-based identification and quantification results obtained in the context of a large-scale project represents a real challenge for developers of bioinformatics solutions. In response to this challenge, we developed a dedicated software suite called hEIDI to manage and combine both identifications and semiquantitative data related to multiple LC−MS/ MS analyses. This paper describes how, through a user-friendly interface, hEIDI can be used to compile analyses and retrieve lists of nonredundant protein groups. Moreover, hEIDI allows direct comparison of series of analyses, on the basis of protein groups, while ensuring consistent protein inference and also computing spectral counts. hEIDI ensures that validated results are compliant with MIAPE guidelines as all information related to samples and results is stored in appropriate databases. Thanks to the database structure, validated results generated within hEIDI can be easily exported in the PRIDE XML format for subsequent publication. hEIDI can be downloaded from http://biodev.extra.cea.fr/docs/heidi. KEYWORDS: data management, quantitative proteomics, protein grouping, relational databases, identification results



INTRODUCTION Proteomics aims to characterize the protein components in biological systems. To achieve this characterization, a pipeline has arisen where proteins are extracted, sometimes fractionated, and conventionally digested into peptides using trypsin. The resulting peptide mixtures are submitted to a separation step prior to analysis by liquid chromatography coupled to mass spectrometry (MS). The combination of a separation method and MS yields two types of information: (i) The peptides present in the mixture analyzed are identified as the proteins they possibly originate from; this information may require validation based on user expertise or statistical considerations due to the possibility of errors in matching mass spectra to amino acid sequences.1 (ii) Values related to the amount of each peptide in the mixture analyzed are identified. A proteomics data set can thus be seen as a repertoire of proteins present in a given fraction of a biological system of interest under a particular set of conditions. To gain valuable knowledge about the system under investigation, it is often interesting to combine and compare multiple data sets. These data sets can correspond to various fractions of the biological system of interest or to the same © XXXX American Chemical Society

fraction of the system in a range of environmental or physiological conditions. Because of rapid increases in the number of samples analyzed as well as the size and complexity of the proteomics data generated, managing and extracting reliable information from large series of data sets now require the use of dedicated software platforms for data storage, processing, validation, and ultimately exploitation and export. LIMS solutions for proteomics exist to manage data through storage and traceability, such as CPAS, Mascot Integra, and mslims (for a review, see ref 2), among many others. These systems are compatible with all types of proteomics data (raw data, peak lists, processed data, database searching data, etc.). If we consider analysis to be part of a proteomics pipeline, it would be relevant to integrate it into LIMS or provide it as part of independent modules. Thus, analysis software, such as OpenMS,3 Trans-Proteomic Pipeline (TPP),4 and Compomics5 (for a review, see ref 6), is often designed as a suite of tools, with each tool addressing a particular step of the identification or quantification process. These individual tools Received: October 30, 2015

A

DOI: 10.1021/acs.jproteome.5b00853 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research

(E), stroma (S), and thylakoids (T). About 500 LC−MS/MS analyses were performed using an LTQ-FT instrument. In the context of the present demonstration, a subset of the AT_Chloro data set was selected corresponding to samples of the three compartments fractionated on gel and analyzed twice by LC−MS/MS. The experimental design is represented in Figure 3 and includes ∼150 database search results. hEIDI can be used to manage this experimental design through a hierarchical tree structure, where each “node”, called a “context”, corresponds to a particular experimental stage. Each context gathers together several identification results. All results were exported in the PRoteomics IDEntifications database (PRIDE XML) format and data were deposited in ProteomeXchange under identifier PXD000136.

are compatible with data exchange due to standard format input and output files (such as mzML or mzIdentML). However, data from different analysis levels (from individual fractions to the higher biological system level) can be difficult to extract at the end of the process. Although dedicated tools have addressed the principles underlying protein inference, the capacity to handle hundreds or thousands of analyses in a unique environment has become critical, especially when the experimental design involves quantitative data analysis. In addition, identification results are often presented in an oversimplified way, with the peptide− protein graph generally reduced to a list of proteins identified by a set of matching peptides. While this simplification is sometimes needed to present identification results, it can induce errors when confronting data sets by comparing a list of protein accession numbers. We describe an identification and quantification data management tool called hEIDI. hEIDI organizes MS-based identification data in a fine-grained relational model that allows a hierarchical representation and handling of identification data sets. The algorithms implemented ensure protein inference consistency and provide a fair method for comparison of different data sets. Data sets can be navigated and visualized in a user-friendly graphical user interface displaying MS evidence of peptides and proteins. Our software also includes additional useful features such as data set export in standard formats and options to calculate multiple spectral counts. Here we will introduce the benefits of hEIDI for spectral count-based quantification when considering the protein inference problem and the experiment design. The context-centric approach used was shown to ensure consistency with regard to protein grouping and further quantitative computations.



Peptide and Protein Grouping

For each node of the experimental design, the protein grouping algorithm can be launched. First, peptides from different identification results attached to this node are grouped. To be grouped, peptides must have the same sequence and the same calculated mass. At the end of grouping, new peptides are attached to the parent context along with their child peptides. The peptide sequences are linked to proteins in a graph. Proteins that contain unique (or specific) peptides (not assigned to other proteins) are considered identified, and those that have no unique (specific) peptide are collated into groups. This process defines protein groups bearing the name of a representative member. The algorithm also calculates a protein score and coverage value. This process can be repeated from the context with the lowest level to the context with the highest level. Specific rules based on regular expressions, through the dedicated “Change typical protein” algorithm, can be applied to ensure consistency in the naming of protein groups between different contexts (e.g., preference for a given taxonomy or a particular database).

EXPERIMENTAL PROCEDURES

Database Schema

Protein Group Filtering

Mass Spectrometry Identification databases (MSIdbs) based on PostgreSQL, an open-source Relational Database Management System, were created to contain validated peptide identification data. A database schema was set up with a core gathering information related to the peptides and proteins identified (Supplementary Figure 1, peptide/protein grouping panel). All of the information related to samples, database searching parameters, and quantitation results is collected in an array of tables (see relevant panels in Supplementary Figure 1).

Additional filtering parameters can be applied to proteins, including the number of peptides per protein group, peptide scores, the number of distinct amino acid sequences, or peptides appearing in a minimal number of child contexts. Using a regular expression, proteins can also be filtered for their accession number or description based on specified criteria. These criteria can be used, for example, to eliminate known contaminants from the list or to remove reverse proteins identified during target-decoy searches (“Protein f ilters” algorithm).

Architecture of hEIDI

hEIDI is a desktop application built on top of the NetBeans Platform, a generic Java framework providing reliable and flexible application architecture. The core algorithms used within hEIDI were developed in pure Java. Database connections to MSIdbs from hEIDI are established using the JDBC API (Java Database Connectivity Application Programming Interface) through the Hibernate Object Relational Mapping framework. hEIDI has been tested on a Windows platform. The software is open-source and distributed under the CECILL license. The source code is available upon request.

Comparing Contexts

To compare two or more contexts, a union reference is generated (i.e., a parent context containing all the contexts to be compared) and each individual context is compared to this reference, thus avoiding a comparison process where contexts are compared to each other. For each child context to be compared, the algorithm confronts each protein group from the union reference with each protein group from the child context and computes a similarity index for each pair of protein groups. This similarity is based on the Dice coefficient9 using the formula (2 |A ∩ B|)/(|A| + |B|), where A and B represent either peptide sets (for a peptide similarity calculation) or protein sets (for a protein similarity calculation) of the protein groups to be compared. Two types of protein similarities are computed: one using same-set proteins only and the other using same-set and subset proteins. The protein group with the highest peptide and

Data Set

Data used to present the examples of how hEIDI works were taken from the data set used to generate the AT_Chloro database.7,8 To generate these samples, chloroplast extracts were fractionated to generate enriched fractions corresponding to the three main subchloroplastic compartments: envelope B

DOI: 10.1021/acs.jproteome.5b00853 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research protein similarities, ranging between 0 and 1, is designated as the most “similar” protein group. Spectral Count Calculation

hEIDI calculates various metrics to determine spectral count: total number of MS/MS spectra corresponding to the identified peptides for a protein group, weighted or not by the context spectra sum (total number of spectra in the context) or by protein mass. Users can also opt to exclusively use spectra from unique peptides (specific spectral count). An additional metric, known as the adjusted spectral count, is also determined. This specific algorithm takes both unique peptides and shared peptides into account. The shared peptides are distributed across the different protein groups in nonequivalent proportions that will be discussed in the Results section. Export

Each context can be exported in Excel or PRIDE XML format. Prior to the PRIDE XML export, the MSIdb must be populated with spectrum details (theoretical fragmentation of peptides with matched fragments) using Paloma, an external Java library developed by EDyP lab, available at http://biodev.extra.cea.fr/ docs/heidi. The PRIDE XML generated includes a header (metadata), as required to meet the PRIDE submission guidelines. The PRIDE XML Converter tool10 can be used to generate the header.

Figure 1. Overview of the main functionalities in hEIDI.

Concepts Behind hEIDI

As terminology surrounding the definition of a database schema can be misleading, the different objects managed in the MSIdbs (identification, context, peptide) must be defined. An “identification” is a protein sequence database search result that is included in the MSIdb. The experimental design can be represented as a hierarchical tree structure where each “node”, or “context”, corresponds to a particular experimental stage, including the corresponding identifications. In other words, a context contains either identification(s) or child context(s). Within an “identification”, an experimentally validated peptide spectrum match (PSM) with its sequence, charge state, and possible modification(s) is referred to as a “leaf peptide”. Leaf peptides are stored in the “peptide” table of the database and have multiple characteristics. For processing purposes, hEIDI considers that leaf peptides are uniquely characterized by their sequence and calculated monoisotopic mass (including any post-translational modifications). Thus, hEIDI can handle “identifications”, “contexts”, and “peptides” that are stored in the MSIdb to (i) perform consistent protein grouping according to the experimental design, (ii) carry out quantitative analysis using the various metrics of spectral counting, and (iii) export the data to public repositories. The following sections present the features and functionalities of hEIDI in greater detail through a case study.



RESULTS Data structuring is compulsory to be able to handle multiple identification and quantification data. The present paper describes the design of a relational database (MSIdb) and a dedicated graphical user interface, hEIDI, whose main purposes are to store validated database searching results so as to manipulate them (e.g., protein grouping) and to compute spectral count-based quantitative results. So that data sets can be readily shared with the scientific community or further explored, a range of export modules have been implemented (e.g., in the PRIDE XML format). hEIDI has been designed to be modular and can be easily integrated into a complete proteomics workflow. hEIDI allows validated results exported to an MSIdb to be handled by the IRMa software.11 In brief, IRMa can parse Mascot.dat files, applying automatic filtering parameters and calculating false discovery rates. The IRMa-generated results can then be exported to Excel or to MSIdbs compatible with hEIDI (Figure 1). IRMa can easily be adapted to consider database searching results from X! Tandem, Sequest, or OMSSA (ongoing work). Once passaged through IRMa, data management within hEIDI can be started, based on a given experimental design that is translated into user-defined “contexts”. In this stage, results can be further filtered (e.g., with respect to the number of amino acids/peptide). The context-centric approach allows two main modules for protein grouping and spectral count-based quantification to be implemented. First, a grouping algorithm is implemented based on the Occam’s razor principle,12 as in MASCOT.13 Within hEIDI, this algorithm has been optimized to handle multiple database search results. After grouping, a comparison algorithm, using information from protein inference, is used to measure similarity between protein lists corresponding to different contexts. Spectral counts can also be computed to provide quantitative data from these LC−MS/MS analyses. In addition, hEIDI is able to export data in the PRIDE XML format,14 in line with current proteomics guidelines.15

Integrating Identification Results and Experimental Design

A database schema was developed to store and handle MSbased proteomics data. Mass Spectrometry Identification databases (MSIdbs) were created, based on PostgreSQL, an open-source Relational Database Management System. The MSIdb’s core gathers information related to identified peptides and proteins (Supplementary Figure 1, peptide/protein grouping panel). All information associated with samples, database searching parameters, and quantitation results are collected in tables (see respective panels in Supplementary Figure 1), ensuring compliance with MIAPE guidelines (e.g., description of contextual metadata16). The database schema provides a model, representing the concepts of identification, context, protein groups, and peptide, as defined above. As indicated above, an identif ication is a database search result that is integrated in the MSIdb. In our current pipeline, C

DOI: 10.1021/acs.jproteome.5b00853 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research

Figure 2. Overview of the hEIDI display. Panel A: experimental design; Panel B: the protein panel; Panel C: the proteins list; Panel D: comparison result; Panel E: identifications available in the chosen MSIdb.

Figure 3. Experimental design of a proteomics experiment based on chloroplast fractionation.7 On the right-hand side is the experiment as represented in hEIDI with its hierarchical view.

D

DOI: 10.1021/acs.jproteome.5b00853 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research Mascot search results are validated using IRMa11 and exported to MSIdbs, where they are represented as identif ications (Figures 1 and 2E). Once the MSIdb is populated, the experimental design can be set up as illustrated in Figure 2A. The experimental design is represented as a hierarchical tree structure, where each “node”, or “context”, corresponds to a particular experimental stage, including the corresponding identifications. As a running example for this paper, we have used a subset of the AT_Chloro database.7,8 In the experiment design shown in Figures 2A and 3, the whole experiment is represented by the “Chloroplast” context. The experimental design corresponds to the analysis of three chloroplast subfractions. Each of the enriched subfractions is defined as a new context: envelope (E), stroma (S), and thylakoids (T). As these three contexts have a parent context (Chloroplast), they are called child contexts. Each subfraction was separated on SDS-PAGE and excised as gel slices that were analyzed twice by LC−MS/MS. These two replicates are identified in the child contexts as Replicate i1 and Replicate i2 (i = S, E, or T) and have been merged into identif ications. In the model protein illustrated, potential identifications are represented in a similar manner within a context and within identif ications. The main difference between the two is that peptides in contexts are an aggregation of experimental peptide evidence and that this relationship is modeled by a hierarchical relationship between peptides.

is achieved by the protein grouping algorithm, which works in two steps, starting from the lowest-level context. First, peptidelevel grouping is performed. At the lowest context level (for example Replicate T2 in Figure 3), all of the leaf peptides are collected from the context identif ications. Leaf peptides sharing the same sequence and calculated monoisotopic mass (incorporating any post-translational modifications) are gathered into a single representative peptide created within an overall context. This representative peptide is linked to the parent context with new characteristics corresponding to the best leaf peptide (experimental peptide), that is, the occurrence of this peptide for which the best identification score was awarded. Peptide features (charge, score, modifications, experimental mass, retention time, and protein matches) are also stored at the parent-context level. Once peptide grouping has been completed, the newly created representative peptides can be assembled into protein groups, as above. A protein group can contain two types of proteins. Same-set proteins are identified by all of the peptides in the group, whereas subset proteins are identified by a strict subset of the peptides. Among proteins sharing the same set of peptides, a representative, usually referred to as the typical or master protein, is selected as the name for the whole protein group. The database model also deals with the ambiguities that can arise from the protein inference solution, where shared peptides can be represented by shared entities in the database. The protein grouping algorithm produces a distinctive set of representative peptides and protein groups attached to the context from which the grouping operation was launched. For instance, in Figure 2B, within the “Chloroplast” context, protein group AT1G07920.1 (named for its typical protein) is defined by 9 representative peptides grouped from 48 identification results corresponding to two replicate experiments that yielded 30 leaf peptides. This protein group has six same-set proteins and two subset proteins.

Context-Related Protein Grouping

A key issue related to large data sets hinges on compiling results to give the most comprehensive and consistent view of a set of individual analyses. Protein grouping is a key step in the validation process: validation criteria applied to accept or reject PSM need to be consistent because the minimal list of proteins generated will be used to compare identification results. To achieve this, the protein grouping process is closely related to protein inference.17,18 Protein inference consists of reporting the most plausible list of proteins present in a sample on the basis of the experimentally identified peptides. The review published by Huang et al.19 summarizes and classifies, based on the input information, the different methods used to achieve this goal. While some of these methods consider additional information such as gene model20,21 or protein interaction network when inferring the list of proteins, in general, the relationship between observed peptides and proteins is represented as a bipartite graph. On the basis of this information alone, some programs, such as ProteinProphet22 or Scaffold,23,24 infer proteins using a probability-based model, while others refer to the Occam’s razor25,26 approach and apply the parsimony principle initially used by Mascot.13 Recently, a tool was developed to implement different protein inference methods.27 hEIDI provides a solution based on a combination of the bipartite graph and the parsimony principle (see Experimental Procedures section). This process allows protein groups to be defined with a representative member, generally taken as the name of the protein group. When a group is defined, it means that at least one protein from the group is present in the biological sample, but it is unclear which protein (or proteins) of the group is actually present. While this grouping process is usually performed at the level of an individual search result, it can also be done with multiple identifications, provided a critical step to merge peptides and proteins into a nonredundant and consistent set of proteins is performed at the context level (Figure 2C). In hEIDI, merging

Comparison of Protein Groups between Contexts

The comparison of identified proteins, corresponding to samples or pool of samples (contexts) to be compared, is not trivial. Indeed, proteins to be compared are, in fact, representative of a protein group, and the peptides that allowed their identification might not be the same in the different contexts to be compared. As a result, a given protein can be a typical protein in one context but only a subset protein in another context. In addition, a given peptide can be shared between two protein groups in one context and not in another. Thus, it is unwise to use direct side-by-side comparison of protein accession lists to compare identification results. To overcome this difficulty, we developed an algorithm to ensure the consistency of the protein group lists to be compared. When two or more contexts must be compared, a union reference is generated (i.e., a parent context containing all of them), and each individual context is compared with this reference, avoiding a comparison process where contexts are compared with each other. For instance, the Envelope, Stroma, and Thylakoids contexts will be compared against the Chloroplast context, which is the union reference. Thus, for each child context to be compared, the algorithm confronts each protein group of the union reference with each protein group of the child context and computes a similarity index for each protein group pair (see Experimental Procedures section). For a given protein group, this index is useful to determine similarity in terms of (i) the set of proteins and (ii) the sets of E

DOI: 10.1021/acs.jproteome.5b00853 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research

Figure 4. Calculation of an adjusted spectral count. P1 and P2 represent two protein groups sharing a peptide, p, defined by its sequence and its MW and identified by scp spectra. ωp,i: weight of peptide p in the context of the protein group i. SCi,proteo: proteotypic spectral count for protein group i. ∑SCj,proteo: sum of proteotypic spectral counts for all protein groups sharing peptide p. SCi,weighted: contributions of shared peptides adjusted by their weight, ωp,I, and added to SCi,proteo (A). Illustration of adjusted Spectral Count from three protein groups sharing 12 peptides (B).

example, in our data, we identified AT2G01140.1, AT2G21330.1, and AT4G38970.1, which share 12 peptides (Figure 4B). To identify the subchloroplastic compartment these proteins belong to, taking all peptides into account, we would have concluded that the three proteins were present in stroma, thylakoids, and envelope (Supplemental Table 1A), but if we consider only unique peptides, it rapidly emerges that only two of them are part of all three compartments, whereas AT2G01140.1 is only present in the stroma. Use of unique peptides makes it possible to avoid hiding specific information from one isoform behind data from other proteins that behave differently. However, considering only unique peptides causes a large part of the data set to be rejected. In our data, overall, shared peptides represent 14% of the spectra identified, but for some proteins, this rate can be much higher. For example, we identified a group of 6 proteins sharing 21 peptides, each identified by between one and six unique peptides. Whereas these proteins are very abundant in thylakoids, unique peptides only represent 12% of the spectra identified for this group. In this case, it is nearly impossible to conclude on the specific location of these proteins in thylakoids based on their specific spectral count, lower than 4 (except for AT2G34430.1 and AT5G54270.1) (Supplemental Table 1B). When comparing protein lists using adjusted SC, the context level at which protein grouping is carried out influences the results. Indeed, the weight calculated will depend on the computation of the number of shared peptides. This number is determined with respect to a given context level, considered for protein grouping. Consequently, the contexts to be compared must be at the same level. The correction factor (weight) can be calculated either at the context level considered or at the parent level, depending on the accuracy of the adjustment. The correction factor, calculated for each protein group, can then be used to calculate the adjusted spectral count in the contexts of interest but also in their child contexts. In our data, because the goal of the study was to compare the different subchloroplastic

peptides that identify them. The protein group with the highest peptide and protein similarities, ranging from 0 to 1, is designated as the most “similar” protein group. This approach circumvents the issues associated with comparing identifications by simply matching accession numbers for typical proteins, especially when grouping differs due to differences in the sets of identified peptides between samples to compare. The comparison result is displayed as a table with as many rows as there are protein groups (represented by their typical protein) in the union reference (parent context) (Figure 3D). Similar protein groups between each compared context can thus be observed at a glance. The table also indicates whether the reference typical protein was identified as a same-set protein or as a subset one, and metrics such as spectral count can be displayed if they were previously computed. Extraction of Semiquantitative Information from Identification Data

For a given context, the spectral count (SC) algorithm can be launched to calculate this property for each protein group. Among the different metrics calculated, we will focus on the use of adjusted spectral count. Peptides shared between different proteins are very common in proteomics identifications, and it is generally recommended to use only unique peptides rather than all peptides to get relevant quantification results. However, it has been shown that distributing spectral counts of shared peptides, based on the presence of unique peptides in a given protein group, generated the best results.28 Thus, to accurately quantify peptides shared between different protein groups, we developed a specific algorithm called adjusted spectral count. This algorithm calculates adjusted spectral counts, as suggested in Abacus,29 to combine and weigh shared peptides depending on the associated protein groups. In brief, for each shared peptide, we define what proportion of spectra should be allocated to the different protein groups. These weights are based on the proteotypic spectral counts of the different protein groups sharing the same peptide(s), as shown in Figure 4A. For F

DOI: 10.1021/acs.jproteome.5b00853 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research ∥

fractions, it was obviously irrelevant to calculate weights at level 1 (whole data set), as this would mix specific information from different compartments. As we know that fragmentation of peptides can be linked to the sampling rate of the mass spectrometer, we choose to calculate weights at level 2 instead of level 3. This reduction in level helps to smooth slight differences between analytical replicates. As shown in Supplemental Table 1C, this approach gives more coherent values between replicates T1 and T2, in particular, for AT1G29930.1. This example shows that the adjusted spectral count can be used to conclude without ambiguity that all proteins from this group are specifically located in thylakoids, as the adjusted values were very high and better represented the protein abundance than specific spectral count.

A.E.: Université Joseph Fourier, 38000 Grenoble, France; INSERM, U823, Institut Albert Bonniot, 38042 Grenoble, France; Pôle Recherche, CHU-Grenoble, 38043 Grenoble, France. Author Contributions ⊥

A.-M.H. and V.D. contributed equally. The manuscript was written through contributions from all authors. All authors have approved the final version of the manuscript. The authors declare no conflict of interest.

Notes

The authors declare no competing financial interest. All results were exported in the PRoteomics IDEntifications database (PRIDE XML) format and data were deposited in ProteomeXchange under identifier PXD000136.

PRIDE (PRoteomics IDEntifications database) XML Export



As part of collaboration or publication, scientists need to share information with the scientific community. The PRoteomics IDEntifications database (PRIDE XML)14,30 is one of the most prominent public data repositories for proteomics data. Thus, the ability to export in the PRIDE XML file format is an essential feature. In hEIDI, identified peptides and proteins with their corresponding mass spectra can be exported to PRIDE XML from any context. Making this feature available from a context guarantees that the preliminary data validation (IRMa) and the multianalyses grouping/filtering (hEIDI) are all reflected in the resulting XML file.

ACKNOWLEDGMENTS We thank the members of the EDyP laboratory that participated in testing the hEIDI software and David Bouyssié for fruitful discussions about the MSIdb schema. We also thank Dr. Maighread Gallagher-Gambarelli for editorial assistance. This work was in part supported by ANR grant Chloro-types (ANR-GENOM-BTV-002-02), the seventh Framework Programme of the European Union (Contract No. 201333DECanBio and 262067-PRIME-XS), and ANR funding for the ProFI project (ANR-10-INBS-08 “Infrastructures Nationales en Biologie et Santé”; “Investissements d’Avenir” call) as well as institutional funding.



CONCLUSIONS hEIDI allows easy handling of multiple Mascot-validated results while maintaining consistent protein grouping. It can also be used to compute different types of spectral count and to compare multiple analyses, or pools of analyses, based on this metric. Thus, hEIDI was able to compare pools of analyses in projects for which up to 1500 search results had been combined. Forthcoming developments will provide additional functionalities, such as the handling of label-free quantification results. hEIDI allows data export in the PRIDE XML format for sharing and exchange of validated results. The software is opensource and distributed under the CECILL license. The software, a sample test database, and the documentation are available at http://biodev.extra.cea.fr/docs/heidi.





ABBREVIATIONS MSIdb, Mass Spectrometry Identification database; hEIDI, handy Exploration and Integration of Data and Identifications; IRMa, Identification Results from Mascot; SC, spectral count



(1) Claassen, M. Inference and validation of protein identifications. Mol. Cell. Proteomics 2012, 11, 1097−1104. (2) Stephan, C.; Kohl, M.; Turewicz, M.; Podwojski, K.; Meyer, H. E.; Eisenacher, M. Using Laboratory Information Management Systems as central part of a proteomics data workflow. Proteomics 2010, 10 (6), 1230−1249. (3) Bertsch, A.; Gröpl, C.; Reinert, K.; Kohlbacher, O. OpenMS and TOPP: open source software for LC−MS data analysis. Methods Mol. Biol. 2011, 696, 353−367. (4) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Farrah, T.; Lam, H.; Tasman, N.; Sun, Z.; Nilsson, E.; Pratt, B.; Prazen, B.; et al. A guided tour of the Trans-Proteomic Pipeline. Proteomics 2010, 10 (6), 1150−1159. (5) Barsnes, H.; Vaudel, M.; Colaert, N.; Helsens, K.; Sickmann, A.; Berven, F. S.; Martens, L. compomics-utilities: an open-source Java library for computational proteomics. BMC Bioinf. 2011, 12, 70. (6) Deutsch, E. W.; Lam, H.; Aebersold, R. Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics. Physiol. Genomics 2008, 33 (1), 18−25. (7) Ferro, M.; Brugière, S.; Salvi, D.; Seigneurin-Berny, D.; Court, M.; Moyet, L.; Ramus, C.; Miras, S.; Mellal, M.; Le Gall, S.; et al. AT_CHLORO, a comprehensive chloroplast proteome database with subplastidial localization and curated information on envelope proteins. Mol. Cell. Proteomics 2010, 9 (6), 1063−1084. (8) Bruley, C.; Dupierris, V.; Salvi, D.; Rolland, N.; Ferro, M. AT_CHLORO: A Chloroplast Protein Database Dedicated to SubPlastidial Localization. Front. Plant Sci. 2012, 3, 205. (9) Dice, L. R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297−302.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.5b00853. Supplementary Table 1. Examples of the different ways spectral count can be computed. Calculating weighting at different levels based on the experimental design influences the adjusted spectral count values. (XLSX) Supplementary Figure 1. Overview of the MSIdb schema. (PDF)



REFERENCES

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Present Addresses #

C.A.: Genostar, 60 Rue Lavoisier, 38330 Montbonnot, St. Martin, France. G

DOI: 10.1021/acs.jproteome.5b00853 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research

data for label-free quantitative proteomic analysis. Proteomics 2011, 11 (7), 1340−1345. (30) Jones, P.; Côté, R. G.; Martens, L.; Quinn, A. F.; Taylor, C. F.; Derache, W.; Hermjakob, H.; Apweiler, R. PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res. 2006, 34 (Database issue), D659−D663.

(10) Côté, R. G.; Griss, J.; Dianes, J. A.; Wang, R.; Wright, J. C.; van den Toorn, H. W. P.; van Breukelen, B.; Heck, A. J. R.; Hulstaert, N.; Martens, L.; et al. The PRoteomics IDEntification (PRIDE) Converter 2 framework: an improved suite of tools to facilitate data submission to the PRIDE database and the ProteomeXchange consortium. Mol. Cell. Proteomics 2012, 11 (12), 1682−1689. (11) Dupierris, V.; Masselon, C.; Court, M.; Kieffer-Jaquinod, S.; Bruley, C. A toolbox for validation of mass spectrometry peptides identification and generation of database: IRMa. Bioinformatics 2009, 25 (15), 1980−1981. (12) Wildner, M. In memory of William of Occam. Lancet 1999, 354 (9196), 2172. (13) Choudhary, J. S.; Blackstock, W. P.; Creasy, D. M.; Cottrell, J. S. Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics 2001, 1 (5), 651−667. (14) Martens, L.; Hermjakob, H.; Jones, P.; Adamski, M.; Taylor, C.; States, D.; Gevaert, K.; Vandekerckhove, J.; Apweiler, R. PRIDE: the proteomics identifications database. Proteomics 2005, 5 (13), 3537− 3545. (15) Martens, L.; Chambers, M.; Sturm, M.; Kessner, D.; Levander, F.; Shofstahl, J.; Tang, W. H.; Römpp, A.; Neumann, S.; Pizarro, A. D.; et al. mzML–a community standard for mass spectrometry data. Mol. Cell. Proteomics 2011, 10 (1), R110.000133. (16) Taylor, C. F.; Binz, P.-A.; Aebersold, R.; Affolter, M.; Barkovich, R.; Deutsch, E. W.; Horn, D. M.; Hühmer, A.; Kussmann, M.; Lilley, K.; et al. Guidelines for reporting the use of mass spectrometry in proteomics. Nat. Biotechnol. 2008, 26 (8), 860−861. (17) Li, Y. F.; Arnold, R. J.; Li, Y.; Radivojac, P.; Sheng, Q.; Tang, H. A bayesian approach to protein inference problem in shotgun proteomics. J. Comput. Biol. 2009, 16 (8), 1183−1193. (18) Nesvizhskii, A. I.; Aebersold, R. Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 2005, 4 (10), 1419−1440. (19) Huang, T.; Wang, J.; Yu, W.; He, Z. Protein inference: a review. Briefings Bioinf. 2012, 13 (5), 586−614. (20) Meyer-Arendt, K.; Old, W. M.; Houel, S.; Renganathan, K.; Eichelberger, B.; Resing, K. A.; Ahn, N. G. IsoformResolver: A peptide-centric algorithm for protein inference. J. Proteome Res. 2011, 10 (7), 3060−3075. (21) Resing, K. A.; Meyer-Arendt, K.; Mendoza, A. M.; Aveline-Wolf, L. D.; Jonscher, K. R.; Pierce, K. G.; Old, W. M.; Cheung, H. T.; Russell, S.; Wattawa, J. L.; et al. Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal. Chem. 2004, 76 (13), 3556−3568. (22) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75 (17), 4646−4658. (23) Searle, B. C. Scaffold: a bioinformatic tool for validating MS/ MS-based proteomic studies. Proteomics 2010, 10 (6), 1265−1269. (24) Searle, B. C.; Turner, M.; Nesvizhskii, A. I. Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. J. Proteome Res. 2008, 7 (1), 245−253. (25) Zhang, B.; Chambers, M. C.; Tabb, D. L. Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. J. Proteome Res. 2007, 6 (9), 3549−3557. (26) Ma, Z.-Q.; Dasari, S.; Chambers, M. C.; Litton, M. D.; Sobecki, S. M.; Zimmerman, L. J.; Halvey, P. J.; Schilling, B.; Drake, P. M.; Gibson, B. W.; et al. IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. J. Proteome Res. 2009, 8 (8), 3872−3881. (27) Uszkoreit, J.; Maerkens, A.; Perez-Riverol, Y.; Meyer, H. E.; Marcus, K.; Stephan, C.; Kohlbacher, O.; Eisenacher, M. PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface. J. Proteome Res. 2015, 14 (7), 2988−2997. (28) Zhang, Y.; Wen, Z.; Washburn, M. P.; Florens, L. Refinements to label free proteome quantitation: how to deal with peptides shared by multiple proteins. Anal. Chem. 2010, 82 (6), 2272−2281. (29) Fermin, D.; Basrur, V.; Yocum, A. K.; Nesvizhskii, A. I. Abacus: a computational tool for extracting and pre-processing spectral count H

DOI: 10.1021/acs.jproteome.5b00853 J. Proteome Res. XXXX, XXX, XXX−XXX