The MetaProteomeAnalyzer: A Powerful Open-Source Software

For a more comprehensive list of citations to this article, users are encouraged to perform a .... MetaLab: an automated pipeline for metaproteomic da...
0 downloads 0 Views 1MB Size
Article pubs.acs.org/jpr

The MetaProteomeAnalyzer: A Powerful Open-Source Software Suite for Metaproteomics Data Analysis and Interpretation Thilo Muth,†,¶ Alexander Behne,‡,¶ Robert Heyer,‡ Fabian Kohrs,‡ Dirk Benndorf,‡ Marcus Hoffmann,† Miro Lehteva,̈ §,⊥ Udo Reichl,†,‡ Lennart Martens,*,§,⊥ and Erdmann Rapp*,† †

Max Planck Institute for Dynamics of Complex Technical Systems, 39106 Magdeburg, Germany Chair of Bioprocess Engineering, Otto von Guericke University Magdeburg, 39106 Magdeburg, Germany § Department of Medical Protein Research, VIB, B-9000 Ghent, Belgium ⊥ Department of Biochemistry, Ghent University, B-9000 Ghent, Belgium ‡

S Supporting Information *

ABSTRACT: The enormous challenges of mass spectrometry-based metaproteomics are primarily related to the analysis and interpretation of the acquired data. This includes reliable identification of mass spectra and the meaningful integration of taxonomic and functional meta-information from samples containing hundreds of unknown species. To ease these difficulties, we developed a dedicated software suite, the MetaProteomeAnalyzer, an intuitive open-source tool for metaproteomics data analysis and interpretation, which includes multiple search engines and the feature to decrease data redundancy by grouping protein hits to socalled meta-proteins. We also designed a graph database back-end for the MetaProteomeAnalyzer to allow seamless analysis of results. The functionality of the MetaProteomeAnalyzer is demonstrated using a sample of a microbial community taken from a biogas plant. KEYWORDS: bioinformatics, environmental proteomics, mass spectrometry, metaproteomics, microbial communities, software



INTRODUCTION Mass spectrometry (MS)-based analysis of pure culture proteomes or simple mixed cultures has advanced rapidly in the past decade. There is now a growing interest in studying complex multispecies samples such as entire microbial communities. Microbial consortia are key players in geochemical cycles,1,2 biochemical networks, and biotechnological and medical applications.3−7 For instance, metaproteomics8 or whole community proteomics9 investigating enzymatic capabilities of microbes on the protein level is applied for wastewater treatment10 and biogas plants11 or in metaproteomic studies with clinical background by analyzing microbial proteins within the human oral cavity12 and intestinal tract.13,14 Over the past few years, the development of metaproteomics is driven mainly by the ability to sequence microbial genomes and metagenomes via high-throughput sequencing15 as well as by the simultaneous improvements in MS instrumentation that allow rapid high-resolution analysis of microbial community samples, predominantly by using shotgun proteomic approaches.16 Despite these improvements, however, the field still suffers from the exceptional complexity and heterogeneity of those samples, which hamper data evaluation.17 State-of-theart protein identification algorithms,18,19 for instance, are designed to handle single-species samples and are severely challenged by size and redundancy of multispecies protein sequence databases.20 In addition, protein hits are typically returned from several hundreds of different species, which further exacerbates the already confounding protein inference problem.21 Taxonomic binning of the identified peptides © 2015 American Chemical Society

therefore must be addressed in a sophisticated manner as well.22 Another challenging issue is encountered in the functional annotation of proteins, as metaproteomics research is not only interested in (single-)protein identifications, but also focuses strongly on specific functions performed by microorganisms in an ecosystem.23 Unfortunately, no stand-alone software tool currently exists to aid metaproteomics research in reliably addressing the central question of the field: “who is doing what?”. Here we describe the MetaProteomeAnalyzer (MPA), a free, open-source, end user oriented Java software suite for the comprehensive analysis and visualization of metaproteomics data sets (http://meta-proteome-analyzer.googlecode.com; a guided, hands-on tutorial can be found in Note 1, Supporting Information). For exhaustive peptide and protein identification, the MPA features four different freely available database search algorithms and furthermore allows for the integration of results derived from the commercial MASCOT search engine (version 2.4).18 The combination of these complementary search engines leads to an increase in protein and peptide identifications as well as in identification reliability. The MPA provides an intuitive workflow for the automated functional and taxonomic characterization of proteins of interest. The software allows for grouping of redundant proteins in the result set according to a set of provided rules, and it also offers an innovative way of querying the results of a metaproteomics Received: December 2, 2014 Published: February 9, 2015 1557

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565

Article

Journal of Proteome Research

Figure 1. MetaProteomeAnalyzer workflow. (1) Project and experiment management is handled on workflow start. (2) MPA server performs database searches and querying of meta-information. (3) Proteins are annotated and grouped to meta-proteins. (4) MPA client user interface provides a results overview with heatmap and charts. (5) Detailed top-down protein result views display meta-proteins and label-free quantification measures. (6) Results linked to meta-information on pathways, ontologies, enzymes, and taxonomies are displayed in additional views. (7) Results can be queried with user-defined functions on the graph database.

function (2). The MPA server also takes care of storing spectra, results, and annotations in a relational database and acts as a simple laboratory information management system (LIMS) that stores all data for later reanalysis or meta-analysis (see Note 3, Supporting Information for the database schema). The user can then retrieve the processed results from the server for performing analyses via the client application (3). In this step, proteins are grouped into so-called meta-proteins (see Note 4, Supporting Information for grouping rules) and are annotated with additional information derived from external resources: general protein-level information (e.g., ontology keywords) from UniProt,30 taxonomic information from NCBI,31 enzyme information from Enzyme Commission (E.C.) classification scheme,32 and metabolic pathway information from KEGG.33 It is also possible to add customized protein databases to the MPA workflow, e.g., for including protein sequences derived from metagenomic sequencing. The formatting and indexing of such user-defined FASTA databases are explained in more detail on the MPA wikipages. The client overview provides a heat map view (Figure 1, Supporting Information) to rapidly screen for relations between identified proteins, peptides, taxonomic groups, and functions (4). The distribution of different taxonomic groups (from superkingdom to species) and functions are visualized as pie charts and bar charts (Figure 2, Supporting Information). Importantly, the database search results panel (Figure 2) displays the identified proteins in a complete overview, which allows detailed inspection of the supporting peptides for each protein, the various peptide-to-spectrum matches for each

analysis by providing a graph database-based back end. The MPA is designed as a client−server application, with the identification workload handled by a high-performance server. In addition, the local client provides a user-friendly graphical user interface to analyze and interpret the results. An example data set derived from a sample of a biogas plant (Note 2, Supporting Information) is included in the MPA viewer application, a stand-alone client version with limited processing capabilities.



MATERIAL AND METHODS

General Workflow

The MPA represents a software pipeline for the analysis and interpretation of metaproteomics data sets. An overview of the general workflow employed in the MPA is outlined in Figure 1 and is described in detail here. At the start of a new project (1), experimental data are provided, and the corresponding tandem mass spectra are loaded and sent to the processing server. The server then executes up to four different database search algorithms (X!Tandem,24 OMSSA,25 Crux,26 and InsPect27) for peptide and protein identification and can also retrieve identifications from the widely used commercial Mascot software using mascotdatfile.18,28 Search results from the different search engines are merged after their individual scores are converted to uniform significance measures, so-called qvalues29 reflecting the minimum FDR for the identifications. The obtained identifications are complemented subsequently with additional information such as protein taxonomy and 1558

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565

Article

Journal of Proteome Research

Figure 2. Search result panel of MetaProteomeAnalyzer. The identified proteins are displayed in the top panel, the identified peptides for the currently selected protein in the middle-left panel, and the peptide-to-spectrum matches for the selected peptide across all search engines in the lower-left panel. The right panel shows the annotated fragment ion series of the currently selected peptide-to-spectrum match.

Meta-Protein Generation

peptide across all search engines, and the fully interactive, annotated spectrum view for each peptide-to-spectrum match34,35 (5). Moreover, several label-free quantitative measures are provided for each protein. This includes spectral count, normalized spectral abundance factor (NSAF),36 and exponentially modified protein abundance index (emPAI).37 The enzyme and pathway views display proteins aggregated by E.C. numbers32 and KEGG pathways,33 which enables a direct inspection of microbial functions (6; Figure 3, Supporting Information). Further detailed views provide protein groupings based on meta-proteins, taxonomies, and ontologies. To focus on the possible role of a defined microbial group in the sample data set, flexible filtering methods are available in all views. Perhaps the most innovative feature of the MPA is the Neo4jbased graph database (7) that holds all the protein metainformation, which allows the user to handle complex queries efficiently using a query dialogue (Figure 4, Supporting Information). For the example data set from a multispecies sample of a microbial community of a biogas plant, this feature of the MPA allows searching of specifically those species that are involved in the production of methane using the key enzyme Methyl coenzyme M Reductase. The power of this innovative approach is illustrated in more detail in Note 5 of the Supporting Information. This method allows distinct information to be retrieved from the sample data set by asking user-defined questions. Finally, the MPA allows the export of all results either as CSV files, for subsequent data analysis in third party software, or as MPA project files for dissemination and display in the viewer application. For more detailed information on the software, we refer to the manual on the Web site (http://meta-proteome-analyzer.googlecode.com).

In metaproteomics, identified peptides frequently belong to homologous proteins expressed by organisms belonging to different species, which causes redundant protein identifications to be reported.38 Previous studies propose several strategies to handle redundant protein hits.39−41 The MPA software incorporates several such approaches within its result processing workflow, the most fundamental being the grouping of proteins according to peptide similarity (Peptide Rule) and protein similarity (Protein Cluster Rule). To minimize the shortcomings of established methods, these rules can be further extended by taking the protein taxonomy into account (Taxonomy Rule). As a precondition for the taxonomy rule, information about the taxonomic lineage of proteins must be inferred. Therefore, the taxonomy definition process, which also affects peptides in relation to proteins and proteins in relation to meta-proteins, is embedded into the meta-protein generation workflow. The common ancestor rule determines the taxonomic lowest common ancestor by all shared peptides for the proteins. Conversely, the most specif ic taxonomy rule preserves peptidelevel specificity for proteins, that is, usually the taxonomy on the species or subspecies level is conserved. As a starting point, a preliminary meta-protein is generated for every protein hit in the raw results. Subsequently, meta-proteins are fused by applying these rules. More detailed information on the metaprotein generation rules can be found in Note 4 of the Supporting Information. Each rule can be applied individually or in combination with other rules, which will yield different results. To visualize the impact of meta-protein generation on the taxonomy, we used the Krona display.42 1559

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565

Article

Journal of Proteome Research Graph Database System

Experimental Data

In general, users have challenging questions during data investigation, and software solutions cannot predict all such use cases. To provide maximal flexibility and power in interrogating the metaproteomics results, the open source graph database Neo4j (www.neo4j.com) has been integrated into the MPA, which enables user-defined querying of the results based on the Cypher query language (http://docs.neo4j. org/refcard/1.9/). Instead of taking the classical approach of a relational database system with tables and indices, this database structure is modeled as a graph consisting of nodes (vertices) and relationships (edges). Both of these entities are named, and relationships are directed referring to a start and end node. Additionally, the graph database uses properties, which are basically key-value pairs that represent certain attributes for nodes and relationships. The graph database is a fully transactional database management system with Create, Read, Update, and Delete (CRUD) methods that are common to relational databases. Although the core of Neo4j has been developed in Java, it provides access to various application programming interfaces (APIs). In contrast to common databases, Neo4j offers two database modes as it runs in either embedded or server mode. The embedded version is used in this case to incorporate the database directly in the MPA client application. Furthermore, the embedded mode comes with the advantages of low latency and full control of the database life cycle. In the case of proteomics data, the graph database structure consists of representative node variants (Table 1) and relationships (Table 2).

The example data set represents the metaproteome of a complex microbial community derived from an agricultural biogas plant located in Magdeburg/Ebendorf (Saxony-Anhalt, Germany) and was obtained by liquid chromatography tandem mass spectrometry (LC−MS/MS). Main process parameters as well as substrate feed composition are summarized in Table 1 of the Supporting Information. More details on sample preparation, LC−MS/MS measurement, and data processing can be found in Note 2 of the Supporting Information. Database Searching

We performed the database searching using the search algorithms MASCOT (version 2.3),18 X!Tandem (version 2013.02.01),24 and OMSSA (version 2.1.8).43 MS/MS spectra were searched against UniProt/SwissProt database (version 2013/02/20). Trypsin was used as default enzyme cleavage parameter, and the maximum allowed number of missed cleavages was set to one. Carbamidomethylation of cysteine (Cys+57 Da) was chosen as fixed modification and oxidation of methionine (Met+16 Da) as variable modification. The precursor ion tolerance was set to 10 ppm, and the fragment ion tolerance was set to 0.5 Da. Target-decoy searching was performed, and the decoy database was constructed by reversing the protein sequences from the target database.



RESULTS AND DISCUSSION To evaluate the MPA and its processing steps, we conducted an experiment with a real data set taken from a biogas plant sample. First, we show the impact of using multiple search engines on the number of identifications. Then, we demonstrate the grouping of redundant proteins to metaproteins on the exemplary data set. Finally, we illustrate the possibility of asking user-defined questions regarding specific aspects of the given data.

Table 1. Node Types and Descriptions for the Graph Database node type Proteins Peptides PSMs Taxonomies Ontologies Pathways Enzymes

description Identified proteins; properties are protein accession, description, sequence coverage, species, and spectral count. Identified peptides; properties are peptide sequence and spectral count. Peptide−spectrum matches; properties are spectrum identifier and search engine score. Taxonomies; properties are taxonomy name, NCBI taxonomy ID, and rank. UniProt ontologies; properties are ontology name and category (e.g., biological process) KEGG pathways; properties are KO number and KEGG description. E.C.-based enzymes; properties are E.C. number and description.

Search Engine Comparison

To test the impact of using multiple search engines, the biogas plant data set was searched with three database search algorithms: X!Tandem, OMSSA, and MASCOT. In the following, we limited the comparison between the different search engines to the number of identified spectra and distinct peptides, as the number of reported proteins varied significantly between the search algorithms. By this restriction, we could guarantee a fair comparison between the different search engines. In the first step, we compared the number of identified spectra at 5% FDR (Figure 5a, Supporting Information). In the

Table 2. Relationship Types and Descriptions for the Graph Database. Nout and Nin Are Defined as Nodes with Outgoing and Incoming Relationship Direction relationship type HAS_PEPTIDE IS_MATCH_IN BELONGS_TO BELONGS_TO_ENZYME BELONGS_TO_PATHWAY INVOLVED_IN_BIOPROCESS HAS_MOLECULAR_FUNCTION BELONGS_TO_CELL_COMP IS_SUPERGROUP_OF IS_ANCESTOR_OF IS_METAPROTEIN_OF

description Nout, Nout, Nout, Nout, Nout, Nout, Nout, Nout, Nout, Nout, Nout,

Proteins; Nin, Peptides; relationship for proteins that share the peptides. PSMs; Nin, Peptides; relationship for PSMs that match for peptides. Proteins; Nin, Taxonomies; relationship for proteins that belong to certain taxonomies. Proteins; Nin, Enzymes; relationship for proteins that fulfill an enzymatic function. Proteins; Nin, Pathways; relationship for proteins that are part of certain pathways. Proteins; Nin, Ontologies (Biological Process); relationship for proteins that are involved in biological processes. Proteins; Nin, Ontologies (Molecular Function); relationship for proteins that have molecular functions. Proteins; Nin, Ontologies (Cellular Component); relationship for proteins that belong to cellular components. Enzymes; Nin, Enzymes; relationship to reflect the enzyme (E.C.) hierarchy. Taxonomies; Nin, Taxonomies; relationship for the taxonomic hierarchy (from superkingdom to species). Proteins; Nin, Proteins; relationship between a meta-protein and a protein (see Note 5, Supporting Information). 1560

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565

Article

Journal of Proteome Research

aeaglobus fulgidus.45 In contrast to methanogens, the pathway is meant to be used in reverse direction here: thus, oxidizing methyl groups derived from the acetyl-CoA decarboxylase/ synthase to carbon dioxide. The rules for meta-protein assignment thus correctly separate proteins with similar functional annotation according to their phylogenetic distance as well as to their actual activity. As illustrated in this section, the choice of meta-protein generation rules has a profound effect on the number of the resulting meta-proteins. On the whole example data set, a reduction between 44 and 50% of the unprocessed data set could be achieved depending on the rules chosen (Table 2, Supporting Information). The impact of applying different meta-protein rules on the calculated composition of the microbial community was visualized by using the Krona display presenting attributed spectral counts onto the different taxonomic levels. In comparison to unprocessed data (Figure 11a, Supporting Information), data from meta-proteins sharing at least one peptide, which had been subsequently assigned to the common ancestor (Figure 11b, Supporting Information), showed a different taxonomic composition: a slightly higher proportion of Methanosarcinales (19% instead of 18%) and a lower proportion of the order Methanomicrobiales (1% instead of 8%) were found due to more assignments to the common ancestor phylum of Euryarchaeota (10% instead of 1%). The applied meta-protein rule (sharing at least one peptide, common ancestor) grouped proteins from wider phylogenetic ranges into meta-proteins and partially prevented the phylogenetic assignment on the order level. Less stringent meta-protein rules, for example, UniRef90 clustering (Figure 11c, Supporting Information), will allow more detailed assignment but also result in a smaller decrease of sample complexity. More differences in the microbial composition can be found when applying other protein grouping rules, such as UniRef50 (Figure 11d, Supporting Information) clustering and shared peptide sets (Figure 11e, Supporting Information). The comparison of phylogenetic assignment of meta-proteins in the Krona diplays is based on spectral counts. Unfortunately, the calculation of label-free quantification methods, such as NSAF36 or emPAI,37 requires either a defined sequence length for the meta-protein or its complete amino acid sequence. The calculation of both measures is therefore biased by partial sequences of proteins in the databases. When working with metaproteome data, this problem becomes worse as the measures are based on the protein hit derived from the database: this protein sequence may differ from the actual protein in the sample. Therefore, we used the metric of spectral counts and added so-called aggregate functions (see Note 1, Supporting Information), for example, to calculate the average of the protein NSAF values for each meta-protein within a sample. This method can be used as a straightforward strategy to enable the label-free quantification for meta-proteins. Another possibility would be the application of an algorithm that is able to handle the high amount of shared peptides between homologous proteins from different organisms in a reasonable manner: for example, a tool called Pipasic (peptide intensity-weighted proteome abundance similarity correction) that corrects strain-level identification and quantification results based on spectral counts.46

Euler diagram, it can be found that each search engine yielded a significant amount of unique spectrum identifications. X! Tandem provided the highest number of unique identifications on the spectrum level, that is, 799 (24.2%) out of 3295 identified spectra. The next comparison involved the number of distinct peptides identified from each of the algorithms (Figure 5b, Supporting Information). The Euler diagram showed that X! Tandem again provided the highest number of peptide hits: 281 (28.3%) out of 992 peptides were identified with this search engine exclusively. In total, 309 (31.1%) peptides were found by all three search engines. Since each of the search engines provided a high amount of unique peptide identifications, the use of multiple search engines is clearly justified. Furthermore, this approach may also be useful for the validation of questionable hits. Meta-Protein Generation

In this section, we investigate the results of the meta-protein generation. For this purpose, we used a group of proteins denoted as F420-dependent methylenetetrahydromethanopterin dehydrogenase (henceforth F420-MTHMO). The example data set features six such protein identifications. The initial state of ungrouped, redundant protein results is shown in Figure 6 of the Supporting Information. Each of the protein identifications was assigned to a single meta-protein at this stage. One way to reduce the redundancy of the results is to group the proteins according to common, so-called shared peptides. Because each of the proteins holds at least one peptide that is also associated with another protein, all proteins are grouped under a single meta-protein according to this grouping rule (Figure 7, Supporting Information). Note that the isomeric amino acids leucine (L) and isoleucine (I) are considered identical. However, as homologous proteins in nature often differ in their amino acid sequence, this exact string matching approach may lead to an incorrect assignment of possible candidates for the protein grouping. Consequently, we also incorporated a distance measure for peptides by allowing a defined maximum number of point mutations. A less stringent peptide similarity definition of one amino acid substitution results in more common peptide sets (Figure 8, Supporting Information). In contrast to the previous rule, the presence of a peptide unique to one specific protein (e.g., A3CSZ5) precludes this protein from also being grouped under the same meta-protein. This can actually be a desired outcome as unique peptides may be an indicator for the presence of distinct proteins. Another rule presents the grouping of proteins according to UniRef database cluster assignments.44 When applying the UniRef50 rule to our example, the majority of all proteins fall into the same similarity cluster due to whole protein sequence similarities, apart from the protein O29544 that is derived from a distant species (Figure 9, Supporting Information). The last rule, which we applied for protein grouping, uses the minimum of one shared peptide. In addition, however, we refined the results by using a taxonomic cutoff. This prevents grouping of proteins whose lineages converge above a specified taxonomic rank threshold. In this example, on the basis of the inferred protein taxonomy rule, the majority of all proteins is grouped under a single meta-protein (Figure 10, Supporting Information). In both cases, the protein O29544 is assigned to a discrete meta-protein belonging to the taxonomic class of Archaeoglobi. The F420-MTHMO is part of the incomplete methanogenesis of the thermophile sulfate reducing Arch-

Integration of Meta-Information from External Resources

Besides the provision of taxonomic information, the identified proteins also point to biological functions. To provide access to 1561

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565

Article

Journal of Proteome Research

Figure 3. KEGG pathway display. The major pathway of the carbon metabolism is shown. KEGG pathway map (1200) of the general carbon metabolism is shown as a net of metabolites and its intermediates. Edges connecting single metabolites by arrows represent respective enzymes and the potential direction of enzymatic conversion. Proteins identified in the data set are highlighted in red after submission from the MPA to the KEGG Web site, thereby representing the coverage of this pathway.

such external meta-information, all protein hits are linked to external sequence databases (UniProt, NCBI), domain/family databases (Pfam, Interpro), services for sequence analysis (Protein BLAST), and functional annotation databases (KEGG, QuickGO). By using the pathway view in the MPA, all proteins identified in the data set can easily be plotted onto the respective KEGG pathway maps, displayed in Figure 3 as an example for the major pathway of carbon metabolism. By restricting the taxonomic range to the superkingdom, the major pathways can be assigned to Archaea (Figure 12, Supporting Information) and Bacteria (Figure 13, Supporting Information). The pathway of methanogenesis is, for example, exclusively present in Archaea and glycolysis/gluconeogenesis mainly in Bacteria. Another example is provided by the analysis of the

amino acid synthesis pathway, which revealed that particular bacterial taxa show a preference for the synthesis of certain amino acids: Proteobacteria prefer arginine and lysine (Figure 14, Supporting Information), while Firmicutes favor tryptophan and histidine (Figure 15, Supporting Information). However, enzymes for amino acid synthesis are generally underrepresented in Archaea (Figure 16, Supporting Information). The high abundance of enzymes involved in arginine, lysine, tryptophan, and histidine synthesis in the metaproteome might indicate the de novo synthesis of these amino acids. By surveying the amino acid composition of maize bulk protein,47 these amino acids are underrepresented in relation to the abundance in microbial biomass: this supports the hypothesis that de novo synthesis is required for microbial growth in 1562

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565

Article

Journal of Proteome Research

On the front-end side of the application, various visualization options facilitate addressing specific questions regarding proteins in relation to their functional and taxonomic information. The convenient usability of these features allows for unbiased data exploration. An additional degree of flexibility in categorizing and visualizing data is provided by the graphbased data handling and user-definable query system for dealing with complex questions. The MPA pipeline has been developed with extensibility in mind regarding the addition of further sources of metainformation, for example, further ontology or pathway databases. Finally, while the MPA is geared toward processing metaproteomics data, the developed workflow also constitutes a highly useful tool for conventional proteomics data analysis.

biogas plants. Both examples show that the MPA facilitates the integration and interpretation of such taxonomic and functional data. Graph Database Driven Query System

In this section, we illustrate the use of querying the data via the flexible graph database system: it allows the user to address specific questions that are not straightforward to answer by the classical result views. To demonstrate the benefits of this approach, we performed four different queries of increasing complexity on the exemplary data set. The syntax and visual representation of these queries are detailed in Note 5 of the Supporting Information, and the query results are shown in Table 3 of the Supporting Information. For each of the following results, the single shared Peptide Rule was taken for the protein redundancy reduction. The first query aims to retrieve all meta-proteins with their related proteins and peptides. This query starts at the protein nodes with the relationship IS_METAPROTEIN_OF and traverses the graph via proteins to peptides. The query resulted in 712 metaproteins linked to 1351 constituent proteins identified by 1140 peptides. The second query extends the first query by a WHERE condition and a regular expression to exclude the term keratin. Filtering out contaminant proteins such as keratin or searching for a specific protein is an important step in any proteomics analysis. This restricted query resulted in 702 metaproteins linked to 1236 proteins and 947 peptides. To find out which taxa and functions can be associated with the data set, the identified meta-proteins are grouped by their taxonomies and ontologies (in this case the biological processes) in the next query: this resulted in 229 taxonomies (all levels), 150 ontologies (biological processes), 428 meta-proteins, and 820 proteins. The last query contains two MATCH clauses and the WHERE condition for the pathway identifier Methyl CoM Reductase (K00399), the final enzyme of methane production. In this case, the query returns the single protein P07962 (Methyl CoM Reductase Subunit Alpha) from the organism Methanosarcina barkeri.



ASSOCIATED CONTENT

S Supporting Information *

Application tutorial; data set description of microbial sample; SQL schema; meta-protein grouping rules; user-defined querying of data. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. Fax: 49-3916110535. *E-mail: [email protected]. Fax: 32-92649484. Author Contributions ¶

T.M. and A.B. contributed equally to this work.

Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS R.H. was supported by the German Environmental Foundation (DBU), Grant No. 20011/136. F.K. was supported by a grant of the Federal Ministry of Food, Agriculture, and Consumer Protection (BMELV) communicated by the Agency for Renewable Resources (FNR), Grant No. 22028811 (Biogas Biocoenosis). L.M. acknowledges the support of Ghent University (Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks”), the IWT SBO grant “INSPECTOR” (120025), and the EU FP7 PRIME-XS project, Grant agreement No. 262067. The sample was kindly provided by the ABO Wind AG by Dipl.-Geograph Matthias Neuss.



CONCLUSION For peptide and protein identification, the MetaProteomeAnalyzer software provides the possibility to search with different algorithms (X!Tandem, OMSSA, Crux, InsPect), each of which has its specific strengths. Combining the results guarantees a maximum reliability on the protein identification level. However, identification lists received as database search engine results are barely sufficient for in-depth proteome analysis of microbial communities and merely serve as a starting point for further investigation. Indeed, a vast pool of knowledge about proteins is readily available in public online databases, but connecting protein search results with these resources typically remains a time-consuming and tedious manual task. MPA therefore automatically combines meta-information retrieved from a variety of sources to annotate protein identifications and thus categorize the data in a meaningful, integrative fashion. An important benefit of the MPA is the flexible grouping of redundant protein identifications to meta-proteins. Instead of providing only one solution, the user can choose between different protein grouping methods or can even use a combination of multiple strategies depending on the required emphasis on protein diversity reduction or taxonomic resolution.



ABBREVIATIONS E.C., Enzyme Commission; emPAI, exponentially modified protein abundance index; LIMS, laboratory information management system; MPA, MetaProteomeAnalyzer; MS, mass spectrometry; LC−MS/MS, liquid chromatography tandem mass spectrometry; SQL, structured query language; NSAF, normalized spectral abundance factor; FDR, false discovery rate; API, application programming interface



REFERENCES

(1) Leininger, S.; Urich, T.; Schloter, M.; Schwark, L.; Qi, J.; Nicol, G. W.; Prosser, J. I.; Schuster, S. C.; Schleper, C. Archaea predominate among ammonia-oxidizing prokaryotes in soils. Nature 2006, 442 (7104), 806−809. (2) Schneider, T.; Keiblinger, K. M.; Schmid, E.; Sterflinger-Gleixner, K.; Ellersdorfer, G.; Roschitzki, B.; Richter, A.; Eberl, L.; ZechmeisterBoltenstern, S.; Riedel, K. Who is who in litter decomposition?

1563

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565

Article

Journal of Proteome Research Metaproteomics reveals major microbial players and their biogeochemical functions. ISME J. 2012, 6 (9), 1749−1762. (3) Arumugam, M.; Raes, J.; Pelletier, E.; Le Paslier, D.; Yamada, T.; Mende, D. R.; Fernandes, G. R.; Tap, J.; Bruls, T.; Batto, J. M.; Bertalan, M.; Borruel, N.; Casellas, F.; Fernandez, L.; Gautier, L.; Hansen, T.; Hattori, M.; Hayashi, T.; Kleerebezem, M.; Kurokawa, K.; Leclerc, M.; Levenez, F.; Manichanh, C.; Nielsen, H. B.; Nielsen, T.; Pons, N.; Poulain, J.; Qin, J.; Sicheritz-Ponten, T.; Tims, S.; Torrents, D.; Ugarte, E.; Zoetendal, E. G.; Wang, J.; Guarner, F.; Pedersen, O.; de Vos, W. M.; Brunak, S.; Dore, J.; Antolin, M.; Artiguenave, F.; Blottiere, H. M.; Almeida, M.; Brechot, C.; Cara, C.; Chervaux, C.; Cultrone, A.; Delorme, C.; Denariaz, G.; Dervyn, R.; Foerstner, K. U.; Friss, C.; van de Guchte, M.; Guedon, E.; Haimet, F.; Huber, W.; van Hylckama-Vlieg, J.; Jamet, A.; Juste, C.; Kaci, G.; Knol, J.; Lakhdari, O.; Layec, S.; Le Roux, K.; Maguin, E.; Merieux, A.; Melo Minardi, R.; M’Rini, C.; Muller, J.; Oozeer, R.; Parkhill, J.; Renault, P.; Rescigno, M.; Sanchez, N.; Sunagawa, S.; Torrejon, A.; Turner, K.; Vandemeulebrouck, G.; Varela, E.; Winogradsky, Y.; Zeller, G.; Weissenbach, J.; Ehrlich, S. D.; Bork, P. Enterotypes of the human gut microbiome. Nature 2011, 473 (7346), 174−180. (4) Ram, R. J.; VerBerkmoes, N. C.; Thelen, M. P.; Tyson, G. W.; Baker, B. J.; Blake, R. C.; Shah, M.; Hettich, R. L.; Banfield, J. F. Community proteomics of a natural microbial biofilm. Science 2005, 308 (5730), 1915−1920. (5) Jehmlich, N.; Schmidt, F.; Taubert, M.; Seifert, J.; Bastida, F.; von Bergen, M.; Richnow, H. H.; Vogt, C. Protein-based stable isotope probing. Nat. Protoc. 2010, 5 (12), 1957−1966. (6) Ley, R. E.; Turnbaugh, P. J.; Klein, S.; Gordon, J. I. Microbial ecology: Human gut microbes associated with obesity. Nature 2006, 444 (7122), 1022−1023. (7) Qin, J.; Li, R.; Raes, J.; Arumugam, M.; Burgdorf, K. S.; Manichanh, C.; Nielsen, T.; Pons, N.; Levenez, F.; Yamada, T.; Mende, D. R.; Li, J.; Xu, J.; Li, S.; Li, D.; Cao, J.; Wang, B.; Liang, H.; Zheng, H.; Xie, Y.; Tap, J.; Lepage, P.; Bertalan, M.; Batto, J. M.; Hansen, T.; Le Paslier, D.; Linneberg, A.; Nielsen, H. B.; Pelletier, E.; Renault, P.; Sicheritz-Ponten, T.; Turner, K.; Zhu, H.; Yu, C.; Jian, M.; Zhou, Y.; Li, Y.; Zhang, X.; Qin, N.; Yang, H.; Wang, J.; Brunak, S.; Dore, J.; Guarner, F.; Kristiansen, K.; Pedersen, O.; Parkhill, J.; Weissenbach, J.; Bork, P.; Ehrlich, S. D. A human gut microbial gene catalog established by metagenomic sequencing. Nature 2010, 464 (7285), 59−65. (8) Wilmes, P.; Bond, P. L. The application of two-dimensional polyacrylamide gel electrophoresis and downstream analyses to a mixed community of prokaryotic microorganisms. Environ. Microbiol. 2004, 6 (9), 911−920. (9) Banfield, J. F.; Verberkmoes, N. C.; Hettich, R. L.; Thelen, M. P. Proteogenomic approaches for the molecular characterization of natural microbial communities. OMICS 2005, 9 (4), 301−333. (10) Kuhn, R.; Benndorf, D.; Rapp, E.; Reichl, U.; Palese, L. L.; Pollice, A. Metaproteome analysis of sewage sludge from membrane bioreactors. Proteomics 2011, 11 (13), 2738−2744. (11) Hanreich, A.; Heyer, R.; Benndorf, D.; Rapp, E.; Pioch, M.; Reichl, U.; Klocke, M. Metaproteome analysis to determine the metabolically active part of a thermophilic microbial community producing biogas from agricultural biomass. Can. J. Microbiol. 2012, 58 (7), 917−922. (12) Rudney, J. D.; Xie, H.; Rhodus, N. L.; Ondrey, F. G.; Griffin, T. J. A metaproteomic analysis of the human salivary microbiota by threedimensional peptide fractionation and tandem mass spectrometry. Mol. Oral Microbiol. 2010, 25 (1), 38−49. (13) Verberkmoes, N. C.; Russell, A. L.; Shah, M.; Godzik, A.; Rosenquist, M.; Halfvarson, J.; Lefsrud, M. G.; Apajalahti, J.; Tysk, C.; Hettich, R. L.; Jansson, J. K. Shotgun metaproteomics of the human distal gut microbiota. ISME J. 2009, 3 (2), 179−189. (14) Kolmeder, C. A.; de Been, M.; Nikkilä, J.; Ritamo, I.; Mättö, J.; Valmu, L.; Salojärvi, J.; Palva, A.; Salonen, A.; de Vos, W. M. Comparative metaproteomics and diversity analysis of human intestinal microbiota testifies for its temporal stability and expression of core functions. PLoS One 2012, 7 (1), e29913.

(15) Venter, J. C.; Remington, K.; Heidelberg, J. F.; Halpern, A. L.; Rusch, D.; Eisen, J. A.; Wu, D. Y.; Paulsen, I.; Nelson, K. E.; Nelson, W.; Fouts, D. E.; Levy, S.; Knap, A. H.; Lomas, M. W.; Nealson, K.; White, O.; Peterson, J.; Hoffman, J.; Parsons, R.; Baden-Tillson, H.; Pfannkoch, C.; Rogers, Y. H.; Smith, H. O. Environmental genome shotgun sequencing of the Sargasso Sea. Science 2004, 304 (5667), 66−74. (16) von Bergen, M.; Jehmlich, N.; Taubert, M.; Vogt, C.; Bastida, F.; Herbst, F. A.; Schmidt, F.; Richnow, H. H.; Seifert, J. Insights from quantitative metaproteomics and protein-stable isotope probing into microbial ecology. ISME J. 2013, 7 (10), 1877−1885. (17) Muth, T.; Benndorf, D.; Reichl, U.; Rapp, E.; Martens, L. Searching for a needle in a stack of needles: Challenges in metaproteomics data analysis. Mol. BioSyst. 2013, 9 (4), 578−585. (18) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551−3567. (19) Eng, J. K.; Mccormack, A. L.; Yates, J. R. An approach to correlate tandem mass-spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976−989. (20) Tanca, A.; Palomba, A.; Deligios, M.; Cubeddu, T.; Fraumene, C.; Biosa, G.; Pagnozzi, D.; Addis, M. F.; Uzzau, S. Evaluating the impact of different sequence databases on metaproteome analysis: Insights from a lab-assembled microbial mixture. PLoS One 2013, 8 (12), e82981. (21) Nesvizhskii, A. I.; Aebersold, R. Interpretation of shotgun proteomic data: The protein inference problem. Mol. Cell. Proteomics 2005, 4 (10), 1419−1440. (22) Mesuere, B.; Devreese, B.; Debyser, G.; Aerts, M.; Vandamme, P.; Dawyndt, P. Unipept: Tryptic peptide-based biodiversity analysis of metaproteome samples. J. Proteome Res. 2012, 11 (12), 5773−5780. (23) Hettich, R. L.; Pan, C.; Chourey, K.; Giannone, R. J. Metaproteomics: Harnessing the power of high-performance mass spectrometry to identify the suite of proteins that control metabolic activities in microbial communities. Anal. Chem. 2013, 85 (9), 4203− 4214. (24) Craig, R.; Beavis, R. C. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466−1467. (25) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X. Y.; Shi, W. Y.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3 (5), 958−964. (26) Park, C. Y.; Klammer, A. A.; Kall, L.; MacCoss, M. J.; Noble, W. S. Rapid and accurate peptide identification from tandem mass spectra. J. Proteome Res. 2008, 7 (7), 3022−3027. (27) Tanner, S.; Shu, H. J.; Frank, A.; Wang, L. C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. InsPecT: Identification of posttransiationally modified peptides from tandem mass spectra. Anal. Chem. 2005, 77 (14), 4626−4639. (28) Helsens, K.; Martens, L.; Vandekerckhove, J.; Gevaert, K. MascotDatfile: An open-source library to fully parse and analyze MASCOT MS/MS search results. Proteomics 2007, 7 (3), 364−366. (29) Kall, L.; Storey, J. D.; Noble, W. S. QVALITY: Nonparametric estimation of q-values and posterior error probabilities. Bioinformatics 2009, 25 (7), 964−966. (30) Patient, S.; Wieser, D.; Kleen, M.; Kretschmann, E.; Martin, M. J.; Apweiler, R. UniProtJAPI: A remote API for accessing UniProt data. Bioinformatics 2008, 24 (10), 1321−1322. (31) Acland, A.; Agarwala, R.; Barrett, T.; Beck, J.; Benson, D. A.; Bollin, C.; Bolton, E.; Bryant, S. H.; Canese, K.; Church, D. M.; Clark, K.; DiCuccio, M.; Dondoshansky, I.; Federhen, S.; Feolo, M.; Geer, L. Y.; Gorelenkov, V.; Hoeppner, M.; Johnson, M.; Kelly, C.; Khotomlianski, V.; Kimchi, A.; Kimelman, M.; Kitts, P.; Krasnov, S.; Kuznetsov, A.; Landsman, D.; Lipman, D. J.; Lu, Z. Y.; Madden, T. L.; Madej, T.; Maglott, D. R.; Marchler-Bauer, A.; Karsch-Mizrachi, I.; Murphy, T.; Ostell, J.; O’Sullivan, C.; Panchenko, A.; Phan, L.; Pruitt, D. P. K. D.; Rubinstein, W.; Sayers, E. W.; Schneider, V.; Schuler, G. D.; Sequeira, E.; Sherry, S. T.; Shumway, M.; Sirotkin, K.; Siyan, K.; 1564

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565

Article

Journal of Proteome Research Slotta, D.; Soboleva, A.; Starchenko, G.; Tatusova, T. A.; Trawick, B.; Vakatov, D.; Wang, Y. L.; Ward, M.; Wilbur, W. J.; Yaschenko, E.; Zbicz, K.; Coordinators, N. R. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2013, 41 (D1), D8−D20. (32) Bairoch, A. The ENZYME database in 2000. Nucleic Acids Res. 2000, 28 (1), 304−305. (33) Kanehisa, M.; Goto, S.; Sato, Y.; Furumichi, M.; Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40 (Database issue), D109−D114. (34) Barsnes, H.; Vaudel, M.; Colaert, N.; Helsens, K.; Sickmann, A.; Berven, F. S.; Martens, L. Compomics-utilities: An open-source Java library for computational proteomics. BMC Bioinf. 2011, 12, 70. (35) Barsnes, H.; Vaudel, M.; Martens, L. JSparklines: Making tabular proteomics data come alive. Proteomics 2014, DOI: 10.1002/ pmic.201400356. (36) Zybailov, B.; Mosley, A. L.; Sardiu, M. E.; Coleman, M. K.; Florens, L.; Washburn, M. P. Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J. Proteome Res. 2006, 5 (9), 2339−2347. (37) Ishihama, Y.; Oda, Y.; Tabata, T.; Sato, T.; Nagasu, T.; Rappsilber, J.; Mann, M. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol. Cell. Proteomics 2005, 4 (9), 1265−1272. (38) Nesvizhskii, A. I.; Aebersold, R. Interpretation of shotgun proteomic dataThe protein inference problem. Mol. Cell. Proteomics 2005, 4 (10), 1419−1440. (39) Lu, F.; Bize, A.; Guillot, A.; Monnet, V.; Madigou, C.; Chapleur, O.; Mazeas, L.; He, P.; Bouchez, T. Metaproteomics of cellulose methanisation under thermophilic conditions reveals a surprisingly high proteolytic activity. ISME J. 2014, 8 (1), 88−102. (40) Meyer-Arendt, K.; Old, W. M.; Houel, S.; Renganathan, K.; Eichelberger, B.; Resing, K. A.; Ahn, N. G. IsoformResolver: A peptide-centric algorithm for protein inference. J. Proteome Res. 2011, 10 (7), 3060−3075. (41) Kolmeder, C. A.; de Been, M.; Nikkila, J.; Ritamo, I.; Matto, J.; Valmu, L.; Salojarvi, J.; Palva, A.; Salonen, A.; de Vos, W. M. Comparative metaproteomics and diversity analysis of human intestinal microbiota testifies for its temporal stability and expression of core functions. PLoS One 2012, 7 (1), No. e29913. (42) Ondov, B. D.; Bergman, N. H.; Phillippy, A. M. Interactive metagenomic visualization in a Web browser. BMC Bioinf. 2011, 12, 385. (43) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3 (5), 958−964. (44) Suzek, B. E.; Huang, H. Z.; McGarvey, P.; Mazumder, R.; Wu, C. H. UniRef: Comprehensive and nonredundant UniProt reference clusters. Bioinformatics 2007, 23 (10), 1282−1288. (45) Klenk, H. P.; Clayton, R. A.; Tomb, J. F.; White, O.; Nelson, K. E.; Ketchum, K. A.; Dodson, R. J.; Gwinn, M.; Hickey, E. K.; Peterson, J. D.; Richardson, D. L.; Kerlavage, A. R.; Graham, D. E.; Kyrpides, N. C.; Fleischmann, R. D.; Quackenbush, J.; Lee, N. H.; Sutton, G. G.; Gill, S.; Kirkness, E. F.; Dougherty, B. A.; McKenney, K.; Adams, M. D.; Loftus, B.; Peterson, S.; Reich, C. I.; McNeil, L. K.; Badger, J. H.; Glodek, A.; Zhou, L.; Overbeek, R.; Gocayne, J. D.; Weidman, J. F.; McDonald, L.; Utterback, T.; Cotton, M. D.; Spriggs, T.; Artiach, P.; Kaine, B. P.; Sykes, S. M.; Sadow, P. W.; D’Andrea, K. P.; Bowman, C.; Fujii, C.; Garland, S. A.; Mason, T. M.; Olsen, G. J.; Fraser, C. M.; Smith, H. O.; Woese, C. R.; Venter, J. C. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 1997, 390 (6658), 364−370. (46) Penzlin, A.; Lindner, M. S.; Doellinger, J.; Dabrowski, P. W.; Nitsche, A.; Renard, B. Y. Pipasic: Similarity and expression correction for strain-level identification and quantification in metaproteomics. Bioinformatics 2014, 30 (12), i149−i156. (47) Ridley, W. P.; Sidhu, R. S.; Pyla, P. D.; Nemeth, M. A.; Breeze, M. L.; Astwood, J. D. Comparison of the nutritional profile of

glyphosate-tolerant corn event NK603 with that of conventional corn (Zea mays L.). J. Agric. Food Chem. 2002, 50 (25), 7235−7243.

1565

DOI: 10.1021/pr501246w J. Proteome Res. 2015, 14, 1557−1565