Unipept 4.0: Functional Analysis of Metaproteome Data - Journal of

Nov 22, 2018 - Unipept (https://unipept.ugent.be) is a web application for ... Because the true potential of metaproteomics lies in gaining insight in...
0 downloads 0 Views 796KB Size
Subscriber access provided by Kaohsiung Medical University

Article

Unipept 4.0: functional analysis of metaproteome data Robbert Gurdeep Singh, Alessandro Tanca, Antonio Palomba, Felix Van der Jeugt, Pieter Verschaffelt, Sergio Uzzau, Lennart Martens, Peter Dawyndt, and Bart Mesuere J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00716 • Publication Date (Web): 22 Nov 2018 Downloaded from http://pubs.acs.org on November 22, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Unipept 4.0: functional analysis of metaproteome data Robbert Gurdeep Singh a, Alessandro Tanca b, Antonio Palomba b, Felix Van der Jeugt a, Pieter Verschaffelt a, Sergio Uzzau b, Lennart Martens c, d, Peter Dawyndt a, Bart Mesuere* a, c, d a Department

of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent,

Belgium b

Porto Conte Ricerche, Science and Technology Park of Sardinia, Tramariglio, Alghero, Italy

c VIB-UGent

Center for Medical Biotechnology, VIB, Ghent, Belgium

d Department

of Biochemistry, Ghent University, Ghent, Belgium

* corresponding author: [email protected]

Abstract Unipept (https://unipept.ugent.be) is a web application for metaproteome data analysis, with an initial focus on tryptic peptide based biodiversity analysis of MS/MS samples. As the true potential of metaproteomics lies in gaining insights in the expressed functions of complex environmental samples, the 4.0 release of Unipept introduces complementary functional analysis based on GO terms and EC numbers. Integration of this new functional analysis with the existing biodiversity analysis is an important asset of the extended pipeline. As a proof of concept, a human faecal metaproteome dataset from 15 healthy subjects was reanalysed with Unipept 4.0, yielding fast, detailed and straightforward characterization of taxon-specific catalytic functions that is shown to be consistent with earlier results from a BLAST-based functional analysis of the same data.

Keywords Unipept; metaproteomics; functional analysis; biodiversity analysis; GO terms; EC numbers; data analysis; data visualisation

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 24

Introduction Shotgun metaproteomics has evolved over the past decade from a promising new technology to an established strategy for gaining insights in complex ecological systems and microbial communities. Its application domain is very diverse, including soil 1, wastewater 2, food safety 3 and the human gut 4–6. In addition, metaproteomics was proven invaluable in deepening our understanding of hostmicrobiome interactions in the human gut related to diabetes 7, inflammatory bowel disease 8, cystic fibrosis 9 and even depression 10. The field has matured over the years and its focus has gradually shifted from studying the biodiversity in environmental samples to the analysis of expressed functions 11. Where metagenomics is an established and relatively cheap technique to unravel the biodiversity in environmental samples, it can only be used to gauge the functional potential of the community. This is where metaproteomics has all potential to shine, but dedicated tools to perform adequate data analysis are generally lacking. A typical analysis strategy adopted today starts with mapping identified peptides to functions using BLAST-like 12 alignment against reference protein databases with functional annotations such as InterPro 13 or KEGG 14. Tools such as MEGAN 15, MetaProteomeAnalyzer (MPA) 16 and MetaGOmics 17

help to automate this process.

MEGAN is a well-established desktop application for interactive analysis of microbiome data. Next to biodiversity analysis based on the NCBI Taxonomy 18, it also provides functional analysis based on InterPro2GO 19, SEED 20, eggNOG 21 and KEGG. Although MEGAN was initially developed for metagenome analysis, it can also be used for analysing metaproteome data by importing BLAST hits resulting from peptide searches and inferring biodiversity and functional annotations from the matched proteins. While this is an acceptable workaround, it is not a perfect solution for metaproteome analysis because of the many laborious steps. Newer tools like MPA and MetaGOmics are specifically designed for metaproteome analysis. MPA offers a complete metaproteomics pipeline from database search to downstream data analysis. Biodiversity and functional analysis in MPA is based on protein identifications and uses the concept of metaproteins to group multiple proteins. Complementary to biodiversity analysis based on the NCBI Taxonomy, MPA also supports functional analysis based on Enzyme Commission (EC) numbers and KEGG-based pathway

1 ACS Paragon Plus Environment

Page 3 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

views. MetaGOmics is a web application that maps tryptic peptides to NCBI Taxonomy entries and Gene Ontology (GO) annotations based on BLAST searches in large and well-annotated protein database such as UniProtKB 22.

Unipept 23,24 is a web application dedicated to metaproteome data analysis. Its initial focus was on fast and accurate biodiversity analysis 25 of individual tryptic peptides as well as large metaproteomes, building on NCBI Taxonomy cross-references from matched UniProtKB proteins. Taxa are assigned to peptides using a lowest common ancestor (LCA) approach. LCAs of all UniProtKB-derived peptides are precomputed and stored in the Unipept database for performance reasons, allowing processing thousands of peptides in just a few seconds. Analysis results are provided as interactive visualisations for complex community exploration. The Unipept metaproteome analysis pipeline was recently extended with an API 26 and a command line interface 27 to support large-scale data analysis and integration in external tools and pipelines such as the Galaxy-P framework 28. Here, we present the new 4.0 release of Unipept that introduces complementary functional analysis based on UniProtKB cross-references to GO terms and EC numbers. Prior to this release, functional analysis support of Unipept was limited to listing cross-references to GO terms and EC numbers of UniProtKB proteins that match a single tryptic peptide. Extended functional analysis support required the design of a novel aggregation strategy and integration of its implementation with the existing analysis pipeline to support combined biodiversity/functional analysis, paving the way for further advancements in metaproteomics.

Unipept 4.0 Unipept 4.0 is a free web application (https://unipept.ugent.be) for metaproteome data analysis that runs on all computers with a modern web browser. The application provides fast and accurate biodiversity/functional analysis for individual tryptic peptides and large metaproteome datasets in de form of peptide lists.

2 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 24

Database construction The Unipept metaproteome analysis pipeline is underpinned by a fast index for mapping tryptic peptides onto proteins in the UniProt Knowledgebase 22 (UniProtKB), the primary protein data source of Unipept. This allows inferencing peptide-related functions from functional UniProtKB crossreferences of the matched proteins. Functional peptide annotations in Unipept 4.0 build on two UniProtKB cross-referencing schemes that describe gene products: Gene Ontology (GO) 29,30 terms and Enzyme Commission (EC) 31 numbers. GO defines concepts that describe gene functions, with names (so-called “GO terms”) assigned to concepts and a subdivision of concepts in three domains: molecular function, biological function and cellular component. The ontology is organized as a hierarchical network of relationships between GO terms. However, where NCBI Taxonomy relationships form a tree topology, GO terms may have multiple parent terms. As a result, the GO relationships form a directed acyclic graph instead of a tree. EC numbers classify enzymes on the chemical reactions they catalyse. Each enzyme is assigned a short description (the accepted name) and a four-part numerical identifier (e.g. EC:1.7.3.2), which explicitly organizes the ontology as a tree. For example, the parent element of EC:1.7.3.2 (acetylindoxyl oxidase) is EC:1.7.3.- (oxidoreductases acting on other nitrogenous compounds as donors with oxygen as acceptor), and the parent of EC:1.7.3.- is EC:1.7.-.- (oxidoreductases acting on other nitrogenous compounds as donors). This results in a clean 4-level tree structure, with the unfortunate downside that a term has to be renumbered if its classification gets modified. For performance reasons, peptide annotations are precomputed for all UniProtKB-derived peptides and cached in the Unipept database, resulting in a fast index of functional peptide annotations that is to the best of our knowledge unique of its kind. This is done by performing an in silico trypsin digest on each UniProtKB protein (Figure 1A). A selection of protein cross-references (NCBI Taxonomy cross-references for biodiversity analysis and EC/GO cross-references for functional analysis) is then inherited by its derived tryptic peptides (Figure 1B).

3 ACS Paragon Plus Environment

Page 5 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1. Schematic overview of preprocessing and caching functional peptide annotations in the Unipept database. a) Three UniProtKB proteins, each with 2 or 3 functional annotations. b) Tryptic digest on each protein, whose functional annotations are inherited by the resulting tryptic peptides. c) The bottom-right tryptic peptide has exact substring matches with the three UniProtKB proteins and its inherited functional annotations are cached in the Unipept database. Because a single tryptic peptide can have multiple exact protein matches, it may have multiple sets of associated functional annotations. These sets are aggregated before storing the peptide and its functional annotations into the Unipept database. Aggregation is done by counting the number of occurrences of each functional annotation in the matched proteins. Figure 1C illustrates how GO aggregation works for the tryptic peptide highlighted in Figure 1B: there is only one occurrence of GO:1, three of GO:2, two of GO:3 and one of both GO:5 and GO:6. EC annotations are handled in a similar fashion. For each tryptic peptide derived from UniProtKB, the aggregation is precomputed and cached in the Unipept database in binary format (JSON BLOB). Since all downstream functional 4 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 24

analysis is based on aggregations of individual tryptic peptides, retrieving precomputed data from the database is a lot faster than computing it on demand at the cost of extra storage. However, the JSON format allows for efficient data compression resulting in a 15% space reduction for the database table containing the tryptic peptides. In total, 68% of all proteins in UniProtKB (June 2018 version) have at least one GO annotation, whereas only 14% have at least one EC annotation (Table S1). If we look at the annotations after aggregation, we see that 45% of all tryptic peptides in Unipept have at least one GO annotation and 9% have at least one EC annotation. This might seem counter-intuitive as tryptic peptides potentially inherit annotations from multiple proteins, but there is a logical explanation. Well-studied organisms are over-represented in UniProtKB. For this reason, UniProtKB contains multiple instances of homologous proteins of the over-represented organisms. Due to the common practice of homologybased inference of protein function annotations 32, proteins from well-studied organisms are more likely to be functionally annotated but are less likely to add new functional annotations to unique tryptic peptides. In practice, the same reasoning explains why real world datasets have annotation coverages between 80 and 90% (Table S2). Together with the biodiversity aggregations (LCAs) inferred from cross-references to the NCBI taxonomy 33 that were already stored in the Unipept database 24, the new functional aggregations have been incorporated in an extensive catalogue of 1.3 billion tryptic peptides. A fast index structure on this database supports high-performance queries for all UniProtKB proteins that contain an exact match of a given tryptic peptide and for precomputed aggregations on the biodiversity and functional annotations of these proteins without the need for any further time-consuming computations. Because Unipept is primarily developed to process mass spectrometry data, tryptic peptides that are shorter than 5 amino acids or longer than 50 amino acids are ignored when building the tryptic peptide database. In addition, users can indicate that no distinction should be made between isoleucine (I) and leucine (L) when matching tryptic peptides to proteins, to deal with the weight similarity of these two amino acids. To avoid a performance overhead when using this feature, an alternative version of the tryptic peptide database is built, where biodiversity and functional aggregations are precomputed by ignoring the difference between isoleucine and leucine.

5 ACS Paragon Plus Environment

Page 7 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Single peptide analysis When searching Unipept for a single tryptic peptide, all UniProtKB proteins are retrieved that contain an exact match of the peptide (with or without making distinction between isoleucine and leucine as specified by the user), along with the biodiversity and functional annotations of the matched proteins and an aggregation of these annotations for the peptide. Unipept computes the lowest common ancestor (LCA) of all biodiversity protein annotations (cross-references to the NCBI Taxonomy), resulting in a single biodiversity annotation for the tryptic peptide. The LCA is the most specific node in the tree of life that encompasses all taxa associated to the matched proteins. However, for functional protein annotations we currently have not found a working aggregation strategy that results in a single consensus value. Such aggregation might not even be appropriate in this case. After all, a significant difference between biodiversity and functional annotations is that each protein is derived from a single organism, whereas it may have multiple functions. Additionally, for GO terms the ontology does not follow the tree structure required by the LCA algorithm. Instead of looking for a single consensus value, we took the more pragmatic approach and report all functional annotations linked to any of the associated proteins ranked by number of occurrences. After searching for a single tryptic peptide, Unipept displays a page whose top section contains a summary of the biodiversity and functional annotations. This includes the number of proteins that contain an exact match of the peptide, the LCA that holds as the aggregated biodiversity annotation of the peptide and the number of matched proteins that have at least one functional GO/EC annotation (Figure 2A). Results of the aggregation process can be explored in more detail underneath the summary, with tabs that contain a list of all matched proteins with their individual biodiversity and functional annotations (Figure 2B), the taxa associated with the matched proteins grouped into a tree and a table, the ranked GO annotations of the matched proteins (Figure 2C) and the ranked EC annotations of the matched proteins (Figure 2D). Unipept 4.0 has amended the “Matched proteins” tab with functional protein annotations and has added the “GO Terms” and “EC Numbers” tabs to provide insight into the functional annotations of the individual proteins that match the peptide. The “GO Terms” tab ranks all GO terms that are cross-referenced in any matched protein (Figure 2C). It contains separate tables for each of the Gene Ontology domains: Biological Process, Cellular Component and Molecular Function. The GO terms are ranked by decreasing number of protein 6 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 24

cross-references, with the most frequently referenced GO term on top. As a visual aid, all GO domainspecific tables have a bar chart background with bar sizes in the first column proportional to the number of matched proteins annotated with the row’s corresponding GO term. A QuickGO hierarchy chart accompanies each GO domain-specific table, locating the five most frequent GO annotations using the QuickGo API 34 . Raw data can be exported in CSV format for downstream analysis using the “Save table as CSV” button. The resulting machine-readable file can then be imported in applications such as Microsoft Excel. The “EC numbers” tab ranks all EC numbers that are cross-referenced in any matched protein (Figure 2D), where ranking is again by decreasing number of protein cross-references. The same data are also represented using an interactive visualisation below the tabular representation. Since EC numbers themselves have a tree-like structure, the relative number of protein cross-references can also be visualised as a tree, emphasising which EC numbers are referenced more often. The radius of each tree node is proportional to the number of proteins referencing any EC numbers equal to or below the EC number corresponding to the node. Raw data can be exported in CSV format in the same way as with the GO terms. The tree can be exported both as a PNG image and as an SVG editable vector graphic.

7 ACS Paragon Plus Environment

Page 9 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2: New and updated summary and tabs in Unipept 4.0 with functional analysis results for the tryptic peptide ALQQLQTK. Exact protein matching has been done by ignoring the difference between isoleucine (I) and leucine (L). a) Summary of biodiversity and functional aggregation of matched

8 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 24

protein annotations. b) List of all proteins containing an exact match to peptide ALQQLQTK, with UniProtKB protein names and cross-references to NCBI Taxonomy, GO terms and EC numbers. c) GO terms from the Molecular Function domain cross-referenced in any matched protein ranked by decreasing number of protein cross-references, accompanied by a QuickGO hierarchy chart locating the five most frequent GO annotations. d) EC numbers cross-referenced in any matched protein ranked by decreasing number of protein cross-references, with bar chart in background of first column of the table and interactive tree visualisation below the table.

Metaproteome analysis In addition to functional analysis of individual tryptic peptides, Unipept also supports fast and accurate functional analysis of metaproteome datasets that typically contain thousands of tryptic peptides. A dataset of tryptic peptides can be loaded either by direct pasting a list of peptides into Unipept or importing from PRIDE 35 by entering the unique identifier of an assay. Unipept also provides several demo datasets that can be loaded with a single click, allowing quick testing and evaluation of the Unipept metaproteome analysis. Loaded peptides are sent to the server for metaproteome analysis after clicking the search button. A results page is displayed as soon as the analysis has been completed (usually in less than a second). Only biodiversity analysis results were reported before the release of Unipept 4.0, using a range of visualisations that provide insights into the biodiversity of the sample. Biodiversity analysis is based on fast retrieval of precomputed LCAs from the Unipept database for the individual tryptic peptides in the metaproteome dataset. The tree-like organization of living organisms is preserved both in the aggregation and the visualisations. For each of the matched peptides in the dataset, the results page provides quick access to the Single Peptide Analysis page as described in the previous section. Complementary to biodiversity analysis, functional analysis of metaproteome datasets was introduced in release 4.0 of Unipept. A new heuristic was designed and implemented for aggregating functional annotations of the individual peptides in the dataset. After all, as with the functional analysis of an individual peptide, the LCA approach used for biodiversity analysis is not suitable for functional analysis due to a non-tree structure of GO terms and a multitude of distinct functional annotations per protein/peptide. Instead, all functional annotations cross-referenced by at least one of the matched

9 ACS Paragon Plus Environment

Page 11 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

proteins are initially considered as functional annotations of the peptide. This strategy requires more database storage and more downstream computations than a single or filtered aggregation result, but it avoids that useful information is discarded a priori and allows more exploratory and interactive downstream analysis. As this conservative approach is highly sensitive to spurious functional annotations of matched proteins, be it erroneous functional protein annotations or erroneous protein matches, the functional analysis report allows users to automatically discard functional peptide annotations with low provenance. This is done by setting a threshold: a lower bound on the relative number of matched proteins that cross-reference a functional GO/EC annotation. Cross-referenced annotations below this threshold (default value: 5%) are not considered as functional peptide annotations (Figure 3). The functional analysis of the metaproteome dataset is automatically recomputed and the reports are automatically updated if the threshold value is modified.

Figure 3: Functional analysis of a metaproteome dataset. All functional annotations cross-referenced by at least one of the matched proteins are initially considered as functional annotations of a peptide. Cross-referenced annotations below a user-specified threshold (a lower bound on the relative number of matched proteins that cross-reference a functional annotation, here set to 5%) are not considered as functional peptide annotations. In this example, only 4% of the matched proteins of the blue peptide cross-reference GO:7. Because the relative number of cross-references falls below the threshold it is not considered as a functional annotation of the blue peptide, resulting in two peptides in the metaproteome dataset annotated with GO:7 instead of three if no annotations were filtered.

10 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 24

The functional analysis results of a metaproteome dataset are reported in two new tabs: “GO terms” and “EC numbers” (Figure S1). The “GO Terms” tab ranks all GO terms that are considered as an annotation of at least one peptide of the dataset, again split into separate tables for each of the three GO domains. The GO terms are sorted by decreasing number of peptide annotations, with a bar chart in the background of the table representing the number of annotations. Only the top five GO terms are displayed by default, with a button to expand the list of displayed GO terms. A QuickGO hierarchy chart locating the five most frequent GO terms is displayed next to each table. Clicking the thumbnail renders a full-screen view of the chart. The “EC numbers” tab ranks all EC numbers that are considered as an annotation of at least one peptide of the dataset. The EC numbers are sorted by decreasing number of peptide annotations and the table is collapsed in the same way as the GO tables. Because of the tree structure of EC numbers, the same data is also represented using an interactive tree visualisation below the tabular representation (Figure 5A). Each tree node corresponds to an EC number, with a radius proportional to the number of peptides in the dataset that are annotated with the EC numbers. All functional analysis reports can be exported in CSV format for downstream analysis. Three types of CSV exports are available: a) Peptide report (click “Download results” on top of the page): Reports for each peptide in the metaproteome dataset the LCA, the taxonomic lineage from root to LCA and the three most abundant functional GO/EC annotations among the matched proteins (separately for the three GO domains). b) Functional analysis report (click “Download table as CSV”): Contains the data from a GO table or an EC table in machine-readable format. c) Functional annotation report (click the download icon on a GO/EC table row): Lists all peptides in the metaproteome dataset that are annotated with the selected GO term or EC number (with filtering based on the user-specified threshold). Reports for each peptide the spectral count, the LCA, the taxonomic lineage from root to LCA, the number of matched proteins and the absolute/relative number of matched proteins that cross-reference the selected GO/EC annotation.

11 ACS Paragon Plus Environment

Page 13 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Because Unipept 4.0 now analyses both the biodiversity and function of each individual tryptic peptide in a metaproteome dataset, the peptides can be used to link both analyses and gain insight into which organisms in the sample perform what functions. The metaproteome data analysis report therefore contains two features that can help answering the “who is doing what” question. Selecting a taxon in any of the biodiversity analysis visualisations restricts the functional analysis to the peptides whose LCA is an ancestor of the taxon in the tree of life (Supplementary Figure 1). Hovering over a functional GO/EC annotation table row then also displays a popup window that reports the abundance of the function compared to the full set of peptides. This allows exploring what functions are executed by the selected taxon. The reverse relationship between function and biodiversity can be explored by clicking a functional GO/EC annotation table row. This expands the row with the tree view from the biodiversity analysis of the dataset, in which all taxa are highlighted that are associated with at least one peptide from the dataset that is annotated with the selected GO term or EC number (Figure 4). This allows exploring what taxa are responsible to execute a selected function.

Figure 4. Selecting the GO:0060243 term (negative regulation of cell growth involved in contact inhibition) in the metaproteome analysis of the sample 7 (human gut) demo dataset by Verberkmoes et al. 4, learns that this biological process is only active in human cells (taxon Homininae) within the sample.

Hardware and performance The backend of Unipept is developed using the Ruby on Rails framework, with a frontend that is optimized for dispatching most computations underpinning the interactive visualisations and dynamic reporting to client-side JavaScript using WebWorker technology. Privacy of input peptides and their spectral counts is ensured by using secure HTTPS communication channels between clients and the server. As a consequence of careful database design and the use of JavaScript Local Storage, input 12 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 24

datasets are kept private to the client computer. The usability and responsiveness is improved by breaking up input datasets before sending them to the server. The Unipept web application can be used with any modern web browser. The time needed to perform a metaproteome analysis depends on the number of peptides in the input dataset and the time needed to process a single peptide. Benchmarks show that processing a single peptide takes between 0.43 ms and 10 ms, depending on the state of the database cache in which recently analysed peptides (across all analyses) yield faster response times. This corresponds to analysing between 107 and 2317 peptides per second. The source code of the Unipept web application (frontend and backend) and the data processing pipeline for constructing a Unipept database are freely available under the permissive MIT License at GitHub (https://github.com/unipept/unipept). This allows power users to run the software on-premise, backed by a database constructed from UniprotKB or any alternative protein repository. As a reference, database construction for UniProtKB/Swiss-Prot took 30 minutes on a modern desktop computer (Intel Core i7-7820HQ CPU @ 2.90GHz processor and 15 GB of RAM). Database construction for all UniProtKB entries took 72 hours on a high-memory machine (512 GB of RAM), including the time for parsing UniProtKB and precomputing the biodiversity and functional aggregations for each tryptic peptide found in the protein database. The latter resulted in a 499 GB database (including indexes) for the June 2018 version of UniProtKB. The Unipept web server runs on a virtual Debian machine with a dedicated 2.6 Ghz Intel Xeon Hexa Core processor and 128 GB of RAM, but less memory should also suffice.

Case study To assess the potential of providing insightful functional annotations of metaproteomes, we used Unipept 4.0 to reanalyse a human faecal metaproteome dataset from 15 healthy subjects that was recently characterised based on both biodiversity and functional annotations 36. The list of identified tryptic peptides (26,186 non-redundant sequences in total) was loaded in the Metaproteome Analysis module, with options "equate I and L" and "advanced missed cleavage handling" selected. The number of non-redundant peptide sequences was reduced to 25,086 after equating isoleucine (I) and

13 ACS Paragon Plus Environment

Page 15 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

leucine (L). Unipept matched 21,369 peptides (85% of the input dataset) to at least one protein, of which 16,390 peptides (65% of the input dataset, 77% of the peptides with matched proteins) were annotated with at least one GO term and 8,747 peptides (35% of the input dataset, 41% of the peptides with matched proteins) were annotated with at least one EC number (default threshold value of 5%). More specifically, 6,960 peptides (28% of the input dataset, 33% of the peptides with matched proteins) were both annotated with a taxon (phylum level or lower) and an EC number.

Figure 5. Functional characterisation of the case study dataset 36 with EC numbers. a) Tree view of the identified enzyme distribution within the six major enzyme classes. b) Box plot of the 20 most abundant enzymes in the gut microbiota of the 15 human subjects of the case study. Same color codes have been used for the six major enzyme classes in the tree view and the box plot. c) Venn

14 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 24

diagram showing the intersection and differences between the 100 most abundant enzymes in the case study dataset analysed with Unipept 4.0 and those identified in the original study 36. Focusing on EC annotations (Figure 5A) we found oxidoreductases (class 1) to be the most abundant enzyme class in the faecal metaproteomes, followed by transferases (class 2) and hydrolases (class 3). Figure 5B shows the 20 most abundant EC annotations of peptides in the case study dataset. NADH peroxidase (EC 1.11.1.1) was the microbial catalytic function with the highest median abundance within the faecal microbiota of the human cohort (associated to other proteins involved in oxidative stress response, such as rubrerythrin and reverse rubrerythrin). The majority of highly expressed enzymes are responsible for carbohydrate metabolism, including glycolysis, pyruvate metabolism and butyrogenesis. Some enzymes linked to glutamate metabolism, transcription and translation were also found at high expression levels. Remarkably, both the second (EC 1.2.1.-) and third (EC 1.2.7.-) most abundant EC annotation lack a fourth digit that represents the most specific classification level. This makes it hard to relate the EC number to a specific protein name/function, which would allow a straightforward comparison with other types of functional annotations such as InterPro families or KEGG orthology. However, in this case EC 1.2.1.- comprises the usually abundant glyceraldehyde-3-phosphate dehydrogenase and EC 1.2.7.- probably corresponds to pyruvate-flavodoxin oxidoreductase. We then sought to perform a more stringent comparison with the functional characterization of the case study dataset as originally published 36, where functional annotation of the tryptic peptides was carried out by DIAMOND alignment of matched metagenomic sequences to the UniProtKB/Swiss-Prot Bacteria database (December 2015 version). This was done by retrieving the original EC numbers for the 100 most abundant enzymes (based on median abundance) and calculating the intersection with the 100 most abundant enzymes in the Unipept 4.0 annotation (using UniProt 2018.06) of the case study dataset. The two functional annotation strategies are highly congruent as they share 71% of their EC annotations (Figure 5C). As a follow-up of the original study, we paid special attention to the carbohydrate metabolism. Enzymes belonging to the glycolysis and butyrogenesis pathways were selected to elucidate the specific role of major gut microbiota members within those two important metabolic pathways. The integrated biodiversity/function analysis of Unipept 4.0 enabled us to assign most catalytic functions

15 ACS Paragon Plus Environment

Page 17 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

down to the genus level (Figure 6). As expected, several genera across different phyla (with a wellknown predominance of Bacteroidetes and Firmicutes) actively contributed to glycolytic reactions, whereas several butyrate producers within the Firmicutes (including Faecalibacterium, Roseburia, Clostridium, Butyricicoccus and Oscillibacter) were consistently found as key actors in butyrogenesis. Furthermore, all 13 enzymes (9 for glycolysis and 4 for butyrogenesis) detected in the original study 36 were also identified by Unipept, although slight differences in their biodiversity assignments could be observed.

Figure 6. Biodiversity annotations (columns) of enzymes (rows) involved in glycolysis and butyrogenesis as found by Unipept 4.0 analysis of the case study dataset. The heatmap colour scale is based on logarithmized relative abundance (average over 15 subjects) of enzyme-taxon pairs. Enzymes detected in at least half of the subjects and genera expressing functions in at least two subjects are shown. Phylum columns (boldface text and black-bordered squares) account for the total abundance of all functions assigned to the phylum.

16 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 24

Conclusions The true potential of metaproteomics lies in characterizing expressed proteins in complex environmental samples. Parallel to technical advances in instrumentation and peptide identification, the field urgently needs computational tools that are specifically tailored for analyzing metaproteomes to get the most out of the available data. With the 4.0 release of our user-friendly web application for metaproteome data analysis, Unipept started focusing on functional signatures of environmental samples. New functional analysis pipelines were added to the existing biodiversity analysis. The pipelines are applicable to both individual peptides and complete metaproteome datasets, and are based on GO terms and EC numbers. The functional analysis pipelines have two-way integration with the biodiversity analysis pipeline. This combined approach not only allows to investigate what each organism is doing, but also what organisms in the community perform a given function. Apart from providing a wide range of interactive data visualisations, all analysis results can be downloaded as CSV files for downstream data analysis. The source code of Unipept is also available on GitHub under the permissive MIT license. A well-characterized human faecal metaproteome dataset of 15 healthy subjects was reanalysed to validate the results of Unipept 4.0. Results obtained with the Unipept analysis pipelines align well to earlier results obtained from a BLAST-based analysis. We observed that 71 of the 100 most abundant EC annotations are shared between the functional analysis of Unipept and the original study. Furthermore, Unipept 4.0 consistently mapped the expression of enzymes involved in carbohydrate metabolism to specific microorganisms known to exert those metabolic activities. These analysis results are produced in just a few seconds and require little to no preprocessing of input peptides as opposed to minutes or hours of compute time required for alternative tools that oftentimes need intermediate manual operations. This allows for highly exploratory investigation of complex metaproteomes as supported by interactive data visualisations. Because all Unipept analysis pipelines start from a collection of tryptic peptides, they are independent of the search engine and the database used for upstream spectral identification. Raw metaproteome analysis results can be exported in a single output file that adds biodiversity and functional annotations to each individual peptide in the input dataset.

17 ACS Paragon Plus Environment

Page 19 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Each Unipept metaproteome analysis pipeline adopts a tailored aggregation strategy for combining annotations of all proteins that have an exact substring match with a given peptide. In theory, the LCA strategy used for biodiversity analysis is superior to the functional annotation strategy implemented in Unipept 4.0 because it results in a single outcome that represents a consensus of the underlying diversity. However, the current aggregation approach for functional annotation has proven to yield acceptable results in practice. Even for low threshold values the adjustable threshold seems to filter most – if not all – spurious annotations and the data multiplicity resulting from considering all protein matches assures that the dominant functions emerge. However, correctness and completeness of primary annotations (in this case functional cross-references on UniProtKB proteins) remains a point of attention. Where most UniProtKB proteins are annotated with a reliable and specific organism of origin (often species level or lower), functional annotations are often missing altogether or rather generic. Due to the practice of homology-based inference of protein function annotations, incorrect annotations may also suffer from unintentional propagation through knowledge bases. Now that Unipept 4.0 has put in place the basic infrastructure for functional annotation, future extensions to the functional annotation pipelines are relatively straightforward. Given its complementarity to the current annotations and its aim to be a one-stop-shop for protein

classification, an InterPro-based functional annotation pipeline seems a prime candidate for integration in the next Unipept release. Mapping results to KEGG pathways would also improve the

richness of the analysis. We also plan to provide the new functional annotation pipelines through the Unipept API 26 and command line interface 27 to support large-scale data analysis and integration in external tools and pipelines such as the Galaxy-P framework 28. FUNDING Robbert Gurdeep Singh received funding from Elixir Belgium. Lennart Martens acknowledges funding from the Research Foundation Flanders (FWO) under Grant number G042518N. Bart Mesuere received funding from the Research Foundation Flanders (FWO) grant 12I5217N. SUPPORTING INFORMATION: The following supporting information is available free of charge at ACS website http://pubs.acs.org

18 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure S1.

Page 20 of 24

Screen capture of the results page of the Unipept Metaproteome Analysis of the

case study dataset 36. Table S1.

Fraction of proteins and tryptic peptides in UniProtKB that have at least one GO or

EC annotation. Table S2.

The percentage of sequences with at least one GO or EC number for various

datasets.

References (1) Starke, R.; Bastida, F.; Abadía, J.; García, C.; Nicolás, E.; Jehmlich, N. Ecological and Functional Adaptations to Water Management in a Semiarid Agroecosystem: A Soil Metaproteomics Approach. Sci. Rep. 2017, 7 (1), 10221. (2) Wilmes, P.; Wexler, M.; Bond, P. L. Metaproteomics Provides Functional Insight into Activated Sludge Wastewater Treatment. PLoS One 2008, 3 (3), e1778. (3) Soggiu, A.; Piras, C.; Mortera, S. L.; Alloggio, I.; Urbani, A.; Bonizzi, L.; Roncada, P. Unravelling the Effect of Clostridia Spores and Lysozyme on Microbiota Dynamics in Grana Padano Cheese: A Metaproteomics Approach. J. Proteomics 2016, 147, 21–27. (4) Verberkmoes, N. C.; Russell, A. L.; Shah, M.; Godzik, A.; Rosenquist, M.; Halfvarson, J.; Lefsrud, M. G.; Apajalahti, J.; Tysk, C.; Hettich, R. L.; et al. Shotgun Metaproteomics of the Human Distal Gut Microbiota. ISME J. 2009, 3 (2), 179–189. (5) Xiong, W.; Abraham, P. E.; Li, Z.; Pan, C.; Hettich, R. L. Microbial Metaproteomics for Characterizing the Range of Metabolic Functions and Activities of Human Gut Microbiota. Proteomics 2015, 15 (20), 3424–3438. (6) Petriz, B. A.; Franco, O. L. Metaproteomics as a Complementary Approach to Gut Microbiota in Health and Disease. Front Chem 2017, 5, 4. (7) Gavin, P. G.; Mullaney, J. A.; Loo, D.; Cao, K.-A. L.; Gottlieb, P. A.; Hill, M. M.; Zipris, D.; Hamilton-Williams, E. E. Intestinal Metaproteomics Reveals Host-Microbiota Interactions in Subjects at Risk for Type 1 Diabetes. Diabetes Care 2018 41(10), 2178-2186.. (8) Zhang, X.; Deeke, S. A.; Ning, Z.; Starr, A. E.; Butcher, J.; Li, J.; Mayne, J.; Cheng, K.; Liao, B.; Li, L.; et al. Metaproteomics Reveals Associations between Microbiome and Intestinal 19 ACS Paragon Plus Environment

Page 21 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Extracellular Vesicle Proteins in Pediatric Inflammatory Bowel Disease. Nat. Commun. 2018, 9 (1), 2873. (9) Debyser, G.; Mesuere, B.; Clement, L.; Van de Weygaert, J.; Van Hecke, P.; Duytschaever, G.; Aerts, M.; Dawyndt, P.; De Boeck, K.; Vandamme, P.; et al. Faecal Proteomics: A Tool to Investigate Dysbiosis and Inflammation in Patients with Cystic Fibrosis. J. Cyst. Fibros. 2016, 15 (2), 242–250. (10) Chen, Z.; Li, J.; Gui, S.; Zhou, C.; Chen, J.; Yang, C.; Hu, Z.; Wang, H.; Zhong, X.; Zeng, L.; et al. Comparative Metaproteomics Analysis Shows Altered Fecal Microbiota Signatures in Patients with Major Depressive Disorder. Neuroreport 2018, 29 (5), 417–425. (11) Blank, C.; Easterly, C.; Gruening, B.; Johnson, J.; Kolmeder, C. A.; Kumar, P.; May, D.; Mehta, S.; Mesuere, B.; Brown, Z.; et al. Disseminating Metaproteomic Informatics Capabilities and Knowledge Using the Galaxy-P Framework. Proteomes 2018, 6 (1), 7. (12) Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997, 25 (17), 3389–3402. (13) Mitchell, A.; Chang, H.-Y.; Daugherty, L.; Fraser, M.; Hunter, S.; Lopez, R.; McAnulla, C.; McMenamin, C.; Nuka, G.; Pesseat, S.; et al. The InterPro Protein Families Database: The Classification Resource after 15 Years. Nucleic Acids Res. 2015, 43 (Database issue), D213– D221. (14) Kanehisa, M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28 (1), 27–30. (15) Huson, D. H.; Weber, N. Microbial Community Analysis Using MEGAN. Methods Enzymol. 2013, 531, 465–485. (16) Muth, T.; Kohrs, F.; Heyer, R.; Benndorf, D.; Rapp, E.; Reichl, U.; Martens, L.; Renard, B. Y. MPA Portable: A Stand-Alone Software Package for Analyzing Metaproteome Samples on the Go. Anal. Chem. 2018, 90 (1), 685–689. (17) Riffle, M.; May, D. H.; Timmins-Schiffman, E.; Mikan, M. P.; Jaschob, D.; Noble, W. S.; Nunn, B. L. MetaGOmics: A Web-Based Tool for Peptide-Centric Functional and Taxonomic Analysis of Metaproteomics Data. Proteomes 2017, 6 (1), 2. (18) Federhen, S. The NCBI Taxonomy Database. Nucleic Acids Res. 2011, 40 (D1), D136–D143.

20 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 24

(19) Burge, S.; Kelly, E.; Lonsdale, D.; Mutowo-Muellenet, P.; McAnulla, C.; Mitchell, A.; SangradorVegas, A.; Yong, S.-Y.; Mulder, N.; Hunter, S. Manual GO Annotation of Predictive Protein Signatures: The InterPro Approach to GO Curation. Database 2012, (2012). (20) Overbeek, R.; Olson, R.; Pusch, G. D.; Olsen, G. J.; Davis, J. J.; Disz, T.; Edwards, R. A.; Gerdes, S.; Parrello, B.; Shukla, M.; et al. The SEED and the Rapid Annotation of Microbial Genomes Using Subsystems Technology (RAST). Nucleic Acids Res. 2014, 42 (Database issue), D206–D214. (21) Huerta-Cepas, J.; Szklarczyk, D.; Forslund, K.; Cook, H.; Heller, D.; Walter, M. C.; Rattei, T.; Mende, D. R.; Sunagawa, S.; Kuhn, M.; et al. eggNOG 4.5: A Hierarchical Orthology Framework with Improved Functional Annotations for Eukaryotic, Prokaryotic and Viral Sequences. Nucleic Acids Res. 2016, 44 (D1), D286–D293. (22) Bateman, A.; Martin, M. J.; O’Donovan, C.; Magrane, M.; Alpi, E.; Antunes, R.; Bely, B.; Bingley, M.; Bonilla, C.; Britto, R.; et al. UniProt: The Universal Protein Knowledgebase. Nucleic Acids Res. 2017, 45 (D1), D158–D169. (23) Mesuere, B.; Debyser, G.; Aerts, M.; Devreese, B.; Vandamme, P.; Dawyndt, P. The Unipept Metaproteomics Analysis Pipeline. Proteomics 2015, 15 (8), 1437–1442. (24) Mesuere, B.; Devreese, B.; Debyser, G.; Aerts, M.; Vandamme, P.; Dawyndt, P. Unipept: Tryptic Peptide-Based Biodiversity Analysis of Metaproteome Samples. J. Proteome Res. 2012, 11 (12), 5773–5780. (25) Tanca, A.; Palomba, A.; Deligios, M.; Cubeddu, T.; Fraumene, C.; Biosa, G.; Pagnozzi, D.; Addis, M. F.; Uzzau, S. Evaluating the Impact of Different Sequence Databases on Metaproteome Analysis: Insights from a Lab-Assembled Microbial Mixture. PLoS One 2013, 8 (12), e82981. (26) Mesuere, B.; Willems, T.; Van der Jeugt, F.; Devreese, B.; Vandamme, P.; Dawyndt, P. Unipept Web Services for Metaproteomics Analysis. Bioinformatics 2016, 32 (11), 1746–1748. (27) Mesuere, B.; Van der Jeugt, F.; Willems, T.; Naessens, T.; Devreese, B.; Martens, L.; Dawyndt, P. High-Throughput Metaproteomics Data Analysis with Unipept: A Tutorial. J. Proteomics 2018, 171, 11–22. (28) Jagtap, P. D.; Blakely, A.; Murray, K.; Stewart, S.; Kooren, J.; Johnson, J. E.; Rhodus, N. L.; Rudney, J.; Griffin, T. J. Metaproteomic Analysis Using the Galaxy Framework. Proteomics 2015,

21 ACS Paragon Plus Environment

Page 23 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

15 (20), 3553–3565. (29) Ashburner, M.; Ball, C. A.; Blake, J. A.; Botstein, D.; Butler, H.; Cherry, J. M.; Davis, A. P.; Dolinski, K.; Dwight, S. S.; Eppig, J. T.; et al. Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium. Nat. Genet. 2000, 25 (1), 25–29. (30) The Gene Ontology Consortium. Expansion of the Gene Ontology Knowledgebase and Resources. Nucleic Acids Res. 2017, 45 (D1), D331–D338. (31) International Union of Biochemistry and Molecular Biology. Nomenclature Committee; Webb, E. C. Enzyme Nomenclature 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes; Academic Press, 1992. (32) Loewenstein, Y.; Raimondo, D.; Redfern, O. C.; Watson, J.; Frishman, D.; Linial, M.; Orengo, C.; Thornton, J.; Tramontano, A. Protein Function Annotation by Homology-Based Inference. Genome Biol. 2009, 10 (2), 207. (33) Wheeler, D. L.; Church, D. M.; Edgar, R.; Federhen, S.; Helmberg, W.; Madden, T. L.; Pontius, J. U.; Schuler, G. D.; Schriml, L. M.; Sequeira, E.; et al. Database Resources of the National Center for Biotechnology Information: Update. Nucleic Acids Res. 2004, 32 (Database issue), D35–D40. (34) Binns, D.; Dimmer, E.; Huntley, R.; Barrell, D.; O’Donovan, C.; Apweiler, R. QuickGO: A WebBased Tool for Gene Ontology Searching. Bioinformatics 2009, 25 (22), 3045–3046. (35) Vizcaíno, J. A.; Csordas, A.; del-Toro, N.; Dianes, J. A.; Griss, J.; Lavidas, I.; Mayer, G.; PerezRiverol, Y.; Reisinger, F.; Ternent, T.; et al. 2016 Update of the PRIDE Database and Its Related Tools. Nucleic Acids Res. 2016, 44 (D1), D447–D456. (36) Tanca, A.; Abbondio, M.; Palomba, A.; Fraumene, C.; Manghina, V.; Cucca, F.; Fiorillo, E.; Uzzau, S. Potential and Active Functions in the Gut Microbiota of a Healthy Human Cohort. Microbiome 2017, 5 (1), 207.

22 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 24

For TOC Only

23 ACS Paragon Plus Environment