Perspective and Guidelines for Metaproteomics in Microbiome

The microbiome is emerging as a prominent factor affecting human health, and its dysbiosis is associated with various diseases. Compositional profilin...
0 downloads 0 Views 905KB Size
Subscriber access provided by Bibliothèque de l'Université Paris-Sud

Perspective

Perspective and guidelines for metaproteomics in microbiome studies Xu Zhang, and Daniel Figeys J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.9b00054 • Publication Date (Web): 22 Apr 2019 Downloaded from http://pubs.acs.org on April 23, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Perspective and guidelines for metaproteomics in microbiome studies Xu Zhang1 and Daniel Figeys1* 1 Ottawa

Institute of Systems Biology and Department of Biochemistry, Microbiology and

Immunology, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada *Correspondence to DF, e-mail: [email protected]

Abstract The microbiome is emerging as a prominent factor affecting human health and its dysbiosis is associated with various diseases. Compositional profiling of microbiome is increasingly being supplemented with functional characterization. Metaproteomics is intrinsically focused on functional changes and therefore will be an important tool in those studies of the human microbiome. In the past decade, development of new experimental and bioinformatic approaches for metaproteomics has enabled large-scale human metaproteomic studies. However, challenges still exist, and there remains a lack of standardizations and guidelines for properly performing metaproteomic studies on human microbiome. Herein, we provide a perspective of recent developments, the challenges faced, and the future directions of metaproteomics and its applications. In addition, we propose a set of guidelines/recommendations for performing and reporting the results from metaproteomic experiments for the study of human microbiomes. We 1|Page ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 38

anticipate that these guidelines will be optimized further as more metaproteomic questions are raised and addressed, metaproteomic applications are published, so that they are eventually recognized and applied in the field.

Keywords: metaproteomics, microbiome, guidelines, standards, bioinformatics, posttranslational modification, multi-omics, workflow

Introduction The microbiome is the ensemble of microorganisms that form a community living in a specific niche. Development of culture-independent, next-generation sequencing (NGS) techniques have greatly facilitated studies of microbiomes given that a high proportion of the microbial species in microbiomes are unknown and difficult to culture (1). Tens of thousands of papers on different microbiomes have been published over the past decade, primarily using DNA sequencing-based genomic techniques. A growing number of metagenome-wide association studies have correlated microbiome symbiosis with human health and microbial dysbiosis with diseases (2, 3). Although very informative, genomic approaches generally lack in their ability to provide information on the functional activity of a microbiome for a given condition. RNA sequencing measures the actively transcribed genes, i.e. transcripts, in microbiome samples; however, the presence of transcripts does not necessarily mean the genes are expressed or translated into proteins and the profile of mRNA transcripts does not perfectly correlate with that of proteins (4, 5). On the contrary, the ability to measure proteins and

2|Page ACS Paragon Plus Environment

Page 3 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

metabolites in microbiome, using metaproteomics and metabolomics respectively, would provide more functional information on the microbiome. Metaproteome was first proposed as a term in 2004 by Rodriguez-Valera in their studies of the environmental microbial community (6). Then, Wilmes and Bond proposed the term “metaproteomics” for the study of the entire protein compositions in microbiome samples (7). Shortly afterwards, Ram et al. performed a shotgun metaproteomics study to comprehensively survey the proteins in complex environmental samples, i.e., acid mine drainage (AMD) microbial biofilm, which identified >2000 proteins from the biofilm microbial community (8). Despite these early efforts, only recently did metaproteomics transform from methodological development and small-scale studies to an approach that can provide more extensive coverage of metaproteomes in larger studies (Figure 1). This rapid progress in metaproteomics has been driven in part by the development of higher resolution mass spectrometers (MS) and quantitative proteomics techniques. Importantly, development of efficient bioinformatic tools adapted to the complexity of microbiomes is key to the expansion of metaproteomics in recent years (9-14). As a result, increasing numbers of metaproteomic studies have been published covering various research fields, including human, environmental, food and plant microbiomes (8, 15-21). In particular, metaproteomic applications in human microbiome samples have been extended to large-scale studies which enabled the identification of >50,000 protein groups and deep characterization of the expressed functions of the microbiomes (15, 22, 23). Clearly, metaproteomics is entering a rapid expansion phase. Nevertheless, there are still many critical issues that need to be addressed in both experimental workflow and 3|Page ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 38

bioinformatic aspects for metaproteomics. Therefore, in this perspective, we discuss the challenges that metaproteomics is facing and the future directions of metaproteomic studies, including the applications for studying microbiomes in the context of human health. Finally, we propose a set of guidelines for properly performing metaproteomic studies, which we hope can be a starting point for discussion in the field and eventually standardize metaproteomic studies of human microbiomes.

Figure 1 Lay-of-the-land of metaproteomics and its applications in human microbiome studies (2003 - 2018).

4|Page ACS Paragon Plus Environment

Page 5 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Coverage of metaproteomics Metaproteomics, in contrast to proteomics, not only has to properly identify and quantify the proteins present in a sample but also needs to link the proteins to the different microbes. Microbiomes, such as the human gut microbiome, are very complex. For example, the human fecal microbiome has been estimated to contain ~1000 prevalent bacterial species and ~9.9 million genes based on metagenomics sequencing (24-26); for each individual human gut microbiome, 500,000 genes and 200 prevalent bacterial species are estimated to be present (24, 25). The coverage of the metaproteomic experiment refers to the number of proteins as well as the number of microbes that can be identified/quantified from the microbiome gene and microbe catalog. Depending on the biological question being asked, the coverage of the metaproteome in terms of proteins and/or microbes needs to be optimized. Protein and peptide fractionation (22, 23) have improved the coverage of the microbiome, albeit using many hours of MS per sample analysis and limiting their application in large-scale studies. For the human gut microbiome, up to 20,000 unique proteins groups can now be identified from one microbiome sample and a total of ~60,000 protein groups can be identified in one metaproteomic project (15, 22, 23). To date, metagenomics predicts that the human gut microbiome contains ~9.9 million genes, with >500,000 genes predicted for each individual (24, 25). However, the number of predicted genes that are coded to mRNA is estimated at 39% (27), which would represent 200,000 mRNA per individual microbiome. A fraction of the transcribed mRNA would lead to protein products. Assuming one gene-one product, there might be 100,000-200,000 expressed proteins in each individual’s microbiome and therefore we 5|Page ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 38

can estimate that the current metaproteomics may cover 10-20% of the expressed proteins in individual’s gut microbiome. Typically, a few hundred microbial species are present in an individual’s gut microbiome (24). In one of our gut metaproteomic studies, with >220,000 peptides identified, we covered 748 species with at least one distinctive peptide; this is not far from the estimated bacterial species number (~1000 prevalent gut bacterial species) by metagenomics sequencing (24-26). However, a high proportion (61%, corresponding to 509 species) of the species were identified with only one or two distinctive peptides and fewer than 100 abundant species can constitute >90% of the total biomass (Figure 2) (15). This may result in lower accuracy of quantification for the majority of the low abundant species. This issue is unlikely to be addressed solely by increasing the number of proteins/peptides identified by the mass spectrometers, because the results will remain biased toward the higher abundance proteins and microbes (22). Therefore, we propose that it will be important to further develop techniques that increase the depth of coverage for low abundant species. For example, new experimental approaches to deplete high abundant proteins or enrich specific types of microbial cells or proteins will be important to detect low abundance microbial species. Activity-based probe enrichment has been applied in a metaproteomic study of the gut microbiome and was effective in identifying additional set of microbial proteins and revealing protein changes that were not seen in unenriched samples (28). In addition, different intestinal segments of mammalians harbor different compositions of microbiota (29, 30), therefore the isolation of microbiomes from different regions of gastrointestinal tract might provide another way to simplify the metaproteome by examining a specific set of microbes. 6|Page ACS Paragon Plus Environment

Page 7 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2 Species identification using metaproteomics. Accumulated biomass contribution of the top N abundant species is shown. The number of distinctive peptides for each species is shown on the right axis (orange dots in plot). Results were obtained using data from our previous published paper in Nature Communications (15).

Even though metaproteomics is currently limited to the top 10-20% of the gene products and can accurately quantify ~30% of the microbes, it provides useful functional information on the microbial community. In macroscopic ecology, the structuring and functioning of a community such as a forest is controlled by several abundant species, called foundation species or core species; the other species are dependent on the foundation species under normal conditions (31). The examination of such foundation species can therefore be used to quickly evaluate the status/homeostasis of the ecosystem. In the human gut microbiome, Zhao et al. recently identified 15 high-fiber– 7|Page ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 38

promoted small chain fatty acid (SCFA) producers out of the 180 species/strains identified using shotgun metagenomics, which form the foundation species group within the human gut for maintaining a healthy microbiota structure (32). These 15 SCFA producers, including species from Faecalibacterium, Bifidobacterium, Lactobacillus, Eubacterium, etc., are abundant in the gut and have been readily detected and quantified using current metaproteomic approaches. In addition, for some of the gut microbes with sufficient proteome coverage and high abundance (likely foundation species), such as Faecalibacterium prausnitzii (15, 22), strain-resolved information on their functions or metabolic pathways can also be obtained by metaproteomics.

Experimental workflow of metaproteomics A typical metaproteomic experimental workflow consists of 1) protein extraction from the microbiome samples, 2) proteolytic digestion of the proteins, 3) peptides separation and analysis using HPLC-MS/MS, and 4) microbial protein identification and quantification by searching against a metagenome database (Figure 3). Finally, the protein/peptide information is used for downstream taxonomic and functional analysis (Figure 3; this part with be detailed in the next section). Given the high complexity of microbiome samples, each step in the workflow requires optimization for each study depending on whether coverage of proteins, microbes, or both are needed. Nevertheless, consistency is needed in sample preparation workflows to facilitate comparison across different metaproteomic studies.

8|Page ACS Paragon Plus Environment

Page 9 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3 Generalized workflow for metaproteomic study of the human gut microbiome.

Differences in protein extraction methods will obviously result in differences in the recovery of proteins from different microbes and affect the coverage and depth of the metaproteomic study (33, 34). The choice of method of DNA extraction from the microbiome is known to be an important cause for inter-lab variability in metagenomic sequencing (35). Although fewer metaproteomic studies have been reported, it is likely that the protein extraction step will also be an important contributor to inter-lab variability in metaproteomics. The protein extraction from microbiome samples is challenging due to the different resistance of microbial species to cell lysis, which is in part due to the 9|Page ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 38

different structure of cell walls in Gram-positive and negative bacteria (34). A previous study has shown that the combination of chemical and mechanical methods, namely sodium dodecyl sulfate (SDS)-based lysis buffer and ultrasonication, was the most consistent and efficient approach to recover proteins from both Gram-positive and negative bacteria (34). In addition, the efficiency of protein digestion and microbial protein identification for microbiome samples can also be significantly influenced by the non-microbial components (i.e., food debris, inorganic salts, etc.) within the samples. In particular, in fecal samples there can be various unknown or unexpected chemicals, which may affect the efficiency of enzymatic digestion during sample preparation. Therefore, an enrichment of microbial cells using approaches such as differential centrifugation and protein purification with precipitation will benefit overall gut microbial protein identifications, although these pre-processing steps may lead to potential loss of some microbial species/proteins (36). Metaproteomics is also challenged by the throughput of both sample preparation and MS measurement. Currently, in detergent-based sample preparation protocols, the detergent must be removed before proteolytic digestion, using approaches such as protein precipitation, which greatly decreases throughput. Filter aided sample preparation (FASP) workflow has been commonly used in proteomics (37), however, FASP is not fast and the throughput is also low. Therefore, additional protein extraction methods using lysis buffers that are compatible with commonly used proteolytic enzymes (i.e., trypsin), such as RapiGest SF Surfactant (38), need to be tested and applied for metaproteomics study in order to increase the throughput of sample preparation. The throughput of MS measurement is another bottleneck of 10 | P a g e ACS Paragon Plus Environment

Page 11 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

metaproteomics. Currently, most of the metaproteomic studies were performed using a label-free approach, which allows only one sample to be analyzed at a time. More recently, Liu et al. reported the successful application of isobaric labeling-based quantitative proteomics, namely tandem mass tag (TMT) that enables 10-plex sample analysis, for the analysis of microbiome samples (39). Up to 11-plex TMT reagents are now commercially available, which in principle can increase the throughput of MS measurement by 10-fold. Other approaches for multiplexed quantitative proteomics, such as neutron encoding (NeuCode) that allows higher multiplexity (40), are also potential alternatives, however their utility in metaproteomics needs further testing. Nevertheless, researchers have started to develop comprehensive workflows for metaproteomics in human microbiome studies (22, 33). Importantly, studies using such workflows have clearly demonstrated the ability of metaproteomics in examining the functional activity of microbiome in the context of human health and diseases (15, 16, 27, 41-43).

Bioinformatics in metaproteomics Bioinformatics is one of the most challenging aspects of metaproteomics. Different metaproteomic tools have recently been reviewed (44, 45). The bioinformatics challenges for metaproteomic analysis can be summarized as 1) handling of type II error rates during peptide identification, 2) the accuracy of gene prediction from metagenomic sequences, 3) peptide and protein taxonomic assignment, and 4) the quality of functional annotation of reference database.

11 | P a g e ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 38

The type II error rate during peptide identification was primarily driven by the much larger size of the microbiome databases compared to conventional proteomic database and the approach to calculate false discovery rate (FDR). This issue have been greatly reduced using approaches such as iterative database search (46), database partition (14), the use of matched metagenome-derived gene catalog databases (12), and the application of Graphic-centric approach (47). Moreover, dedicated bioinformatic software tools, such as MetaLab (9), MetaProteomeAnalyzer (11), Galaxy-P (13), ComPIL (48) and TCUP (49), enable easy and straightforward MS data processing. Although metaproteomics would still benefit from faster search approaches and better spectrum identification rate, these tools have greatly facilitated the application of metaproteomics in human microbiome studies. In addition to peptide/protein identification, the downstream statistical, taxonomic and functional data analysis is even more challenging. These include, but are not limited to, the microbial taxonomic identification, quantification, functional annotation and interpretation, and taxon-specific functional or pathway analysis. For bottom-up metaproteomics, the identification of microbial species is usually performed using the taxon-specific distinctive peptides (10). Generally, a high proportion of identified unique peptides (see below Guidelines section for details on the proposed definition of “unique peptide” and “distinctive peptide” in metaproteomics) in metaproteomics are shared by multiple species and therefore cannot be assigned to a specific species. For example, in human gut metaproteomic datasets, only ~20% of the identified peptides are species distinctive (9, 15). This means that 80% of the peptides remain at the phylum to genus level while the remaining 20% are annotated at the species or even strain levels, but 12 | P a g e ACS Paragon Plus Environment

Page 13 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

with a bias toward the higher abundance microbes. Therefore, new approaches are needed for experiments that require deeper metaproteomic analysis. The capability of metaproteomics to reveal functional activity of microbiome depends on the functional annotation of proteins/genes within the reference databases and the bioinformatic tools enabling automated functional/pathway analysis. Unfortunately, the metagenomic gene databases are usually poorly annotated. For the human gut microbial gene catalog database, ~9.9 million genes are present in the gene catalog; however 40~80% of these genes are not well annotated using currently available knowledge databases, including KEGG, eggNOG, MetaCys, Pfam, UniProtKB, etc. (50) This makes our functional interpretation of meta-omics data, including metaproteomics and metagenomics, much more difficult and prone to errors. Therefore, more efforts should be made to improve the gene annotation and functional characterization for microbiomes, using approaches such as proteogenomics (51), single-cell genome sequencing (52, 53) and pure culture-based functional characterizations (54). Moreover, the bioinformatic tool for functional analysis of metaproteomics data needs further developments. Currently, iMetaLab (55) and Unipept (version 4.0) (56) have included some features of functional analysis modules, such as enrichment analysis. However, more tools considering both the protein expression levels and their differences between groups are still needed. Metaproteomics provides a promising window for directly looking at the taxon-specific, active functions or metabolic pathways, even for closely related species/strains of microorganisms (22, 42, 57). However, taxon-specific functional analysis in metaproteomics is still limited to species with high numbers of identified distinctive 13 | P a g e ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 38

peptides/proteins or at the higher taxonomy levels. There is also a lack of bioinformatic tools that enable automatic and efficient taxon-specific functional analysis. Therefore, metaproteomics would benefit from novel taxon-function association analysis for low abundant species, and novel automatic software for efficiently profiling the taxonspecific functions.

Post-translational modifications in microbiome Post-translational modification (PTM) is an important strategy to regulate the activity of proteins and cellular function in both eukaryotes and prokaryotes. The global and deep characterizations of common protein PTMs, such as phosphorylation, acetylation, methylation and glycosylation, have been performed for human and animal models, as well as, several bacterial species, such as Escherichia coli and Salmonella enterica (58, 59). These studies were usually carried out using PTM-specific enrichment approaches followed by high-resolution MS analysis. In bacteria, it has been shown that PTMs play important roles in various physiological functions including microbial virulence (60, 61), stress responses (59, 62), metabolisms (58, 59), etc., which highlights the importance of studying PTMs in the microbiome. However, there is still no global characterization of PTMs in human microbiomes nor deep profiling of any single PTM type in microbiomes. Several studies have attempted to identify PTM events in microbiomes from unenriched samples using database searches with common PTMs as potential modifications. For example, Li et al. directly identified eight protein PTMs in AMD microbial communities using Sipros software tool, and demonstrated the prevalence of PTMs in microbiomes with divergent patterns between closely related species (25). Zhang et al. also utilized a 14 | P a g e ACS Paragon Plus Environment

Page 15 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

similar enrichment-independent approach to study the PTMs in a deep-sea microbiome, which found different patterns of PTMs in different microbial species and showed that PTMs were enriched in specific functions or taxa (63). However, enrichmentindependent approaches suffer from high false discovery rates and lower depth of PTM identifications, given that the modified proteins are usually low in abundance. Therefore, greater coverage of PTM events and deeper measurement of modified peptides are needed for the PTM study of microbiomes. For example, modification-specific enrichment approaches (64), such as high-specificity antibodies, metal-based affinity purification, and hydrophilic interaction chromatography, should be applied in order to reach deep and accurate characterizations of PTMs in microbiome. Microbiomes can contain thousands of different microbes and hence potentially more types of protein modifications, possibly novel protein modifications. This may, in part, explain the low spectra identification rate (5~40%) for metaproteomic data sets. As mentioned above, comprehensive PTM profiling studies on human microbiomes are still lacking, to date, and it remains challenging to identify and quantify unknown modifications in microbiomes. Therefore, new database search strategies, such as open search algorithms (65, 66), are needed for metaproteomic studies to efficiently identify both known and unknown PTMs in microbiome samples. Proteogenomic approaches have proven to greatly improve genome annotations, detect protein variants, and enable global discovery of PTMs in both mammalian and bacterial cells (51, 67, 68). The application of proteogenomic methods in microbiome studies, namely metaproteogenomics that combines paired-sample, deep, metagenomics/metaproteomics measurement and various database search strategies, 15 | P a g e ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 38

such as unrestrictive PTM searches (67), would likely help global characterization of the protein PTMs in microbiomes.

Integrating metaproteomics for multi-omics study The integration of metagenomics and metaproteomics can not only improve the gene annotation/prediction of metagenomic sequencing but also aid in more functional and deeper mining of metaproteomics data. In fact, due to the high complexity of microbiome samples, integrative multi-omics approaches have been proposed as the most powerful strategy to provide full pictures of their compositions, functions and activities (69). Briefly, the microbiome compositions can be profiled using ribosomal RNA gene sequencing, such as 16S rRNA gene sequencing; the functional capability (genomic contents) and species/strain level compositions of microbiome can be examined using shotgun metagenomics; and the actively expressed genes can be studied using metatranscriptomics. Metaproteomics and metabolomics directly measures the enzymes (proteins) or end-products (metabolites) of various biological processes, such as biosynthesis or biodegradation, which are more relevant to the functions. Most of the current multi-omics studies have been performed by integrating genomics approaches, namely metagenomics and metatranscriptomics, given that they have similar types of data and methods for data analysis. Only recently, studies have successfully integrated metaproteomics or metabolomics with genomics approaches (18, 27, 70-73), which provides additional information, such as the changes of metabolites and host proteins, in the microbiome samples. The latter provides an additional layer of information for understanding host-microbiome interactions. 16 | P a g e ACS Paragon Plus Environment

Page 17 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

To date, the comprehensive metabolite profiling for microbiome is scarce. This is, in part, explained by poor gene product annotations for many microbes and the unknown mechanisms/pathways involved in the production of many metabolites. In particular, in a microbial community, metabolites can be produced from different types of microbes or even through co-metabolism of multiple species (74), which requires more systematic approaches to re-construct their metabolic pathways. Since the proteins or enzymes are the direct players in metabolic pathways for producing metabolites, the integration of metaproteomics and metabolomics seems to be a promising direction for deeper understanding of the functions of microbiome and microbe-microbe interactions. In addition, the integration of metaproteomics and metabolomics would enable the study of bi-directional drug-microbiome interactions, which have been reported as present for a high proportion of commonly used drugs (75-77). Briefly, the microbial proteins involved in the drug metabolism/response can be examined using metaproteomics, and the chemical derivate of drugs can be measured using metabolomics. This will also aid in identifying new drug targets in the microbiome. An example is microbial β-glucuronidase (GUS) that mediates the gastrointestinal toxicity of anticancer drug irinotecan through converting SN-38 glucuronide into SN-38 (78). Recent studies have shown that selective inhibition of gut microbial GUS effectively protected the host from irinotecaninduced toxicity (79). One of the challenges for integrative multi-omics studies is that there is still a lack of efficient approaches for multi-omics data integration. Most of the current multi-omics studies were performed using correlation analysis across different data sets. Currently, software tools, such as GOmixer (80), Mixomics (81), XCMS Online (82) and IMP (83), 17 | P a g e ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 38

are available for integrative multi-omics analysis. However, more efforts should be dedicated to developing new strategies and more user-friendly bioinformatic tools, in particular those that can support metaproteomic and metabolomic integration.

Applications of metaproteomics in human microbiome study Metaproteomics has been applied to various research areas, such as environmental microbiome studies and the study of microbiomes associated with food, plants as well as humans (8, 15-21). While the application of metaproteomics in most of the research areas is still limited and progresses slowly, its application in human microbiome studies has boomed in the past five years. One reason is that metaproteomics has irreplaceable advantages in studying host-microbiome interactions. Briefly, both human proteins and microbial proteins can be simultaneously identified and quantified using metaproteomics and those human proteins are usually of great importance in mediating the hostmicrobiome interactions. For example, over 800 human proteins can be present in the microbial pellets collected from human feces and may constitute >10% of the total biomass in the samples (22). Recently, the value and utility of metaproteomics in human microbiome studies have been exemplified by several large-scale metaproteomic studies on clinical samples, which characterized the functions of the intestinal microbiome, studied the host-microbiome interactions, and identified new potential biomarkers for human diseases (15, 16). Using a shotgun metaproteomics and MetaPro-IQ approach for database searching, Gavin et al. (16) reported the quantification of 470 human proteins and over 10,000 microbial proteins in fecal samples collected from 101 patients with new-onset type 1 18 | P a g e ACS Paragon Plus Environment

Page 19 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

diabetes (T1D) or at different risks to develop T1D. They demonstrated that both T1D patients and individuals with high risk to develop T1D had impaired pancreatic protein secretion and successfully identified T1D-associated metaproteomic signatures (including both human and microbial proteins) that might be used to monitor T1D disease progression. In a longitudinal study, Maier et al. (23) studied the fecal microbiome using multi-omics approaches, including metaproteomics and metabolomics, in pre-diabetes patients before and after dietary interventions with resistant starch (RS). In addition to the observations from genomic approaches that were mostly in agreement with previous studies, they identified >56,000 proteins in the fecal samples and found that carbohydrate metabolism, specifically butyrate synthesis affiliated with species in the Firmicutes phylum, were among the most significantly changed functions in RS-treated microbiomes. When combined with metabolomics and targeted butyrate measurements, they further demonstrated the mechanisms of RSmediated alterations of the host-microbiome cross-talk on carbohydrate metabolism (RS fermentation) and lipid metabolism, which may contribute to the beneficial effects of RS. In a recent study, we performed metaproteomic analysis on the samples collected from patients with inflammatory bowel disease (IBD) and control subjects without IBD at the site of disease onset, namely intestinal mucosal-luminal interface (MLI). We augmented the gut microbial gene catalog database which enabled the identification of >50,000 proteins from various kingdoms of organisms, including bacteria, virus, fungi, as well as the human proteins (15). Based on this, we characterized the microbiome compositions, functions and metabolic pathways that were altered in patients with IBD. The analysis of the host proteins associated with the microbiome identified host defense proteins, 19 | P a g e ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 38

specifically the cargo proteins of extracellular vesicles, which were significantly increased in IBD in comparison to those without IBD. Altogether, these studies prove the utility of metaproteomics in studying host-microbiome interactions, even when considering current limitations in terms of coverage of proteins and microbes.

Standardization and guidelines of metaproteomics Along with the increasing applications of metaproteomics, the standardization of both the experimental and data analysis workflow is becoming apparent. Several workflows have been proposed in previous studies (22, 33, 84), however, as mentioned above, there is currently a lack of consistency in both experimental methods and bioinformatic approaches used in the published metaproteomics papers. This makes the experimental results difficult to compare across studies and their data integration challenging. Therefore, widely accepted guidelines and standardizations are needed in metaproteomics for properly performing and reporting metaproteomic studies. Here we suggest a set of guidelines for metaproteomics, in particular for large-scale clinical metaproteomic studies. We hope that these proposed guidelines can provide a starting point for discussions, revisions and optimizations within the field. (1) Sample preparation. Sample preparation is the first step and also one of the most important aspects of the metaproteomic workflow. As mentioned above, studies have clearly demonstrated that different sample preparation protocols lead to the identification of different proteins and microbes, which greatly hinder the cross-study comparisons. Often in microbiome

20 | P a g e ACS Paragon Plus Environment

Page 21 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

studies, a large number of samples are needed due to the high inter-individual microbiota variations. In this case, both sample preparation and MS analysis will need to be performed in different batches and it will be of critical importance to minimize any batch effects relating to technical or instrument performance changes. We propose the following guidelines for metaproteomics sample preparation: (a) Samples need to be randomized prior to sample preparation. This is important to avoid batch effects; (b) All studies should include technical replicates for a subset of samples in every batch of sample preparation. This would provide invaluable information on the consistency of sample preparation; (c) Samples should be pre-processed to eliminate potential interference prior to protein extraction. For example, differential centrifugation can be performed and the harvested microbial cells washed with phosphate buffered saline (PBS) or normal saline. The method of pre-processing should be described in manuscript; (d) Protein extractions should be performed using strong detergent (i.e., SDS) and mechanical disruption (i.e., ultrasonication or bead beating) and SDS should be removed before proteolytic digestion. This combination has consistently provided the best coverage of metaproteomes; (e) A universal reference sample consisting of a mixture of microbes or protein extracts should be developed. This could be included in every study to address inter-study or inter-lab sample preparation variations. However, practically this

21 | P a g e ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 38

could be difficult to implement as shipment of microbiomes to labs located in different countries could be restricted.

(2) Recommendations for mass spectrometry measurement. Most of the current metaproteomic studies were performed using the label-free quantification approach that is known to be sensitive to experimental variation, including changes in MS performance. In this case, it is also of great importance to minimize any batch effects related to technical or instrument performance changes. Therefore, here we propose a guideline for metaproteomic MS analysis: (a) Samples should be randomized prior to sample preparation and randomized again for MS analysis. This would avoid batch effects due to drift in technical and instrument performance; (b) All studies should include proper quality control (QC) samples for MS analysis to ensure quality, reproducibility and comparability of metaproteomic results across different batches. The QC sample can be project-specific, such as a mix of representative samples that are repeatedly run throughout the experiment; (c) MS performance should be consistently monitored. The MS spectrum identification rate of a QC sample should be within a reasonable range (