Metagenomic Taxonomy-Guided Database ... - ACS Publications

Subscriber access provided by MT ROYAL COLLEGE

Article

A Metagenomic Taxonomy-guided Database Search Strategy for Improving Metaproteomic Analysis Jinqiu Xiao, Alessandro Tanca, Ben Jia, Runqing Yang, Bo Wang, Yu Zhang, and Jing Li J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00894 • Publication Date (Web): 13 Feb 2018 Downloaded from http://pubs.acs.org on February 20, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

A Metagenomic Taxonomy-guided Database Search Strategy for Improving Metaproteomic Analysis Jinqiu Xiao1, Alessandro Tanca2, Ben Jia1, Runqing Yang3, Bo Wang1, Yu Zhang4, Jing Li1* 1

Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology,

Shanghai Jiao Tong University, Shanghai 200240, People's Republic of China 2

Porto Conte Ricerche, Science and Technology Park of Sardinia, Tramariglio, Alghero, Italy

3

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, People’s

Republic of China 4

Institute of Oceanography, Shanghai Jiao Tong University, Shanghai 200240, People's Republic

of China

KEYWORDS: metaproteomics, mass spectrometry, metagenomics, taxonomy, microbial communities

ABSTRACT Metaproteomics provides a direct measure of the functional information by investigating all proteins expressed by a microbiota. However, due to the complexity and heterogeneity of microbial communities, it is very hard to construct a sequence database suitable for a metaproteomic study. Using a public database, researchers might not be able to identify proteins from poorly characterized microbial species, while a sequencing-based metagenomic database may not provide adequate coverage for all potentially expressed protein sequences. To

ACS Paragon Plus Environment

1

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 35

address this challenge, we propose a Metagenomic Taxonomy-guided database search strategy (MT), in which a merged database is employed, consisting both of taxonomy-guided reference protein sequences from public databases and of proteins from metagenome assembly. By applying our MT strategy to a mock microbial mixture, about two times as many peptides were detected as with the metagenomic database only. According to the evaluation of the reliability of taxonomic attribution, the rate of misassignments was comparable to that obtained using an a priori matched database. We also evaluated the MT strategy with a human gut microbial sample, and we found 1.7 times as many peptides as using a standard metagenomic database. In conclusion, our MT strategy allows the construction of databases able to provide high sensitivity and precision in peptide identification in metaproteomic studies, enabling the detection of proteins from poorly characterized species within the microbiota.

INTRODUCTION With the rapid development of metagenomics,1 which allows the holistic study of genetic materials from microbial communities in a culture-independent way, researchers are enabled to investigate the taxonomic and functional profiling of complex microbial communities in a highresolution manner.2-6 However, with metagenomic techniques alone researchers can only describe the functional potential according to the genetic assortment of a microbiota, with no information concerning the biological processes that are actually active.7 Metaproteomics,8 which investigates the proteome expressed by a microbial community, provides a direct measurement of active functions and their change in expression in response to various stimuli.9-12


2

Page 3 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


The construction of a suitable protein database is a critical step for protein identification in shotgun proteomics. An ideal database is supposed to consist of all proteins potentially expressed in the sample, while too many spurious sequences in the database will interfere with the accuracy of protein identification.13 Due to the complexity and heterogeneity of microbial communities, using improper protein databases for metaproteomic studies will lead to significant lower peptide/peptide-spectrum match (PSM) detection rates, especially when uncultured and unknown species are contained in the sample. Public sequence databases are often used for protein identification in proteomic studies because of their comprehensiveness, high quality and free access. However, searching directly against a protein database comprising sequences from a broad range of microorganisms presents big challenges for metaproteomic analysis because of the huge search space, leading in turn to an increased risk of false negatives, which constrains the number of identified peptides when using a stringent FDR threshold.14,15 Several approaches have been reported to solve this problem by searching against a reduced database. One is retaining specific sequences according to taxonomic classification, typically based on 16S rDNA sequencing.16,17 This method, however, is hampered by the fact that the reference genomes corresponding to the identified taxa might not be publicly available. Besides, extracting taxa by applying a fixed abundance threshold, which is useful to reduce misassignments, may not take into account the complexity and heterogeneity of microbial communities. Another approach is the 'two-step' strategy proposed by Jagtap et al.18 In this method, firstly a big, comprehensive database is searched without FDR control; then, a second target-decoy search is performed with a stringent FDR threshold against a subset database, which consists of


3


Page 4 of 35

all proteins identified from the first search.19 However, this iterative strategy has to be applied with very stringent FDR filtering, as the number of false-positive identifications tends to increase dramatically.20 With the development of close-to-complete human and mouse gut microbial gene catalog database,21,22 Zhang et al. proposed the MetaPro-IQ strategy, which uses public human/mouse gut microbial gene catalogs as initial database in the context of an iterative database search.23 Although efficient for human and rodent gut metaproteome analysis, this strategy cannot be applied directly to less commonly studied microbial communities, such as microbiota from deepsea and other extreme environments. As a large fraction of the microorganisms in microbial community samples have no reference genomes available in public resources, and the composition of microbial communities may vary dramatically, proteins from poorly characterized species do likely remain unidentified through searching against a public database. Constructing a matched metagenomic database (based on the metagenome sequencing of the same sample) offers the capability to explore the protein expression of the poorly characterized species.24 However, it may not provide an adequate coverage for many protein sequences expressed by the microbial community, due to the inherent limitations of DNA sequencing in terms of assembly and annotation quality, and further lead to a decreased sensitivity.25 In this study we present a Metagenomic Taxonomy-guided database search strategy (MT) for metaproteomic analysis, which combines the advantages of public reference and metagenomic databases. Specifically, metagenome data are exploited in two ways: first, the genetic potential of the microbial community revealed by metagenomic sequencing is used as matched sequence database; second, the taxonomic profile is used as a filter to reduce the number of public protein


4

Page 5 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


sequences by only retaining taxonomy-specific entries. Using the data of a lab-assembled microbial mixture with a priori knowledge from the study of Tanca et al.,26 we show that our MT strategy can significantly improve sensitivity and coverage in peptide/PSM identification with the quality of PSMs well controlled in metaproteomic studies. We also applied our method to a human gut microbial sample and identified about 1.7 times as many peptide identifications as using a simple metagenomic database.

MATERIALS AND METHODS Metaproteomic and Metagenomic Datasets The mass-spectrometry-based proteomic data, obtained from a mixture of nine microbial strains (9MM,

namely,

Brevibacillus

laterosporus,

Escherichia

coli,

Enterococcus

faecalis,

Lactobacillus acidophilus, Lactobacillus casei, Pasteurella multocida, Pediococcus pentosaceus, Rhodotorula glutinis, Saccharomyces cerevisiae), were retrieved from a study by Tanca et al.26 The

raw

data

were

downloaded

from

the

Peptide

Atlas

repository

at

http://www.peptideatlas.org/PASS/PASS00194.26 The metagenomic reads of these nine microbial strains

mixture

were

downloaded

from

the

NCBI

BioSample

repository

at

http://www.ncbi.nlm.nih.gov/biosample, with the accession numbers 2352454 and 2352511. All datasets mentioned above were used in the performance evaluation of our MT strategy. The metagenomic sequencing data and the mass-spectrometry-based proteomic data of a healthy Sardinian volunteer’s gut microbial sample (human_0) were reported by Tanca et al.24 The contig-based metagenomic database and the proteomic data could be retrieved from the PRIDE repository (dataset identifier PXD004039).


5


Page 6 of 35

Protein Database Construction Four protein databases reported in the study by Tanca et al.26 were used here for comparison. The first database (SGA-PA) is an assembly of experimentally obtained individual sequencing data of nine strains (named ‘single genome assembly’, SGA). The second database (Meta-PA) was obtained by metagenomic sequencing of 9MM. Both the metagenome and the single genomes were downloaded directly also from http://www.peptideatlas.org/PASS/PASS00194, which had been assembled de novo using Velvet,27 and subjected to coding sequence prediction and annotation

(PA) in

their study.

These two

databases

were

directly retrieved

at

http://www.peptideatlas.org/PASS/PASS00194. The remaining two protein databases (database 8G and 9S) were assembled from protein sequences, downloaded from the UniProt website (release 2017_04) using an in-house script, corresponding to 8 selected genera (namely, Brevibacillus, Escherichia, Enterococcus, Lactobacillus, Pasteurella, Pediococcus, Rhodotorula, Saccharomyces) or 9 selected species (namely, Brevibacillus laterosporus, Escherichia coli, Enterococcus faecalis, Lactobacillus acidophilus, Lactobacillus casei, Pasteurella multocida, Pediococcus pentosaceus, Rhodotorula glutinis, Saccharomyces cerevisiae) which the strains in the 9MM sample belong to. Identical sequences within the same taxon were clustered using UniRef100, while redundant sequences among taxa were removed.28,29 In addition, a large database containing the whole bacterial and fungal sequences in UniProt (named BF, release 2017_04) was added in this study for comparison. In our MT strategy, we tried to combine proteins from public database guided by taxonomic classification and the sequences assembled from the metagenomic sequencing of the sample. Firstly, in order to obtain the taxonomic information, the metagenomic reads from the 9MM sample were classified against the NCBI nr database (release 2017_02, including Bacteria,


6

Page 7 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Archaea, Viruses, Fungi and microbial eukaryotes) with Kaiju (1.5.0).30 Low-complexity regions were filtered out with –x option, and the other parameters were set as default. Summary report files for genus and species level were generated by the Kaiju’s program kaijuReport. The most abundant taxa, accounting in total for 95% of classified reads at specific taxonomy rank, were included in the database (Figure S1). Then, a large pseudo-metagenomic database was constructed by retrieving the protein sequences from UniProt, parsing at the genus or species level according to the taxonomic classification. Specifically, the protein sequences of 10 genera (Table S1, including all genera of the 9MM except Saccharomyces) from UniProt were included, while the protein sequences of 41 species (Table S2, including all species of the 9MM except Rhodotorula glutinis) were contained at the species level. Saccharomyces was not detected in taxonomic classification because there were not enough reads matching to this genus in 9MM sample, and Rhodotorula glutinis was not contained because it lacks genomic characterization and enough sequences deposited in public databases. At last, a subset of the pseudo-metagenomic database, based on the output of an initial search without FDR filtering, was combined with the metagenomic assembly sequences, for the identical sequences (identity=100%), only one copy was kept with in-house script. We named this combined database as MTG (parsing at genus level) or MTS (parsing at species level). In order to improve the sensitivity, the two-step strategy was also applied for the other large protein databases (8G, 9S, BF).18 The contig-based metagenomic database (the sequencing depth was 6 Mbps) from Tanca et al.24 used for the human_0 sample analysis was directly retrieved from the PRIDE repository (dataset identifier PXD004039). The MTS_h0 database was constructed according to the MTS strategy mentioned above.


7


Page 8 of 35

To sum up, seven protein databases were assessed for peptide identification of 9MM. The detailed features of these databases are described in Table1. Table 1. Overview of the Used Protein Databases Database

Description

UniProt release

Number of sequences for the final search 52,485

SGA-PA

Single predicted and annotated genomes assembly

Meta-PA

Predicted and annotated metagenome

8G

8 genera from 9MM

2017_04

33,261

9S

9 species from 9MM

2017_04

13,466

BF

UniProt-Bacteria and UniProt-Fungi

2017_04

205,595

MTG

Metagenomic assembly and identified 10 genera from 9MM

2017_04

62,035

MTS

Metagenomic assembly and identified 41 species from 9MM

2017_04

51,673

24,667

Metaproteomic Bioinformatics Two Thermo raw files from the 9MM sample were converted into mgf format using MSConvert from the ProteoWizard suite (v. 3.0.9974).31 For each large database, the first search was conducted with X! Tandem32 (version 2015.04.01, Enzyme Trypsin, Maximum Missed Cleavage Sites 2, Precursor Mass Tolerance 10 ppm, Fragment Mass Tolerance 0.2 Da) without any FDR limitation. All proteins identified in the first search were used in the second search as a smaller size database. The raw files from the human gut microbial sample were analyzed using the same way except for the Fragment Mass Tolerance that was set to 0.02 Da according to the reference.24


8

Page 9 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


The final target-decoy search for each database was performed with MaxQuant33 (version 1.5.5.1). Carbamidomethylation of cysteine was set as fixed modification and oxidation of methionine as variable modification. The FDR threshold was set at 0.01. The SGA-PA and MetaPA were searched with MaxQuant using the same parameters as the second search of the twostep strategy. Taxonomic annotation of the identified peptides was performed with Unipept’s Metaproteomics Analysis module (https://unipept.ugent.be/datasets).34

RESULTS Implementation of the MT strategy Here we propose a Metagenomic Taxonomy-guided database search strategy (MT) for metaproteomic analysis. The workflow of the MT strategy is illustrated in Figure 1. On the one hand, the metagenomic reads were classified with Kaiju for taxonomic classification at the genus or species level, and the derived pseudo-metagenomic database was then constructed by drawing the sequences from the public databases for a first-search without filtering (Figure 1, orange part); on the other hand, the sample-specific metagenomic reads were assembled and used for constructing a contig-based metagenomic database (Figure 1, blue part). Next, the metagenomic database and all identified proteins from the first-search were merged together as a combined database for the final search, performed by setting an FDR equal or less than 0.01. In order to control the risk of false positives and the bias of FDR estimation brought by using a large combined database, a PSM quality assessment was conducted by comparing the distributions of PSM scores between the metagenomic database and the subset of the pseudo-metagenomic database, so as to evaluate whether a stricter FDR estimation could be necessary (Figure 1, green


9


Page 10 of 35

part). If any of the distributions shows an apparent shift to lower score (higher risk of false positive), a refined FDR control or transferred subgroup FDR estimation is recommended as a more rigorous alternative.35,36

Figure 1. Overview of the MT strategy.

The MTS Strategy Increases Peptide/PSM Identification Yields To assess the sensitivity in peptide/PSM identifications of the MT strategy at genus (MTG) and species (MTS) level, we made a comparison with other five databases (listed in Table 1). The overview of peptide and PSM identifications against 7 different databases is illustrated in Figure 2A. The MTS database yielded the highest number of peptide/PSM identifications, while the Meta-PA database led to the lowest amount of peptide/PSM identifications. It should be noted


10

Page 11 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


that the performance of MTS is always same or even better than the a priori knowledge-based databases 8G, 9S and SGA-PA when considering the sensitivity of peptide/PSM identifications. Next, we checked the posterior error probability (PEP) score distribution of the PSMs identified by searching against the MTS, Meta-PA or BF databases. The results revealed that the distribution of PEP score of PSMs identified using MTS strategy showed a slight shift to the right, corresponding to a higher confidence value (Figure 2B). With concern about the risk of increasing false positives depending on the difficulties in controlling FDR when using large databases, we also carefully examined the quality of PSMs uniquely identified by the public or metagenomic part in the MTS strategy. No obvious difference was observed in the distributions of their PEP scores (Figure 2C). We could not find significant differences in distribution according to a Chi-square test (p-value=0.32) when we looked at the 10% PSMs identifications with lowest -log10(PEP). Therefore, no stricter FDR estimation was judged to be necessary in this case. Conversely, a refined or transferred subgroup FDR estimation, as according to the methods proposed by Li et al. and Fu et al., should be applied to reduce the risk of false positives when significant shifts in PSM quality distribution are found. 35,36 Next, when considering the peptide identification yields obtained using taxonomy-based databases (Figure 2D), we found that 54.0% of peptide identifications were common to all four databases. 8G and 9S database shared 78.9% of the identified peptides, while the overlap between MTG and MTS was 73.8%. Concerning the databases constructed at the species level, 92.6% of the identified peptides in 9S were also detected in MTS. Searching against 8G database yielded 12% more peptide identifications than 9S database. A possible explanation of this result might be related to the lack of an adequate amount of genomic sequences belonging to Rhodotorula glutinis in the UniProt database. Although Rhodotorula glutinis was fully


11


Page 12 of 35

sequenced later and re-classified as Rhodotorula mucilaginosa,37 only 34 protein sequences of Rhodotorula mucilaginosa could be retrieved in the UniProt database at the time of the analysis. When comparing two databases constructed with our MT strategy, MTS behaved far better than MTG, as it enabled 2326 additional peptide identifications. Here Saccharomyces failed to be detected at the genus level due to low genomic abundance. Thus, we decided to use the MTS strategy in the following experiment. When considering the peptide identification yield obtained using the sequencing-based databases (Figure 2E), 46.0% of the identified peptides were common to all databases. Meta-PA identifications were 94.9% common to MTS, while for SGA-PA the intersection was 92.7%. More specifically, MTS identified 3665 peptides that were missed when using Meta-PA. These sequences were also identified by using SGA-PA, a database based on an a priori knowledge of the composition of the microbial mixture. In summary, the MTS strategy outperformed searching against huge public database or de novo metagenome sequencing database.


12

Page 13 of 35


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


13


Page 14 of 35

Figure 2. Peptide/PSM identifications among 7 databases. (A) The number of unique peptides and PSMs identified by each approach (FDR≤0.01). (B) PEP score distribution of PSMs identified with MTS, Meta-PA or BF database. (C) PEP score distribution of PSMs only identified with public or metagenomic part of the MTS database. The dotted red line represents the score cut-off of 10% PSMs identifications with lowest -log10(PEP). (D) Venn diagram illustrating the overlap of peptide identifications among the taxonomy-based databases (8G, 9S, MTG, MTS). (E) Venn diagram illustrating the overlap of peptide identifications among the sequencing-based databases (SGA-PA, Meta-PA, and MTS).

Taxonomic Distribution of Identified Peptides Once assessed that applying the MTS strategy led to a higher sensitivity of peptide and PSM identification than other databases without the loss of PSM quality, we also sought to test whether it came at the expense of precision or not. To test the coverage and precision of each database search, we evaluated the reliability of the taxonomic distribution of the identified peptides according to the known composition of the 9MM. The lowest common ancestor (LCA) strategy applied by Unipept’s Metaproteomics Analysis module was used to infer the taxonomic classification of the identified peptides.34 The taxonomic distribution of the peptides identified specifically by each database is illustrated at family, genus and species level in Figure 3. Among these databases, the sequencing-based databases (Meta-PA, SGA-PA) had the lowest rate of misassignments, while the BF, which included all UniProt sequences from bacteria and fungi, provided a significantly higher rate of misassignments than any other databases at all taxonomic levels. Although the MTS database was constructed without a priori knowledge of the 9MM taxonomic composition, its rate of misassignments was significantly lower than BF and only slightly higher than that obtained with sequencing-based databases.


14

Page 15 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


As a result, the species Enterococcus faecalis and Escherichia coli as well as their corresponding higher taxonomic ranked taxa were underrepresented with all databases. Rhodotorula glutinis was undetected with all databases since it lacks a previous genomic characterization. Its corresponding genus Rhodotorula has been detected with all databases, but underrepresented with Meta-PA and 9S. Applying Meta-PA for database search failed to detect the peptides from Saccharomyces cerevisiae and even its corresponding taxon at higher taxonomic levels due to low DNA abundance. By combining the metagenomic assembly with the taxonomy-guided sequences from a public database, our MTS method showed higher coverage and sensitivity in peptide/PSM identification than other databases constructed without a priori knowledge while few misassignments were kept.


15


Page 16 of 35

Figure 3. The taxonomic distribution of unique peptides identified by each approach as specific to family (top), genus (middle) and species (bottom) level.

Application of the MTS strategy to a human gut microbiota sample


16

Page 17 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


To further assess the performance of MTS on “real-world” microbiota samples, we applied the strategy to the metaproteomic data from a healthy Sardinian volunteer’s fecal microbiota sample (human_0), retrieved from the study Tanca et al.24 According to their report, searching against the traditional contig-based metagenomic database (the sequencing depth was 6 Mbps) with MaxQuant yield 6373 peptide identifications at 1 % FDR. Using the MTS strategy, we extracted 1730 species in the taxonomic classification, and 1402 of them have protein sequences in UniProt (TableS3). We finally built a database consisted of 266,897 proteins (MTS_h0). Searching against this database, we identified 10932 peptides which were about 1.7 times as many as using the metagenomic database; a similar trend was observed for PSM identifications as well. We examined the PEP scores of PSMs identified with MTS_h0 or the metagenomic database (Figure 4A), and the PEP scores of PSMs only identified by MTS_h0’s public or metagenomic part (Figure 4B). As a result, there was no obvious difference in PEP distribution between MTS_h0 and the metagenomic database, neither between PSMs identified only by the public or metagenomic part of MTS_h0. As illustrated in Figure 5A-B, the taxonomic distributions of the identified peptides in the sample human_0 changed considerably depending on the database used, no matter if considering the phylum or genus level. With MTS_h0, we found about twice as many taxa as using the metagenomic database alone according to the taxonomic annotation (16 vs 9, at the phylum level; 85 vs 45, at the genus level). According to Tilg et al.,38 most mammalian gut microbial species belong to four bacterial phyla: Firmicutes, Bacteroidetes, Proteobacteria and Actinobacteria. Those were the top 4 abundant phyla identified with the MTS_h0 database, while with the metagenomic database Actinobacteria was not included in top 4.


17


Page 18 of 35

About 93.4% of the identified peptides in the metagenomic database were also detected in MTS_h0 database. Concerning the peptides exclusively identified with the MTS_h0 database (Figure S3), 86.77% of these peptide identifications were assigned to Firmicutes while only 6.12% of identified peptides were assigned to Bacteroidetes. As a consequence, the Firmicutes/Bacteroides ratio varied from 2.83 to 1.14, indicating that Firmicutes were underrepresented with the metagenomic database. The five most abundant KEGG orthology functional items of the peptides exclusively detected using MTS strategy are illustrated in Figure 5C.

Figure 4. (A) PEP score distribution of PSMs identified with MTS_h0 or metagenomic database. (B) PEP score distribution of PSMs only identified with public or metagenomic part of the MTS_h0 database.


18

Page 19 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 5. Taxonomy analysis of the peptides identified using MTS_h0 or metagenomic database (DB) at the phylum (A) and genus level (B). (C) Five most abundant KEGG orthology functional groups of the uniquely identified using MTS_h0 database. Spectral counts were used as a quantitative measure of relative abundance.


19


Page 20 of 35

DISCUSSION Several computational methods have been developed to identify and quantify the peptides and proteins present in a microbiome sample. In this study, we demonstrated that the MT strategy, which effectively integrates metagenomic sequencing-based and up-to-date public reference sequence databases, could achieve higher sensitivity in peptide identification and more comprehensive taxonomic information in both lab-assembled and complex “real-world” datasets confidently. The metagenomic sequences of the test datasets we used for proteomic database searching here were determined through the algorithm Velvet, which was designed for de novo short read assembly using de Bruijn graphs.27 Some researchers found that many genes are missing from gene prediction or assembled incompletely due to the complexity of microbiome samples and the limitation of short-read sequencing.25 In the MT strategy, we addressed this issue by exploiting the public well-annotated sequence database with the guide of the taxonomic information derived from metagenomic sequencing. Alternatively, rather than assembling the reads into contigs and then predicting full-length genes, some researchers have tried to infer candidate peptides (short amino acid sequences) directly from metagenomic sequencing that may present in the samples to construct a metapeptide database.39 May et al. trimmed and filtered short sequences from gene fragments predicted by MetaGeneAnnotator or six-frame translations of raw reads to build a database of metapeptides.39 Tang et al. developed a graph-centric tool named Graph2Pro, which assemble the reads from metagenome or metatranscriptome raw data into peptides using the de Bruijn graph structure.40 We evaluated these two alignment-free methods using 9MM and human_0 samples. The results showed that fewer peptides and PSMs were identified in comparison with the MTS strategy (Table S4). It should be noted that metaproteomic results


20

Page 21 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


obtained using metagenomic sequencing databases are hugely dependent on the sequencing strategy/instrumentation and on the tools used for metagenome sequence processing. For example, using sequencers able to produce longer reads, as well as alternative tools for sequence assembly, might lead to different databases and thus to considerably different identification yields. According to the data presented here, the usage of public reference databases guided by metagenome taxonomy, as in our MT strategy, can represent a good complement to metagenome assembly. Moreover, the MT strategy is flexible and easy to be implemented, since any novel metagenome processing method can be integrated into our workflow in place of classic metagenome assembly. MetaPro-IQ is an efficient metaproteomic approach for intestinal microbial protein identification which makes use of a close-to-complete (human or mouse) gut microbial gene catalog as database and an iterative database search strategy.23 We also compared MetaPro-IQ and the MTS strategy using the previously described gut microbial sample (human_0). The identification rates of peptides and PSMs were found to be slightly higher using the MTS strategy. About 70% peptides identified by MetaPro-IQ and the MTS strategy were overlapped, and the following-up taxonomic classifications were also consistent (Figure S3). In database searching of proteome data, a higher risk of false positives and a considerable bias in peptide identification have been observed when different sequence databases are combined, such as the combination of regular and mutant protein sequences, or modifications.35,36 In those cases, rather than using a global FDR estimation, more stringent FDR control methods were reported and recommended. In our MT strategy, the user should carefully examine the quality of the PSMs identified from different database sources to ensure the confidence of identified peptides. In case an apparent shift to the low-score is observed in the PSM distribution, a stricter FDR


21


Page 22 of 35

estimation needs to be used alternatively, such as the refined separated FDR or transferred subgroup FDR estimation method.35,36 In our tests, there is no significant difference in PSMs quality between the items matched to the public database and metagenomic sequencing. In the workflow of our MT strategy, two ways have been adopted to balance the false positive rate and the false negative rate. One is downsizing the public reference database based on the metagenome-derived taxonomic profiles and the knowledge about the sample composition to reduce the false positives. On the other hand, a two-step searching strategy can be used for huge databases to improve the identification sensitivity. According to the study by Muth et al.,20 searching using two-step strategy could increase the peptide identification yield, but the risk of increasing the number of false positive identification should be cautioned. In conclusion, the MT strategy is easy-to-use and effective for peptide identification in metaproteomics. We hope the new workflow will facilitate the further functional and taxonomic analysis of microbial communities.

ASSOCIATED CONTENT Supporting Information. Table S1-S3, the list of taxa extracted for constructing the MTS (Table S1), MTG (Table S2) and MTS_h0 (Table S3) database (XLSX) Table S4, comparison of PSM/peptide identification among the MTS strategy and other approaches (XLSX)

Figure S1, a schematic illustration of how to extract the taxa for database construction according to the taxonomic classification (PDF)


22

Page 23 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure S2, taxonomy analysis of peptides only identified with MTS_h0 database at the phylum (left) and genus level (right) (PDF) Figure S3, peptide identification (A) and taxonomic classification (B) comparing between MetaPro-IQ and the MTS strategy in analysis of human gut microbial sample (PDF)

AUTHOR INFORMATION Corresponding Author *E-mail: [email protected]. Tel.: (086) (021) 34204348-108. Funding Sources This work was supported by National Natural Science Foundation of China (31271416, 41676177), Natural Science Foundation of Shanghai (17ZR1413900). Notes The authors declare no competing financial interest. ACKNOWLEDGMENT We would like to thank Professor Sergio Uzzau for his generous data sharing, Jialin Hou for his good suggestions on metagenomic analysis, Peng Cui and Yafei Chang for reviewing manuscript, and the High Performance Computing Center (HPCC) at Shanghai Jiao Tong University for the computation.


23


Page 24 of 35

REFERENCES (1) Handelsman, J.; Rondon, M. R.; Brady, S. F.; Clardy, J.; Goodman, R. M. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol 1998, 5 (10), R245-9. (2) Venter, J. C.; Remington, K.; Heidelberg, J. F.; Halpern, A. L.; Rusch, D.; Eisen, J. A.; Wu, D.; Paulsen, I.; Nelson, K. E.; Nelson, W.; Fouts, D. E.; Levy, S.; Knap, A. H.; Lomas, M. W.; Nealson, K.; White, O.; Peterson, J.; Hoffman, J.; Parsons, R.; Baden-Tillson, H.; Pfannkoch, C.; Rogers, Y. H.; Smith, H. O. Environmental genome shotgun sequencing of the Sargasso Sea. Science 2004, 304 (5667), 66-74. (3) Hugenholtz, P.; Tyson, G. W. Microbiology: metagenomics. Nature 2008, 455 (7212), 481-3. (4) Qin, J.; Li, R.; Raes, J.; Arumugam, M.; Burgdorf, K. S.; Manichanh, C.; Nielsen, T.; Pons, N.; Levenez, F.; Yamada, T.; Mende, D. R.; Li, J.; Xu, J.; Li, S.; Li, D.; Cao, J.; Wang, B.; Liang, H.; Zheng, H.; Xie, Y.; Tap, J.; Lepage, P.; Bertalan, M.; Batto, J. M.; Hansen, T.; Le Paslier, D.; Linneberg, A.; Nielsen, H. B.; Pelletier, E.; Renault, P.; Sicheritz-Ponten, T.; Turner, K.; Zhu, H.; Yu, C.; Li, S.; Jian, M.; Zhou, Y.; Li, Y.; Zhang, X.; Li, S.; Qin, N.; Yang, H.; Wang, J.; Brunak, S.; Dore, J.; Guarner, F.; Kristiansen, K.; Pedersen, O.; Parkhill, J.; Weissenbach, J.; Meta, H. I. T. C.; Bork, P.; Ehrlich, S. D.; Wang, J. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 2010, 464 (7285), 59-65. (5) Human Microbiome Project, C. Structure, function and diversity of the healthy human microbiome. Nature 2012, 486 (7402), 207-14. (6) Sunagawa, S.; Coelho, L. P.; Chaffron, S.; Kultima, J. R.; Labadie, K.; Salazar, G.; Djahanschiri, B.; Zeller, G.; Mende, D. R.; Alberti, A.; Cornejo-Castillo, F. M.; Costea, P. I.; Cruaud, C.; d'Ovidio, F.; Engelen, S.; Ferrera, I.; Gasol, J. M.; Guidi, L.; Hildebrand, F.;


24

Page 25 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Kokoszka, F.; Lepoivre, C.; Lima-Mendez, G.; Poulain, J.; Poulos, B. T.; Royo-Llonch, M.; Sarmento, H.; Vieira-Silva, S.; Dimier, C.; Picheral, M.; Searson, S.; Kandels-Lewis, S.; Tara Oceans, c.; Bowler, C.; de Vargas, C.; Gorsky, G.; Grimsley, N.; Hingamp, P.; Iudicone, D.; Jaillon, O.; Not, F.; Ogata, H.; Pesant, S.; Speich, S.; Stemmann, L.; Sullivan, M. B.; Weissenbach, J.; Wincker, P.; Karsenti, E.; Raes, J.; Acinas, S. G.; Bork, P. Ocean plankton. Structure and function of the global ocean microbiome. Science 2015, 348 (6237), 1261359. (7) Ottman, N.; Smidt, H.; de Vos, W. M.; Belzer, C. The function of our microbiota: who is out there and what do they do? Front Cell Infect Microbiol 2012, 2, 104. (8) Wilmes, P.; Bond, P. L. Metaproteomics: studying functional gene expression in microbial ecosystems. Trends Microbiol 2006, 14 (2), 92-7. (9) Erickson, A. R.; Cantarel, B. L.; Lamendella, R.; Darzi, Y.; Mongodin, E. F.; Pan, C.; Shah, M.; Halfvarson, J.; Tysk, C.; Henrissat, B.; Raes, J.; Verberkmoes, N. C.; Fraser, C. M.; Hettich, R. L.; Jansson, J. K. Integrated metagenomics/metaproteomics reveals human host-microbiota signatures of Crohn's disease. PLoS One 2012, 7 (11), e49138. (10) Mayers, M. D.; Moon, C.; Stupp, G. S.; Su, A. I.; Wolan, D. W. Quantitative Metaproteomics and Activity-Based Probe Enrichment Reveals Significant Alterations in Protein Expression from a Mouse Model of Inflammatory Bowel Disease. J Proteome Res 2017, 16 (2), 1014-1026. (11) Daniel, H.; Gholami, A. M.; Berry, D.; Desmarchelier, C.; Hahne, H.; Loh, G.; Mondot, S.; Lepage, P.; Rothballer, M.; Walker, A.; Bohm, C.; Wenning, M.; Wagner, M.; Blaut, M.; Schmitt-Kopplin, P.; Kuster, B.; Haller, D.; Clavel, T. High-fat diet alters gut microbiota physiology in mice. ISME J 2014, 8 (2), 295-308.


25


Page 26 of 35

(12) Mosier, A. C.; Li, Z.; Thomas, B. C.; Hettich, R. L.; Pan, C.; Banfield, J. F. Elevated temperature alters proteomic responses of individual organisms within a biofilm community. ISME J 2015, 9 (1), 180-94. (13) Noble, W. S. Mass spectrometrists should search only for peptides they care about. Nat Methods 2015, 12 (7), 605-8. (14) Cargile, B. J.; Bundy, J. L.; Stephenson, J. L., Jr. Potential for false positive identifications from large databases through tandem mass spectrometry. J Proteome Res 2004, 3 (5), 1082-5. (15) Renuse, S.; Chaerkady, R.; Pandey, A. Proteogenomics. Proteomics 2011, 11 (4), 620-30. (16) Callister, S. J.; Wilkins, M. J.; Nicora, C. D.; Williams, K. H.; Banfield, J. F.; VerBerkmoes, N. C.; Hettich, R. L.; N'Guessan, L.; Mouser, P. J.; Elifantz, H.; Smith, R. D.; Lovley, D. R.; Lipton, M. S.; Long, P. E. Analysis of biostimulated microbial communities from two field experiments reveals temporal and spatial differences in proteome profiles. Environ Sci Technol 2010, 44 (23), 8897-903. (17) Morris, B. E.; Herbst, F. A.; Bastida, F.; Seifert, J.; von Bergen, M.; Richnow, H. H.; Suflita, J. M. Microbial interactions during residual oil and n-fatty acid metabolism by a methanogenic consortium. Environ Microbiol Rep 2012, 4 (3), 297-306. (18) Jagtap, P.; Goslinga, J.; Kooren, J. A.; McGowan, T.; Wroblewski, M. S.; Seymour, S. L.; Griffin, T. J. A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies. Proteomics 2013, 13 (8), 1352-7. (19) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 2007, 4 (3), 207-14.


26

Page 27 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(20) Muth, T.; Kolmeder, C. A.; Salojarvi, J.; Keskitalo, S.; Varjosalo, M.; Verdam, F. J.; Rensen, S. S.; Reichl, U.; de Vos, W. M.; Rapp, E.; Martens, L., Navigating through metaproteomics data: a logbook of database searching. Proteomics 2015, 15 (20), 3439-53. (21) Li, J.; Jia, H.; Cai, X.; Zhong, H.; Feng, Q.; Sunagawa, S.; Arumugam, M.; Kultima, J. R.; Prifti, E.; Nielsen, T.; Juncker, A. S.; Manichanh, C.; Chen, B.; Zhang, W.; Levenez, F.; Wang, J.; Xu, X.; Xiao, L.; Liang, S.; Zhang, D.; Zhang, Z.; Chen, W.; Zhao, H.; Al-Aama, J. Y.; Edris, S.; Yang, H.; Wang, J.; Hansen, T.; Nielsen, H. B.; Brunak, S.; Kristiansen, K.; Guarner, F.; Pedersen, O.; Dore, J.; Ehrlich, S. D.; Meta, H. I. T. C.; Bork, P.; Wang, J.; Meta, H. I. T. C. An integrated catalog of reference genes in the human gut microbiome. Nat Biotechnol 2014, 32 (8), 834-41. (22) Xiao, L.; Feng, Q.; Liang, S.; Sonne, S. B.; Xia, Z.; Qiu, X.; Li, X.; Long, H.; Zhang, J.; Zhang, D.; Liu, C.; Fang, Z.; Chou, J.; Glanville, J.; Hao, Q.; Kotowska, D.; Colding, C.; Licht, T. R.; Wu, D.; Yu, J.; Sung, J. J.; Liang, Q.; Li, J.; Jia, H.; Lan, Z.; Tremaroli, V.; Dworzynski, P.; Nielsen, H. B.; Backhed, F.; Dore, J.; Le Chatelier, E.; Ehrlich, S. D.; Lin, J. C.; Arumugam, M.; Wang, J.; Madsen, L.; Kristiansen, K. A catalog of the mouse gut metagenome. Nat Biotechnol 2015, 33 (10), 1103-8. (23) Zhang, X.; Ning, Z.; Mayne, J.; Moore, J. I.; Li, J.; Butcher, J.; Deeke, S. A.; Chen, R.; Chiang, C. K.; Wen, M.; Mack, D.; Stintzi, A.; Figeys, D. MetaPro-IQ: a universal metaproteomic approach to studying human and mouse gut microbiota. Microbiome 2016, 4 (1), 31. (24) Tanca, A.; Palomba, A.; Fraumene, C.; Pagnozzi, D.; Manghina, V.; Deligios, M.; Muth, T.; Rapp, E.; Martens, L.; Addis, M. F.; Uzzau, S. The impact of sequence database choice on metaproteomic results in gut microbiota studies. Microbiome 2016, 4 (1), 51.


27


Page 28 of 35

(25) Cantarel, B. L.; Erickson, A. R.; VerBerkmoes, N. C.; Erickson, B. K.; Carey, P. A.; Pan, C.; Shah, M.; Mongodin, E. F.; Jansson, J. K.; Fraser-Liggett, C. M.; Hettich, R. L. Strategies for metagenomic-guided whole-community proteomics of complex microbial environments. PLoS One 2011, 6 (11), e27173. (26) Tanca, A.; Palomba, A.; Deligios, M.; Cubeddu, T.; Fraumene, C.; Biosa, G.; Pagnozzi, D.; Addis, M. F.; Uzzau, S. Evaluating the impact of different sequence databases on metaproteome analysis: insights from a lab-assembled microbial mixture. PLoS One 2013, 8 (12), e82981. (27) Zerbino, D. R.; Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 2008, 18 (5), 821-829. (28) Apweiler, R. Activities at the Universal Protein Resource (UniProt) (vol 42, pg D198, 2014). Nucleic Acids Research 2014, 42 (11), 7486-7486. (29) Suzek, B. E.; Wang, Y.; Huang, H.; McGarvey, P. B.; Wu, C. H.; UniProt, C. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015, 31 (6), 926-32. (30) Menzel, P.; Ng, K. L.; Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun 2016, 7, 11257. (31) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24 (21), 2534-6. (32) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466-7. (33) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 2008, 26 (12), 1367-72.


28

Page 29 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(34) Mesuere, B.; Debyser, G.; Aerts, M.; Devreese, B.; Vandamme, P.; Dawyndt, P. The Unipept metaproteomics analysis pipeline. Proteomics 2015, 15 (8), 1437-42. (35) Li, J.; Su, Z. L.; Ma, Z. Q.; Slebos, R. J. C.; Halvey, P.; Tabb, D. L.; Liebler, D. C.; Pao, W.; Zhang, B. A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics. Mol. Cell. Proteomics 2011, 10 (5), M110-006536. (36) Fu, Y.; Qian, X. H. Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry. Mol. Cell. Proteomics 2014, 13 (5), 1359-1368. (37) Deligios, M.; Fraumene, C.; Abbondio, M.; Mannazzu, I.; Tanca, A.; Addis, M. F.; Uzzau, S. Draft Genome Sequence of Rhodotorula mucilaginosa, an Emergent Opportunistic Pathogen. Genome Announc 2015, 3 (2), e00201-15. (38) Tilg, H.; Kaser, A. Gut microbiome, obesity, and metabolic dysfunction. Journal of Clinical Investigation 2011, 121 (6), 2126-2132. (39) May, D. H.; Timmins-Schiffman, E.; Mikan, M. P.; Haryey, H. R.; Borenstein, E.; Nunn, B. L.; Noble, W. S., An Alignment-Free "Metapeptide" Strategy for Metaproteomic Characterization of Microbiome Samples Using Shotgun Metagenomic Sequencing. J Proteome Res 2016, 15 (8), 2697-2705. (40) Tang, H. X.; Li, S. J.; Ye, Y. Z. A Graph-Centric Approach for Metagenome-Guided Peptide and Protein Identification in Metaproteomics. PLoS Computational Biology 2016, 12 (12), e1005224.


29


Figure1 263x171mm (151 x 151 DPI)


Page 30 of 35

Page 31 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 2 179x238mm (151 x 151 DPI)



Figure 3 119x188mm (151 x 151 DPI)


Page 32 of 35

Page 33 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 4 255x105mm (151 x 151 DPI)



Figure 5 227x249mm (151 x 151 DPI)


Page 34 of 35

Page 35 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


For TOC only 73x33mm (300 x 300 DPI)


Metagenomic Taxonomy-Guided Database ... - ACS Publications

Recommend Documents