Critical Review pubs.acs.org/est
Experimental Design and Bioinformatics Analysis for the Application of Metagenomics in Environmental Sciences and Biotechnology Feng Ju and Tong Zhang* Environmental Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Hong Kong SRA, China ABSTRACT: Recent advances in DNA sequencing technologies have prompted the widespread application of metagenomics for the investigation of novel bioresources (e.g., industrial enzymes and bioactive molecules) and unknown biohazards (e.g., pathogens and antibiotic resistance genes) in natural and engineered microbial systems across multiple disciplines. This review discusses the rigorous experimental design and sample preparation in the context of applying metagenomics in environmental sciences and biotechnology. Moreover, this review summarizes the principles, methodologies, and state-of-the-art bioinformatics procedures, tools and database resources for metagenomics applications and discusses two popular strategies (analysis of unassembled reads versus assembled contigs/draft genomes) for quantitative or qualitative insights of microbial community structure and functions. Overall, this review aims to facilitate more extensive application of metagenomics in the investigation of uncultured microorganisms, novel enzymes, microbe-environment interactions, and biohazards in biotechnological applications where microbial communities are engineered for bioenergy production, wastewater treatment, and bioremediation.
■
INTRODUCTION Over the past two decades, molecular techniques, such as PCR cloning and sequencing, fluorescence in situ hybridization (FISH), denaturing gradient gel electrophoresis (DGGE), and terminal restriction fragment length polymorphism (T-RFLP) have been applied to fingerprint microbial communities engineered for various environmental biotechnical applications, including bioenergy production, bioremediation, and biological wastewater treatment.1−3 These studies have greatly enriched our knowledge of morphology, taxonomy and diversity of the contributing microorganisms. For example, it is recognized that microorganisms rarely work alone; instead, they are usually arranged in biogranules, biofilms, or flocs to facilitate the establishment of tightly coupled interactions necessary for inhibitive intermediate removal, biodegradation, or bioenergy production.4 However, while these useful molecular techniques have partially elucidated taxonomic identification of major players for relatively simple microbial systems, they are limited in addressing the function of microorganisms and the mechanisms underlying their interactions. Although the molecular techniques could be combined with cultivation to achieve this, few microorganisms ( 99.9% accuracy41) and concatenate the low-quality but exceptionally long reads (e.g., 20 kb) from so-called third-generation sequencing platforms,5 such as PacBio RS. The expected sequence depth is closely related with the measured biodiversity and complexity of microbial samples. Generally, soil42 and sediments43 harbor more diverse microbial species than bioengineered systems.20,33,44−46 For biological wastewater treatment systems, higher biodiversity is generally detected in full-scale than lab-scale bioreactors, in biofilm than suspended sludge, and in activated sludge than anaerobic sludge.44−46 Recent attempts to assemble large complex soil metagenomes suggests that 80% of the sequencing data could not be assembled (because of low coverage) and even 300 Gbp of read data are still insufficient to cover even a localized soil sample deeply.26 In contrast, more than 45% metagenomic reads from enriched microbial systems could be effectively used in assembly.20,21,47 Moreover, 29−57 Gbp of sequences are enough for reconstructing 31 bacterial genomes, including rare (95% similarity; MG-RAST removes all but a single representative of clusters of DNA sequences whose first 50 base pairs are identical;49 and PRINSEQ offers a complete list of options (five alternative modes) for users to remove exact (100% similarity) and reverse complement duplicates.51 Third, a PE sequencing strategy on NGS platforms makes it possible to overlap/merge PE sequences. Two major advantages of using overlapped sequences (i.e., “tags”) with extended length include higher resolution and accuracy of taxonomic and functional annotation and higher quality tags due to overlapping PE sequences with error correction.38,57 Popular free platforms or tools for overlapping PE sequences include QIIME,58 mothur,57 FLASH,59 PANDAseq,38 etc. Assembly. The assembly (i.e., the computational process to connect short DNA/cDNA fragments) of metagenome or metatranscriptome can yield long contigs for predicting fulllength protein-coding genes (PCGs, or open reading frames) or transcripts, recovering genomic sequences (via binning as discussed later), and identifying strain specific genomic islands,60 thus allowing more accurate qualitative analysis of genetic contents (e.g., at strain or species levels), especially for 12632
DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640
Critical Review
Environmental Science & Technology
based on similarity searches and taxonomic assignments. However, they are not reliable for assigning short reads and often require longer assembled contigs (e.g., ≥ 1 kb for the expert-trained PhyloPythiaS package) and manual efforts to ensure high assignment accuracies. Homology-based binning methods based on similarity search against currently available genomes may provide useful grouping information on short metagenomic reads for improving downstream genome binning of specific organisms, although these methods can be limited by the heavy reliance on the quality and representativeness of reference databases, poor taxonomic resolution of short reads, and the accuracy and/or sensitivity of alignment tools. Other software programs (e.g., PhymmBL90 and MetaCluster91) consider both the composition and homology of metagenomic sequence for taxonomic classification or clustering of reads from the same/or similar genomes. The continuously decreasing sequencing cost has allowed researchers to access environmental metagenomes at increasing sequencing depths (e.g., > 50 Gbp), thereby offering sufficient resolution to retrieve partial or near-complete genomes of rare ( 50 Gbp). However, considering that unassembled metagenomic reads from NGS platforms are short (e.g., 100−150 bp for Illumina Hiseq), it is essential to self-evaluate annotation accuracy and sensitivity before customized pipelines (e.g., multistep similarity search and subdatabases) are applied to speed up annotation of NGS data. One general practice in suppressing false positive annotations of short reads is to exert rigorous filtering criteria or cutoffs (i.e., identity, alignment length, and e-value) during database search.49,103,116,118 Assembled Contigs. The contigs assembled from metagenomic reads can be used to recover genomes of uncultured organisms using the afore-discussed binning approaches. The reconstructed genomes can be submitted to web-based servers, such as IMG/ER,119 RAST Server, WebMGA,121 and ggKbase (http://ggkbase.berkeley.edu/). These platforms provide a series of standardized and/or customized pipelines and annotation tools. Moreover, gene-calling tools commonly used for PCGs prediction from contigs include Prodigal,122 MetaGeneAnnotator,123 MetaGeneMark,124 FragGeneScan,125 and Orphelia.126 Prodigal is a popular tool with improved gene structure predictions and lower false positive rates. The relative quantification of PCGs in a metagenome or metatranscriptome is typically based on mapping short genomic DNA or cDNA reads against each PCG. Widely used quantitative metrics include RPKM (Reads Per Kilobase of transcript per Million mapped reads),127 FPKM (Fragments Per Kilobase of transcript per Million fragments mapped),128 and relative abundance (number of reads mapped to a PCG divided by total number of reads). Considering an uneven richness of different PCGs in a metagenome, metatranscriptomic RPKM (M-RPKM) has been proposed to better reflect the absolute transcriptional activity of each PCG; M-RPKM is calculated as the transcriptional activity of each PCG (i.e., RPKM-RNA) divided by its RPKM in the coupled metagenome (i.e., RPKM-DNA).13 The two most popular free alignment tools for mapping DNA reads against PCGs or contigs are Bowtie/Bowtie2129,130 and BWA/BWA-SW.131,132 A fast and sensitive spliced alignment program named “HISAT”133 was recently developed based on Bowtie2 to align transcriptome data; HISAT enables efficient mapping of transcriptome reads, particularly reads that span multiple exons. To detect proteins with related functions but without considerable homology to known PCGs, domain-based database search tools, such as InterProScan,134 HMMER135 and
abundance (coverage) profiles in one metagenome. The draft genomes obtained from this method can be further purified and refined using composition-based methods and other strategies, including paired-end tracking, reassembly, and manual curation. Recently, several automated tools, including GroopM,39 CONCOCT,95 MaxBin96 and METABAT,97 were developed to integrate the above “coverage-composition” strategies into efficient pipelines for genome reconstruction from metagenomes, primarily based on coverage/abundance profiles and tetranucleotide frequency (TNF) patterns of contigs across multiple related metagenomes (e.g., on temporal or spatial scales). The completeness and potential contamination in reconstructed genomes have been estimated by the presence/ absence of essential marker genes, such as essential single copy marker genes conserved in 95% of bacteria,98 conserved phylogenetic marker genes,21 or clusters of orthologous groups (COGs).99 Currently, CheckM is the only automated tool that can assess the quality of a genome recovered from isolates, single cells, and metagenomes based on conserved marker genes.100 Annotation. Both unassembled reads and assembled contigs from metagenomic data are used to annotate the structure and functions of microbial community. Generally, unassembled short reads retain all the original abundance information and enable quantitative comparisons of microbial taxa, functional genes, and global metabolic profiles (e.g., KEGG pathways101 and SEED subsystems102) within or across microbial habitats or ecosystems.47,103 However, short reads are large in data size and may lack resolution for taxonomic and functional annotations. In contrast, assembled contigs with much longer lengths and more compact size allow more robust and rapid analysis of both specific species and their functional genes, compared with unassembled short reads. However, an assembly based annotation strategy has the potential to introduce biases for quantitative analysis, because of the difficulty in assembling low-abundance species and closely related strain, and the exclusion of considerable amounts of unassembled data from downstream analysis. Unassembled Reads. The first strategy for metagenome and metatranscriptome annotation is the direct use of unassembled clean reads for the quantitative analysis of microbial community composition and function. Moreover, paired metagenomic and metatranscriptomic data (from the same samples) are used to describe the “active” microorganisms and compare their gene expression patterns in both natural and engineering ecosystems.7,24 Additionally, short metagenomic DNA reads are used to characterize or predict environmental biohazards (i.e., bacterial pathogens, viruses, and antibiotic resistance genes (ARGs)) in drinking water disinfection systems104,105 and waste and wastewater treatment systems.103,106−108 Read-based taxonomic and functional annotation is mainly supervised and based on homology searches against a reference database using various alignment tools. Unassembled clean reads can be annotated either via standardized platforms or packages (e.g., MG-RAST for DNA fragment analysis and PRADA for RNA analysis). Several taxonomy profiling tools that can provide at least species-level resolution, such as Pathoscope,109 Sigma,110 and MetaPhlAn2,83 are available for performing biosurveillance and detecting biohazards in targeted infected tissue samples109 or healthy human microbiome111 with the assistance of reference genomes. Running local similarity search with BLASTX or PSI-BLAST is rather computationally expensive and time-consuming, 12634
DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640
Critical Review
Environmental Science & Technology CD-search,136 are developed to search against domain or motif models (e.g., hidden Markov model (HMM)). Many sequenceor domain-based functional gene databases are publicly available for functional annotation of genomes, metagenomes and metatranscriptomes, including KEGG,101 UniRef90,137 Swiss-Prot,138 PFAM,139 TIGRFAM,140 eggNOG,141 COG,142 CDD (Conserved Domain Database),143 and SMART.144 These databases are structured with raw protein sequences (e.g., KEGG and UniRef90), or trained models (e.g., PFAM, TIGRFAM, and CDD) representing protein families, and/or domains or the combination of both (SMART). Different alignment tools (similarity search or model-based) are used for different types of databases or research initiatives. For example, HMMER is widely utilized to search against PFAM, TIGRFAM, and SMART databases for detecting novel genes in assembled metagenomes. To maximize functional annotation of genomes and/or metagenomes, recent versions of MGRAST (for metagenome only) and IMG/M have merged the interpretation of annotations from most of the above databases within a single framework.25 Noteworthy, FOAM (Functional Ontology Assignments for Metagenomes), the first HMM database with an environmental focus, was recently released, providing an opportunity to link model-based functional gene searches to a large group of target KEGG Orthologs (KOs).145 Last, network modeling and visualization have been demonstrated as a powerful method for exploring species− species co-occurrence,146−148 gene regulation and coexpression,149 protein−protein interactions,150 and metabolic pathways151 from large omics data sets generated from a sufficient number of related samples. Popular network analysis platforms include Cytoscape,152 Gephi,153 and WGCNA (Weighted Gene Co-Expression Network Analysis).154 Among them, Cytoscape is a popular and open source platform for network visualization and exploration. Data Sharing and Storage. The sharing of sample metadata, sequence data, and computational results is a traditional and efficient method of knowledge exchange in the field of genomic research. The major significance of exchanging sequence data and computational results lies in the beneficial outcomes of comparative studies and the complete elimination of unnecessarily repeated processing of the same data sets or sequencing of similar microbial systems. Several publicly available databases have been maintained to promote the sharing and storage of NGS data, such as NCBI-SRA, MGRAST, and Genomes OnLine Database v.5 (GOLD).155 The submitted sequence data are required to be equipped with a complete standard metadata file that is prepared following the Minimum Information about any (x) Sequence checklists (MIxS) of Genomic Standards Consortium (http://gensc.org/ projects/mixs-gsc-project/) to include essential information, such as the sampling location, time, habitat type, organism, sequencing method. Perspectives. Many researchers have successfully used metagenomics to identify novel microorganisms, mine novel genes for bioenergy and biodegradation,13,15,156 monitor microbial biohazards (e.g., ARGs, pathogens, and viruses),103,107,108,157 and elucidate microbial community structure, metabolism and functions.12,16,17,158,159 However, there are several aspects (mainly pitfalls and limitations) that need to be aware of, eliminated or addressed in future metagenomicsbased studies. First, a metagenomic approach has a lower sensitivity for gene detection compared with qPCR despite its far higher
coverage of gene catalogues in the whole microbial community. Thus, these two techniques should be used in a complementary manner to investigate low abundance genes or transcripts in microbial communities. Second, NGS-based metagenome (DNA sequences) or metatranscriptome (RNA sequences) of complex microbial communities remains expensive, because the cost increases with increased sequencing depth. This economic challenge has rendered most metagenomic studies performing quantitative analysis due to lack of technical/biological replicates, making it hard to determine whether observed differences (especially those small ones) within or between experimental groups are statistically significant and robust (after considering replicate variance). While metagenomic DNA data are known to have good technical reproducibility (R2 > 0.9, from DNA extraction to NGS) for taxa composition and/or functional annotation,12,13,22,44 technical replicates of metatranscriptomic RNA data show larger variation for environmental samples (e.g., R2 = 0.7, from RNA extraction to NGS),13,160 although NGS-based transcriptome of a single organism is highly reproducible.161 Future attention should be paid to addressing reproducibility of metatranscriptomic results before discussing differentially expressed features. Notably, selective enrichment prior to NGS, either by purposefully using target substrates or under harsh conditions, can largely reduce community diversity and complexity,162 thereby lowering the minimum necessary sequencing depth (thus cost) in metagenomics-based studies. Third, compared with a general overview and plain description of overall microbial community structure, functional categories (e.g., subsystems) and metabolic pathway (e.g., KEGG) by using automatic pipelines (e.g., MG-RAST), metagenomic de novo assembly and “coverage-composition” binning are powerful and increasingly popular approaches to recover population genomes from time-series (or taxonomically similar) metagenomes,22,32,34 facilitating the discovery of diverse novel genetic resources, especially from uncultured microbes. Genome reconstruction and annotation provide physiological and biochemical properties to help isolate uncultured microbes, eventually making the in vitro validation of predicted enzymatic activities feasible. More efforts are needed to evaluate and reduce assembly errors and improve binning accuracy, especially in the presence of microdiversity in the samples.20 Fourth, taxonomic and functional annotation of sequence data is inevitably affected by the completeness of database resources used. Current practice to use all PCGs (rather than clade-specific marker genes) for whole community taxonomic profiling (e.g., as implemented in MG-RAST) can be problematic considering the incompleteness of protein sequence databases and a broad taxonomic distribution of PCGs. Therefore, analysis of 16S rRNA gene fragments in a metagenome is recommended for an overview of taxonomic composition of a microbial community. Moreover, functional metagenomics that couples high-throughput metagenomic library for functional screening of clones growing on target substrates (e.g., antibiotics, chitin, cellulose, etc.) and NGS is recommended to speed up the discovery of new protein families and functions,48,163 which is promising for the development of novel genetic engineering materials and the expansion of publicly available protein databases. Last and foremost, metagenomics and related state-of-the-art data mining approaches (e.g., network modeling, de novo assembly, genome binning) are demonstrated as effective 12635
DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640
Critical Review
Environmental Science & Technology
Banfield, J. F. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 2004, 428 (6978), 37−43. (11) Tringe, S. G.; Von Mering, C.; Kobayashi, A.; Salamov, A. A.; Chen, K.; Chang, H. W.; Podar, M.; Short, J. M.; Mathur, E. J.; Detter, J. C. Comparative metagenomics of microbial communities. Science 2005, 308 (5721), 554−557. (12) Ju, F.; Guo, F.; Ye, L.; Xia, Y.; Zhang, T. Metagenomic analysis on seasonal microbial variations of activated sludge from a full-scale wastewater treatment plant over 4 years. Environ. Microbiol. Rep. 2014, 6 (1), 80−89. (13) Xia, Y.; Wang, Y.; Fang, H. H.; Jin, T.; Zhong, H.; Zhang, T. Thermophilic microbial cellulose decomposition and methanogenesis pathways recharacterized by metatranscriptomic and metagenomic analysis. Sci. Rep. 2014, 4, 6708−6716. (14) Xia, Y.; Ju, F.; Fang, H. H.; Zhang, T. Mining of novel thermostable cellulolytic genes from a thermophilic cellulose-degrading consortium by metagenomics. PLoS One 2013, 8 (1), e53779. (15) Wexler, M.; Bond, P. L.; Richardson, D. J.; Johnston, A. W. A wide host-range metagenomic library from a waste water treatment plant yields a novel alcohol/aldehyde dehydrogenase. Environ. Microbiol. 2005, 7 (12), 1917−1926. (16) Martín, H. G.; Ivanova, N.; Kunin, V.; Warnecke, F.; Barry, K. W.; McHardy, A. C.; Yeates, C.; He, S.; Salamov, A. A.; Szeto, E. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 2006, 24 (10), 1263− 1269. (17) Krause, L.; Diaz, N. N.; Edwards, R. A.; Gartemann, K. H.; Krömeke, H.; Neuweger, H.; Pühler, A.; Runte, K. J.; Schlüter, A.; Stoye, J. Taxonomic composition and gene content of a methaneproducing microbial community isolated from a biogas reactor. J. Biotechnol. 2008, 136 (1), 91−101. (18) Guermazi, S.; Daegelen, P.; Dauga, C.; Rivière, D.; Bouchez, T.; Godon, J. J.; Gyapay, G.; Sghir, A.; Pelletier, E.; Weissenbach, J. Discovery and characterization of a new bacterial candidate division by an anaerobic sludge digester metagenomic approach. Environ. Microbiol. 2008, 10 (8), 2111−2123. (19) Rinke, C.; Schwientek, P.; Sczyrba, A.; Ivanova, N. N.; Anderson, I. J.; Cheng, J. F.; Darling, A.; Malfatti, S.; Swan, B. K.; Gies, E. A. Insights into the phylogeny and coding potential of microbial dark matter. Nature 2013, 499 (7459), 431−437. (20) Albertsen, M.; Hugenholtz, P.; Skarshewski, A.; Nielsen, K. L.; Tyson, G. W.; Nielsen, P. H. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 2013, 31 (6), 533−538. (21) Wrighton, K. C.; Thomas, B. C.; Sharon, I.; Miller, C. S.; Castelle, C. J.; VerBerkmoes, N. C.; Wilkins, M. J.; Hettich, R. L.; Lipton, M. S.; Williams, K. H. Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 2012, 337 (6102), 1661−1665. (22) Mao, Y.; Yu, K.; Xia, Y.; Chao, Y.; Zhang, T. Genome Reconstruction and Gene Expression of “Candidatus Accumulibacter phosphatis” Clade IB Performing Biological Phosphorus Removal. Environ. Sci. Technol. 2014, 48 (17), 10363−10371. (23) Meyerdierks, A.; Kube, M.; Kostadinov, I.; Teeling, H.; Glöckner, F. O.; Reinhardt, R.; Amann, R. Metagenome and mRNA expression analyses of anaerobic methanotrophic archaea of the ANME-1 group. Environ. Microbiol. 2010, 12 (2), 422−439. (24) Yu, K.; Zhang, T. Metagenomic and metatranscriptomic analysis of microbial community structure and gene expression of activated sludge. PLoS One 2012, 7 (5), e38183. (25) Thomas, T.; Gilbert, J.; Meyer, F. Metagenomics-a guide from sampling to data analysis. Microb. Inf. Exp. 2012, 2 (3), 1−12. (26) Howe, A. C.; Jansson, J. K.; Malfatti, S. A.; Tringe, S. G.; Tiedje, J. M.; Brown, C. T. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl. Acad. Sci. U. S. A. 2014, 111 (13), 4904−4909.
methods to reveal the intricate microbial mechanisms that deteriorate, maintain or advance function and stability of microbial communities.147,164 These “open-ended” approaches have enabled the formulation of numerous novel hypotheses (e.g., on interspecies interactions,147 gene coexpression,149 putative carbohydrate-active genes,165 antibiotic/metal resistance genes,103,166 or metabolic potentials of uncultured microbial dark matter (e.g. Candidatus Accumulibacter Clade IB,22 ANME-1,23 SBR1093167)) which require future validation via design of “hypothesis-driven” studies. To fulfill such endeavors and obtain more thorough understanding of microbially mediated systems, it is promising to combine metagenomics with key complementary techniques, including isotope labeling, qPCR, FISH, flowcytometer, and advanced chemical analysis.
■
AUTHOR INFORMATION
Corresponding Author
*Phone: 852-28591968 (lab), 852-28578551 (office); fax: 85225595337; e-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS We thank the reviewers for their review and constructive comments on this manuscript. The authors thank all the members of Environmental Biotechnology Laboratory at HKU for their helpful literature sharing and idea exchange on this topic. This work was supported by a grant from the Hong Kong General Research Fund (172099/14E), Hong Kong. J.F. thanks The University of Hong Kong for the postgraduate studentship.
■
REFERENCES
(1) Fang, H. H. P.; Zhang, T., Eds. Anaerobic Biotechnology: Environmental Protection and Resource Recovery; Imperial College Press: London, 2015. (2) Grady Jr, C. L.; Daigger, G. T.; Love, N. G.; Filipe, C. D.; Leslie Grady, C. Biological Wastewater Treatment; IWA Publishing: London, U.K. 2011. (3) Lovley, D. R. Cleaning up with genomics: applying molecular biology to bioremediation. Nat. Rev. Microbiol. 2003, 1 (1), 35−44. (4) Sieber, J. R.; McInerney, M. J.; Gunsalus, R. P. Genomic insights into syntrophy: the paradigm for anaerobic metabolic cooperation. Annu. Rev. Microbiol. 2012, 66 (1), 429−452. (5) van Dijk, E. L.; Auger, H.; Jaszczyszyn, Y.; Thermes, C. Ten years of next-generation sequencing technology. Trends Genet. 2014, 30 (9), 418−426. (6) Pernthaler, A.; Dekas, A. E.; Brown, C. T.; Goffredi, S. K.; Embaye, T.; Orphan, V. J. Diverse syntrophic partnerships from deepsea methane vents revealed by direct cell capture and metagenomics. Proc. Natl. Acad. Sci. U. S. A. 2008, 105 (19), 7052−7057. (7) Frias-Lopez, J.; Shi, Y.; Tyson, G. W.; Coleman, M. L.; Schuster, S. C.; Chisholm, S. W.; DeLong, E. F. Microbial community gene expression in ocean surface waters. Proc. Natl. Acad. Sci. U. S. A. 2008, 105 (10), 3805−3810. (8) Mason, O. U.; Hazen, T. C.; Borglin, S.; Chain, P. S. G.; Dubinsky, E. A.; Fortney, J. L.; Han, J.; Holman, H. Y. N.; Hultman, J.; Lamendella, R. Metagenome, metatranscriptome and single-cell sequencing reveal microbial response to Deepwater Horizon oil spill. ISME J. 2012, 6 (9), 1715−1727. (9) Shi, Y.; Tyson, G. W.; Eppley, J. M.; DeLong, E. F. Integrated metatranscriptomic and metagenomic analyses of stratified microbial assemblages in the open ocean. ISME J. 2011, 5 (6), 999−1013. (10) Tyson, G. W.; Chapman, J.; Hugenholtz, P.; Allen, E. E.; Ram, R. J.; Richardson, P. M.; Solovyev, V. V.; Rubin, E. M.; Rokhsar, D. S.; 12636
DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640
Critical Review
Environmental Science & Technology (27) Desai, N.; Antonopoulos, D.; Gilbert, J. A.; Glass, E. M.; Meyer, F. From genomics to metagenomics. Curr. Opin. Biotechnol. 2012, 23 (1), 72−76. (28) Knight, R.; Jansson, J.; Field, D.; Fierer, N.; Desai, N.; Fuhrman, J. A.; Hugenholtz, P.; van der Lelie, D.; Meyer, F.; Stevens, R. Unlocking the potential of metagenomics through replicated experimental design. Nat. Biotechnol. 2012, 30 (6), 513−520. (29) Goodrich, J. K.; Di Rienzi, S. C.; Poole, A. C.; Koren, O.; Walters, W. A.; Caporaso, J. G.; Knight, R.; Ley, R. E. Conducting a Microbiome Study. Cell 2014, 158 (2), 250−262. (30) Scholz, M. B.; Lo, C. C.; Chain, P. S. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr. Opin. Biotechnol. 2012, 23 (1), 9−15. (31) Zhou, J.; He, Z.; Yang, Y.; Deng, Y.; Tringe, S. G.; AlvarezCohen, L. High-throughput metagenomic technologies for complex microbial community analysis: open and closed formats. mBio 2015, 6 (1), e02288−14. (32) Kunin, V.; Copeland, A.; Lapidus, A.; Mavromatis, K.; Hugenholtz, P. A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 2008, 72 (4), 557−578. (33) Zhang, T.; Shao, M. F.; Ye, L. 454 Pyrosequencing reveals bacterial diversity of activated sludge from 14 sewage treatment plants. ISME J. 2012, 6 (6), 1137−1147. (34) Riesgo, A.; PPÉREZ-PORRO, A. R.; Carmona, S.; Leys, S. P.; Giribet, G. Optimization of preservation and storage time of sponge tissues to obtain quality mRNA for next-generation sequencing. Mol. Ecol. Resour. 2012, 12 (2), 312−322. (35) Kucuktas, H.; Liu, Z. J. Library construction for next generation sequencing. In Next Generation Sequencing and Whole Genome Selection in Aquaculture; Liu, Z. J., Ed; Blackwell Publishing Ltd.: Oxford, U.K., 2010; pp 57−67. (36) Head, S. R.; Komori, H. K.; LaMere, S. A.; Whisenant, T.; Van Nieuwerburgh, F.; Salomon, D. R.; Ordoukhanian, P. Library construction for next-generation sequencing: overviews and challenges. BioTechniques 2013, 56 (2), 61−64. (37) Peng, Y.; Leung, H. C.; Yiu, S. M.; Chin, F. Y. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 2012, 28 (11), 1420−1428. (38) Masella, A. P.; Bartram, A. K.; Truszkowski, J. M.; Brown, D. G.; Neufeld, J. D. PANDAseq: paired-end assembler for illumina sequences. BMC Bioinf. 2012, 13 (1), 31. (39) Imelfort, M.; Parks, D.; Woodcroft, B. J.; Dennis, P.; Hugenholtz, P.; Tyson, G. W. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2014, 2, e603. (40) Prjibelski, A. D.; Vasilinetc, I.; Bankevich, A.; Gurevich, A.; Krivosheeva, T.; Nurk, S.; Pham, S.; Korobeynikov, A.; Lapidus, A.; Pevzner, P. A. ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics 2014, 30 (12), 293−301. (41) Koren, S.; Schatz, M. C.; Walenz, B. P.; Martin, J.; Howard, J. T.; Ganapathy, G.; Wang, Z.; Rasko, D. A.; McCombie, W. R.; Jarvis, E. D. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 2012, 30 (7), 693−700. (42) Roesch, L. F.; Fulthorpe, R. R.; Riva, A.; Casella, G.; Hadwin, A. K.; Kent, A. D.; Daroub, S. H.; Camargo, F. A.; Farmerie, W. G.; Triplett, E. W. Pyrosequencing enumerates and contrasts soil microbial diversity. ISME J. 2007, 1 (4), 283−290. (43) Wang, Y.; Sheng, H. F.; He, Y.; Wu, J. Y.; Jiang, Y. X.; Tam, N. F. Y.; Zhou, H. W. Comparison of the levels of bacterial diversity in freshwater, intertidal wetland, and marine sediments by using millions of illumina tags. Appl. Environ. Microbiol. 2012, 78 (23), 8264−8271. (44) Ju, F.; Zhang, T. Novel microbial populations in ambient and mesophilic biogas-producing and phenol-degrading consortia unraveled by high-throughput sequencing. Microb. Ecol. 2014, 68 (2), 235− 246. (45) Ye, L.; Zhang, T. Bacterial communities in different sections of a municipal wastewater treatment plant revealed by 16S rDNA 454 pyrosequencing. Appl. Microbiol. Biotechnol. 2013, 97 (6), 2681−2690.
(46) Ju, F.; Zhang, T. 16S rRNA gene high-throughput sequencing data mining of microbial diversity and interactions. Appl. Microbiol. Biotechnol. 2015, 99 (10), 4119−4129. (47) Ye, L.; Zhang, T.; Wang, T.; Fang, Z. Microbial structures, functions, and metabolic pathways in wastewater treatment bioreactors revealed using high-throughput sequencing. Environ. Sci. Technol. 2012, 46 (24), 13244−13252. (48) Ufarté, L.; Potocki-Véronèse, G.; Laville, E. Discovery of new protein families and functions: new challenges in functional metagenomics for biotechnologies and microbial ecology. Front. Microbiol. 2015, 6, 563−572. (49) Meyer, F.; Paarmann, D.; D’Souza, M.; Olson, R.; Glass, E. M.; Kubal, M.; Paczian, T.; Rodriguez, A.; Stevens, R.; Wilke, A. The metagenomics RAST server-a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinf. 2008, 9 (1), 386−393. (50) Torres-García, W.; Zheng, S.; Sivachenko, A.; Vegesna, R.; Wang, Q.; Yao, R.; Berger, M. F.; Weinstein, J. N.; Getz, G.; Verhaak, R. G. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics 2014, 30 (15), 2224−2226. (51) Schmieder, R.; Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27 (6), 863−864. (52) Patel, R. K.; Jain, M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 2012, 7 (2), e30619. (53) Schmieder, R.; Edwards, R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One 2011, 6 (3), e17288. (54) Wang, L.; Wang, S.; Li, W. RSeQC: quality control of RNA-seq experiments. Bioinformatics 2012, 28 (16), 2184−2185. (55) DeLuca, D. S.; Levin, J. Z.; Sivachenko, A.; Fennell, T.; Nazaire, M. D.; Williams, C.; Reich, M.; Winckler, W.; Getz, G. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 2012, 28 (11), 1530−1532. (56) Niu, B.; Fu, L.; Sun, S.; Li, W. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinf. 2010, 11 (1), 187−197. (57) Schloss, P. D.; Westcott, S. L.; Ryabin, T.; Hall, J. R.; Hartmann, M.; Hollister, E. B.; Lesniewski, R. A.; Oakley, B. B.; Parks, D. H.; Robinson, C. J. Introducing mothur: open-source, platformindependent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009, 75 (23), 7537−7541. (58) Caporaso, J. G.; Kuczynski, J.; Stombaugh, J.; Bittinger, K.; Bushman, F. D.; Costello, E. K.; Fierer, N.; Peña, A. G.; Goodrich, J. K.; Gordon, J. I. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 2010, 7 (5), 335−336. (59) Magoč, T.; Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 2011, 27 (21), 2957−2963. (60) Langille, M. G.; Hsiao, W. W.; Brinkman, F. S. Detecting genomic islands using bioinformatics approaches. Nat. Rev. Microbiol. 2010, 8 (5), 373−382. (61) Zerbino, D. R.; Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5), 821− 829. (62) Simpson, J. T.; Wong, K.; Jackman, S. D.; Schein, J. E.; Jones, S. J.; Birol, I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009, 19 (6), 1117−1123. (63) Luo, R.; Liu, B.; Xie, Y.; Li, Z.; Huang, W.; Yuan, J.; He, G.; Chen, Y.; Pan, Q.; Liu, Y. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Bioinformatics 2012, 1 (1), 18−23. (64) Peng, Y.; Leung, H. C.; Yiu, S. M.; Chin, F. Y. Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 2011, 27 (13), 94−101. (65) Namiki, T.; Hachiya, T.; Tanaka, H.; Sakakibara, Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012, 40 (20), 155−155. 12637
DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640
Critical Review
Environmental Science & Technology (66) Boisvert, S.; Raymond, F.; Godzaridis, É.; Laviolette, F.; Corbeil, J. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012, 13 (12), 122−134. (67) Haider, B.; Ahn, T. H.; Bushnell, B.; Chai, J.; Copeland, A.; Pan, C. Omega: an Overlap-graph de novo Assembler for Metagenomics. Bioinformatics 2014, 30 (19), 2717−2722. (68) Li, D.; Liu, C. M.; Luo, R.; Sadakane, K.; Lam, T. W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 2015, 31 (10), 1674−1676. (69) Martin, J. A.; Wang, Z. Next-generation transcriptome assembly. Nat. Rev. Genet. 2011, 12 (10), 671−682. (70) Grabherr, M. G.; Haas, B. J.; Yassour, M.; Levin, J. Z.; Thompson, D. A.; Amit, I.; Adiconis, X.; Fan, L.; Raychowdhury, R.; Zeng, Q. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 2011, 29 (7), 644−652. (71) Schulz, M. H.; Zerbino, D. R.; Vingron, M.; Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012, 28 (8), 1086−1092. (72) Birol, I.; Jackman, S. D.; Nielsen, C. B.; Qian, J. Q.; Varhol, R.; Stazyk, G.; Morin, R. D.; Zhao, Y.; Hirst, M.; Schein, J. E. De novo transcriptome assembly with ABySS. Bioinformatics 2009, 25 (21), 2872−2877. (73) Xie, Y.; Wu, G.; Tang, J.; Luo, R.; Patterson, J.; Liu, S.; Huang, W.; He, G.; Gu, S.; Li, S. SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 2014, 30 (12), 1660−1666. (74) Pertea, M.; Pertea, G. M.; Antonescu, C. M.; Chang, T. C.; Mendell, J. T.; Salzberg, S. L. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 2015, 33 (3), 290−295. (75) Guttman, M.; Garber, M.; Levin, J. Z.; Donaghey, J.; Robinson, J.; Adiconis, X.; Fan, L.; Koziol, M. J.; Gnirke, A.; Nusbaum, C. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 2010, 28 (5), 503−510. (76) Yang, Y.; Smith, S. A. Optimizing de novo assembly of shortread RNA-seq data for phylogenomics. BMC Genomics 2013, 14 (1), 328−338. (77) Li, B.; Fillmore, N.; Bai, Y.; Collins, M.; Thomson, J. A.; Stewart, R.; Dewey, C. Evaluation of de novo transcriptome assemblies from RNA-Seq data. Genome Biol. 2014, 15 (12), 553−273. (78) Surget-Groba, Y.; Montoya-Burgos, J. I. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 2010, 20 (10), 1432−1440. (79) Zhao, Q. Y.; Wang, Y.; Kong, Y. M.; Luo, D.; Li, X.; Hao, P. Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinf. 2011, 12 (14), 1−12. (80) Leung, H. C.; Yiu, S. M.; Parkinson, J.; Chin, F. Y. IDBA-MT: de novo assembler for metatranscriptomic data generated from nextgeneration sequencing technology. J. Comput. Biol. 2013, 20 (7), 540− 550. (81) Leung, H. C.; Yiu, S. M.; Chin, F. Y. IDBA-MTP: A Hybrid Metatranscriptomic Assembler Based on Protein Information, Research in Computational Molecular Biology 2014; Springer, 2014; pp 160−172. (82) Sharon, I.; Morowitz, M. J.; Thomas, B. C.; Costello, E. K.; Relman, D. A.; Banfield, J. F. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 2013, 23 (1), 111−120. (83) Segata, N.; Waldron, L.; Ballarini, A.; Narasimhan, V.; Jousson, O.; Huttenhower, C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 2012, 9 (8), 811−814. (84) Liu, B.; Gibbons, T.; Ghodsi, M.; Pop, M. MetaPhyler: Taxonomic profiling for metagenomic sequences, Bioinformatics and Biomedicine (BIBM). In 2010 IEEE International Conference; IEEE: 2010; pp 95−100.
(85) Gerlach, W.; Stoye, J. Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res. 2011, 39 (14), e91. (86) Teeling, H.; Waldmann, J.; Lombardot, T.; Bauer, M.; Glöckner, F. O. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinf. 2004, 5 (1), 163−169. (87) McHardy, A. C.; Martín, H. G.; Tsirigos, A.; Hugenholtz, P.; Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 2007, 4 (1), 63−72. (88) Chatterji, S.; Yamazaki, I.; Bai, Z.; Eisen, J. A. CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads, Research in Computational Molecular Biology, 2008; Springer: 2008; pp 17−28. (89) Diaz, N. N.; Krause, L.; Goesmann, A.; Niehaus, K.; Nattkemper, T. W. TACOA-Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinf. 2009, 10 (1), 56−71. (90) Brady, A.; Salzberg, S. L. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 2009, 6 (9), 673−676. (91) Wang, Y.; Leung, H. C.; Yiu, S. M.; Chin, F. Y. MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species. J. Comput. Biol. 2012, 19 (2), 241−249. (92) Dick, G. J.; Andersson, A. F.; Baker, B. J.; Simmons, S. L.; Thomas, B. C.; Yelton, A. P.; Banfield, J. F. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 2009, 10 (8), R85−R100. (93) Sheik, C. S.; Jain, S.; Dick, G. J. Metabolic flexibility of enigmatic SAR324 revealed through metagenomics and metatranscriptomics. Environ. Microbiol. 2014, 16 (1), 304−317. (94) Karlsson, F. H.; Tremaroli, V.; Nookaew, I.; Bergström, G.; Behre, C. J.; Fagerberg, B.; Nielsen, J.; Bäckhed, F. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 2013, 498 (7452), 99−103. (95) Alneberg, J.; Bjarnason, B. S.; de Bruijn, I.; Schirmer, M.; Quick, J.; Ijaz, U. Z.; Lahti, L.; Loman, N. J.; Andersson, A. F.; Quince, C. Binning metagenomic contigs by coverage and composition. Nat. Methods 2014, 11 (11), 1144−1146. (96) Wu, Y. W.; Tang, Y. H.; Tringe, S. G.; Simmons, B. A.; Singer, S. W. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2014, 2 (1), 1−18. (97) Kang, D. D.; Froula, J.; Egan, R.; Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 2015, 3, e1165. (98) Dupont, C. L.; Rusch, D. B.; Yooseph, S.; Lombardo, M. J.; Richter, R. A.; Valas, R.; Novotny, M.; Yee-Greenbaum, J.; Selengut, J. D.; Haft, D. H. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. ISME J. 2012, 6 (6), 1186−1199. (99) Raes, J.; Korbel, J. O.; Lercher, M. J.; von Mering, C.; Bork, P. Prediction of effective genome size in metagenomic samples. Genome Biol. 2007, 8 (1), 10−20. (100) Parks, D. H.; Imelfort, M.; Skennerton, C. T.; Hugenholtz, P.; Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015, 25, 1043. (101) Kanehisa, M.; Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28 (1), 27−30. (102) Overbeek, R.; Begley, T.; Butler, R. M.; Choudhuri, J. V.; Chuang, H. Y.; Cohoon, M.; de Crécy-Lagard, V.; Diaz, N.; Disz, T.; Edwards, R. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005, 33 (17), 5691−5702. (103) Yang, Y.; Li, B.; Ju, F.; Zhang, T. Exploring variation of antibiotic resistance genes in activated sludge over a four-year period through a metagenomic approach. Environ. Sci. Technol. 2013, 47 (18), 10197−10205. 12638
DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640
Critical Review
Environmental Science & Technology (104) Li, B.; Yang, Y.; Ma, L.; Ju, F.; Guo, F.; Tiedje, J. M.; Zhang, T. Metagenomic and network analysis reveal wide distribution and cooccurrence of environmental antibiotic resistance genes. ISME J. 2015, DOI: 10.1038/ismej.2015.59. (105) Gomez-Alvarez, V.; Revetta, R. P.; Santo Domingo, J. W. Metagenomic analyses of drinking water receiving different disinfection treatments. Appl. Environ. Microbiol. 2012, 78 (17), 6095−6102. (106) Li, B.; Ju, F.; Cai, L.; Zhang, T. Profile and fate of bacterial pathogens in sewage treatment plants revealed by high-throughput metagenomic approach. Environ. Sci. Technol. 2015, 49 (17), 10492− 10502. (107) Bibby, K.; Peccia, J. Identification of viral pathogen diversity in sewage sludge by metagenome analysis. Environ. Sci. Technol. 2013, 47 (4), 1945−1951. (108) Tamaki, H.; Zhang, R.; Angly, F. E.; Nakamura, S.; Hong, P. Y.; Yasunaga, T.; Kamagata, Y.; Liu, W. T. Metagenomic analysis of DNA viruses in a wastewater treatment plant in tropical climate. Environ. Microbiol. 2012, 14 (2), 441−452. (109) Francis, O. E.; Bendall, M.; Manimaran, S.; Hong, C.; Clement, N. L.; Castro-Nallar, E.; Snell, Q.; Schaalje, G. B.; Clement, M. J.; Crandall, K. A. Pathoscope: Species identification and strain attribution with unassembled sequencing data. Genome Res. 2013, 23 (10), 1721−1729. (110) Ahn, T. H.; Chai, J.; Pan, C. Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics 2015, 31 (2), 170−177. (111) Consortium, H. M. P. Structure, function and diversity of the healthy human microbiome. Nature 2012, 486 (7402), 207−214. (112) Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010, 26 (19), 2460−2461. (113) Kent, W. J. BLATthe BLAST-like alignment tool. Genome Res. 2002, 12 (4), 656−664. (114) Zhao, Y.; Tang, H.; Ye, Y. RAPSearch2: a fast and memoryefficient protein similarity search tool for next-generation sequencing data. Bioinformatics 2012, 28 (1), 125−126. (115) Buchfink, B.; Xie, C.; Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 2015, 12 (1), 59−60. (116) Yang, Y.; Jiang, X. T.; Zhang, T. Evaluation of a hybrid approach using ublast and blastx for metagenomic sequences annotation of specific functional genes. PLoS One 2014, 9 (10), e110947. (117) Yu, K.; Zhang, T. Construction of customized sub-databases from ncbi-nr database for rapid annotation of huge metagenomic datasets using a combined blast and megan approach. PLoS One 2013, 8 (4), e59831. (118) Qin, J.; Li, Y.; Cai, Z.; Li, S.; Zhu, J.; Zhang, F.; Liang, S.; Zhang, W.; Guan, Y.; Shen, D. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 2012, 490 (7418), 55−60. (119) Markowitz, V. M.; Mavromatis, K.; Ivanova, N. N.; Chen, I. M. A.; Chu, K.; Kyrpides, N. C. IMG ER: a system for microbial genome annotation expert review and curation. Bioinformatics 2009, 25 (17), 2271−2278. (120) Aziz, R. K.; Bartels, D.; Best, A. A.; DeJongh, M.; Disz, T.; Edwards, R. A.; Formsma, K.; Gerdes, S.; Glass, E. M.; Kubal, M. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 2008, 9 (1), 75−89. (121) Wu, S.; Zhu, Z.; Fu, L.; Niu, B.; Li, W. WebMGA: a customizable web server for fast metagenomic sequence analysis. BMC Genomics 2011, 12 (1), 444−452. (122) Hyatt, D.; Chen, G. L.; LoCascio, P.; Land, M.; Larimer, F.; Hauser, L. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 2010, 11 (1), 119−129. (123) Noguchi, H.; Taniguchi, T.; Itoh, T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res. 2008, 15 (6), 387−396.
(124) Zhu, W.; Lomsadze, A.; Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 2010, 38 (12), 132−146. (125) Rho, M.; Tang, H.; Ye, Y. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010, 38 (20), e191− e191. (126) Hoff, K. J.; Lingner, T.; Meinicke, P.; Tech, M. Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Res. 2009, 37 (2), 101−105. (127) Mortazavi, A.; Williams, B. A.; McCue, K.; Schaeffer, L.; Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 2008, 5 (7), 621−628. (128) Trapnell, C.; Williams, B. A.; Pertea, G.; Mortazavi, A.; Kwan, G.; van Baren, M. J.; Salzberg, S. L.; Wold, B. J.; Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010, 28 (5), 511−515. (129) Langmead, B.; Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9 (4), 357−359. (130) Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3), R25−R34. (131) Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows−Wheeler transform. Bioinformatics 2009, 25 (14), 1754− 1760. (132) Li, H.; Durbin, R. Fast and accurate long-read alignment with Burrows−Wheeler transform. Bioinformatics 2010, 26 (5), 589−595. (133) Kim, D.; Langmead, B.; Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 2015, 12 (4), 357−360. (134) Quevillon, E.; Silventoinen, V.; Pillai, S.; Harte, N.; Mulder, N.; Apweiler, R.; Lopez, R. InterProScan: protein domains identifier. Nucleic Acids Res. 2005, 33 (suppl 2), W116−W120. (135) Finn, R. D.; Clements, J.; Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011, 39, 29−37. (136) Marchler-Bauer, A.; Bryant, S. H. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004, 32 (2), 327−331. (137) Consortium, U. The universal protein resource (UniProt). Nucleic Acids Res. 2008, 36 (1), 190−195. (138) Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M. C.; Estreicher, A.; Gasteiger, E.; Martin, M. J.; Michoud, K.; O’Donovan, C.; Phan, I. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31 (1), 365− 370. (139) Punta, M.; Coggill, P. C.; Eberhardt, R. Y.; Mistry, J.; Tate, J.; Boursnell, C.; Pang, N.; Forslund, K.; Ceric, G.; Clements, J. The Pfam protein families database. Nucleic Acids Res. 2012, 40, 290−301. (140) Selengut, J. D.; Haft, D. H.; Davidsen, T.; Ganapathy, A.; Gwinn-Giglio, M.; Nelson, W. C.; Richter, A. R.; White, O. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res. 2007, 35 (1), 260−264. (141) Powell, S.; Szklarczyk, D.; Trachana, K.; Roth, A.; Kuhn, M.; Muller, J.; Arnold, R.; Rattei, T.; Letunic, I.; Doerks, T. eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 2012, 40 (1), 284−289. (142) Tatusov, R. L.; Fedorova, N. D.; Jackson, J. D.; Jacobs, A. R.; Kiryutin, B.; Koonin, E. V.; Krylov, D. M.; Mazumder, R.; Mekhedov, S. L.; Nikolskaya, A. N. The COG database: an updated version includes eukaryotes. BMC Bioinf. 2003, 4 (1), 41−54. (143) Marchler-Bauer, A.; Lu, S.; Anderson, J. B.; Chitsaz, F.; Derbyshire, M. K.; DeWeese-Scott, C.; Fong, J. H.; Geer, L. Y.; Geer, R. C.; Gonzales, N. R. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res. 2011, 39 (1), 225−229. (144) Letunic, I.; Doerks, T.; Bork, P. SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res. 2012, 40 (1), 302−305. 12639
DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640
Critical Review
Environmental Science & Technology (145) Prestat, E.; David, M. M.; Hultman, J.; Taş, N.; Lamendella, R.; Dvornik, J.; Mackelprang, R.; Myrold, D. D.; Jumpponen, A.; Tringe, S. G. FOAM (Functional Ontology Assignments for Metagenomes): a Hidden Markov Model (HMM) database with environmental focus. Nucleic Acids Res. 2014, 42 (19), e145. (146) Ju, F.; Xia, Y.; Guo, F.; Wang, Z.; Zhang, T. Taxonomic relatedness shapes bacterial assembly in activated sludge of globally distributed wastewater treatment plants. Environ. Microbiol. 2014, 16 (8), 2421−2432. (147) Ju, F.; Zhang, T. Bacterial assembly and temporal dynamics in activated sludge of a full-scale municipal wastewater treatment plant. ISME J. 2015, 9 (3), 683−695. (148) Peng, X.; Guo, F.; Ju, F.; Zhang, T. Shifts in the Microbial Community, Nitrifiers and Denitrifiers in the Biofilm in a Full-scale Rotating Biological Contactor. Environ. Sci. Technol. 2014, 48 (14), 8044−8052. (149) FU, X.; SUN, Y.; WANG, J.; XING, Q.; ZOU, J.; LI, R.; WANG, Z.; WANG, S.; HU, X.; ZHANG, L. Sequencing-based gene network analysis provides a core set of gene resource for understanding thermal adaptation in Zhikong scallop Chlamys farreri. Mol. Ecol. Resour. 2014, 14 (1), 184−198. (150) Vinayagam, A.; Zirin, J.; Roesel, C.; Hu, Y.; Yilmazel, B.; Samsonova, A. A.; Neumüller, R. A.; Mohr, S. E.; Perrimon, N. Integrating protein-protein interaction networks with phenotypes reveals signs of interactions. Nat. Methods 2013, 11 (1), 94−99. (151) Chao, Y.; Ma, L.; Yang, Y.; Ju, F.; Zhang, X. X.; Wu, W. M.; Zhang, T. Metagenomic analysis reveals significant changes of microbial compositions and protective functions during drinking water treatment. Sci. Rep. 2013, DOI: 10.1038/srep03550. (152) Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N. S.; Wang, J. T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13 (11), 2498−2504. (153) Bastian, M.; Heymann, S.; Jacomy, M. Gephi: An Open Source Software for Exploring and Manipulating Networks, International AAAI conference on weblogs and social media, 2009; AAAI Press: Menlo Park, CA, 2009. (154) Langfelder, P.; Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 2008, 9 (1), 559− 571. (155) Reddy, T.; Thomas, A. D.; Stamatis, D.; Bertsch, J.; Isbandi, M.; Jansson, J.; Mallajosyula, J.; Pagani, I.; Lobos, E. A.; Kyrpides, N. C. The Genomes OnLine Database (GOLD) v. 5: a metadata management system based on a four level (meta) genome project classification. Nucleic Acids Res. 2015, 43, 1099−1106. (156) Wang, Y.; Xia, Y.; Ju, F.; Zhang, T. Metagenome approaches revealed a biological prospect for improvement on mesophilic cellulose degradation. Appl. Microbiol. Biotechnol. 2015, DOI: 10.1007/s00253015-6945-y. (157) Yang, Y.; Li, B.; Zou, S.; Fang, H. H.; Zhang, T. Fate of antibiotic resistance genes in sewage treatment plant revealed by metagenomic approach. Water Res. 2014, 62, 97−106. (158) Chao, Y.; Mao, Y.; Wang, Z.; Zhang, T. Diversity and functions of bacterial community in drinking water biofilms revealed by highthroughput sequencing. Sci. Rep. 2015, 5, 10044. (159) Guo, F.; Wang, Z. P.; Yu, K.; Zhang, T. Detailed investigation of the microbial community in foaming activated sludge reveals novel foam formers. Sci. Rep. 2015, 5, 7637. (160) Tsementzi, D.; Poretsky, R.; Rodriguez-R, L. M.; Luo, C.; Konstantinidis, K. T. Evaluation of metatranscriptomic protocols and application to the study of freshwater microbial communities. Environ. Microbiol. Rep. 2014, 6 (6), 640−655. (161) Nagalakshmi, U.; Wang, Z.; Waern, K.; Shou, C.; Raha, D.; Gerstein, M.; Snyder, M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008, 320 (5881), 1344− 1349. (162) Delmont, T. O.; Eren, A. M.; Maccario, L.; Prestat, E.; Esen, Ö . C.; Pelletier, E.; Le Paslier, D.; Simonet, P.; Vogel, T. M.
Reconstructing rare soil microbial genomes using in situ enrichments and metagenomics. Front. Microbiol. 2015, 6, 358−373. (163) Benndorf, D.; Balcke, G. U.; Harms, H.; Von Bergen, M. Functional metaproteome analysis of protein extracts from contaminated soil and groundwater. ISME J. 2007, 1 (3), 224−234. (164) Rotaru, A.-E.; Shrestha, P. M.; Liu, F.; Shrestha, M.; Shrestha, D.; Embree, M.; Zengler, K.; Wardman, C.; Nevin, K. P.; Lovley, D. R. A new model for electron flow during anaerobic digestion: direct interspecies electron transfer to Methanosaeta for the reduction of carbon dioxide to methane. Energy Environ. Sci. 2014, 7 (1), 408−415. (165) Hess, M.; Sczyrba, A.; Egan, R.; Kim, T.-W.; Chokhawala, H.; Schroth, G.; Luo, S.; Clark, D. S.; Chen, F.; Zhang, T. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 2011, 331 (6016), 463−467. (166) Li, L. G.; Cai, L.; Zhang, X. X.; Zhang, T. Potentially novel copper resistance genes in copper-enriched activated sludge revealed by metagenomic analysis. Appl. Microbiol. Biotechnol. 2014, 99 (18), 7771−7779. (167) Wang, Z. P.; Guo, F.; Liu, L. L.; Zhang, T. Evidence of carbon fixation pathway in a bacterium from Candidate Phylum SBR1093 revealed with genomic analysis. PLoS One 2014, 9 (10), e109571.
12640
DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640