Experimental Design and Bioinformatics Analysis for the Application of Metagenomics in Environmental Sciences and Biotechnology Feng Ju and Tong Zhang* Environmental Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Hong Kong SRA, China ABSTRACT: Recent advances in DNA sequencing technologies have prompted the widespread application of metagenomics for the investigation of novel bioresources (e.g., industrial enzymes and bioactive molecules) and unknown biohazards (e.g., pathogens and antibiotic resistance genes) in natural and engineered microbial systems across multiple disciplines. This review discusses the rigorous experimental design and sample preparation in the context of applying metagenomics in environmental sciences and biotechnology. Moreover, this review summarizes the principles, methodologies, and state-of-the-art bioinformatics procedures, tools and database resources for metagenomics applications and discusses two popular strategies (analysis of unassembled reads versus assembled contigs/draft genomes) for quantitative or qualitative insights of microbial community structure and functions. Overall, this review aims to facilitate more extensive application of metagenomics in the investigation of uncultured microorganisms, novel enzymes, microbe-environment interactions, and biohazards in biotechnological applications where microbial communities are engineered for bioenergy production, wastewater treatment, and bioremediation.

INTRODUCTION Over the past two decades, molecular techniques, such as PCR cloning and sequencing, fluorescence in situ hybridization (FISH), denaturing gradient gel electrophoresis (DGGE), and terminal restriction fragment length polymorphism (T-RFLP) have been applied to fingerprint microbial communities engineered for various environmental biotechnical applications, including bioenergy production, bioremediation, and biological wastewater treatment.1−3 These studies have greatly enriched our knowledge of morphology, taxonomy and diversity of the contributing microorganisms. For example, it is recognized that microorganisms rarely work alone; instead, they are usually arranged in biogranules, biofilms, or flocs to facilitate the establishment of tightly coupled interactions necessary for inhibitive intermediate removal, biodegradation, or bioenergy production.4 However, while these useful molecular techniques have partially elucidated taxonomic identification of major players for relatively simple microbial systems, they are limited in addressing the function of microorganisms and the mechanisms underlying their interactions. Although the molecular techniques could be combined with cultivation to achieve this, few microorganisms ( 99.9% accuracy41) and concatenate the low-quality but exceptionally long reads (e.g., 20 kb) from so-called third-generation sequencing platforms,5 such as PacBio RS. The expected sequence depth is closely related with the measured biodiversity and complexity of microbial samples. Generally, soil42 and sediments43 harbor more diverse microbial species than bioengineered systems.20,33,44−46 For biological wastewater treatment systems, higher biodiversity is generally detected in full-scale than lab-scale bioreactors, in biofilm than suspended sludge, and in activated sludge than anaerobic sludge.44−46 Recent attempts to assemble large complex soil metagenomes suggests that 80% of the sequencing data could not be assembled (because of low coverage) and even 300 Gbp of read data are still insufficient to cover even a localized soil sample deeply.26 In contrast, more than 45% metagenomic reads from enriched microbial systems could be effectively used in assembly.20,21,47 Moreover, 29−57 Gbp of sequences are enough for reconstructing 31 bacterial genomes, including rare (95% similarity; MG-RAST removes all but a single representative of clusters of DNA sequences whose first 50 base pairs are identical;49 and PRINSEQ offers a complete list of options (five alternative modes) for users to remove exact (100% similarity) and reverse complement duplicates.51 Third, a PE sequencing strategy on NGS platforms makes it possible to overlap/merge PE sequences. Two major advantages of using overlapped sequences (i.e., “tags”) with extended length include higher resolution and accuracy of taxonomic and functional annotation and higher quality tags due to overlapping PE sequences with error correction.38,57 Popular free platforms or tools for overlapping PE sequences include QIIME,58 mothur,57 FLASH,59 PANDAseq,38 etc. Assembly. The assembly (i.e., the computational process to connect short DNA/cDNA fragments) of metagenome or metatranscriptome can yield long contigs for predicting fulllength protein-coding genes (PCGs, or open reading frames) or transcripts, recovering genomic sequences (via binning as discussed later), and identifying strain specific genomic islands,60 thus allowing more accurate qualitative analysis of genetic contents (e.g., at strain or species levels), especially for 12632

DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640

Critical Review

Environmental Science & Technology

based on similarity searches and taxonomic assignments. However, they are not reliable for assigning short reads and often require longer assembled contigs (e.g., ≥ 1 kb for the expert-trained PhyloPythiaS package) and manual efforts to ensure high assignment accuracies. Homology-based binning methods based on similarity search against currently available genomes may provide useful grouping information on short metagenomic reads for improving downstream genome binning of specific organisms, although these methods can be limited by the heavy reliance on the quality and representativeness of reference databases, poor taxonomic resolution of short reads, and the accuracy and/or sensitivity of alignment tools. Other software programs (e.g., PhymmBL90 and MetaCluster91) consider both the composition and homology of metagenomic sequence for taxonomic classification or clustering of reads from the same/or similar genomes. The continuously decreasing sequencing cost has allowed researchers to access environmental metagenomes at increasing sequencing depths (e.g., > 50 Gbp), thereby offering sufficient resolution to retrieve partial or near-complete genomes of rare ( 50 Gbp). However, considering that unassembled metagenomic reads from NGS platforms are short (e.g., 100−150 bp for Illumina Hiseq), it is essential to self-evaluate annotation accuracy and sensitivity before customized pipelines (e.g., multistep similarity search and subdatabases) are applied to speed up annotation of NGS data. One general practice in suppressing false positive annotations of short reads is to exert rigorous filtering criteria or cutoffs (i.e., identity, alignment length, and e-value) during database search.49,103,116,118 Assembled Contigs. The contigs assembled from metagenomic reads can be used to recover genomes of uncultured organisms using the afore-discussed binning approaches. The reconstructed genomes can be submitted to web-based servers, such as IMG/ER,119 RAST Server, WebMGA,121 and ggKbase (http://ggkbase.berkeley.edu/). These platforms provide a series of standardized and/or customized pipelines and annotation tools. Moreover, gene-calling tools commonly used for PCGs prediction from contigs include Prodigal,122 MetaGeneAnnotator,123 MetaGeneMark,124 FragGeneScan,125 and Orphelia.126 Prodigal is a popular tool with improved gene structure predictions and lower false positive rates. The relative quantification of PCGs in a metagenome or metatranscriptome is typically based on mapping short genomic DNA or cDNA reads against each PCG. Widely used quantitative metrics include RPKM (Reads Per Kilobase of transcript per Million mapped reads),127 FPKM (Fragments Per Kilobase of transcript per Million fragments mapped),128 and relative abundance (number of reads mapped to a PCG divided by total number of reads). Considering an uneven richness of different PCGs in a metagenome, metatranscriptomic RPKM (M-RPKM) has been proposed to better reflect the absolute transcriptional activity of each PCG; M-RPKM is calculated as the transcriptional activity of each PCG (i.e., RPKM-RNA) divided by its RPKM in the coupled metagenome (i.e., RPKM-DNA).13 The two most popular free alignment tools for mapping DNA reads against PCGs or contigs are Bowtie/Bowtie2129,130 and BWA/BWA-SW.131,132 A fast and sensitive spliced alignment program named “HISAT”133 was recently developed based on Bowtie2 to align transcriptome data; HISAT enables efficient mapping of transcriptome reads, particularly reads that span multiple exons. To detect proteins with related functions but without considerable homology to known PCGs, domain-based database search tools, such as InterProScan,134 HMMER135 and

abundance (coverage) profiles in one metagenome. The draft genomes obtained from this method can be further purified and refined using composition-based methods and other strategies, including paired-end tracking, reassembly, and manual curation. Recently, several automated tools, including GroopM,39 CONCOCT,95 MaxBin96 and METABAT,97 were developed to integrate the above “coverage-composition” strategies into efficient pipelines for genome reconstruction from metagenomes, primarily based on coverage/abundance profiles and tetranucleotide frequency (TNF) patterns of contigs across multiple related metagenomes (e.g., on temporal or spatial scales). The completeness and potential contamination in reconstructed genomes have been estimated by the presence/ absence of essential marker genes, such as essential single copy marker genes conserved in 95% of bacteria,98 conserved phylogenetic marker genes,21 or clusters of orthologous groups (COGs).99 Currently, CheckM is the only automated tool that can assess the quality of a genome recovered from isolates, single cells, and metagenomes based on conserved marker genes.100 Annotation. Both unassembled reads and assembled contigs from metagenomic data are used to annotate the structure and functions of microbial community. Generally, unassembled short reads retain all the original abundance information and enable quantitative comparisons of microbial taxa, functional genes, and global metabolic profiles (e.g., KEGG pathways101 and SEED subsystems102) within or across microbial habitats or ecosystems.47,103 However, short reads are large in data size and may lack resolution for taxonomic and functional annotations. In contrast, assembled contigs with much longer lengths and more compact size allow more robust and rapid analysis of both specific species and their functional genes, compared with unassembled short reads. However, an assembly based annotation strategy has the potential to introduce biases for quantitative analysis, because of the difficulty in assembling low-abundance species and closely related strain, and the exclusion of considerable amounts of unassembled data from downstream analysis. Unassembled Reads. The first strategy for metagenome and metatranscriptome annotation is the direct use of unassembled clean reads for the quantitative analysis of microbial community composition and function. Moreover, paired metagenomic and metatranscriptomic data (from the same samples) are used to describe the “active” microorganisms and compare their gene expression patterns in both natural and engineering ecosystems.7,24 Additionally, short metagenomic DNA reads are used to characterize or predict environmental biohazards (i.e., bacterial pathogens, viruses, and antibiotic resistance genes (ARGs)) in drinking water disinfection systems104,105 and waste and wastewater treatment systems.103,106−108 Read-based taxonomic and functional annotation is mainly supervised and based on homology searches against a reference database using various alignment tools. Unassembled clean reads can be annotated either via standardized platforms or packages (e.g., MG-RAST for DNA fragment analysis and PRADA for RNA analysis). Several taxonomy profiling tools that can provide at least species-level resolution, such as Pathoscope,109 Sigma,110 and MetaPhlAn2,83 are available for performing biosurveillance and detecting biohazards in targeted infected tissue samples109 or healthy human microbiome111 with the assistance of reference genomes. Running local similarity search with BLASTX or PSI-BLAST is rather computationally expensive and time-consuming, 12634

DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640

Critical Review

Environmental Science & Technology CD-search,136 are developed to search against domain or motif models (e.g., hidden Markov model (HMM)). Many sequenceor domain-based functional gene databases are publicly available for functional annotation of genomes, metagenomes and metatranscriptomes, including KEGG,101 UniRef90,137 Swiss-Prot,138 PFAM,139 TIGRFAM,140 eggNOG,141 COG,142 CDD (Conserved Domain Database),143 and SMART.144 These databases are structured with raw protein sequences (e.g., KEGG and UniRef90), or trained models (e.g., PFAM, TIGRFAM, and CDD) representing protein families, and/or domains or the combination of both (SMART). Different alignment tools (similarity search or model-based) are used for different types of databases or research initiatives. For example, HMMER is widely utilized to search against PFAM, TIGRFAM, and SMART databases for detecting novel genes in assembled metagenomes. To maximize functional annotation of genomes and/or metagenomes, recent versions of MGRAST (for metagenome only) and IMG/M have merged the interpretation of annotations from most of the above databases within a single framework.25 Noteworthy, FOAM (Functional Ontology Assignments for Metagenomes), the first HMM database with an environmental focus, was recently released, providing an opportunity to link model-based functional gene searches to a large group of target KEGG Orthologs (KOs).145 Last, network modeling and visualization have been demonstrated as a powerful method for exploring species− species co-occurrence,146−148 gene regulation and coexpression,149 protein−protein interactions,150 and metabolic pathways151 from large omics data sets generated from a sufficient number of related samples. Popular network analysis platforms include Cytoscape,152 Gephi,153 and WGCNA (Weighted Gene Co-Expression Network Analysis).154 Among them, Cytoscape is a popular and open source platform for network visualization and exploration. Data Sharing and Storage. The sharing of sample metadata, sequence data, and computational results is a traditional and efficient method of knowledge exchange in the field of genomic research. The major significance of exchanging sequence data and computational results lies in the beneficial outcomes of comparative studies and the complete elimination of unnecessarily repeated processing of the same data sets or sequencing of similar microbial systems. Several publicly available databases have been maintained to promote the sharing and storage of NGS data, such as NCBI-SRA, MGRAST, and Genomes OnLine Database v.5 (GOLD).155 The submitted sequence data are required to be equipped with a complete standard metadata file that is prepared following the Minimum Information about any (x) Sequence checklists (MIxS) of Genomic Standards Consortium (http://gensc.org/ projects/mixs-gsc-project/) to include essential information, such as the sampling location, time, habitat type, organism, sequencing method. Perspectives. Many researchers have successfully used metagenomics to identify novel microorganisms, mine novel genes for bioenergy and biodegradation,13,15,156 monitor microbial biohazards (e.g., ARGs, pathogens, and viruses),103,107,108,157 and elucidate microbial community structure, metabolism and functions.12,16,17,158,159 However, there are several aspects (mainly pitfalls and limitations) that need to be aware of, eliminated or addressed in future metagenomicsbased studies. First, a metagenomic approach has a lower sensitivity for gene detection compared with qPCR despite its far higher

coverage of gene catalogues in the whole microbial community. Thus, these two techniques should be used in a complementary manner to investigate low abundance genes or transcripts in microbial communities. Second, NGS-based metagenome (DNA sequences) or metatranscriptome (RNA sequences) of complex microbial communities remains expensive, because the cost increases with increased sequencing depth. This economic challenge has rendered most metagenomic studies performing quantitative analysis due to lack of technical/biological replicates, making it hard to determine whether observed differences (especially those small ones) within or between experimental groups are statistically significant and robust (after considering replicate variance). While metagenomic DNA data are known to have good technical reproducibility (R2 > 0.9, from DNA extraction to NGS) for taxa composition and/or functional annotation,12,13,22,44 technical replicates of metatranscriptomic RNA data show larger variation for environmental samples (e.g., R2 = 0.7, from RNA extraction to NGS),13,160 although NGS-based transcriptome of a single organism is highly reproducible.161 Future attention should be paid to addressing reproducibility of metatranscriptomic results before discussing differentially expressed features. Notably, selective enrichment prior to NGS, either by purposefully using target substrates or under harsh conditions, can largely reduce community diversity and complexity,162 thereby lowering the minimum necessary sequencing depth (thus cost) in metagenomics-based studies. Third, compared with a general overview and plain description of overall microbial community structure, functional categories (e.g., subsystems) and metabolic pathway (e.g., KEGG) by using automatic pipelines (e.g., MG-RAST), metagenomic de novo assembly and “coverage-composition” binning are powerful and increasingly popular approaches to recover population genomes from time-series (or taxonomically similar) metagenomes,22,32,34 facilitating the discovery of diverse novel genetic resources, especially from uncultured microbes. Genome reconstruction and annotation provide physiological and biochemical properties to help isolate uncultured microbes, eventually making the in vitro validation of predicted enzymatic activities feasible. More efforts are needed to evaluate and reduce assembly errors and improve binning accuracy, especially in the presence of microdiversity in the samples.20 Fourth, taxonomic and functional annotation of sequence data is inevitably affected by the completeness of database resources used. Current practice to use all PCGs (rather than clade-specific marker genes) for whole community taxonomic profiling (e.g., as implemented in MG-RAST) can be problematic considering the incompleteness of protein sequence databases and a broad taxonomic distribution of PCGs. Therefore, analysis of 16S rRNA gene fragments in a metagenome is recommended for an overview of taxonomic composition of a microbial community. Moreover, functional metagenomics that couples high-throughput metagenomic library for functional screening of clones growing on target substrates (e.g., antibiotics, chitin, cellulose, etc.) and NGS is recommended to speed up the discovery of new protein families and functions,48,163 which is promising for the development of novel genetic engineering materials and the expansion of publicly available protein databases. Last and foremost, metagenomics and related state-of-the-art data mining approaches (e.g., network modeling, de novo assembly, genome binning) are demonstrated as effective 12635

DOI: 10.1021/acs.est.5b03719 Environ. Sci. Technol. 2015, 49, 12628−12640

Critical Review

Environmental Science & Technology

methods to reveal the intricate microbial mechanisms that deteriorate, maintain or advance function and stability of microbial communities.147,164 These "open-ended" approaches have enabled the formulation of numerous novel hypotheses (e.g., on interspecies interactions,147 gene coexpression,149 putative carbohydrate-active genes,165 antibiotic/metal resistance genes,103,166 or metabolic potentials of uncultured microbial dark matter (e.g. Candidatus Accumulibacter Clade IB,22 ANME-1,23 SBR1093167)) which require future validation via design of "hypothesis-driven" studies. To fulfill such endeavors and obtain more thorough understanding of microbially mediated systems, it is promising to combine metagenomics with key complementary techniques, including isotope labeling, qPCR, FISH, flowcytometer, and advanced chemical analysis.


