Quantitative Tracking of Combinatorially Engineered Populations with

Jan 19, 2017 - Advances in synthetic biology and genomics have enabled full-scale genome engineering efforts on laboratory time scales. However, the a...
0 downloads 4 Views 1MB Size
Letter pubs.acs.org/synthbio

Quantitative Tracking of Combinatorially Engineered Populations with Multiplexed Binary Assemblies Ramsey I. Zeitoun, Gur Pines, Willliam C. Grau, and Ryan T. Gill* Department of Chemical and Biomolecular Engineering, University of Colorado, 596 UCB Boulder, Colorado 80303, United States S Supporting Information *

ABSTRACT: Advances in synthetic biology and genomics have enabled full-scale genome engineering efforts on laboratory time scales. However, the absence of sufficient approaches for mapping engineered genomes at system-wide scales onto performance has limited the adoption of more sophisticated algorithms for engineering complex biological systems. Here we report on the development and application of a robust approach to quantitatively map combinatorially engineered populations at scales up to several dozen target sites. This approach works by assembling genome engineered sites with cell-specific barcodes into a format compatible with high-throughput sequencing technologies. This approach, called barcoded-TRACE (bTRACE) was applied to assess E. coli populations engineered by recursive multiplex recombineering across both 6-target sites and 31-target sites. The 31-target library was then tracked throughout growth selections in the presence and absence of isopentenol (a potential next-generation biofuel). We also use the resolution of bTRACE to compare the influence of technical and biological noise on genome engineering efforts. KEYWORDS: genome engineering, synthetic biology, high throughput sequencing, DNA assembly

T

resistance, protein evolution) but also to the engineering of microbial systems to exhibit complex behaviors.9 To address the need to track mutations at a population-level, we developed the barcoded-TRACE (bTRACE) approach. bTRACE uses a persistent barcode sequencing and multiplexed binary assembly to enable tracking of mutations and quantification of mutations on a population wide level. As an improvement to the TRACE approach, the number of targets that can be mapped per genotype is not limited by sequencing read length, and noise associated with crossover can be identified and eliminated in a direct manner. Quantifying genotype to phenotype relationships using the bTRACE approach is straightforward and can be performed independent of genotype mapping. The ease of using bTRACE along with the new capabilities it introduces should enable studies of combinatorial genetic interactions of a wide range of systems. The bTRACE process is diagramed in Figure 1. bTRACE uses a library containing multiplexed mutations and persistent DNA barcodes (Figure 1a). Throughout we use two E. coli libraries generated from a clonal genotype using the MAGE technique and a plasmid containing 12 nt random sequences to act as a barcode. Each cell is uniquely tagged with a barcode by transforming a barcoded plasmid library into an E. coli population following the introduction of targeted diversity (Figure 1b). To ensure high genotyping coverage and reduce the potential of a barcode to be represented in two genotypes,

he understanding and manipulation of biological systems requires methods for efficiently mapping genes to targeted traits.1,2 A fundamental limitation in the progression of this field is our inability to effectively manipulate complex phenotypes, for which the relevant combinatorial mutational space is often much larger than can be searched on laboratory time scales.3 Recent techniques for generating combinatorial mutants have proven effective; however, commensurate techniques to map such combinatorial mutants to targeted traits have not developed as rapidly.4,5 To address this limitation, we recently described a multiplexed single-step DNA assembly reaction (TRACE) used to track combinatorial mutants with single genotype resolution at a maximum of 4−8 sites using next-generation sequencing technologies.6 With TRACE, the number of sites that can be evaluated in a population is limited by sequencing read length and noise from crossover during the construct preparation process. While powerful, these limitations restrict this approach to applications involving the smallest biological complexes that involve a small number of genes (e.g., operons or multiple sites on a single gene). Many interesting biological systems, however, are often dependent on up to a few dozen distal genes, each of which contains multiple parts that might be the subject of engineering efforts (e.g., promoters, active sites, RBS, etc.).7,8 As a result, the ability to track combinatorial mutants at the scale of several dozen sites would enable new studies extending to complex traits. Such studies are critical not only to understanding the emergence of complicated phenotypes in nature (e.g., drug © 2017 American Chemical Society

Received: December 13, 2016 Published: January 19, 2017 619

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627

Letter

ACS Synthetic Biology

Figure 1. Schematic of bTRACE strategy of library quantification and identification. (a) A library is first transformed with a barcode library. (b) The population is diluted to ensure high coverage of mutants. (c) Multiplex binary assemblies are generated and assembled for site identification or barcodes are extracted from the library for quantification. (d) The population of binary assemblies is sequenced to link specific genotypes with specific barcodes and, separately, the population of plasmid-barcodes are sequenced to determine the distribution of the genotypes throughout the library population.

the population is set to a defined number cells (103−105 genotypes) by diluting a larger library. Barcodes are quantitatively tracked by high-throughput short-read length sequencing (e.g., Hiseq, Figure 1c).10 Using high-throughput short read sequencing, we can identify barcodes with high coverage, thereby reducing sampling error and improving confidence in frequency measurements. In a separate reaction, each cell-specific barcode is linked to a specific combinatorial genotype via a multiplexed binary assembly reaction performed in emulsion. Binary assemblies are sequenced in multiplex via paired-end high-throughput sequencing (e.g., Miseq, NextSeq Figure 1d). As such, the linkage of the same barcode with multiple targeted sites indicates that the sequences located in those sites are from the same cell.11 We define this barcode to targeted site linkage map throughout as a bTRACE genotype, since it links a specific barcode with the mutated sequences of each of the targeted sites. To perform this multiplexed assembly, primers and linkers are designed to be orthogonal to prevent primer−dimers while assembling into multiple binary complexes following many of the same principles developed in the TRACE method (Supplemental Figure 1). Briefly, primers are designed to have similar melting points. Linkers are designed to have 10 degree greater melting points than primers to ensure an irreversible hybridizaiton under normal operating temperatures. In addition, primers and linkers are designed to have unfavorable intraprimer and interprimer interactions. Primers are designed from sequences upstream and downstream of the barcode and sites of interest and linkers are designed from random 29 nt sequences representing a possible search space of 429 linkers. This design involved searching through potentially millions of primer−linker combinations, therefore the software package used to design sequences is available for download (see Methods). This approach requires that multiplexed binary assemblies can be generated in a single reaction and subsequently accurately tracked using high-throughput sequencing. To show this is possible, a combinatorial library was studied. This library was generated by recursive multiplexed recombineering (MAGE) in E. coli. We specifically engineered and targeted combinations of degenerate ribosome binding sequences in six target genes. First, the multiplexed binary assembly approach was tested in bulk (nonemulsion) conditions. All six sites were individually

assembled with cell specific barcodes in a single reaction. To perform this reaction, all six targets are assembled giving a mixture of binary assemblies, barcodes, and single sites (Supplemental Figure 2). Full-size assembled products were then isolated by gel extraction of the large products comprising individual binary assemblies. These assemblies were verified by polymerase chain reaction (PCR) individually amplifying each barcode-site assembly product using sequence specific primers. This six-site library was then assessed in a population via emulsion PCR assembly and high-throughput paired-end MiSeq sequencing. To ensure high sequencing coverage of the population (which is limited by sequencing throughput) the population was diluted to a few thousand cells per reaction from presumably over 106 genotypes. Our sequencing data contain linkages between unique plasmid barcodes with each of the six targeted sites (barcode-site reads), the combination of which for a single barcode defines a bTRACE genotype. High coverage of sites with barcodes was observed where 89% of the barcodes were linked to five or more sites (Figure 2a) and only ∼5% were linked with three or fewer sites. The result also shows correlation between the number of times a specific barcode is identified in the sequencing data versus the likelihood it is to be covered with a larger number of linkages. This suggests that deeper sequencing would produce an even greater percentage of barcodes linked with all six targeted sites. Next, the heterogenity of site coverage was assessed by quantifying the number of times that a specific site is not identified with a identified barcode (Figure 2b). It was found that the ompT ribosomal binding site (RBS) was poorly covered with respect to other sites in the library, presumably due to inefficiencies in assembly, site amplification and/or clustering (Figure 2b). The ompT barcode-site linkage was missing in approximately 10-fold more genotypes than the best covered ygiH linkage. To address this, ompT can be individually assembled and sequenced in a separate reaction. As the barcode-site linkage data are constant, targeted sequencing efforts can be used directly to supplement previously acquired data. To validate the claims that a measured bTRACE genotype was both correct and constant, population-wide barcode-site linkages were determined for a 6-site combinatorial library before and after three-days of growth in minimal media. First, approximately 88% of the unique bTRACE genotypes (about 2100) were present in both the original (t = 0) population and selected (t = 3) population (Figure 2c), indicating that the 620

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627

Letter

ACS Synthetic Biology

barcodes persist during growth selections. This is expected as the population was grown in the presence of a plasmid antibiotic selection and a high-copy plasmid is unlikely to be stochastically lost in a 3-day time period. Also, 12% of barcodes were not identified after the 3-day growth. This could be because of fitness heterogeneity in the population, causing a diverse barcoded population to lose diversity.12 We also assessed if the barcode-site linkages were constant between the original and selected populations. To do so, barcode-site linkages from each of the six targeted sites at both time points (∼12 500 linkages including ∼2500 mutations) were compared. To compare linkages between time points, three categories were used which include both linkages are wild-type, both linkages are identical mutations, or the linkages disagree. About 88% of our barcode-site linkages agreed between our two replicates demonstrating that bTRACE is capable of tracking mutations with the same barcodes throughout growth selections (Figure 2d). The roughly 300 linkages that do not agree between replicates demonstrate a rate of genotyping errors in a single measurement, and also demonstrate that replicates can be used to identify and mitigate errors. Since errors can be identified through replicates, these errors likely occur from random events. For instance, genotyping errors can result from loading multiple cells per droplet during assembly or crossover occurring between barcodes during amplification. Furthermore, high-throughput sequencing results were orthogonally validated by Sanger sequencing barcodes from 13 individual colonies obtained from plated samples of the library population. It is important to note that all 13 colony barcodes were also found in the high-throughput linkage map sequencing data. To validate that Sanger sequencing would produce the same bTRACE genotypes as were identified in the

Figure 2. Assessing bTRACE genotyping validity with replicate highthroughput sequencing. (a) Total counts per barcode versus the number of linkages per barcode. Values above boxes represent the proportion of barcodes with a corresponding amount of linkages. (b) Number of times that a specific barcode-site linkage is not found for a barcode with respect to a specific site. (c) Comparison of the barcodes identified for an initial population and three-day serial outgrowth. Also indicated is the proportion of barcodes shared between these two measurements. (d) Comparison of the barcode-site linkages between the initial population and the three-day outgrowth. Between these two time points, linkages can match as both wild-type, match as both identical mutants, or mismatch.

Figure 3. Generation and genotyping of a 31-site library. (a) Genes chosen for modification colored by function. (b) bTRACE genotype coverage; 62.6% of bTRACE genotypes contained sequencing data on 27 or more of the targeted sites. (c) Simulated coverage of genotypes (27 or more linkages identified per barcode) versus the number of genotypes (i.e., barcodes) in a population. (d) Modeling the probability of identifying a correct bTRACE genotype where unidentified linkages are assumed to be wild type; in addition, modeling the fraction of correct genotypes in the sequenced 31-site population in panel b based on a linkage threshold filter. 621

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627

Letter

ACS Synthetic Biology

Figure 4. Quantification of barcodes under varying selective pressures. (a) Theoretical analysis of the influence of CNV with respect to measured frequency in the population. (b) Barcode frequencies between technical replicates. These replicates were measured from overnight cultures (harvested from a single freezer stock) and grown in MOPS minimal media for 24 h. (c) Comparison of wild-type enrichment noise in the case of three selective pressures. Star indicates a significance of p < 10−4 (see Methods).

identified. The genotype distribution in the population and site coverage distribution from the barcode-site assembly were directly obtained from our sequencing data. Using these distributions, the genotype coverage (genotypes having 27 or more sites identified) was simulated and measured (Figure 3c). In the case of a MiSeq (i.e., millions of reads passing filter) at 103 genotypes in the population, about 97% of population were identified, which quickly drops off as you increase the number of cells in a population. In the case of 10 s of millions of reads passing filter we see a similar trend where around 97% of genotypes are covered from a population of 104. With higher numbers of reads (100 s of millions) about 105 genotypes can be covered. Next, we assessed barcode transformation efficiency as a potential limiting factor toward assessing larger population sizes. To test this we diluted a population to 104 cells and measured about 104 genotypes with HiSeq sequencing. Therefore, we found that larger numbers of barcodes could be transformed and identified. Finally, a significant emulsion bias can negatively influence coverage. Emulsion bias was assessed by tracking the distribution of barcodes from quantitative barcode sequencing and emulsion barcode-site sequencing. Distributions appear similar and there seems to be no strong bias toward a single genotype in either the emulsion or the population (p > 0.1, Wilcoxon Rank Sum test, Supplemental Figure 5). By eliminating bTRACE genotypes covering fewer than 27 sites, this bTRACE genotype library contains 21 646 linkages distributed among 718 unique combinatorial bTRACE genotypes. When linkages were missing in a bTRACE genotype (2.7% of the cases), we assigned the wild-type sequence for all subsequent analysis (note that wild-type is the predominant RBS sequence and the overall RBS mutation rate was 5.6%, Supplemental Figure 6). As the number of unidentified linkages increases, the probability for miscalling the correct genotype increases (see Methods, Genotype Error Calling, and Figure 3d). A cutoff of four or fewer unidentified linkages was used resulting in 97% of the genotypes correctly identified for all 31 sites. In addition, the presence of wild type, single mutants, double mutants, and higher order mutants is approximately equal (Supplemental Figure 7). To quantify members of the population, barcode frequencies were measured using high-throughput short read sequencing (HiSeq). It was first important to verify that barcode frequency

high-throughput sequencing data, we sequenced all six sites from five of the same colonies using Sanger sequencing with the original TRACE method.6 All of the resulting genotype data (30 sites) were identical when measured by colony sequencing or high-throughput sequencing (Supplemental Table 1). It appeared that template crossover (identified as a challenge with TRACE) was not significant in this data. This is because there is considerable redundancy for each barcode-site assembly read (10 s−100 s of redundant reads), whereas in TRACE it is not possible to assess redundancy because there is not a cell specific barcode. Moreover, the shorter binary assemblies of bTRACE are less prone to template crossover than higher-order assemblies.13 To assess the ability of bTRACE to track much larger combinatorial libraries we constructed a 31-site RBS library targeting genes expected to affect alcohol tolerance in E. coli.14−16 Alcohol tolerance is an important phenotype for organisms producing alcohols in high titers as potential nextgeneration biofuels. As such, tolerance has been studied using evolutionary approaches, single target libraries, and combinatorial studies. To address this problem with bTRACE, an overall library comprised four sublibraries each organized based on target gene function (e.g., membrane structure, transmembrane pumps, metabolism, or stress-response pathways; see Figure 3a, Supplemental Table 2). Again, MAGE was used to construct this library. ssDNA oligomers were designed to give a broad predicted gene expression (translation initiation rate) for each site using the RBS calculator and oligomer design tools (Supplemental Figure 3).17,18 This library contained 266 different ribosome binding sites (including wild type) for a total possible combinatorial space of 1028 genotypes. First, bTRACE genotypes were mapped for this library prior to any growth selection. To do so, we again diluted the population to approximately 103 cells to ensure sufficient redundancy given the large number of sites targeted and the number of high-quality reads expected. We identified 1 114 unique genotypes, ∼63% of which contained linkages on 27 or more sites (Figure 3b). Similar to Figure 2a, increasing counts per barcode resulted in an increasing number of barcode-site reads, suggesting that higher throughput sequencing could be used to map even larger populations (Supplemental Figure 4). To address scalability, a simulation was performed on our evaluated library (31 targeted sites per cell) to assess the effect of read quantity on the number of genotypes that can be 622

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627

Letter

ACS Synthetic Biology

Figure 5. Enrichment analysis of a population grown in a selective pressure. (a) Comparison of differential enrichment scores for MOPS and MOPS with isobutyl alcohol for identical genotypes. Point colors represent different genotyped barcodes having 27 or more reads identified. Data points marked with X are found to differ between enrichments (p < 0.025). (b) Histogram differentially enriched bTRACE genotypes. Genotypes are colored based on the number of mutated sites identified out of the 31 targeted sites.

population (Figure 4c, see Methods, Noise Calculations). The most significant observation here is that the total variation within wild-type genotypes was observed to be larger than technical noise in the MOPS (t test, p < 10−4) and isobutyl alcohol selections (t test, p < 10−6). To assess the effect of targeted RBS mutagenesis on isobutyl alcohol tolerance, the enrichment of genotypes were compared between minimal media and minimal media supplemented with isobutyl alcohol (Figure 5a). We use the differential enrichment of a genotype, defined as the difference between the enrichment in isobutyl alcohol and enrichment in MOPS, to assess its tolerance. bTRACE genotypes with differential enrichment scores that are greater than 0 with statistical significance based on three technical replicates are appropriately marked (p < 0.025, single tail paired t test). Although many combinatorial mutants having significant differential enrichments between both selections found, there is also a proportional amount of wild-type genotypes having the same profile. This seems to be sourced from the large amount of biological variation observed in wild-type genotypes. As a result, no trends or significant outliers are identified when plotting the histogram of differential enrichment between statistically different library members (Figure 5b). A strategy to quantitatively track and identify combinations of mutations was developed and validated with a proof of concept 6-site library and then expanded to a 31-site library. With the ability to track larger numbers of targeted sites, interactions among mutations at the biological subsystem scale (metabolic pathways, regulatory networks, protein complexes) can now be measured and used to inform more complex engineering efforts. Using this approach we were able to identify mutants with significant differential enrichment values (p < 0.025). One advantage of bTRACE is that genotype enrichments are evaluated on a per barcode basis. This enables evaluation of phenotypes for multiple instances of the same genotype instead of measuring an average phenotype. By performing enrichment measurements on barcodes rather than genotypes, we were able to observe noise occurring at the single mutant level, rather than a population level. Wild-type genotype enrichment variation (biological noise) was larger

measurements were accurate and reproducible. A major concern in this approach is that barcodes are contained on a high-copy plasmid (pUC57). This was done to avoid potential assembly bottlenecks when attempting to assemble a plurality of sites in the genome with a single plasmid based barcode. Since high-copy plasmids can be susceptible to copy number variation (CNV), this can undermine quantification results. A simulation was performed which predicts the noise of the frequency of a barcode observed based on CNV (see Methods, Copy Number Variation Noise Simulation section). This simulation showed that theoretically CNV is insignificant to quantitative noise (Figure 4a). Specifically, CNV affects barcode quantification precision by under 1% for a genotype occurring at a frequency of at least 10−5 (a genotype of 104 cells in a 1 mL OD600 population of approximately 109 cells). Higher frequency mutants are less susceptible to this source of noise. CNV noise would be equivalent to the counting error associated with 104 barcode counts (frequency of 10−3 in 107 reads) in a sequencing run (assuming Poisson statistics). This suggests that counting noise is more influential than CNV noise. Technical replicates were performed to assess the reproducibility of quantifying genotypes in a population. Barcode frequencies between replicates were measured from overnight cultures (harvested from a single freezer stock) then subsequently grown in MOPS minimal media for 24 h. Frequencies of barcodes between replicates are highly correlated (Figure 4b, Supplemental Figure 8, N = 1121 genotypes, r2 = 0.97, p ≪ 10−6). Barcode sequencing was used to track this 31-site library throughout growth selection in either minimal media, minimal media supplemented with isobutyl alcohol or minimal media supplemented with isopentenol. During selections, replicate growth rate in minimal media and isobutyl alcohol showed less technical variability than isopentenol (Supplemental Figure 9). Barcode enrichment (ε = log2(fs/f i)) was used as the measure of genotype fitness where f i and fs are the initial and selected barcode frequencies (individual barcode counts versus total counts), respectively. The technical noise and biological noise of enrichment were assessed in wild-type cells across the 623

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627

Letter

ACS Synthetic Biology

Library Construction. Libraries were constructed from a clonal base strain HME63 which has chromosomally integrated RED recombineering genes.27 This strain was recombineered via MAGE with ssDNA 90-nt oligomers that were designed using the MODEST tool for sequence structure, and a ribosome binding site calculator for designing a range in gene expressions (Supplemental Table 4).17,18 MAGE was then performed using about 50 μL of resuspended cell pellet with an equimolar primer mixture (100 pmol) for 12 cycles (6-site library) or 15 cycles (31-site library) according to previously described protocol. Cells were recovered in low salt (25%) Luria Broth. After library generation, cells were transformed with the barcode plasmids according to previous protocol. A plasmid dilution series was performed to ensure that only one plasmid transformed per cell was obtained. This could be performed by identifying the dilution in which the CFU having kanamycin resistance dropped proportionally with the plasmid dilution. To do this, a spot plate was performed on LB agar supplemented with 50 μg/mL kanamycin and 100 μg/mL carbinicillin. Finally, the transformation showing 1 plasmid per cell was diluted to a known cell concentration (103−105 cells) and allowed to grow to saturation to make a smaller library which can be easily covered with high-throughput sequencing technologies. Primer and Linker Design. Multiplexed primers and linker sequences were designed using a custom MATLAB script (Supplemental Table 5−8, Supplemental Figure 10). Primers were designed from a search space 100 nt upstream and downstream of each site of interest (RBS and barcode). Primers were designed to be orthogonal to other primers in the multiplexed PCR reaction (i.e., minimize homology between sequences), have no secondary structure (based on MATLAB’s oligoprop calculation) and no nucleotide repeats greater than 4. These principles are outlined in more detail in previous publications.6 Linkers were designed from random 29 nt sequences having similar thermodynamic properties as primers and be orthogonal to all sequences in the library. Because of the large linker search space available, linkers were designed after primers and designed having stringent thermodynamic properties (secondary structure, homology, etc.). The melting temperature of primers was designed to be around 60 °C, and that of linkers was designed to be 68 °C. Linker sequences are attached to the reverse barcode primer (Assembly-BCR-X) and the reverse complement forward site primer. If a linker is compatible (i.e., not showing secondary structure or creating incompatible sequences) with both the reverse barcode primer and several (three or more) forward site primers it is considered acceptable. Several reverse barcode primer−linker sequences are needed to cover an entire library. For ease of adding Illumina adapter sequences, a universal adapter is added to each reverse site sequence (Assembly-CR-X) following the same principles of linker design. It was found to be difficult to generate orthogonal linkers as the number of sites exceeded 20. Therefore, two sets of assembly primers were used to cover 31sites. In addition, a third set of assembly primers was generated to cover constructs having low coverage (inefficient assembly). A detailed description of using the software is included with the distributed package. Primer Design Software. The Matlab scripts and instructions for using these scripts are available for download at https://sites.google.com/site/thegillgroupcu/bTRACE_ Software.zip

than expected, spanning the range of observed isobutyl alcohol−MOPS differential enrichments. This is problematic when applying a Design-Build-Test-Learn (DBTL) during multiplexed genome engineering because the reduced precision in quantitative genotype−phenotype measurements affects the quantitative conclusions that can be generated, inherently limiting the Learn step used to inform future designs.19 Biological noise, in this context, is likely due to cryptic genetic variation occurring during recursive genome engineering in a mutS deficient strain.20,21 This limitation is independent of the sophistication of the “Test” step and arises from the “Build” step. To address this inherent noise, we can focus studies on building more influential mutations or studying stronger selective pressures. Another approach would be to build libraries with higher coverage containing more biological replicates; however, this undermines the advantages of a MAGE-type multiplexed genome engineering approach in which genotypes are stochastically generated from a massive sample space. Alternatively, we can introduce less noise by minimizing genetic drift and variation. This can be done by building libraries with less mutagenic strains or more efficient genome engineering approaches. To this end, our lab has leveraged CRISPR/Cas9 technology in a multiplexed genome engineering tool that allows for library generation on drastically shorter experimental time scales.22 With bTRACE, once barcode-site linkages are identified in a library, then numerous selections can be easily performed and tracked simply by sequencing the barcode region. While our libraries only covered about 103 cells, next-generation highthroughput sequencing strategies (i.e., NextSeq) should enable up to 400 million reads, giving almost 100-fold deeper library interrogation (100 000s of mutants or 100s of sites in parallel); thus allowing extension of these methods to enable the engineering of multiple subsystems in parallel.23 Finally, this approach is expandable to many systems and approaches. For example, barcodes can be used on low copy number plasmids or be genomically integrated. On the basis of our experience assembly should occur between multiple sites and a single barcode although less efficiently than with a high-copy number plasmid. In addition, plasmids, or other extra-chromosomal vectors, can be used with a wide-variety of species (including mammalian) or barcodes could be genomically integrated into various species using a method such as CRISPR/Cas9.24 Demonstrations of barcoding and identifying cancer cell lines have been used to study drug susceptibility in a highthroughput format, and as such bTRACE could be used in a downstream manner.25



METHODS Barcode Construction. Barcodes were constructed in the high copy plasmid pUC57 with a kanamycin resistance marker. Barcode sequences were generated as a double-stranded cassette from two PCR primers having homology to adjacent regions in the pUC57 plasmid (Supplemental Table 3). These adjacent regions were linearized using Phusion polymerase PCR (New England Biolabs) and two primers (LinF, LinR). The mixture was then treated with DpnI (New England BioLabs) to remove circular plasmid. Both barcodes casettes and linear plasmid were then gel extracted (Qiagen) and assembled using circular polymerase extension cloning (CPEC).26 The resulting mixture was then desalted by PCR cleanup (Qiagen) to prepare for transformation. 624

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627

Letter

ACS Synthetic Biology Library Genotyping. Libraries were genotyped via a combination of a multiplexed binary assembly and emulsionbased PCR. Primers were mixed in equimolar mixtures of site primers mixes and barcode primer mixes at a total concentration of 40 μM. Barcode primer mixes were made from 50% forward barcode primer and the remaining 50% was an equimolar mixture of the reverse barcode primers. Site primers were mixed in equimolar concentrations. Assembly was performed as previous described.6 Briefly, assembly mixture contained 20% 5× Phusion polymerase buffer, 5% 10 mM BSA, 5% 0.1 OD600 cell mixture, 3% 5 mM MgCl2, 3% 100 mM dNTPs, 2% primers, and the remaining volume ultrapure water. Mineral oil was mixed with ABIL EM90 surfactant and TritonX-100 surfactant. This mixture was then emulsified in a cryogenic tube with a stir bar by a stir plate for 10 min. Estimates for droplet size are 10 pL although there is heterogeneity which can be further controlled with microfluidics. In this case a 0.1 OD600 cell mixture should contain about 5 × 104 cells/μL or 1.3 × 105 cells per emulsion reaction, or about 0.1 cells per droplet. The resulting opaque solution was pipetted in 50 μL quantities to PCR tubes and assembled using a 5 min 95 °C hold step and 25 amplification steps (95− 60−72 °C) with longer hold times of (75, 75, 90 s). Longer hold times were performed to compensate for increased thermal resistance of the oil phase. After assembly, the emulsion was broken using 3:1 water-saturated isobutyl alcohol to sample, heat (55 °C) on a shaking heatblock (1 kRPM) followed by centrifugation (5 min). To further break the emulsion the centrifuged mixture was frozen in a −80 °C freezer for 30 min then further centrifuged resulting in a clean bottom aqueous phase. The aqueous phase was removed and gel purified (in a 10 μM guanosine gel to prevent UV damage). The gel slab removed was from 350 bp to 700 bp to only capture assembled barcode-site sequences. The extracted DNA was then prepared for high throughput sequencing in two Phusion PCR steps using homologous primers (pA-F, pA-R-115, and MiSeq-A-F, MiSeq-A-R-1-3). With appropriate adapters, the library was sequenced on an Illumina MiSeq using custom primers (R1, R2, IR) following Illumina’s suggested protocol. In addition a 300-cycle kit was used and was set to perform 100 forward basecalls and 200 reverse basecalls. Selection. Barcoded libraries were grown in MOPS minimal media either supplemented with 8 g/L isobutyl alcohol, 4 g/L isopentenol or 0 g/L isopentenol, 50 μg/mL kanamycin (for barcode retention) and 0.2% glucose. MOPS was freshly made (within 1 week) prior to experiments. Cells were first adapted in MOPS prior to selections. At t = 0, the adapted population was diluted to OD600 of 0.01 and grown until an OD600 of approximately 1. The ending OD and ending time varied per replicate and per condition. Quantification. Barcodes were PCR amplified from a plasmid extracted library via a Miniprep (Qiagen) following growth selections. Primers were added at 1% of the reaction and used at 40 μM total (HiSeq-pA-F, HiSeq-pA-R) with standard Phusion polymerase conditions. About 7−8 PCR cycles were used to properly amplify barcodes. Barcodes then had adapter sequences attached (HiSeq-A-F, HiSeq-A-R) and were sequenced on a single Illumina HiSeq lane using a 1 × 50 kit resulting in around 107 reads per condition. Genotype Data Analysis. Data was analyzed using custom MATLAB software. For qualitative data, software first filtered for the presence of barcodes based on forward read alignment

to the forward barcode primer. Sites were then identified and binned based on alignment to the expected reverse primer. Because of the high redundancy and coverage of reads, sequences having mean Phred scores less than 20 (1% error) were removed from analysis. A most likely RBS sequence was called for each site from the most frequent RBS mutation. Replicates were combined based on the frequency of each RBS data point. For more stringent assessment of RBS while maintaining a large number of reads (the 31-site library), all reads were kept, however basecalls with Phred scores less than 20 were eliminated and RBS were calculated as an unweighted consensus sequence. In addition, the number of counts per barcode was calculated as total number of reads passing filter assuming a read passing filter of 35 nt with a Phred Score greater than 20. Finally, to ensure reliable barcode-site identification a final filter of greater than 3 total reads/site/ barcode was used to consider a site identified. Noise Calculations. Noise was calculated as technical, total, or biological. To calculate technical noise the enrichment, ε, is used to calculate the standard deviation between M = 3 replicates for N barcodes. The mean of those calculated standard deviations is used as the noise (eq 1). ntechnical =

1 N

N

∑ i=1

1 M−1

M

∑ (εj − ε ̅ )2 (1)

j=1

To calculate total noise, the enrichment, ε, has its standard deviation first calculated versus the enrichment of a population of N wild-type strains for M = 3 sets of measurements (eq 2). ntotal =

1 M

M

∑ i=1

1 N−1

N

∑ (εj − ε ̅ )2 j=1

(2)

Biological noise is calculated from the total noise and technical noise as indicated in eq 3. 2 2 2 ntotal = nbiolog ical + n technical

(3)

The statistical difference was calculated using a two-tailed t test for unpaired samples. In Matlab the ttest2 function was used. Quantitative Data Analysis. For quantitative data, the frequency of barcodes was measured as the number of counts per barcode versus the total number of counts. Enrichment scores are calculated as the logarithm (base 2) of the ratio of frequencies between growth conditions (or between a growth condition and the initial population). Low-count barcodes are subject to counting error, and therefore barcodes were removed that appear fewer than 10 times at any time during the selection. This had the unintended effect of removing the unquantifiable least-fit (or stochastically lost) mutants from the population. Differential enrichment, ΔE, was calculated as the enrichment difference between isobutyl alcohol + MOPS and MOPS enrichments (eq 4). In this case, f i is the frequency of barcodes in the initial population, f IB is the frequency of barcodes in the isobutyl alcohol selected population and f MOPS is the frequency of barcodes in the MOPS selected population. ΔE = log 2

fIB fi

− log 2

fMOPS fi

(4)

To calculate statistical significance of the differential enrichment, a one-tailed paired t test was performed between the three replicates with a significance of 0.975. 625

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627

Letter

ACS Synthetic Biology Author Contributions

Genotype Error Calling Model. The chance of improperly classifying a genotype was modeled (Figure 2c). The chance of calling a mutated sequence as wild-type is designated by an error rate, ε = 0.056. Using this, the chance of correctly calling a genotype, p, for a designated number of sites missing, n, is described by eq 5. pn = 1 − (1 − ε)n

R.I.Z. and R.T.G. generated concept and wrote manuscript. R.I.Z. performed experiments. R.I.Z., R.T.G., G.P., and W.C.G. planned experiments and interpreted data. Notes

The authors declare no competing financial interest.

(5)

Using the results of eq 5, the cumulative number of correctly called genotypes, c, can be described by eq 6, where N is the cutoff number of sites and kn is the number of mutants in the population with n missing sites. ∑n = 0 knpn N

∑n = 0 kn

(6)

Copy Number Variation Noise Simulation. A random distribution of copy number variation was generated using Matlab’s randn command with a mean value of 103 and a standard deviation of 250. Plasmid copy numbers were generated for each cell from this distribution. The quantity of random copy numbers generated are based on the number of cells sampled, for example if 102 cells are sampled, then 102 normally distributed values are generated. These values are then averaged and result in the quantity of plasmids (barcodes) per cell. Coefficient of variation is calculated from the mean and standard deviation of 10 independently generated mean plasmids per cell with respect to the mean normal distribution value (Supplemental Figure 2a). Coverage versus Reads Simulation. Coverage was simulated using a custom Matlab script. This script randomly selects sites from a population. A read is defined by first selecting a genotype, based on the probability of selecting a genotype in the population, and then selecting a site, based on the probability of selecting a site from the population. The population distribution and site distribution (31-sites) were directly measured from barcode sequencing and barcode-site sequencing, respectively. Similar to our bTRACE data analysis, 27-sites were considered a covered genotype and over 3 reads per site was considered a covered site. For number of reads, we used 5 × 106 reads for the MiSeq and 5 × 107 reads for 10×MiSeq This was similar to what was obtained after filtering in a standard barcode-site high-throughput sequencing run.





REFERENCES

(1) Warner, J. R., Reeder, P. J., Karimpour-Fard, A., Woodruff, L. B., and Gill, R. T. (2010) Rapid profiling of a microbial genome using mixtures of barcoded oligonucleotides. Nat. Biotechnol. 28, 856−862. (2) Na, D., Yoo, S. M., Chung, H., Park, H., Park, J. H., and Lee, S. Y. (2013) Metabolic engineering of Escherichia coli using synthetic small regulatory RNAs. Nat. Biotechnol. 31, 170−174. (3) Sandoval, N. R., Kim, J. Y., Glebes, T. Y., Reeder, P. J., Aucoin, H. R., Warner, J. R., and Gill, R. T. (2012) Strategy for directing combinatorial genome engineering in Escherichia coli. Proc. Natl. Acad. Sci. U. S. A. 109, 10540−10545. (4) Wang, H. H., Isaacs, F. J., Carr, P. A., Sun, Z. Z., Xu, G., Forest, C. R., and Church, G. M. (2009) Programming cells by multiplex genome engineering and accelerated evolution. Nature 460, 894−898. (5) Wang, H. H., Kim, H., Cong, L., Jeong, J., Bang, D., and Church, G. M. (2012) Genome-scale promoter engineering by coselection MAGE. Nat. Methods 9, 591−593. (6) Zeitoun, R. I., Garst, A. D., Degen, G. D., Pines, G., Mansell, T. J., Glebes, T. Y., and Gill, R. T. (2015) Multiplexed tracking of combinatorial genomic mutations in engineered cell populations. Nat. Biotechnol. 33, 631−637. (7) Raman, S., Rogers, J. K., Taylor, N. D., and Church, G. M. (2014) Evolution-guided optimization of biosynthetic pathways. Proc. Natl. Acad. Sci. U. S. A. 111, 17803−17808. (8) Smanski, M. J., Bhatia, S., Zhao, D., Park, Y., Woodruff, L. B., Giannoukos, G., Gordon, D. B., et al. (2014) Functional optimization of gene clusters by combinatorial design and assembly. Nat. Biotechnol. 32, 1241−1249. (9) Hughes, D., and Andersson, D. I. (2015) Evolutionary consequences of drug resistance: shared principles across diverse targets and organisms. Nat. Rev. Genet. 16, 459−471. (10) Smith, A. M., Heisler, L. E., Mellor, J., Kaper, F., Thompson, M. J., Chee, M., Nislow, C., Roth, F. P., and Giaever, G. (2009) Quantitative phenotyping via deep barcode sequencing. Genome Res. 19, 1836−1842. (11) Spencer, S. J., Tamminen, M. V., Preheim, S. P., Guo, M. T., Briggs, A. W., Brito, I. L., Alm, E. J., A Weitz, D., Pitkanen, L. K., Vigneault, F., and Virta, M. P. (2016) Massively parallel sequencing of single cells by epicPCR links functional genes with phylogenetic markers. ISME J. 10, 427. (12) Levy, S. F., Blundell, J. R., Venkataram, S., Petrov, D. A., Fisher, D. S., and Sherlock, G. (2015) Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature 519, 181−186. (13) Judo, M. S., Wedel, A. B., and Wilson, C. (1998) Stimulation and suppression of PCR-mediated recombination. Nucleic Acids Res. 26, 1819−1825. (14) Minty, J. J., Lesnefsky, A. A., Lin, F., Chen, Y., Zaroff, T. A., Veloso, A. B., Rouillard, J. M., Xie, B., McConnell, C. A., Ward, R. J., et al. (2011) Evolution combined with genomic study elucidates

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acssynbio.6b00376. RBS gene targets for 31-site library; degenerate MAGE oligomers used; assembly primers for the 6-site and 31site libraries; additional figures and tables as described in the text (PDF)



ACKNOWLEDGMENTS

The authors would like to acknowledge funding by DOE award Number DE-SC0008812. Special thanks to Dr. Sean Lynch, Professor Thomas Mansell, and Ryan Hockstad for research discussions and assistance. MiSeq sequencing services were provided by Dr. Jaime Prior and the University of Colorado Boulder Biofrontiers sequencing core facility, and HiSeq sequencing services were provided by the University of Colorado, Denver, sequencing core facility.

N

cN =



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Ramsey I. Zeitoun: 0000-0003-2306-9206 Gur Pines: 0000-0002-1757-6722 626

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627

Letter

ACS Synthetic Biology genetic bases of isobutanol tolerance in Escherichia coli. Microb. Cell Fact. 10, 1. (15) Atsumi, S., Wu, T. Y., Machado, I. M., Huang, W. C., Chen, P. Y., Pellegrini, M., and Liao, J. C. (2010) Evolution, genomic analysis, and reconstruction of isobutanol tolerance in Escherichia coli. Mol. Syst. Biol. 6, 449. (16) Woodruff, L. B., Pandhal, J., Ow, S. Y., Karimpour-Fard, A., Weiss, S. J., Wright, P. C., and Gill, R. T. (2013) Genome-scale identification and characterization of ethanol tolerance genes in Escherichia coli. Metab. Eng. 15, 124−133. (17) Farasat, I., Kushwaha, M., Collens, J., Easterbrook, M., Guido, M., and Salis, H. M. (2014) Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria. Mol. Syst. Biol. 10, 731. (18) Bonde, M. T., Klausen, M. S., Anderson, M. V., Wallin, A. I., Wang, H. H., and Sommer, M. O. (2014) MODEST: a web-based design tool for oligonucleotide-mediated genome engineering and recombineering. Nucleic Acids Res. 42, W408. (19) Venturelli, O. S., Egbert, R. G., and Arkin, A. P. (2016) Towards engineering biological systems in a broader context. J. Mol. Biol. 428, 928−944. (20) Paaby, A. B., and Rockman, M. V. (2014) Cryptic genetic variation: evolution’s hidden substrate. Nat. Rev. Genet. 15, 247−258. (21) Masel, J., and Trotter, M. V. (2010) Robustness and evolvability. Trends Genet. 26, 406−414. (22) Garst, A. D., Bassalo, M. C., Pines, G., Lynch, S. A., HalwegEdwards, A. L., Liu, R., Liang, L., Wang, Z., Zeitoun, R. I., Alexander, W. G., and Gill, R. T. (2016) Genome-wide mapping of mutations at single-nucleotide resolution for protein, metabolic and genome engineering. Nat. Biotechnol. 35, 48. (23) Reuter, J. A., Spacek, D. V., and Snyder, M. P. (2015) Highthroughput sequencing technologies. Mol. Cell 58, 586−597. (24) Esvelt, K. M., Mali, P., Braff, J. L., Moosburner, M., Yaung, S. J., and Church, G. M. (2013) Orthogonal Cas9 proteins for RNA-guided gene regulation and editing. Nat. Methods 10, 1116−1121. (25) Yu, C., Mannan, A. M., Yvone, G. M., Ross, K. N., Zhang, Y. L., Marton, M. A., Weir, B. A., Taylor, B. R., Crenshaw, A., Gould, J. Z., et al. (2016) High-throughput identification of genotype-specific cancer vulnerabilities in mixtures of barcoded tumor cell lines. Nat. Biotechnol. 4, 419−423.

627

DOI: 10.1021/acssynbio.6b00376 ACS Synth. Biol. 2017, 6, 619−627