Facilitated large-scale sequence validation platform using Tn5

6 days ago - A typical molecular cloning procedure requires Sanger sequencing for sequence validation, which is cost-prohibitive and labor-intensive f...
0 downloads 0 Views 762KB Size
Subscriber access provided by TULANE UNIVERSITY

Technical Note

Facilitated large-scale sequence validation platform using Tn5-tagmented cell lysates Byungjin Hwang, Sunghoon Heo, Namjin Cho, Hanna Seo, and Duhee Bang ACS Synth. Biol., Just Accepted Manuscript • DOI: 10.1021/acssynbio.8b00482 • Publication Date (Web): 06 Feb 2019 Downloaded from http://pubs.acs.org on February 7, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Facilitated large-scale sequence validation platform using Tn5-tagmented cell lysates

Byungjin Hwang#,1, Sunghoon Heo#,2, Namjin Cho2, Hanna Seo2, Duhee Bang*,2

1Institue

for Human Genetics (IHG), Department of Epidemiology and Biostatistics, Department of

Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California, USA, 2Department of Chemistry, Yonsei University, Seoul, Korea.

*Corresponding author: Duhee Bang, PhD Department of Chemistry, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, Korea Tel: 82-2-2123-7625 E-mail: [email protected]

#These

authors contributed equally to this work

Keywords: Tn5 transposon, next-generation sequencing (NGS), de novo assembly, graphical user interface (GUI), cell lysates

1 ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT: A typical molecular cloning procedure requires Sanger sequencing for sequence validation, which is cost-prohibitive and labor-intensive for large-scale clone analysis in genotype– phenotype studies. Here we present the cost-effective clone analysis platform TnClone, which uses next-generation sequencing based on Tn5 tagmentation to rapidly analyze a large number of clones from cell lysates. This method bypasses the extensive plasmid purification step. We also developed a user-friendly graphical user interface and provided general guidelines for conducting validation experiments. We tested our program with 1023 plasmids (222 from cell lysates and 801 from purified clones) and achieved 92% and 99.3% sensitivity with cell lysates and purified DNA, respectively. Our platform provides rapid turnaround with minimal hands-on time for secondary evaluation, as nextgeneration sequencing technology continues to evolve.

2 ACS Paragon Plus Environment

Page 2 of 15

Page 3 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

The Sanger sequencing method1 is still regarded as the gold standard for quality control in the analysis of DNA clones (e.g., bacterial plasmids). In general, DNA generated by PCR amplification or plasmid vectors requires a sequence validation step, which increases the cost associated with the analysis of clones. Sanger sequencing in both directions is generally used for this validation step. For large-insert clones, serial tiling of primers is typically required, which linearly increases the cost of sequencing. For a 2,000-bp fragment, at least four Sanger sequencing reads are required for accurate validation, costing at least $12. Overlapping chromatogram signals makes it difficult to distinguish the true signal from noise due to clone contamination, template heterozygosity [insertion and deletions (indels)], or incomplete purification. Moreover, Sanger sequencing is not sufficiently sensitive to distinguish mixed peaks when the frequency of the less abundant template is 99% on average for the insert DNA sequence and 99% for the entire plasmid sequence, indicating high- quality contigs. When we further down sampled the data to 32 k read pairs (0.01%, HiSeq 4000), TnClone consistently showed high accuracy (99%). Comparison with available assemblers. We compared TnClone to two available assembly programs, SOAPdenovo24 and SPAdes5, for sample validation. Execution time was comparable between SOAPdenovo2 and SPAdes, but the accuracy of SPAdes was better than that of SOAPdenovo2. However, TnClone outperformed the other assemblers in terms of execution time and accuracy for both short and long insert assemblies (Table 3). This result is likely due to TnClone’s use of simplified graph structure for short insert assemblies rather than complex graph reduction used by typical genome-scale assemblers.

5 ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Indel detection. We also observed greater accuracy in detecting indels compared with conventional indel detection tools, especially for large indels (>100 bp). Of 12 indel loci from 10 plasmid clones (10–268 bp), 67% were detected by HaplotypeCaller and PINDEL, 75% were detected by SOAPindel, and no indels were detected by UnifiedGenotyper (Supplementary Figure 7). Of the 3 large indels (>200 bp), 0% were detected by HaplotypeCaller, 33% by SOAPindel, and 66% by PINDEL. One indel (85bp) from lysate sample (intein) was mis-called by all programs. In contrast, TnClone detected all 13 validated indels. We reasoned that fine-grained local assembly of full-length contigs would have greater accuracy than local realignment and split-read based methods. In summary, our TnClone platform facilitates clone analysis by directly tagmenting plasmids in cell lysates using Tn5 transposase, thereby bypassing the costly and labor-intensive DNA purification step. TnClone provides a user-friendly GUI as well as a command-line interface for general biologists. In particular, TnClone performed consistently higher accuracy in detecting long indels. In the future, assembly strategy could be applied to validating sequences from metagenome cloning and gene shuffling experiments for directed evolution. Also, de novo assembly renders validation of the diverse scFv profiles that could not have been possible with alignment-based approach. To our knowledge, this is the first platform for high-throughput NGS aimed at large-scale clone validation. By simply clicking and scrolling, users can analyze sequences from the raw data without the need to individually process intermediate files. The entire cost for analyzing a clone is > 12-fold lower than the cost of conventional Sanger sequencing. As the scalability of biological experiments increases, we expect that an NGS-based clone analysis platform will be broadly applicable for a range of validation experiments.

Methods Tagmentation-based library construction. We constructed the clone library from plasmids or cell lysates containing synthetic genes encoding cas9 (average: 3734 bp, range: 2952–5064), single-chain fragment variable (scFv) antibody (average: 770 bp, range: 732–797), and intein (average: 506 bp, range: 479–554). Tn5 assembly with annealed ME double-stranded (MEDS) sequences was performed as previously described3 with custom-designed oligonucleotide barcodes (Supplementary Table 1). Analysis of mixed samples and PCR amplicons. To test the ability of TnClone to detect variants in mixed samples containing two cas9 gene variants, each with three variant loci (mean

6 ACS Paragon Plus Environment

Page 6 of 15

Page 7 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

distance between loci was 509 bp), we tested three replicates each of the following mixture ratios: 1:1, 2:1, 3:1, and 9:1. For comparison to in silico data, we generated synthetic libraries using the ART sequencing read simulator6. We also tested whether TnClone could assemble PCR amplicons. Sequences (759 bp, 1398 bp, and 2126 bp) were amplified directly from cloned vectors (pCMV-BE3, Addgene accession #73021). All tagmented samples were subject to the standard NGS library construction (Supporting Information). Comparison with available assemblers. We compared TnClone to two popular de novo assembly programs, SOAPdenovo24 (v 2.04-r241. 6) and SPAdes5 (plasmidSPAdes for lysate assembly, v 3.9.0, 7), for sample validation. For this comparison, we used the same parameters (-k 63, -cov-cutoff 3.0) for all three assemblers and chose one of the best matched assembled scaffolds. We applied strict accuracy criteria by calculating the fraction of bases that aligned perfectly to the reference sequence. Indel detection. We evaluated TnClone’s ability to detect indels by comparison with the following four indel detection tools (using default options): UnifiedGenotyper (UG) and HaplotypeCaller (HC) (Genome Analysis Toolkit v 3.3.0 variant calling algorithms), PINDEL v 0.2.5b9, and SOAPindel v 2.1.7.17.

7 ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

REFERENCES (1)

Sanger, F.; Nicklen, S.; Coulson, A. R. DNA Sequencing with Chain-Terminating Inhibitors. Proc. Natl. Acad. Sci. U. S. A. 1977, 74 (12), 5463–5467.

(2)

Shapland, E. B.; Holmes, V.; Reeves, C. D.; Sorokin, E.; Durot, M.; Platt, D.; Allen, C.; Dean, J.; Serber, Z.; Newman, J.; et al. Low-Cost, High-Throughput Sequencing of DNA Assemblies Using a Highly Multiplexed Nextera Process. ACS Synth. Biol. 2015, 4 (7), 860–866.

(3)

Picelli, S.; Björklund, A. K.; Reinius, B.; Sagasser, S.; Winberg, G.; Sandberg, R. Tn5 Transposase and Tagmentation Procedures for Massively Scaled Sequencing Projects. Genome Res. 2014, 24 (12), 2033–2040.

(4)

Luo, R.; Liu, B.; Xie, Y.; Li, Z.; Huang, W.; Yuan, J.; He, G.; Chen, Y.; Pan, Q.; Liu, Y.; et al. SOAPdenovo2: An Empirically Improved Memory-Efficient Short-Read de Novo Assembler. Gigascience 2012, 1 (1), 18.

(5)

Bankevich, A.; Nurk, S.; Antipov, D.; Gurevich, A. A.; Dvorkin, M.; Kulikov, A. S.; Lesin, V. M.; Nikolenko, S. I.; Pham, S.; Prjibelski, A. D.; et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J. Comput. Biol. 2012, 19 (5), 455– 477.

(6)

Huang, W.; Li, L.; Myers, J. R.; Marth, G. T. ART: A next-Generation Sequencing Read Simulator. Bioinformatics 2012, 28 (4), 593–594.

8 ACS Paragon Plus Environment

Page 8 of 15

Page 9 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

ASSOCIATED CONTENT The Supporting Information is available free of charge on the ACS Publications website. (Supplementary Texts, Supplementary Figures 1-11, Supplementary Table 1)

ACKNOWLEDGEMENTS We would like to thank members of the Bang laboratory for their helpful discussions on the manuscript and for testing the algorithm. We would also like to thank members of the Junho Chung laboratory for kindly donating cloned plasmid vectors carrying the single-chain fragment variable antibody. This work was supported by Mid-career Researcher Program (2015R1A2A1A10055972 and 2018R1A2A1A05079172), Bio & Medical Technology Development Programs (NRF2016M3A9B6948494 and NRF-2018M3A9H3024850) through the National Research Foundation of Korea funded by the Ministry of Science, ICT & Future Planning and Korea Health Technology R&D Project through the Korea Health & Welfare, Republic of Korea (HI18C2282).

AUTHOR CONTRIBUTIONS B.H., S.H., N.C., and H.S performed the experiments and wrote the paper. B.H. and S.H. analyzed the data. D.B. supervised the project.

COMPETING FINANCIAL INTERESTS The authors declare that they have no competing financial interests.

9 ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For Table of Contents

Facilitated large-scale sequence validation platform using Tn5-tagmented cell lysates

Byungjin Hwang#,1, Sunghoon Heo#,2, Namjin Cho2, Hanna Seo2, Duhee Bang*,2

1Institue

for Human Genetics (IHG), University of California, San Francisco, San Francisco, California, USA, 2Department of Chemistry, Yonsei University, Seoul, Korea.

10 ACS Paragon Plus Environment

Page 10 of 15

Page 11 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

FIGURES AND FIGURE LEGENDS

Figure 1. TnClone user interface. TnClone has two windows: the main analysis window (above) and the option selection window (below). The main window shows the general inputs for TnClone. All fields are required except for the “Start sequence” and “End sequence” fields. In the options window, users can change the values of parameters required for downstream analysis. Default parameters (shown below) were based on optimal results from in-house testing.

11 ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Analysis of clones using TnClone software. (A) Pie charts for assembled contigs of initially assembled candidate contigs (left) and contigs after the final selection step (right). (B) Knee plot of coefficient of variation scores of candidate contigs for representative samples. Red arrows indicate true contigs validated by Sanger sequencing of plasmids obtained from pure bacterial cultures. If the “leap factor” between two consecutive points (i, i + 1) exceeds a certain value, ≤ith contigs are considered true contigs. On the right, the ambiguous case is shown with no clear leap point. (C) The relationship between next-generation sequencing (NGS) library input versus coefficient of variation of scFv clones is shown (n = 91). The lower the input DNA available for NGS sequencing, the greater variance in the depth distribution of the assembled contigs, leading to a higher number of incorrectly assembled contigs. (D) TnClone analysis of PCR amplicons. We used a sliding window approach to evaluate assembly in two replicate experiments and found that a 30-bp flanking sequence is required for accurate assembly.

12 ACS Paragon Plus Environment

Page 12 of 15

Page 13 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Table 1. Cost analysis. The amount of sequencing data (0.001%) was calculated based on the 3,000-bp input size gene assembly with sufficient depth of coverage (300×) for conventional analysis of purified clones. Sanger sequencing costs $12 for accurate sequencing of a 2,000-bp gene (i.e., typically $3/1,000-bp single read/reaction; both forward and reverse sequencing are required). The top table shows outlined processing costs required to sequence individual plasmids in the table below. Shapland et al.2 analyzed 4,000 plasmids with a cost per plasmid of $2.68. Most accessory items, such as pipette tips and PCR tubes, did not differ markedly between their study and our current work. Any discrepancies are likely due to cost differences between suppliers. Most of the observed price difference relates to Tn5 (Nextera reagent vs. in-house Tn5) and the selected sequencing platform. The previous method is optimized for miniaturization and relies on associated reagents for high-throughput plasmid validation using a robotic system (LabCyte Echo acoustic liquid dispensing system) that is cost-prohibitive for many laboratories. In addition, the previous study did not analyze direct cell lysates. Our platform excludes plasmid purification (highlighted bold), a limiting step in many laboratories, and its associated total cost of $0.97/clone, a >12-fold cost reduction compared to the Sanger method. For low copy vectors, adding a colony-PCR before tagmentation step is highly recommended to give more reliable results.

13 ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 2. Comparison of real and simulated datasets for mixed clone analysis. We mixed two cas9 variants at different ratios. The contigs have two heterogeneous (G/A, T/C) sites that could generate up to four assembly chimeras (‘GT’ stands for genotype). The original genotype combination for each contig is GT and AC, respectively (‘|’ indicates the bases are phased).

14 ACS Paragon Plus Environment

Page 14 of 15

Page 15 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Table 3. Performance comparison of results from different assemblers Data were obtained with 50 k paired-end reads. TnClone was run with a single thread, and SOAPdenovo2 and SPAdes were run with 8 threads. The plasmidSPAdes was tested for lysate sample (intein) only for the comparison. Standard deviation is indicated in brackets.

*Parameters Average match of assembled contigs, % (SD) Execution time, min Peak memory usage, GB

TnClone

SOAPdenovo2

SPAdes

cas9

intein

scFv

cas9

intein

scFv

cas9

intein

scFv

plasmid SPAdes intein scFv

100 (0)

100 (0)

100 (0)

45.9 (42.8)

75.2 (10.2)

59.9 (13.6)

71.4 (36.5)

93.3 (8.1)

76.6 (15.6)

48.8 (29.5)

15.6 (19.7)

< 1.0

< 1.0

< 1.0

< 1.0

< 1.0

< 1.0

< 1.5

< 1.0

< 1.2

< 1.0

< 1.0

1.8

0.9

1.2

0.25

0.1

0.1

14

9

10

17

8

*SD, standard deviation. k=63, coverage cutoff=3

15 ACS Paragon Plus Environment