Genotype Specification Language - ACS Synthetic Biology (ACS

Feb 17, 2016 - Nicholas Roehner , Jacob Beal , Kevin Clancy , Bryan Bartley , Goksel Misirli , Raik Grünberg , Ernst Oberortner , Matthew Pocock , Mi...
2 downloads 6 Views 1MB Size
Research Article pubs.acs.org/synthbio

Genotype Specification Language Erin H. Wilson,† Shiori Sagawa,† James W. Weis,† Max G. Schubert,† Michael Bissell, Brian Hawthorne, Christopher D Reeves, Jed Dean, and Darren Platt* Amyris, Inc., 5885 Hollis Street, Suite 100, Emeryville, California 94608, United States ABSTRACT: We describe here the Genotype Specification Language (GSL), a language that facilitates the rapid design of large and complex DNA constructs used to engineer genomes. The GSL compiler implements a high-level language based on traditional genetic notation, as well as a set of low-level DNA manipulation primitives. The language allows facile incorporation of parts from a library of cloned DNA constructs and from the “natural” library of parts in fully sequenced and annotated genomes. GSL was designed to engage genetic engineers in their native language while providing a framework for higher level abstract tooling. To this end we define four language levels, Level 0 (literal DNA sequence) through Level 3, with increasing abstraction of part selection and construction paths. GSL targets an intermediate language based on DNA slices that translates efficiently into a wide range of final output formats, such as FASTA and GenBank, and includes formats that specify instructions and materials such as oligonucleotide primers to allow the physical construction of the GSL designs by individual strain engineers or an automated DNA assembly core facility. KEYWORDS: programming language, bio-design automation, DNA assembly

S

Although GSL was initially designed as a tool for the Amyris strain engineering environment, it should be useful in any similar high throughput DNA assembly operation. The inspiration for GSL syntax was the natural language used by geneticists to describe strain genotypes. Because of this, GSL syntax reflects the biologists’ prior notational conventions, which encourages adoption and allows strain engineers to quickly specify relatively simple designs. For example, the following notation from the supplementary material of Westfall et al.11 compactly conveys engineering constructs used in the artemisinin producing strain:

oftware tools have long aided the design of DNA constructs. The earliest tools were inspired by desktop publishing and were DNA focused, with typical operations involving DNA cut and paste. This approach provided huge advances in engineering productivity and is included in many current generation tools such as Benchling, Teselagen and Vector NTI.1 More recently, synthetic biology has incorporated ideas from the electronics industry, resulting in tools that enable DNA design through the composition of modular components (genetic parts with predefined functions), usually with modern graphical interface principles such as drag and drop. Examples include Genome Compiler, Gene Designer,2 SBROME,3 TinkerCell,4 Clotho,5 iBioSim,6 GenoCAD,7 and Teselagen. Written languages are a third approach to DNA design, inspired by the software and silicon chip design fields, in which a formal specification is written and compiled or translated into the end product: a DNA sequence. One example of such a tool is Eugene,8 which contains a vertically integrated build platform and allows composition of abstract parts (e.g., a NOT gate) while incorporating concepts from biology such as combinatorial library designs. Most large-scale host engineering involves the use and reuse of existing native parts, with direct DNA synthesis only being used for the introduction of heterologous genes. This is motivated by the need to explore large design spaces in a costeffective way, as DNA construction capabilities can efficiently exploit cheap construction strategies via PCR. Amyris has built hundreds of thousands of DNA constructs in the course of optimizing yeast strains to produce high levels of compounds such as artemisinic acid9 and farnesene.10 We have explored the design paradigms mentioned above, and we find that written languages are not only fast, but also the least error-prone means of large scale DNA design. © XXXX American Chemical Society

erg9Δ :: kanMX_PMET3‐ERG9 his3Δ1 :: hisMX_PGAL1‐ERG12_PGAL10‐ERG10

Briefly, the first design is intended to delete the native ERG9 gene while reinserting ERG9 driven by a MET3 promoter with a KAN marker. Similarly, the second Westfall et al. design will delete the native HIS3 locus, use a heterologous HIS marker, and insert genes ERG12 and ERG10 being driven by GAL1 and GAL10 promoters, respectively. GSL syntax is designed to reflect this notation style which has helped to bring more biologists into the GSL user base. Some other key engineering features allowed in GSL syntax include the creation of more complex designs that involve specific modifications of parts from both the native host and heterologous organisms, such as introducing precise mutations and adding or removing tags and domains. To achieve such Special Issue: IWBDA 2015 Received: October 5, 2015

A

DOI: 10.1021/acssynbio.5b00194 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology

Figure 1. The Thumper database stores parts in a hierarchy. (A) Rabits are blocks of fundamental genetic elements. (B) Stitches are Rabits connected by their matching linkers. (C) Megastitches are a pair of overlapping Stitches that assemble in vivo and integrate at native loci.

seamless junctions and linkers with biological functionality must also be supported. There are three types of parts in the ASE database, related as shown in Figure 1. Rabits (RYSE-Associated Bits) typically represent a fundamental genetic element (promoter, gene, etc.) with linker sequences at each end. Stitches are DNA assemblies built by connecting selected Rabits by their linkers. Finally, Megastitches are specific pairs of Stitches with overlapping homology at the inner edges (often a selectable marker) and sequences for genomic targeting by homologous integration at the outer edges. During each cycle of DNA construction, ASE accepts hundreds of in-silico designs submitted by strain engineers via Thumper, assembles the DNA according to the compiler’s design instructions, and delivers final DNA constructs to the requesting engineers.12−14 Within the context of ASE and Thumper, the GSL language is used as the source code for insilico genotype specification. This source code is then translated by the compiler to generate a list of parts, oligonucleotide primers, and other materials needed for the ASE construction process. GSL encodes a wide range of DNA manipulation operations for precision genome editing. The language features are organized into four distinct levels, shown in Table 1, each of which can be translated into any lower level representation. For construction purposes, it is convenient to have an intermediate language (borrowing from compiler terminology) of DNA slices, where the final Level 0 sequence is composed of a series of these slices. Each slice has a distinct source and can be translated into a specific construction step, for example, a PCR operation. Language Elements. The Level 1 language specifies DNA assemblies as semicolon-delimited lists of parts, organized left to right in the intended DNA layout. Part options include preexisting library components or any genetic element region (promoter, gene, and terminator) in a well-annotated reference genome. In addition, in-line DNA and protein sequences may be incorporated. Table 2 shows the full range of syntax operators for slicing and modifying DNA sequences. These operators allow users to define novel parts from existing genetic material. The prefix operators specify particular sequence regions relative to a locus from a default organism’s gene namespace, though alternative

precision, strain engineers often spend significant time defining their complex designs at the DNA sequence level. GSL still allows this direct editing of the primary sequence but facilitates the introduction of such alterations through higher-level, more abstract representations. Additionally, GSL extends the paradigm found in most current DNA languages (e.g., Eugene8) of allowing a user to manipulate an existing library of engineering parts by allowing new parts to be defined and derived from wild-type features in well annotated genomes. Drawing on prior experience in the electronics and software fields, such as the transition from assembly language to programming languages like C, we recognize that moving users away from working directly with DNA sequences and toward more abstract representations will require transparency, time, and trust. Building a mixed language that allows the user to investigate designs across hierarchical levels will be important for motivating adoption. Finally, we recognize that a successful language must work seamlessly with construction methods that are cost-effective, and such a language must ideally allow the user to generate physical material and electronic records that are compatible with their existing workflows.



RESULTS AND DISCUSSION Overview. GSL was written to facilitate the design phase of the Amyris biofab, which is called Automated Strain Engineering, or ASE.12−14 Initially focused on the S. cerevisiae strain CEN.PK2, both ASE and GSL have expanded to include a variety of host organisms. GSL specifies DNA sequences by referring to regions relative to genomic landmarks, such as the start and stop codons of a gene in a genetically wellcharacterized organism, or relative to pre-existing DNA parts in the ASE database. This database is managed by internal software called Thumper, which was originally written to support Rapid Yeast Strain Engineering (RYSE).12 In RYSE, linkerssequences of 24, 28, or 36 base pairs with high melting temperatureswere created to facilitate assembly using the CPEC method.15 Linkers enable inexpensive part reuse and facilitate the modular design mentioned above. If adjacent parts in an assembly share a linker sequence, they can be easily connected via many protocols including yeast homologous recombination, LCR,12 and Gibson Assembly.16 In practice, B

DOI: 10.1021/acssynbio.5b00194 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

Table 2. GSL Language Level 1 Operators and Example Usage operator

operator type

g p t u d o f

prefix prefix prefix prefix prefix prefix prefix

m

prefix

[]

postfix

S or E

postfix

!

prefix

function

example

gene locus promoter part terminator part upstream part downstream part open reading frame fusible ORF, no stop codon mRNA (ORF + terminator) specifies a sub slice of a gene locus specifies slice coordinate relative to the Start or the End of an open reading frame invert sequence orientation

gADH1 pERG10 tERG10 uHO dHO oERG10 fERG10 mERG10 gADH1[1: ∼400] gADH1[−20S:150E]

!pERG10

organism namespaces may be accessed with a namespace qualifier. The postfix operators can be used on both wild-type genes and existing parts in libraries to derive new parts. We borrowed and extended programming conventions for features not commonly included in genotypes, such as slices of DNA adapted from Python’s string slice syntax.17 Slice notation can be used to select an exact range of nucleotides or amino acids to go into a part where a predefined gene region is not appropriate. The ∼ symbol in a slice range expresses an approximate preference that may vary slightly to optimize construction. Pairing an S or an E with a slice coordinate allows the use of nucleotide indices relative to the Start or End of an open reading frame. This enables users to access these gene regions without requiring prior knowledge of the exact gene length. In addition to the operators that create standard parts in GSL, several other syntax features, described in Table 3, can reference abstract or external parts. The ### represents a selectable marker system that is defined separately from the genotype definition. When not used in conjunction with slice notation, the ∼ represents a heterology block, a virtual part that modifies an adjacent protein coding region by rewriting a small number of codons to retain the protein sequence while maximizing dissimilarity of the DNA sequence. The @ symbol may be used to reference an external part that already exists in ASE’s database or to reference a previously defined variable in GSL code (see the artemisinin design example in Figure 7). Finally, a short inline sequence of custom bases or peptides can be inserted between parts. To put some GSL syntax into context, Figure 2A shows a simple construct removing the native HO gene and placing a copy of the ERG10 gene under the control of the TDH3 promoter with a default marker. Figure 2B demonstrates a lower level manipulation of a gene using Python-like slice notation. In this example, the slice is selecting a range of nucleotides in the HMG1 gene from 1586 base pairs into the gene through approximately 200 bases after the end of the open reading frame. Four additional nucleotides are prepended to specify the start codon of the truncated HMGR protein used in Westfall et al.11

ACCTTTTTTGTGCGTGTATT...

uHO ; pTDH3 ; mERG10 ; ### ; dHO

Level 0

Can be translated unambiguously to a Level 0 sequence assuming some conventions Literal DNA sequence; may contain ambiguous bases

Level 2

Level 1

gHO∧ ; pTDH3 > gERG10

Using syntax like ″Neutral″, ″Strong″, or an EC# relieves users from needing to choose specific DNA parts and instead allows them to encode the desired function of their design. This gives the GSL compiler the flexibility to choose the specific part. The HO locus, TDH3 promoter, and ERG10 gene are chosen as the specific DNA parts but the Level 2 syntax is still able to encode the functional intention to “delete HO” with “∧” and “drive ERG10 with TDH3 promoter” with “>”. This Level 1 design uses specific upstream and downstream HO flanking parts, the TDH3 promoter, and ERG10 ORF + terminator “mRNA”. These parts are arranged in an explicit order and orientation. Raw literal DNA sequence

GSL example description GSL example

gNeutral∧ ; pStrong>@EC2.3.1.9

level definition

An abstract design that allows leeway for the compiler to choose parts. For example a strong constitutive promoter may be specified rather than a particular promoter choice. Specifies concrete components to be used but may allow considerable freedom for the compiler to rearrange the parts during construction

level

Level 3

Table 1. GSL Encodes a Hierarchy of Design Languages, From Literal DNA Sequence to High-Level Abstract Design Specification. See Table 2 for Specific Syntax Definitions

ACS Synthetic Biology

C

DOI: 10.1021/acssynbio.5b00194 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology Table 3. GSL Abstract Part and External Reference Syntax operator

operator type

### ∼

inline prefix, inline

marker sequence approximate coordinate, heterology block

function

example

@

prefix

reference external part or previously defined variable

/ ... / /$ ... /

inline inline

insert custom bases insert custom peptides

uHO ; ### ; dHO gERG10[-300E: ∼200E] gYNG2[-22:678] ; ∼ ; /TAC/;gYNG2[682:200E] @R12345 @BBa_J11053 @myStrongPromoter /ATGTACCGG/ /$MYR/

Figure 2. Level 1 GSL design examples: (A) in this construct, the HO gene is being replaced by an ERG10 mRNA sequence driven by a TDH3 promoter attached to a default selectable marker. Native genetic elements are selected as parts using the upstream, downstream, promoter, and mRNA (open reading frame plus terminator) prefixes u, d, p, and m. The ### represents a default marker sequence. (B) Four bases of inline DNA are prepended to a slice of the HMG1 gene and serve as a start codon. This slice of HMG1 cuts out the first 1585 bases of the gene but includes the tail end of the open reading frame and about 200 bases beyond the HMG1 stop codon.

the compiler to use a certain construction strategy. Table 5 lists a selection of pragma examples, though currently there are over 20 in existence and more are often added in response to feature requests from innovative users.

The Level 2 language elements provide simpler, higher level operations for common engineering steps, such as gene knockouts, promoter replacement, and introducing mutations. Level 2 designs give the compiler more flexibility for constructing the final DNA as they are not constrained to a strict left to right layout as in the Level 1 case. Level 2 operators are shown in Table 4.

Table 5. Example Pragma Directives to Alter Compiler Behavior pragma

Table 4. Level 2 Operators and Example Usage operator

operator type

$ * ∧ >

postfix postfix postfix infix

.

infix

function amino acid mutation nucleotide mutation gene knockout promoter replacement or overexpression namespace qualifier

#stitch example #linkers

oADH1$A147E oADH1*G100C gADH1∧ pADH1>gERG10

#fuse #primermax

Sc.pERG10

#refgenome

Roughage. We have embedded a simple language subset in GSL called Roughage, with a syntax that closely resembles the genotype notation of our users. It provides only three operations: deletion of genes, the replacement of promoters, and the insertion of promoter-gene-terminator cassettes. Despite its simplicity, Roughage is used to encode many genotype designs used at Amyris for metabolic engineering. This language was modeled very closely on traditional genetic nomenclature, with the delta symbol replaced by a caret ∧ and the junction between a promoter and gene denoted by the > symbol. Basic Roughage examples are shown in Figure 3. Compiler Directives. Finally, the GSL compiler allows for the use of global and inline pragmas. Rather than being direct sequence manipulations, pragmas serve as extra instructions for

function

example

construct design as a single piece Stitch instead of a Megastitch with a split marker specify which linkers to use at each Rabit junction

#stitch uHO ; pTDH3 ; mERG10 ; dHO #linkers 0,2,A,3,9 | 0,9 pTDH3 {#fuse } ; mERG10

inline pragma to create a seamless junction between two Rabits (no linker) maximum primer length available for construction use alternative reference genome

#primermax 60 #refgenome Klactis

Rewriting Designs Enables Users to Innovate. The GSL compiler is implemented as a hierarchy of rewriting rules. High-level designs are iteratively translated into lower level GSL lines that are more detailed and specific. Compilation typically proceeds directly to the target output format, but the user has the option of inspecting the intermediate translations. In addition to providing transparency for end users, this option has proven to be a driver of innovation, as users often discover ways of hacking the intermediate compiler output to implement new ideas. Such hacking has revealed new design paradigms and provided guidance for implementing them at higher levels of GSL. Target Platforms. Carefully separating design and implementation allows the compiler to take different implementation routes as it translates input into lower-level designs, according to the capabilities available. For example, selection for the desired genotypic change can use either drug markers or the repair of a double-strand break introduced by a nuclease.18 When submitting designs to Amyris’ ASE platform, GSL targets an XML dialect called RYCOD (RYSE COmponent Description language), which specifies the parts

Figure 3. Example designs using Roughage syntax. (A) Knock out of HO locus. (B) Replacing ERG9 promoter with MET3 promoter. (C) Replacing the HIS3 gene with a cassette of GAL1 promoter driving ERG12 gene. D

DOI: 10.1021/acssynbio.5b00194 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology

Figure 4. Level 1 GSL for implementing promoter titration.

Figure 5. GSL compiler’s iterative expansion steps for allele swap designs. A) Allele swap syntax intended to change the 227th amino acid Cysteine to a Tyrosine in gene YNG2. B) First expansion step inserts an inline TAC between 2 flanking regions of YNG2 and a ∼ to form a heterology block immediately before the mutation. C) The full heterology block is expanded into the upstream flanking region. D) An alternative compiler output to construct the mutation using a nuclease.

Figure 6. GSL for introducing a degenerate nucleotide at amino acid position 100 of Rabit 12345.

Figure 7. Example yeast design for terpene production.

Example Applications. We outline here some simple applications of GSL. Promoter Titration. Replacing the native promoter of a gene with an alternative promoter to reduce or increase expression is a routine operation in metabolic engineering. Figure 4 shows a Level 2 promoter titration placing the ADH1 promoter in front of the ERG10 gene, and two possible implementation outputs. Output 1 places a marker upstream of the inserted promoter and generates flanking sequences upstream of the locus and matching the first 500 bases of the gene. This will efficiently incorporate via homologous recombination to place the ADH1 promoter upstream. Output 2 shows a similar implementation with no marker. In this scenario, 100 base pairs of the native promoter are removed, and a nuclease matching that region will be generated to drive incorporation without a marker. Allele Swaps. Designing DNA to efficiently incorporate mutations into a genome can be time-consuming. GSL solves this by providing a compact syntax for specifying mutations. Figure 5 shows an input mutation (A) and each iteration of

and reagents ASE would need for construction (further details in Implementation section). Although scientists at Amyris primarily use GSL to create designs that will be built by the ASE platform, occasionally they want to specify constructs that are compatible with the GSL compiler but are not yet compatible with current ASE protocols. For example, ASE cannot currently make a Rabit composed of several seamlessly fused parts. In these cases, engineers may program their designs in GSL, but instead of compiling to the standard ASE output (RYCOD), they can ask the compiler to output instructions for generating the reagents needed to make the desired DNA sequence themselves, so that they can drop it into a downstream step of ASE’s workflow. In addition to different construction platforms, our users routinely target different species as platforms and can often reuse substantially similar code to engineer a range of hosts.18 This will become an increasingly critical feature as we seek to produce hundreds of different molecules at scale using a range of hosts. E

DOI: 10.1021/acssynbio.5b00194 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology

Figure 8. Example RYSE construction with three Rabits and associated primers. Each Rabit is built either by amplifying a region from a template (genomic DNA or a template in ASE storage) using two or four primers or via de novo synthesis. Solid boxes represent PCR products while pink junction regions and black linkers in line B are incorporated into the tails of the amplification primers for the PCR products. Line C shows the final assembly.

output from the compiler as it expands to a final construct. In this design, the compiler builds an initial construct with the mutation TAC introduced between two flanking regions in the gene. A marker is placed at the 3′ end of the locus and a heterology block is introduced to the left of the mutation. The heterology block is expanded in Figure 5C and merged with the mutation to rewrite a section of the gene using alternative codons. The use of the heterology block increases the efficiency with which homologous recombination includes the mutation and provides a convenient barcode for assaying the change. We note that introducing the control design gYNG2$C227C (no amino acid change) is important to measure any side effects from the engineering. Figure 5D shows an alternative compiler output if nucleases are available. In this design path, no marker is required, but the selection of the cut site and design of the heterology block are coordinated to ensure erasure of the cut site postmutation incorporation. This approach was used for the introduction of 5 simultaneous mutations in a single transformation18 Enzyme Saturation Mutagenesis. GSL is an efficient language for performing site-directed mutagenesis of enzymes. The approaches used for host genome engineering apply equally well to part manipulation. Here in Figure 6 a user is taking a library part, Rabit 12345, and creating a degenerate PCR product to produce a range of amino acid substitutions at amino acid 100 in the protein using the NDT sequence between two sections of the enzyme. Artemisinin Design. We conclude with a brief description in Figure 7 of the core pathway design used to synthetically produce the antimalarial drug artemisinin, as described in Westfall et al.11 This strain design is widely cited as a powerful demonstration of the potential of synthetic biology, but reproducing the design from the text in the paper is not straightforward as with most methods sections. However, the GSL code in Figure 7 can be compiled directly into the DNA reagents needed to construct the base yeast strain. Issues. In practice, the largest issue with engineering has been the effective management of large namespaces, as users edit and reintroduce variants of parts into the library. Some parts may contain the same sequence but with different linkers, or one version of the part may have a specific mutation

introduced. These variants are all valid parts in ASE’s database and should be easily distinguishable and searchable. Additionally, there is a natural learning curve moving from more traditional DNA design software into a written design language, particularly if no programming language syntax is familiar to the user. However, this barrier is often overcome once users discover the speed with which they can create new designs. Ongoing Work. Level 1 and Level 2 language features are routinely in use at Amyris for the vast majority of designs. Strain engineers are generally willing to trust the compiler with many construction details and rarely have to inspect the actual DNA in the parts to check results. We are entering a phase where the more abstract Level 3 features will become important. It will require more trust to allow the language to automatically select parts from abstract descriptions (e.g., pStrong to select a generic strong promoter, gNeutral to choose an integration site, @ECx.y.z to select an enzyme). We believe that this direction will enable greater productivity and ultimately more accurate designs. GSL has been used across a range of species, but we anticipate adding prokaryotic operators for features such as ribosome binding sites, or better intron support for species where this is a more prevalent phenomenon. We note also that GSL is a useful target language for other DNA languages such as Eugene that have powerful features for describing part order and layout and may wish to interface with GSL as a build language. Finally, we believe that users still think predominantly in terms of the final linear or circular DNA construct, and that a more pathwaycentric workflow would lead to further increased productivity.



METHODS Compiler Architecture. The GSL compiler gslc is implemented using a standard Look-Ahead Left Right (LALR) approach to generate a parse tree from GSL input lines. Each line represents a specific genotype, compiler directive, part alias, or function definition. The parse tree is examined for elements that can be expanded. At each iteration, one type of element is expanded (e.g., protein sequences are translated into codon optimized DNA), and the entire tree is re-emitted in GSL. When no further elements may be F

DOI: 10.1021/acssynbio.5b00194 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology

and 5′ transcript structure. Mutation is performed by stochastically sampling preferred codons. Oligonucleotide creation requires accurate estimation of melting temperature. An informal survey of publically available oligo melting temperature calculators revealed a wide discrepancy in accuracy of implementations, and so we include for reference the melting temperature code here https://github.com/Amyris/gsl_paper. This GitHub repository contains the following additional material: oligo melting temperature calculation code in F#, RYCOD XML example in text form and the source code for the GSL compiler itself under https://github.com/Amyris/gsl. We would be happy to collaborate to ensure the language and compiler are widely used.

translated into lower level GSL, the remaining parts are converted into a DNA slice tree. A DNA slice may be an explicit DNA sequence, a contiguous section of DNA from an underlying genome, a linker sequence, etc. This tree of parts represents the intermediate language. Reference Data. The core part library for GSL is the genome itself. The GSL compiler minimally requires the sequence in FASTA format and gene coordinates to derive locus parts. Promoters and terminators are functionally defined as DNA within a defined distance of the open reading frame, but they can also be supplied as auxiliary data when their size has been experimentally determined. In addition, a codon usage table, when provided, enables protein-to-DNA translation and heterology block creation. Additional genome specific parameters, such as optimal flanking sequence length for homologous recombination, may also be specified. The Amyris Rabit collection is available to GSL programs via a RYCOD data source. Additionally, we implemented Biobrick access and Synthetic Biology Open Language (SBOL)19 part consumption using the SBOL 1.0 data source http://convert. sbols.org/biobrick [currently unavailable]. Within GSL, users can access Biobrick parts with @B and the GSL compiler will retrieve the brick sequence as an SBOL 1.0 file via HTTP. Output Targets. The compiler translates the intermediate language into various formats, including FASTA sequences, GenBank records, and Clone Manager cx5 files. We have been actively involved in developing the SBOL 2.0 standard, and we plan to support SBOL 2.0 output generation from the intermediate language in the near future as a replacement for RYCOD. Optionally, the compiler can process the constructs to insert linker sequences between certain slices and generate primer sequences for amplifying and stitching parts. The primer designer generates amplification primers that create the PCR products. The tails of the amplification primers optionally introduce linkers and extra sequences at the ends of PCR products. The short sandwich regions between the linkers and PCR amplicons are useful for introducing start or stop codons or short tags. Figure 8 shows a range of scenarios for linkerbased assembly that is handled by this module. A surprisingly wide range of designs can be constructed with sets of cheap oligonucleotides less than 80 base pairs long, resulting in costeffective construction. Using the same system, Ligase Cycling Reaction reagents were also generated.12 RYCOD formalizes the bill of materials for ASE construction with a three-part hierarchy of Megastitches, Stitches, and Rabits. Rabits, at the lowest level, contain two terminal primers and optional internal primers for two-piece Rabits, for example, Rabit 2 in Figure 8. The XML schema definition for RYCOD and example XML is included here https://github.com/ Amyris/gsl_paper for reference. Implementation. The compiler is implemented in the F# programming language20 using fsyacc and fslex for the parser [http://fsprojects.github.io/FsLexYacc/]. XML handling is streamlined with the XML type provider [http://fsharp. github.io/FSharp.Data/library/XmlProvider.html] enabling SBOL and RYCOD implementations using only a few lines of code. Efficient searches for suitable nuclease recognition sites are performed with a suffix tree of genomic sequence. The compiler can consider all 1 or 2 base pair variants of a candidate site efficiently using this data structure. Codon optimization is performed using a genetic algorithm. The objective fitness function penalizes the presence of a set of target restriction sites



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Author Contributions †

E.H.W., S.S., J.W.W., and M.G.S. contributed equally to this work. D.P. and J.D. conceived the GSL and Roughage languages. D.P., E.H.W., S.S., J.W.W., and B.H. implemented the compiler, M.B. and B.H. designed the RYCOD standard and advised on architecture. C.R. and M.S. developed genetic engineering strategies implemented in the compiler, D.P., M.B., C.R., M.S., S.S., E.H.W., and J.W.W. wrote the paper. Notes

The authors declare the following competing financial interest(s): The corresponding author and several current or former Amyris employees who are co-authors own Amyris stock.



ACKNOWLEDGMENTS This work was funded entirely by Amyris. We would like to thank the Amyris community for nurturing and challenging the language as it developed. In particular, Sunil Chandran, Victor Holmes, Andrew Horwitz, Hanxiao Jiang, Quinn Mitrovich, Jeff Ubersax, Elaine Shapland, Jared Wenger, Jessica Walter and Gale Wichmann. Maxime Durot and Yang Zhang contributed to key library code used in the compiler. Chris Macklin, Amoolya Singh, and Jeff Ubersax provided valuable feedback on the manuscript.



REFERENCES

(1) Lu, G., and Moriyama, E. N. (2004) Vector NTI, a balanced allin-one sequence analysis suite. Briefings Bioinf. 5, 378−388. (2) Villalobos, A., Ness, J. E., Gustafsson, C., Minshull, J., and Govindarajan, S. (2006) Gene Designer: a synthetic biology tool for constructing artificial DNA segments. BMC Bioinf. 7, 285. (3) Huynh, L., Tsoukalas, A., Koppe, M., and Tagkopoulos, I. (2013) SBROME: a scalable optimization and module matching framework for automated biosystems design. ACS Synth. Biol. 2, 263−273. (4) Chandran, D., Bergmann, F. T., and Sauro, H. M. (2009) TinkerCell: modular CAD tool for synthetic biology. J. Biol. Eng. 3, 19. (5) Xia, B., Bhatia, S., Bubenheim, B., Dadgar, M., Densmore, D., and Anderson, J. C. (2011) Developer’s and user’s guide to Clotho v2.0 A software platform for the creation of synthetic biological systems. Methods Enzymol. 498, 97−135. (6) Myers, C. J., Barker, N., Jones, K., Kuwahara, H., Madsen, C., and Nguyen, N. P. (2009) iBioSim: a tool for the analysis and design of genetic circuits. Bioinformatics 25, 2848−2849. (7) Czar, M. J., Cai, Y., and Peccoud, J. (2009) Writing DNA with GenoCAD. Nucleic Acids Res. 37, W40−47. (8) Bilitchenko, L., Liu, A., Cheung, S., Weeding, E., Xia, B., Leguia, M., Anderson, J. C., and Densmore, D. (2011) Eugene–a domain

G

DOI: 10.1021/acssynbio.5b00194 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology specific language for specifying and constraining synthetic biological parts, devices, and systems. PLoS One 6, e18882. (9) Paddon, C. J., Westfall, P. J., Pitera, D. J., Benjamin, K., Fisher, K., McPhee, D., Leavell, M. D., Tai, A., Main, A., Eng, D., Polichuk, D. R., Teoh, K. H., Reed, D. W., Treynor, T., Lenihan, J., Fleck, M., Bajad, S., Dang, G., Dengrove, D., Diola, D., Dorin, G., Ellens, K. W., Fickes, S., Galazzo, J., Gaucher, S. P., Geistlinger, T., Henry, R., Hepp, M., Horning, T., Iqbal, T., Jiang, H., Kizer, L., Lieu, B., Melis, D., Moss, N., Regentin, R., Secrest, S., Tsuruta, H., Vazquez, R., Westblade, L. F., Xu, L., Yu, M., Zhang, Y., Zhao, L., Lievense, J., Covello, P. S., Keasling, J. D., Reiling, K. K., Renninger, N. S., and Newman, J. D. (2013) Highlevel semi-synthetic production of the potent antimalarial artemisinin. Nature 496, 528−532. (10) Sandoval, C. M., Ayson, M., Moss, N., Lieu, B., Jackson, P., Gaucher, S. P., Horning, T., Dahl, R. H., Denery, J. R., Abbott, D. A., and Meadows, A. L. (2014) Use of pantothenate as a metabolic switch increases the genetic stability of farnesene producing Saccharomyces cerevisiae. Metab. Eng. 25, 215−226. (11) Westfall, P. J., Pitera, D. J., Lenihan, J. R., Eng, D., Woolard, F. X., Regentin, R., Horning, T., Tsuruta, H., Melis, D. J., Owens, A., Fickes, S., Diola, D., Benjamin, K. R., Keasling, J. D., Leavell, M. D., McPhee, D. J., Renninger, N. S., Newman, J. D., and Paddon, C. J. (2012) Production of amorphadiene in yeast, and its conversion to dihydroartemisinic acid, precursor to the antimalarial agent artemisinin. Proc. Natl. Acad. Sci. U. S. A. 109, E111−118. (12) de Kok, S., Stanton, L. H., Slaby, T., Durot, M., Holmes, V. F., Patel, K. G., Platt, D., Shapland, E. B., Serber, Z., Dean, J., Newman, J. D., and Chandran, S. S. (2014) Rapid and reliable DNA assembly via ligase cycling reaction. ACS Synth. Biol. 3, 97−106. (13) Dharmadi, Y., Patel, K., Shapland, E., Hollis, D., Slaby, T., Klinkner, N., Dean, J., and Chandran, S. S. (2014) High-throughput, cost-effective verification of structural DNA assembly. Nucleic Acids Res. 42, e22. (14) Shapland, E. B., Holmes, V., Reeves, C. D., Sorokin, E., Durot, M., Platt, D., Allen, C., Dean, J., Serber, Z., Newman, J., and Chandran, S. (2015) Low-Cost, High-Throughput Sequencing of DNA Assemblies Using a Highly Multiplexed Nextera Process. ACS Synth. Biol. 4, 860−866. (15) Quan, J., and Tian, J. (2009) Circular polymerase extension cloning of complex gene libraries and pathways. PLoS One 4, e6441. (16) Gibson, D. G., Young, L., Chuang, R. Y., Venter, J. C., Hutchison, C. A., 3rd, and Smith, H. O. (2009) Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343− 345. (17) Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., and de Hoon, M. J. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422−1423. (18) Horwitz, A. A., Walter, J. M., Schubert, M. G., Kung, S. H., Hawkins, K., Platt, D. M., Hernday, A. D., Mahatdejkul-Meadows, T., Szeto, W., Chandran, S. S., and Newman, J. D. (2015) Efficient Multiplexed Integration of Synergistic Alleles and Metabolic Pathways in Yeasts via CRISPR-Cas. Cell Systems 1, 88−96. (19) Galdzicki, M., Clancy, K. P., Oberortner, E., Pocock, M., Quinn, J. Y., Rodriguez, C. A., Roehner, N., Wilson, M. L., Adam, L., Anderson, J. C., Bartley, B. A., Beal, J., Chandran, D., Chen, J., Densmore, D., Endy, D., Grunberg, R., Hallinan, J., Hillson, N. J., Johnson, J. D., Kuchinsky, A., Lux, M., Misirli, G., Peccoud, J., Plahar, H. A., Sirin, E., Stan, G. B., Villalobos, A., Wipat, A., Gennari, J. H., Myers, C. J., and Sauro, H. M. (2014) The Synthetic Biology Open Language (SBOL) provides a community standard for communicating designs in synthetic biology. Nat. Biotechnol. 32, 545−550. (20) Syme, D., Margetson, J. (2008) The F# programming language. Microsoft; http://research.microsoft.com/projects/fsharp.

H

DOI: 10.1021/acssynbio.5b00194 ACS Synth. Biol. XXXX, XXX, XXX−XXX