A Validator and Converter for the Synthetic Biology Open Language

Therefore, Java GDA software developers using libSBOLj have direct access to ...... SBML, BioPAX, SBGN-ML, Matlab, Octave, XPP, GPML, Dot, MDL and APM...
1 downloads 3 Views 3MB Size
Subscriber access provided by Fudan University

Article

A Validator and Converter for the Synthetic Biology Open Language Zach Zundel, Meher Samineni, Zhen Zhang, and Chris J. Myers ACS Synth. Biol., Just Accepted Manuscript • DOI: 10.1021/acssynbio.6b00277 • Publication Date (Web): 29 Dec 2016 Downloaded from http://pubs.acs.org on December 30, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

ACS Synthetic Biology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

A Validator and Converter for the Synthetic Biology Open Language Zach Zundel,∗,‡ Meher Samineni,§ Zhen Zhang,k and Chris J. Myers§ ‡Department of Bioengineering University of Utah 36 S. Wasatch Drive, SMBB Room 3100 Salt Lake City, UT, 84112 §Department of Electrical and Computer Engineering University of Utah 50 S. Central Campus Drive, MEB Room 2110 Salt Lake City, UT, 84112 kDepartment of Computer Science and Engineering University of South Florida 4202 E. Fowler Avenue, ENB 118 Tampa, FL 33620 E-mail: [email protected]

Running header A Validator and Converter for SBOL

1 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract This paper presents a new validation and conversion utility for the Synthetic Biology Open Language (SBOL). This utility can be accessed directly in software using the libSBOLj library, through a web interface, or using a web service via RESTful API calls. The validator checks all required and best practice rules set forth in the SBOL specification document, and it reports back to the user the location within the document of any errors found. The converter is capable of translating from/to SBOL 1, GenBank, and FASTA formats to/from SBOL 2. The SBOL Validator/Converter utility is released freely and open source under the Apache 2.0 license. The online version of the validator/converter utility can be found here: http://www.async.ece.utah.edu/sbol-validator/ The source code for the validator/converter can be found here: http://github.com/SynBioDex/SBOL-Validator/

Keywords SBOL, GenBank, FASTA, standards, validation, conversion, genetic design automation

Introduction The Synthetic Biology Open Language (SBOL) is an emerging standard for the expression of both structural and functional data for biological constructs used in genetic circuit designs (1 –3 ). The data is encoded in an RDF/XML format that allows for hierarchical representation of these constructs. As with all data standards, files encoding SBOL data must be validated according to a set of validation rules that describe correct formatting and encoding, as well as a minimum acceptable set of information which represents a complete and correct construct. Furthermore, it is useful to be able to convert between data stored in the SBOL format and other commonly used formats. Our validator has functionality to

2 ACS Paragon Plus Environment

Page 2 of 23

Page 3 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

convert between SBOL versions 1 and 2, GenBank, a commonly used format for annotated DNA sequences, and FASTA, a commonly used format for DNA sequences. Our validator/converter utility is implemented within version 2.1 of libSBOLj, a native Java library that provides an interface for SBOL data objects (4 ). It can also be accessed through a web form and a web service using RESTful API calls that invoke this Java library to perform the requested operation. This validator/converter utility allows for a universal standard for validation and conversion that existing tools (a list of SBOL-compliant tools can be found at http://sbolstandard.org/software/tools/) can use for validation without requiring tool developers to create, test, and maintain their own validation and conversion routines. SBOL is an emerging synthetic biology standard that takes a designer’s perspective. Figure 1 illustrates the added features provided by SBOL 2 over earlier data formats. In particular, previous data formats are limited to only a flat description of known DNA sequences. For example, FASTA is designed to express solely sequence data, with space only for general description of the entire sequence. GenBank provides slightly more information, allowing users to encode regions with general functionality information, but it is still a flat representation capable of only describing complete, known DNA sequences. SBOL 1, in addition to moving to a RDF/XML format suitable for semantic web data representation, added support for hierarchical constructs along with the ability to express DNA components with unknown sequences. The ability to express hierarchical components with yet to be determined implementation details are principles taken from other fields of engineering design, and allows for abstraction in genetic design facilitating the development of genetic design automation (GDA) software. About the same time, SBOL visual 1 (SBOLv 1) was also introduced to provide a standard set of symbols to use in diagrams of genetic designs (5 ). SBOL 2 adds support for describing non-DNA components, such as RNAs, proteins, small molecules, and complexes, the interactions between these components, and the organization of them within modules (2 , 3 ).

3 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 23

SBOL 2

TATAATAGGATT

GATTACAGGGTTAGC

CTGATTACAGG

ATGGCAGCCT

TATAATAGGATT

GATTACAGGGTTAGC

CTGATTACAGG

ATGGCAGCCT

SBOL 1

GenBank

TATAATAGGATTCCGCAATGGATTACAGGGTTAGCAAATGGCAGCCTGATTACAGGGTTAGCAAATGGCAGCCT

FASTA

TATAATAGGATTCCGCAATGGATTACAGGGTTAGCAAATGGCAGCCTGATTACAGGGTTAGCAAATGGCAGCCT

Promoter

RBS

CDS

Terminator

Promoter

RBS

CDS

Terminator

Figure 1: The evolution of genetic data standards, showing the increase in functional data that can be encoded alongside sequence, or structural, data. In particular, FASTA only describes raw DNA sequences. GenBank adds the ability to annotate the locations of specific genetic features. SBOL 1 enables the ability to express components hierarchically, as well as components without specified sequences. Finally, SBOL 2.0 adds the ability to describe non-DNA components, such as RNAs, proteins, small molecules, and complexes, as well as the interactions between them, and finally composing them within modules. Note that the SBOL 1.1 image uses SBOLv 1.0 symbols, while the SBOL 2.0 image uses these, as well as symbols being proposed for SBOLv 2.0. This figure has been adapted and modified form a similar figure in doi:10.2390/biecoll-jib-2015-272.

4 ACS Paragon Plus Environment

Page 5 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Recently, ACS Synthetic Biology announced that they are recommending the use of SBOL by their authors to capture the data for expression and visualization of their genetic designs (6 ). Their proposed workflow for their authors is shown in Figure 2. In this workflow, authors are encouraged when submitting their article to also submit their genetic circuit design files produced using the GDA tools of their choice. Currently, these GDA tools produce genetic design information in a variety of formats, most commonly GenBank, FASTA, and SBOL. The authors can then process their design files using our validator/converter utility to produce a valid SBOL 2 document. This document should then be deposited in a repository that supports SBOL 2, such as JBEI’s ICE repository (7 ) or SBOL Stack (8 ). The repository may provide an SBOLv image that the authors are recommended to include in their article. When the article is published, a link to the SBOL data stored in the public repository is released along with the manuscript on the ACS Synthetic Biology website. Clearly, the proposed validator/convertor plays a critical role in this process, since many tools either do not yet support SBOL natively, or their SBOL support has not been thoroughly validated for compliance with the SBOL validation rules (2 ). The validator/converter serves as a hub in the SBOL workflow. First, it enables integration of existing tools that do not yet support the SBOL data standard by offering conversion of common genetic data formats to the SBOL standard. Second, it enables library and tool developers to check the efficacy of their software by validating the output against the SBOL standard validation rules and allows for the comparison of a file to another file, perhaps one generated by a tool with known validity. This utility could be used to ensure that not only is the document produced by the tool under test valid, but also that it expresses the entirety of the data that the user intends. Third, it speeds the development of new libraries by allowing developers to skip implementation of native validation methods and simply utilize an API for the purposes of validation. The SBOL validator was inspired by a similar utility developed for the Systems Biology Markup Language (SBML). The web-based validator for SBML made data validation much

5 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Validator/Convertor

Page 6 of 23

Valid SBOL 2

SBOL FASTA GenBank

SBOL Repository

Article

GDA Tools AUTHOR

Figure 2: The prototypical workflow for including SBOL design information in ACS Synthetic Biology. When submitting their article, the author should also produce their genetic design files created by the GDA tool of their choice. These files may be in a variety of formats, including GenBank, FASTA, SBOL 1, or SBOL 2, and produced by a wide variety of GDA tools. Therefore, the next step is to use the validator/converter utility described in this paper to convert their files to SBOL 2, if necessary, and validate that they meet all validation rules. The resulting SBOL 2 file is then deposited in an SBOL repository for private access to the reviewers. The author is also provided an SBOLv image that they are encouraged to include in their article. Finally, upon publication, the link to the data stored in the repository is published alongside the article.

6 ACS Paragon Plus Environment

Page 7 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

easier for the end-user than downloading a library and learning an API to validate documents. This lowers the adoption cost for SBML, and the objective of the SBOL validator/convertor is to do the same for SBOL, while also providing useful conversion utilities. Furthermore, the idea for a converter for SBOL is based in part on the System Biology Format Converter (SBFC), an online utility that allows for conversion between SBML, BioPax, GPML, and related formats (9 ).

Results and Discussion libSBOLj library: To date, several library implementations of the SBOL data model have been developed that provide an interface for developers to interact with SBOL data objects. These include a native Java library (libSBOLj) (4 ), a C/C++ library (libSBOL), a Python library (pySBOL), and a Javascript library (sboljs). A unique feature of libSBOLj is that it also provides routines for validation, conversion, and comparison. Therefore, Java GDA software developers using libSBOLj have direct access to these routines enabling them to validate the SBOL data that they create, to convert from/to alternative data formats, and compare SBOL 2 files. In particular, the validation routines check that an SBOL file adheres to all the rules laid out in the SBOL 2 specification (2 ). The conversion routines enable translation from/to FASTA, GenBank, and SBOL 1 data formats to/from SBOL 2 data format. Finally, the conversion routines enable the data content of two SBOL 2 files to be compared and differences are reported back to the user. It should be noted that when using the library to create SBOL content, whenever possible, it immediately throws an exception when the user attempts to create invalid content. One example of this type of rule is the validation rule sbol-10204, that states that the displayId field “MUST be composed of only alphanumeric or underscore characters and MUST NOT begin with a digit”; this is checked whenever this field is set. Some validation rules, though, can only be checked after the entire SBOL file is created, so the library also includes a

7 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

validation routine that can check all the rules on a complete SBOL file. An example of this type of rule is rule sbol-10605, which essentially states that there can be no cycles in a design’s component hierarchy structure (i.e., a ComponentDefinition should not include another that occurs higher up in the hierarchy graph). Online web-based tool: For those developing their SBOL content using other means, we also provide an online web-based validator with a simple interface for uploading files for validation, conversion, and comparison. It is a web wrapper around the command-line validation routines in libSBOLj, so it supports all the various options provided by libSBOLj. The interface, shown in Figure 3, has been designed to scale well for both desktop and mobile browsers. It also performs basic sanity checks on validation requests before submitting them to be run by the server. By performing these checks, it is possible to ensure that users do not waste time diagnosing and solving common request errors, such as attempting to convert from GenBank to SBOL 2 without including a URI prefix. RESTful API: The validator/converter also provides a RESTful API for computational access (10 ). This allows developers to integrate validation into their tools whether or not the libraries or implementation they used for SBOL include validation routines. It is built in the common RESTful web paradigm for ease of integration. The RESTful API allows the user to control all of the options that are available on the web interface using a web request, so there is no functionality lost by using the programmatic web interface rather than the graphical web interface. This is especially useful to developers of other SBOL libraries, such as libSBOL, pySBOL, and sboljs that do not currently natively support validation or conversion. Discussion: The SBOL Validator/Converter is a critical element within an ecosystem in which SBOL is the lingua franca of genetic data as depicted in Figure 4. Repositories that either store SBOL natively, like SBOL Stack (8 ), or in a format that is converted to SBOL, such as NCBI GenBank, can be utilized via the SBOL validator/converter to provide data to various GDA tools. While these data repositories and GDA tools may be SBOL-compliant,

8 ACS Paragon Plus Environment

Page 8 of 23

Page 9 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Figure 3: The web interface for the SBOL Validator/Converter. The validator/converter accepts three main groups of input: a file for validation/conversion, an optional file for comparison with the first file, and a range of options. The files can be uploaded or pasted into the form, and their file type are determined automatically.

9 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 23

they are numerous legacy resources and software that can also be integrated by leveraging conversion to/from SBOL. The validator/converter utility can be utilized again to pass the new data produced by the GDA tools back to the data repositories. Therefore, at all stages, SBOL is used as the medium of communication, which can be validated to ensure that the data is correctly and usefully encoded.

Data Repositories

GDA Tools

jbei-ICE iGEM NCBI GenBank SBOL Stack

Validator + Converter

Any repository which stores convertible data

iBioSim SnapGene EuGene SBOLDesigner ApE Geneious Benchling Any tool which supports convertible data

Figure 4: The SBOL-centric ecosystem enabled by a unified validator/converter utility. This ecosystem is composed of data repositories that store data in either the SBOL format or another format that can be converted to SBOL, and the GDA tools that process this data to create new biological design constructs. The validator/converter utility glues them together by providing a seamless connection that ensures that data is processed efficiently while maintaining data validity and integrity. Software compliance testing: The SBOL Validator, along with its comparison tool, is a critical element of a framework for testing GDA software’s compliance with the SBOL standard. This tool is already being heavily utilized to ensure the non-Java libraries produce SBOL that is both valid and identical to that produced using libSBOLj. These tests ensure that software developed using one of these libraries produces SBOL files that can be readily

10 ACS Paragon Plus Environment

Page 11 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

exchanged with software developed using another library. Furthermore, as more tools are developed, it is possible that tool developers may not choose to use a community-developed SBOL library. They may do this for various reasons, including a desire to create their own library in a language not currently supported by the SBOL community or for consistency with some internal data model that does not match the SBOL data model. Once again, the SBOL Validator can be leveraged to check compliance of these tools, as well. Support for conversion between more formats: SBOL is being developed to address the issue of several competing outdated standards, such as GenBank and FASTA, that are not conducive to effectively convey engineering-focused synthetic biology data. As there are many such data formats, it will be useful in the future to create more converters that allow users to convert their existing data to/from data repositories and tools. One such format is FASTQ, a standard that allows for the expression of the integrity of segments of a sequence. At this time, this information could be encoded using SBOL annotations so that the conversion from FASTQ to SBOL would be lossless. This functionality enables easy conversion of data between SBOL and other formats allows users who use a tool that is not SBOL-compliant to generate SBOL data for publication, as well as exchange, and use in SBOL-enabled workflows. This promotes the wider adoption of SBOL by users of the current standards that SBOL aims to supplant. Additionally, there are already several converters between SBOL and formats designed for other purposes. It is possible to convert from SBOL to the SBML, a language used to describe models for simulation (11 ), annotated with SBOL URIs (12 , 13 ). The SBOL to SBML converter enables a user to create an initial model for simulation and analysis of synthetic biology design. To complete the SBML-SBOL conversion cycle, there is another converter that converts SBML into SBOL (14 ). This conversion enables SBML model construction tools to be leveraged to produce functional interaction information that can be visualized and reasoned about using SBOL-compliant software. In the future, converters between SBOL and BioPAX (15 ), a standard for describing

11 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

biological pathways, and SBGN-ML (16 ), a standard for expressing graphical representations of biological systems, could be very useful as well. Integration with existing libraries: Currently, there are four SBOL libraries under active development: Java (libSBOLj), C++ (libSBOL), Python (pySBOL), and JavaScript (sboljs). Only libSBOLj implements native validation and conversion routines. It is useful in the future to integrate the other libraries and any future libraries with the validation/conversion API so that there is a single definition for validity and conversion is performed consistently. pySBOL has already implemented validation/conversion using the RESTful API, and integration is underway for libSBOL and sboljs. Additional features: The main improvement of the SBOL Validator/Converter over the SBML Validator is offering inter-conversion between related data standards, a feature not offered by the SBML tool. This offers the significant benefit of allowing adoption of SBOL by researchers and scientists ahead of adoption by tool developers. The SBML validator offers some features which are not present in the proposed validator/converter, namely delayed validation and the ability to save validation results for repeat viewing. In addition to these features, we also intend to add highlighting of errors found within the SBOL file and links back to the SBOL specification for further clarification.

Methods SBOL validation: SBOL validation rules describe a set of minimum requirements that a valid SBOL file should meet. These validation rules are derived from properties and constraints of the SBOL data model, inherent and/or inferred laws from naturally occurring biological entities, and best practices for encoding a biological construct using SBOL. The complete validation rules can be found in the SBOL 2.1.0 data model document (2 ). These rules are classified into four categories according to their degrees of strictness and their ability to be automatically checked by SBOL software libraries. The first category

12 ACS Paragon Plus Environment

Page 12 of 23

Page 13 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

groups rules that define conformance to the SBOL specification, and all the rules in this category can be checked by the software library. The second category includes rules that are weakly required: although they must be satisfied, there are situations in which they cannot be programmatically checked by an SBOL software library. These are, for example, rules that depend upon checking properties of a referenced object. Since all referenced objects are not required to be provided within a single SBOL file, they might not be easily accessible. The third category are recommended best practice rules. While an SBOL document not following these rules can still be valid, these rules enable an SBOL data model to be more effectively and unambiguously constructed, encoded, annotated, and exchanged. Rules in this category can be checked by the software library. The last category represents required rules for SBOL-compliance, but cannot be implemented in a software library to allow for machine validation. Most of these rules specify the recommended user’s intent, which in many cases, are impossible to implement in software. Therefore the user has to be responsible for following and validating these rules. One example of a rule like this is that if a Sequence represents DNA, then it should use the IUPAC DNA encoding (part of sbol-10406). The validation rules have been manually implemented in libSBOLj (version 2.1.0). Specifically, SBOL validation rules are stored as a text file using the OBO file format. libSBOLj parses and then stores these rules as a map, whose key-value pair is the rule ID and its corresponding rule description. Violation of a required rule, provided it is at least partially machine-checkable, causes an SBOL validation exception to be thrown. Detection of possible violations of a machine-checkable rule is embedded in all relevant locations in libSBOLj, including reading an SBOL file and constructing an SBOL object using the library API. To facilitate diagnoses of a violation, each SBOL validation exception is accompanied by the corresponding rule identifier, description, and reference to the specification, as well as the URI of the SBOL object causing the validation. Best practice rules do not throw an exception, but instead are collecting in a list that can be examined after validation is complete. Conversion: The libSBOLj library supports conversion to/from SBOL 2 and from/to

13 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

FASTA, GenBank, and SBOL 1. For each converter to SBOL 2, the user can provide a default URI prefix and version. The URI prefix is required for FASTA and GenBank conversion, and optional for SBOL 1 conversion. The version is always optional. If specified, these are used when constructing the identities of the generated SBOL objects, and the corresponding data fields. Since the SBOL 1 to/from SBOL 2 conversion is described in (2 , 4 ), this section focuses on the FASTA and GenBank conversions. The simplest conversion is from FASTA, and it is depected in Figure 5. The FASTA format encodes only DNA and RNA sequences (the user of the converter should specify which type is being encoded). Therefore, when converting from FASTA to SBOL, each sequence within the FASTA file is converted into a Sequence object within the SBOL file. Each sequence is separated by descriptors starting with a “;” or “>” character. These descriptors, along with the user provided URI prefix and version, are used to construct the identity, displayId, version, and description fields for the Sequence. The FASTA to SBOL 2 converter is loss-less. When converting from SBOL 2 to FASTA, a file is created that includes all of the DNA, RNA, or protein sequences within the SBOL file. These sequences are separated by a descriptor composed of the displayId and description, if specified. Note that since all non-Sequence objects are dropped, the converter to FASTA is lossy. The conversion between GenBank and SBOL 2 is depicted in Figure 6. When converting from GenBank, each GenBank record within the file is converted into a ComponentDefinition and a Sequence. This ComponentDefinition has a SequenceAnnotation for each feature specified within the GenBank record that records the location of the feature. The feature position is encoded within the SequenceAnnotation’s Location(s). The type of the feature is encoded by giving this SequenceAnnotation a role using a term from the Sequence Ontology (SO) (17 ) converted using the mapping from GenBank feature types shown in Table 1. Some elements within the GenBank record have no corresponding field within SBOL, such as comments for a feature, and these are converted into custom annotations on

14 ACS Paragon Plus Environment

Page 14 of 23

Page 15 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

>SEQUENCE_1 : This is a DNA sequence CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTG CTTGGACAACCTGGAACTCTTCTAGGAGACGACCAAATT TAATAATTTTCTTTATAGTAATACCAATCATGATCGGTGGT CGGCGCCCCCGACATAGCATTCCCCCGTATAAACAACATAAGCT

URI Prefix

displayID

Version

Identity Description

Encoding = IUPAC_DNA

Figure 5: The conversion between FASTA and SBOL. For conversion from FASTA to SBOL, the URI Prefix and Version must be provided by the user. The displayId id derived from any text found before the first colon. Together, these elements are grouped to for the identity URI for a Sequence object. The description of this object is all text after the first colon. The sequence in the FASTA file are the elements of the Sequence while the encoding is set to either IUPAC DNA or IUPAC Protein, depending on which type of sequence is being converted. The conversion from SBOL to FASTA simply reverses these steps.

15 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 23

the SBOL record. Therefore, the GenBank to SBOL 2 converter is loss-less. SBOL File ComponentDefinition LOCUS NewDesign 934 bp DNA ACCESSION NewDesign VERSION NewDesign.1 FEATURES Location/Qualifiers promoter 1..200 RBS 201..215 CDS 216..900 terminator 901..934 ORIGIN 1 caatacgcaa accgcctctc cccgcgcgtt ggccgattca 61 ggtttcccga ctggaaagcg ggcagtgagc gcaacgcaat 121 attaggcacc ccaggcttta cactttatgc ttccggctcg 181 gcggataaca atttcacaca attaaagagg agaaaatgtc 241 tgattaacag cgcattagag ctgcttaatg aggtcggaat 301 aactcgccca gaagctaggt gtagagcagc ctacattgta 361 gggctttgct cgacgcctta gccattgaga tgttagatag 421 ctttagaagg ggaaagctgg caagattttt tacgtaataa 481 ctttactaag tcatcgcgat ggagcaaaag tacatttagg 541 agtatgaaac tctcgaaaat caattagcct ttttatgcca 601 atgcattata tgcactcagc gctgtggggc attttacttt 661 aagagcatca agtcgctaaa gaagaaaggg aaacacctac 721 tattacgaca agctatcgaa ttatttgatc accaaggtgc 781 gccttgaatt gatcatatgc ggattagaaa aacaacttaa 841 caaacgacga aaactacgct ttagtagctt aataacactg 901 aaaaaaaaac cccgcttcgg cggggttttt tttt //

linear

types

UNK 19-Dec

displayId

Version

Annotations

ttaatgcagc taatgtgagt tatgttgtgt cagattagat cgaaggttta ttggcatgta gcaccatact cgctaaaagt tacacggcct acaaggtttt aggttgcgta tactgatagt agagccagcc atgtgaaagt atagtgctag

SequenceAnnotation

tggcacgaca tagctcactc ggaattgtga aaaagtaaag acaacccgta aaaaataagc cacttttgcc tttagatgtg acagaaaaac tcactagaga ttggaagatc atgccgccat ttcttattcg gggtccgctg tgtagatcac

SequenceAnnotation SequenceAnnotation SequenceAnnotation

Sequence

Figure 6: The conversion between GenBank and SBOL. The URI prefix provided by the user is combined displayId taken from the GenBank ACCESSION field and version taken from the GenBank VERSION field to form the identity for a new ComponentDefinition. The type field is converted from the GenBank molecule type and topology elements. The GenBank division, modification date, sources, references, and other header information are converted into SBOL annotations. Each GenBank FEATURE is converted into a SequenceAnnotation with corresponding Location(s) and role converted to SO using Table 1. Finally, the sequence is converted into an SBOL Sequence object. When converting from SBOL 2 to GenBank, each root ComponentDefinition is converted into a GenBank record. A ComponentDefinition is a root when it is not referenced by any other ComponentDefinition. The descriptors, along with any GenBank annotations, are converted into corresponding descriptors in the GenBank record. The referenced Sequence is used as the sequence for the record, and the SequenceAnnotations are converted into features in the GenBank record using the reverse mapping from SO to GenBank features. Not all ComponentDefinitions can be converted to GenBank. First, only those that are for DNA or RNA components can be converted. Second, they must include complete sequences (i.e., they cannot include partial sequences and SequenceConstraints). Third, 16 ACS Paragon Plus Environment

Page 17 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Table 1: The mapping between SO Terms and GenBank features. Provided courtesy of Nathan Hillson of the Joint Bio-Energy Institute. SO Term SO:0001023 SO:0000172 SO:0000458 SO:0000704 SO:0000188 SO:0000419 SO:0000001 SO:0000233 SO:0000002 SO:0001835 SO:0000185 SO:0005850 SO:0000139 SO:0000657 SO:0001836 SO:0000418 SO:0000019 SO:0000141 SO:0000253 SO:0000175 SO:0000205

Feature allele CAAT signal D segment gene intron mat peptide misc feature misc RNA misc structure N region precursor RNA primer bind RBS repeat region S region sig peptide stem loop terminator tRNA -10 signal 3’UTR

SO Term SO:0000140 SO:0000316 SO:0000165 SO:0000173 SO:0000470 SO:0000409 SO:0001645 SO:0001411 SO:0000305 SO:0000551 SO:0000185 SO:0000167 SO:0000552 SO:0000726 SO:0000005 SO:0000274 SO:0000331 SO:0000725 SO:0001833 SO:0000176 SO:0000555

Feature attenuator CDS enhancer GC signal J region misc binding misc marker misc signal modified base polyA signal prim transcript promoter RBS repeat unit satellite snRNA STS transit peptide V region -35 signal 5’clip

17 ACS Paragon Plus Environment

SO Term SO:0001834 SO:0000297 SO:0000147 SO:0000723 SO:0000286 SO:0000413 SO:0000298 SO:0005836 SO:0000234 SO:0000553 SO:0000112 SO:0000410 SO:0000296 SO:0000252 SO:0000013 SO:0000149 SO:0000174 SO:0001054 SO:0001060 SO:0000557 SO:0000204

Feature C region D-loop exon iDNA LTR misc difference misc recomb regulatory mRNA polyA site primer protein bind rep origin rRNA scRNA source TATA signal transposon variation 3’clip 5’UTR

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

they cannot include more than two levels of hierarchy. Namely, The ComponentDefinition referenced by any Components must not itself include Components. The last problem can be addressed using a flattening procedure, but the rest are fundamental limitations of the GenBank format. Furthermore, SBOL ModuleDefinitions and other classes not related to the structural aspects are dropped. Therefore, the SBOL 2 to GenBank converter is lossy. Online web interface implementation: The web validator application is built using common web application paradigms, such as the RESTful API architecture. The frontend (the web form) is built using HTML and Bootstrap CSS for layout and formatting. Basic sanity checks are done in the browser using JavaScript, like ensuring that a main file is provided, or that a URI is given for conversions that require a URI, like GenBank to SBOL 2. JavaScript is also used to convert pasted files to file uploads for ease of interpretation at the backend. The validation request is sent to the backend using an HTTP form and file encoding. The web validator API is simply an exposed endpoint that accepts either JSON or form data. These are sent directly to the backend for running without any extra processing. The backend is written in Python with Flask. It parses the validation request from either an endpoint API request or a request from the graphical frontend and uses it to build a validation command for libSBOLj. It also handles upload and anonymization of files for validation, standardizing the process for accessing pasted and uploaded files for the backend. The filenames are anonymized upon upload, but the contents are still saved on the server briefly. The files are deleted upon completion of the validation run, but there is still a potential vulnerability during the validation run. This can be partially addressed by enabling HTTPs on the validator host server, but the best solution is likely federating an instance of the validator/converter into a local intranet, ensuring data security. The validator has a functionality that enables it to update its internal validation rules with a single command, so a private entity could temporarily scrub a server, connect it to the Internet, and allow the validator to update its copy of the validation program. Testing of the SBOL validator/converter: In order to robustly test the libSBOLj

18 ACS Paragon Plus Environment

Page 18 of 23

Page 19 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

validation routines, a test suite containing both valid and invalid files has been created. These files are validated using libSBOLj to verify that the file contains correct SBOL. The invalid SBOL files are manually created such that each file broke a specific validation rule from the set of validation rules referenced in the SBOL 2.1.0 data model specification (2 ). Particularly, the invalid files ensured that each validation rule violated should be caught and reported at least once. Furthermore, when invalid SBOL is detected within the file, then an error is thrown with a message showing the corresponding validation rule(s) that are broken. The conversion routines are tested using a set of FASTA, GenBank, and SBOL 1 data files that are first converted to SBOL 2, then converted back to their original format, and finally back to SBOL 2. The two SBOL 2 documents created during this process are compared to ensure that the data they contain is identical and that nothing has been lost or changed. Phantom.js is used to develop a set of unit tests on the web interface so that all capabilities can be tested and verified to produce the correct output quickly and programmatically, rather than manually. In order to stress test the Validation API and ensure that it could handle an appropriate amount of simultaneous connections, a computer is configured to generate dozens of requests simultaneously and send them to an instance of the validation server. This parameter is increased upwards until the server begins to time out or behave erratically. It is found that the application can handle approximately 1,000 simultaneous connections and validation requests before the weight of initializing a new subprocess and running the libSBOLj validation routines is too heavy for the web server. This limit could theoretically be raised, if necessary, by increasing the processing capability of the webserver or moving to a webserver with more heavy-duty hardware.

Acknowledgement The authors would like to thank Nathan Hillson of the Joint Bio-Energy Institute at Berkeley for creating the mapping between GenBank elements and SO terms.

19 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

This material is based upon work supported by the National Science Foundation under Grants CCF-1218095 and DBI-135604. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References 1. Galdzicki, M. et al. (2014) The Synthetic Biology Open Language (SBOL) provides a community standard for communicating designs in synthetic biology. Nat. Biotechnol. 32, 545–550. 2. Beal, J. et al. (2016) Synthetic Biology Open Language (SBOL) Version 2.1.0. Journal of Integrative Bioinformatics 13 . 3. Roehner, N. et al. (2016) Sharing Structure and Function in Biological Design with SBOL 2.0. ACS Synth. Biol. 5, 498–506, PMID: 27111421. 4. Zhang, Z., Nguyen, T., Roehner, N., Misirli, G., Pocock, M., Oberortner, E., Samineni, M., Zundel, Z., Beal, J., Clancy, K., Wipat, A., and Myers, C. J. (2015) libSBOLj 2.0: A Java Library to Support SBOL 2.0. IEEE Life Sciences Letters 1, 34–37. 5. Quinn, J. Y. et al. (2015) SBOL Visual: A Graphical Language for Genetic Designs. PLoS Biol 13, e1002310. 6. Hillson, N., Plahar, H., Beal, J., and Prithviraj, R. (2016) Improving Synthetic Biology Communication: Recommended Practices for Visual Depiction and Digital Submission of Genetic Designs. ACS Synth. Biol. 5, 449–451. 7. Ham, T. S., Dmytriv, Z., Plahar, H., Chen, J., Hillson, N. J., and Keasling, J. D. (doi: 10.1093/nar/gks531, 2012) Design, implementation and practice of JBEI-ICE: an open source biological part registry platform and tools. Nucleic Acids Res. 40 . 20 ACS Paragon Plus Environment

Page 20 of 23

Page 21 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

8. Madsen, C., McLaughlin, J. A., Mısırlı, G., Pocock, M., Flanagan, K., Hallinan, J., and Wipat, A. (2016) The SBOL Stack: A Platform for Storing, Publishing, and Sharing Synthetic Biology Designs. ACS Synth. Biol. 9. Rodriguez, N., Pettit, J.-B., Dalle Pezze, P., Li, L., Henry, A., van Iersel, M. P., Jalowicki, G., Kutmon, M., Natarajan, K. N., Tolnay, D., Stefan, M. I., Evelo, C. T., and Le Nov`ere, N. (2016) The systems biology format converter. BMC Bioinf. 17, 154. 10. Fielding, R. T., and Taylor, R. N. Principled Design of the Modern Web Architecture. Proceedings of the 22nd International Conference on Software Engineering. New York, NY, USA, 2000; pp 407–416. 11. Hucka, M. et al. (2003) The Systems Biology Markup Language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524–531. 12. Roehner, N., Zhang, Z., Nguyen, T., and Myers, C. J. (2015) Generating Systems Biology Markup Language Models from the Synthetic Biology Open Language. ACS Synth. Biol. 4, 873–879, PMID: 25822671. 13. Roehner, N., and Myers, C. J. (2014) A methodology to annotate Systems Biology Markup Language Models with the Synthetic Biology Open Language. ACS Synth. Biol. 3, 57–66. 14. Nguyen, T., Roehner, N., Zundel, Z., and Myers, C. J. (2016) A Converter from the Systems Biology Markup Language to the Synthetic Biology Open Language. ACS Synth. Biol. 5, 479–486, PMID: 26696234. 15. Demir, E. et al. (2010) The BioPAX community standard for pathway data sharing. Nat. Biotechnol. 28, 935–942. 16. van Iersel, M. P. et al. (2012) Software support for SBGN maps: SBGN-ML and LibSBGN. Bioinformatics 28, 2016–2021. 21 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

17. Eilbeck, K., Lewis, S. E., Mungall, C. J., Yandell, M., Stei, L., Durbin, R., and Ashburner, M. (2005) The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 6, R44.

22 ACS Paragon Plus Environment

Page 22 of 23

Page 23 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Graphical TOC Entry

23 ACS Paragon Plus Environment