Toward the First Data Acquisition Standard in ... - ACS Publications

Feb 7, 2016 - creation of a methodology that is designed to capture all the data, metadata ... bioparts and electronic components could be used to des...
0 downloads 0 Views 4MB Size
Subscriber access provided by ORTA DOGU TEKNIK UNIVERSITESI KUTUPHANESI

Article

Towards the first data acquisition standard in Synthetic Biology Iñaki Sainz de Murieta, Matthieu Bultelle, and Richard I. Kitney ACS Synth. Biol., Just Accepted Manuscript • DOI: 10.1021/acssynbio.5b00222 • Publication Date (Web): 07 Feb 2016 Downloaded from http://pubs.acs.org on February 9, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

ACS Synthetic Biology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Towards the First Data Acquisition Standard in Synthetic Biology Iñaki Sainz de Murieta, Matthieu Bultelle and Richard I Kitney* * Correspondence: Richard I Kitney, Co-Director of the Centre for Synthetic Biology and Innovation, Imperial College London, SW7 2AZ, United Kingdom. [email protected] KEYWORDS: data acquisition, characterization, standard, biopart, synthetic biology.

ABSTRACT

This paper describes the development of a new data acquisition standard for synthetic biology. This comprises the creation of a methodology that is designed to capture all the data, metadata and protocol information associated with biopart characterization experiments. The new standard, called DICOM-SB, is a based on the highly successful Digital Imaging and Communications in Medicine (DICOM) standard in medicine. A data model is described which has been specifically developed for synthetic biology. The model is a modular, extensible data model for the experimental process, which can optimize data storage for large amounts of data. DICOM-SB also includes services orientated towards the automatic exchange of data and information between modalities and repositories. DICOM-SB has been developed in the context

ACS Paragon Plus Environment

1

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 41

of systematic design in synthetic biology — which is based on the engineering principles of modularity, standardization and characterization. The systematic design approach utilizes the design, build, test and learn design cycle paradigm. DICOM-SB has been designed to be compatible with and complementary to other standards in synthetic biology, including SBOL. In this regard, the software provides effective interoperability. The new standard has been tested by experiments and data exchange between Nanyang Technological University in Singapore and Imperial College London.

INTRODUCTION Synthetic biology is a young discipline (15 years, at most) that aims to design and engineer biologically based parts, novel devices and systems — as well as redesigning existing, natural biological systems 1,2. Bioparts are the key element of this definition: they perform specific functions such as regulating transcription/translation or the binding to small molecules or protein domains, and are used as the basic blocks for building devices and systems of higher complexity. The first bioparts used in synthetic biology applications were natural parts, transplanted to other settings (e.g. a different chassis). Originally only a few parts were available, but soon synthetic libraries were built by modifying natural parts with techniques such as error-prone PCR 3–6. Device design, spearheaded by the repressilator 7 and the toggle switch 8, and followed by an extensive amount of important devices 9–20 , proved that the analogy between bioparts and electronic components could be used to design devices — and that practically, it was possible to endow biological systems with computing-like behavior by combining elementary bioparts.

ACS Paragon Plus Environment

2

Page 3 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

The second wave of synthetic devices (2010 onwards) has not only been characterized by attempts to build more complex devices and investigate robust design principles. But rather, its focus has also been on applications such as biosensing, biofuels, pharmaceuticals and biomaterials, with the stated aim to establish synthetic biology as one of the key technologies to solve major societal problems 21,22. Standard workflows. As engineered biological systems and their applications become more complex and ambitious, the traditional iterative approach to design, mainstream in many fields of engineering, has also been adopted in synthetic biology 23. The design cycle (illustrated in Figure 1) comprises four distinct sections. Depending on the results, the process may be repeated (iterated) several times until the initial specifications are met. The concepts of modularity (the approach that builds larger systems by combining smaller subsystems — here, bioparts and available devices) and division of labor (the specialization of cooperating individuals who perform specific tasks and roles) are central to its success. The latter plays an increasing role, as projects become more complex and larger teams of specialists are needed. In a move that mirrors the electronic industry, where circuit design is “fabless” and construction takes place in specialized foundries, outsourcing the DNA synthesis of genes and gene fragments is now an integral part of the rational design cycle — final assembly taking place in-house using an ever wider range of techniques 24,25. Characterization is the process of describing distinctive characteristics or essential features of bioparts. Accurate characterization is essential to the success of the iterative design approach. Parts need to be characterized to a high standard, so the behavior of their combination may be predicted with higher fidelity. Of equal importance is that repositories must make large libraries

ACS Paragon Plus Environment

3

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 41

of characterized parts available, such that new systems can be built by their addition/combination.

Figure 1. The synthetic biology design cycle.

ACS Paragon Plus Environment

4

Page 5 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Several high profile repositories are in common use — the iGEM Registry of Standard Biological Parts 26, the JBEI Inventory of Composable Elements (ICE) 27 and the Virtual Parts Repository 28, to name a few. However, currently no repository offers a large catalogue of bioparts that are characterized to a consistently high standard. Although regrettable, such a state of affairs is not surprising. Collecting the data for such catalogues is a lengthy, staff and resource-intensive affair – very difficult before the advent of high-throughput, automated platforms. Building the catalogues also requires the development and adoption of a set of robust data formats to describe the various components of the characterization experiments. In particular, it requires the development and adoption of a data format to store the raw data they generate — a format such as the one presented in this paper. In order to build such online catalogue of bioparts, a characterization pipeline has been established at Imperial College, London (see Figure 2). It is supported by an IT-spine called SynBIS 29, which enables characterization on a scale that is difficult for human experimentalists to achieve. First developed with constitutive promoters, the system now supports the characterization of other fundamental bioparts — such as inducible promoters — and continues to be expanded. Plate reader and flow cytometry data are typically acquired. Information and data standards. Design and characterization projects greatly vary in terms of purpose, internal organization and output; but, nonetheless they deal with similar types of information. We have identified three main categories:



Sequence description (Description of genetic objects of interest): FASTA 30 and GenBank 31 are well established formats that underpin very large public databases of naturally occurring, annotated, sequences. However, they are not suitable for the

ACS Paragon Plus Environment

5

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 41

description of the genetic constructs encountered in synthetic biology — as they were not designed to express the constructs hierarchy and modularity. This has led to the creation of SBOL (the Synthetic Biology Open Language), a standard that captures the same sequence-oriented information found in a GenBank file, which allows full hierarchical annotation of DNA components 32, and thus facilitates the exchange of genetic designs33. SBOL’s latest version, SBOL 2.0 34, proposed a revision to the core model in order to represent a wider range of molecular interactions and components.

Figure 2. The CSynBI characterization pipeline.

ACS Paragon Plus Environment

6

Page 7 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology



Modelling (representing the actual or desired behavior of genetic objects): The Systems Biology Markup Language (SBML) 35–37, is the most commonly used modelling standard for the representation of biological phenomena. It is free, open, enjoys widespread software support and is the de facto standard for representing computational models in systems biology today 35–37. Modelling standards naturally complement genetic-construct descriptions standards, as made apparent by SBOL 2.0 38 and by the mechanism to annotate SBML models with SBOL files put forward by Roehner and Myers 39.



Raw data acquisition (learning about the genetic objects): As it matures into an engineering discipline, synthetic biology will move from qualitative to quantitative data. The amount of data captured and analyzed will consequently increase substantially — due in no small part to the availability of new imaging modalities (generating ever larger files) and the development of high throughput platforms 25.

RESULTS AND DISCUSSION Motivation: data acquisition is the missing standard. In the design cycle, data acquisition takes place during the testing phase. Since validating or rejecting a design is determined by whether some concentrations and observed phenotypes fall within ranges listed in the specifications, only a few repeat experiments may be needed for a given context. However, testing may have to be performed for a potentially large number of candidate constructs and experimental contexts. With biopart characterization, a set of experiments are typically run on a small number of characterization constructs (often, one containing the biopart and a set of controls). Characterization differs from testing in that it has no margin of tolerance: it has to be as precise

ACS Paragon Plus Environment

7

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 41

as possible (so the results may be re-used to model more complex designs). In practice, a large number of repeat experiments should be run and several acquisition modalities used. Because of the need for catalogues of significant numbers of characterized parts, characterization can be expected to be a main driver behind increased data capture in the future. At present there is no such standard. Hence, developing such a standard is of the utmost importance for synthetic biology - as it will be a key driver in transforming the field into a fullyfledged engineering discipline. For such a standard to be of use in synthetic biology, it needs to effectively support data acquisition. Therefore, it must: 1. Be based on a modular, extensible data model for the experimental process. 2. Optimize data storage of very large amounts of data. 3. Provide services oriented to automate the exchange of information between modalities and repositories. 4. Build, if possible, on an existing validated standard, to facilitate adoption by hardware manufacturers and the engineering community. DICOM and DICOM-SB. There is already a highly successful representation and communication standard that meets requirements 1 to 4, described above, in the field of biomedicine. DICOM (Digital Imaging and Communications in Medicine) is the de facto standard for handling, storing, printing, and transmitting information in medical imaging. It is also known as NEMA standard PS3, and as ISO standard 12052:2006 "Health informatics — Digital Imaging and Communication in Medicine (DICOM) including workflow and data management" 40. It is both a file format definition and a network communications protocol

ACS Paragon Plus Environment

8

Page 9 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

(based on TCP/IP). Originally developed to achieve compatibility between different medical imaging and information systems, it has developed into a comprehensive standard over the last 20 years. Several key technical aspects of DICOM provide a strong case for extending the standard to synthetic biology (technically by adding a new module; working title DICOM-SB), rather than developing a novel standard or adapting an existing rival data acquisition standard:



First, DICOM was designed with data acquisition and transmission in mind. This means that in practice most of the practical issues involved in networking various imaging resources and repositories have already been solved. Also, DICOM’s real-world model was built around the experimental process.



Second, DICOM already supports a number of imaging modalities, such as microscopy, used in synthetic biology.



Third, because of DICOM’s popularity, there already are a large number of programmers and engineers familiar with the standard. It would therefore take little development work for manufacturers to adapt their equipment and support DICOM-SB.

DICOM has also had a transformative effect on medical ICT — the like of which would be highly beneficial to synthetic biology. The combination of DICOM and HL7 has supported the development of electronic health records (EHR) — a class of software that systematically collects electronic health information about individual patients or populations – including demographics, medical history, medication and allergies, images, vital signs, personal statistics, and procedural information 41–43. In parallel, a class of software called PACS (picture archiving

ACS Paragon Plus Environment

9

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 41

and communication systems) was developed to provide economical storage, access to images acquired with multiple modalities 44–46. It is straightforward to draw an analogy between EHRs/PACS and repositories of characterized bioparts that would collect raw data, procedural data (assembly for instance), experimental protocol data and processed data. Indeed we fully anticipate a successful data acquisition standard (such as the one we present in this paper) to underpin such repositories. The next subsections will introduce a novel variant of the successful DICOM standard and make the case that this standard is highly suitable to support data acquisition in synthetic biology. In particular, it is complementary to SBOL and provides efficient data storage. DICOM for Synthetic Biology. As stated in the Introduction, our analysis of the DICOM standard established that its features are compatible with the requirements for synthetic biology. Therefore, we decided to develop a new synthetic biology extension for DICOM (DICOM-SB). DICOM-SB provides a framework that allows the integration of wet lab experimental data acquisition modalities into a common data model. It enhances the basic architecture inherited from DICOM, to allow the encoding of new synthetic biology data acquisition modalities not present in the general standard 40. DICOM encodes data objects as a series of items (or data elements), such that each item is identified by a predefined attribute (also called a tag). Attributes are named by the combination of two fields: group and element. Groups organize the attributes into categories — while each element identifies each different type of attribute within a group. Each attribute is related to a data type (e.g. integer, float, character, string etc.) which in DICOM is called Value Representation (VR). Finally, the value to be represented is encoded at the end (last bytes) of

ACS Paragon Plus Environment

10

Page 11 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

each item, preceded by the total items length. Figure 3-A depicts the DICOM encoding of data items.

Figure 3. (A) DICOM encoding of data elements. (B) Nesting data elements using the SQ value representation. (C) Building Modules and Information Entities. (D) The DICOM-SB data model.

ACS Paragon Plus Environment

11

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 41

Data objects can be nested into higher level objects. This is achieved by using the Sequence (SQ) value representation. When an attribute is assigned an SQ VR it means the content of its value field is a series of DICOM objects. Each object within the nested series includes another series of data elements, and some of them may (or may not) be encoded again an SQ VR — which would add another nesting layer, and so on. The tree in Figure 3-B illustrates a nesting example including different data elements and objects. DICOM includes two types of value representations to enable resource identification. On the one hand the AE VR represents an Application Entity. An AE is the name of a DICOM device or program which uniquely identifies it locally (e.g. inside of your network). It can refer to a specific workstation (e.g. WORKSTATION2), a specific software service (e.g. DATASTORE), etc. On the other hand, the UI VR encodes a Unique Identifier, used uniquely to reference instances of DICOM data. DICOM UI's must be globally unique, and they are built from groups of digits separated by periods (e.g. 1.2.408.41112.3.1). Since the amount of DICOM attributes is so extensive, building consistent objects that include all the required information can become a tedious task if they have to be searched and chosen one by one. In order to ease this task, DICOM clusters the attributes describing the same concept into the same Information Module. Hence, when designing a DICOM object to encode a certain data structure, modules will be the minimal blocks that will be combined. The module specification also determines what attributes are mandatory (and thus must always be completed) and which ones can be left incomplete. Finally, Information Modules are combined to build Information Entities (IE), and IE's are aggregated to build Information Object Definitions (IOD) (see Figure 3-C).

ACS Paragon Plus Environment

12

Page 13 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

In order to understand the basis of the extension of DICOM for synthetic biology, the hierarchical data model for standard DICOM will now be described. Every data object must implement a standard IOD. Their entities are related following a hierarchical information model. In the standard DICOM model for medicine, the patient is at the top of this hierarchy, as they are the object of analysis of any biomedical application. All the details related to a patient (name, identifier, age, gender, etc.) are included in the Patient IE. Patients can be subject to different medical studies, and this requires tracking additional data such as study date (e.g. date, time, study number, physician’s name, etc.). Going one step down in the hierarchy, this is represented by the Study IE. Studies comprise different procedures, such that each one is performed on specific equipment and can be repeated over time. Each procedure is termed a series, and their features (series number, date, time, etc.) are included in the Series IE. Finally, each series contains raw data acquired with one modality (e.g. electrocardiogram, magnetic resonance imaging etc.). In total, the data measured at the down most level of the hierarchy — the modality results — are annotated by the remaining levels: patient, study and series. It is worth noting that although the main DICOM standard is associated first and foremost with images, the standard also supports other types of data— waveforms being the most relevant for the present exercise (see the section on Synthetic Biology Raw Data IOD). The synthetic biology extension of DICOM organizes its data model following a similar strategy (see class diagram in Figure 3-D):



Instead of “patient”, the object of study is the transformation of a host organism , according to a transformation protocol, with a set of genetic constructs — whose behavior within the host is to be determined. We have created three new IE’s to model this information (see top-left of the hierarchy in Figure 3-D, in green).

ACS Paragon Plus Environment

13

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 41

o Component: as the main target of the characterization process, this IE lays at the top of the hierarchy. It describes the basic features to be tracked for each biopart under analysis (the term Component has been chosen to be consistent with SBOL). One of its attributes, named URI, allows the biopart to be described by referencing an SBOL entity (a Component Definition), which allows a more detailed annotation of its DNA sequence. Although nothing prevents the use of GenBank or FASTA files, it is recommended to use SBOL, as it has the advantage supporting the representation of recursive biopart structures. This is especially useful to represent in a single file a circular plasmid structure integrating e.g. cargos, bioparts of interest and reporting genes. o Host: the components (or bioparts) to be characterized are studied and analyzed in the context of a specific host organism (also known as a chassis). This IE describes the basic features to be tracked for a host in a characterization experiment. Cell free systems are also represented by this entity by using a special host type. o Transformation: host cells may be genetically modified (transformed) by DNA components before they are used in a characterization experiment. A Transformation IE represents a cell design — as a combination of one host organism and a list of components. The list of components in a transformation can be empty, meaning that the host would be used untransformed in the experiment (typically to be used as a control). Optionally, it can also include details about the transformation protocol (more details about the Protocol IE below).

ACS Paragon Plus Environment

14

Page 15 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology



In the next level of the hierarchy, the Experiment IE (analogous to Study in the medical model) is defined. Its purpose is to perform all the procedures required to analyze the change of behavior that the integrated set of components produces in the host. Each experiment must also adhere to an experimental protocol whose details are defined by the Protocol IE (see top-right of the hierarchy in Figure 3-D, in red).



An experiment comprises a set of procedures that are repeated on different compartments (typically a well) over time. Each single repeat of a specific procedure in a compartment, performed with dedicated equipment constitutes a Series (similar to the medical model). The following IE’s expand the scope of a series (see bottom of the hierarchy in Figure 3D, in blue): o Stimuli: when the series requires interaction with external stimuli, this IE may represent either environmental conditions (e.g. temperature) or chemical components to be added into the media during the course of the series. Environmental conditions and chemical components can either be specified as absolute values or as increments over time. o Compartment: the experiment cells may be grouped in different compartments, according to the cell interactions that need to be tested. Thus the Compartment IE can be seen as a container (e.g. a vessel) where an experiment is performed. When working with automated platforms, it is common to use plates that arrange wells as a matrix of rows and columns, such that each well is assigned a different series. In such a scenario, each well is represented by a Compartment IE that enables tracking the localization of the series. The term ‘compartment’ was chosen to be

ACS Paragon Plus Environment

15

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 41

consistent with the modelling standard SBML (where it is defined as a bounded space in which species are located). o Equipment: this IE identifies and describes the piece of equipment performing the measurements — whether it is microscopy, flow cytometry etc.



Finally, each series references the raw data generated by the equipment after that run. In synthetic biology the raw data are often organized as a list of data arrays, such that each array represents each of the different magnitudes measured by the equipment (e.g. time, temperature, fluorescence intensity, optical density, etc.) within its corresponding values. When the raw data are structured in this fashion they can be easily encoded using the standard DICOM Waveform module 40. The next subsections of the paper show in more detail how the Waveform module is used to encode both cell population and single cell measurements.

As with the standard DICOM model, the experimental measurements for each data acquisition activity are stored in the corresponding attributes of each modality — whereas the rest of the higher entities in the data model store the metadata required for classifying, process, analyzing and disseminating the experimental measurements. Having established the basic structure of the DICOM-SB data model, a new Synthetic Biology IOD was defined to accommodate some of the modalities (and metadata) that are often used in synthetic biology, but not already present in the DICOM standard. We call this IOD the Synthetic Biology Raw Data IOD (SBRD). The SBRD was constructed both by reusing some of the standard DICOM Information Modules (such as the Waveform Module), and defining new modules. A full description of the DICOM-SB data model is available in the supporting material,

ACS Paragon Plus Environment

16

Page 17 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

including details of the associated Tags and their Value Representations, as well as the corresponding Information Modules and Information Entities. To illustrate how the SBRD is used in practice, let us consider the following case study: the characterization of constitutive promoters on a robotic automated platform as performed at the Centre for Synthetic Biology and Innovation (CSynBI) at Imperial college London — see Supplementary Information for more on the characterization protocol for constitutive promoters at CSynBI and the data that are collected as part of the exercise. Encoding cell population measurements. CSynBI’s characterization experiments use a plate reader to measure the optical density and fluorescence of the population of E. coli (MG-1655) transformed according to a specific experimental protocol. Thanks to previously established calibration curves, these measurements are converted into estimates for cell population and GFP population respectively. The characterization protocol states that the plate reader should periodically sample each well of the plate at intervals of 15 minutes and tracks measurements in two channels:



Target measure: total fluorescence intensity



Population measure: total optical density

The SBRD handles plate reader population level data (which are time series) as follows. As long as the sampling frequency is constant, and the channel values are chronologically sorted, the attribute Sampling Frequency (003A, 001A) can be used to track the frequency value. If this is not the case, an extra channel can be added with corresponding time marks.

ACS Paragon Plus Environment

17

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 41

The attributes defined as part of the Data Series module provide a data structure to store target and populations measurements in different channels, as well as sampling independent data related to the test itself (e.g. acquisition date and time, number of channels and sample, channel names, channel properties, etc.). Figure 4 depicts how the data generated by a plate reader experiment can be mapped and structured into the Raw Data module as part of the SBRD IOD. Referring to Figure 4, reading from left to right, it can be seen that the whole structure is encoded as a Waveform Sequence, which allows the inclusion of several modality repeats (for instance, a range of fluorescence channel, each corresponding to a given bandwidth). As per the protocol, there is only one repeat, named “Object 1” (corresponding to an excitation wavelength of 385 nm ±10nm and an emission wavelength of 428 nm ±10nm).



The first attribute within this object is the Channel Definition Sequence, which includes the metadata required to describe all the channel related settings; this sequence must contain as many objects as different channels (Objects 2.1 and 2.2 for OD and GFP channels), and the data elements within each object relate to the different channel features: Channel Number, Channel Label, Status (active / inactive / data / test / ...), number of bits encoding each channel value (Waveform Bits Allocated), etc. Since our experiments report in arbitrary units, there is no detail about units of measure. However, such details can be included, if required, and there are attributes available to track them (see supporting material).

ACS Paragon Plus Environment

18

Page 19 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Figure 4. Encoding the results of the plate reader using the Raw Data IOD. •

The last attribute (Waveform Data) is used to encode the sequence of experimental data, such that the samples are sorted in ascending order (from 1 to n). Each sample is built by the concatenation of the different channel values, sorted as per the corresponding Waveform Channel Numbers (first OD and then GFP).



The attributes in the center encode metadata not related with the channels — such as the Sampling Frequency, the length of each data sample (Waveform Bits Allocated and Waveform Sample Interpretation), total Number of Waveform Channels and Samples, and Waveform Originality.

ACS Paragon Plus Environment

19

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 41

Encoding single cell measurements. Single cell modalities, such as flow cytometry, yield an estimate of the amount of fluorescence on a per cell basis. As with plate reader data, the data need processing before they can be used. However, with flow cytometry, the problem is not to estimate population values, but, rather, to identify which of the measured particles (events) correspond to growing bacteria, instead of cell debris. This is typically done through a process called gating, which implies selecting area(s) on the scatter plot generated during the flow experiment to decide which cells are to be analyzed and which not. In the characterization protocol, data acquisition with flow cytometry takes place twice during the assay - the first time after 3 hours, the second time after 6 hours. Each time a 10 % sacrificial sample is extracted. In addition to measuring fluorescence in a range of bandwidths, flow cytometry provides other types of measurements that are related to properties of a particle. For example, forward scatter (FSC) relates to the size of the event, while the side scatter (SSC) refers to its granularity. The curation protocol we have implemented for data analysis uses the FSC value of each event to determine whether it should be included as living bacteria. The Raw Data module enables the encoding of as many scattering (FSC / SSC) and fluorescence channels as generated by the flow cytometer. Even the time mark can be tracked as an extra channel. The mapping is similar to that depicted in Figure 4, but with a larger number of channels. It is worth noting that there have been attempts at using DICOM to encode cytometry data 47; the SBRD presented in this article is a more general approach however, as it can be used for any type of data.

ACS Paragon Plus Environment

20

Page 21 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Supporting communication. After having set the data model, DICOM must offer a service to allow the communication of data objects (encoded as IOD's) between different Application Entities. In a data acquisition context (such as here), there is a need for at least one service that stores the acquired IOD into a data repository. Consequently, we have developed a web service that implements a DICOM Message Service Element (DIMSE): the Store service. The combination of an IOD and the corresponding DIMSE service creates the Service Object Pairs (SOP's). Accordingly, our DICOM-SB extension has defined new SOP objects for each different equipment modality used: Synthetic Biology Plate Reader (SBPR) and Synthetic Biology Flow Cytometry (SBFC). Both of them include (see Figure 5-A):



The IOD called Synthetic Biology Raw Data (SBRD).



The Store DIMSE service.

DICOM-SB has been developed jointly with our automated biopart characterization pipeline (see Figure 1) and is now mature enough to support all biopart characterization at the Centre for Synthetic Biology and Innovation (CSynBI). The SBRD and the Store DIMSE service are used as part of the data acquisition step of the pipeline (step 2), which, in practice, proceeds as follows (Figure 5-B): 1. Execution of the experimental protocol in the laboratory. 2. Data conversion of experimental results following the DICOM-SB standard (generation of the SBRD IOD).

ACS Paragon Plus Environment

21

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 41

3. Automated communication of experimental results from the laboratory equipment to a centralized data repository (by using the Store DIMSE service). Data stored in the central repository and then analyzed and eventually published onto SynBIS. Internal (CSynBI) data are now not the only data feeding SynBIS. We have established collaboration with the Poh Lab at the Nanyang Technological University (NTU). As part of this joint work they have characterized a set of constitutive promoters following a manual version of our experimental protocol that only collects cell population measurements (currently unpublished). In addition, they have used DICOM-SB to standardize their characterization results (encoding their data as an SBRD IOD) and make them available to SynBIS (uploading them using our Store DIMSE service).

ACS Paragon Plus Environment

22

Page 23 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Figure 5. (A) The Synthetic Biology SOP Classes. (B) Receiving characterization data from external partners.

DISCUSSION No single standard will be able to support typical, full synthetic biology workflows. It is too large and too diverse a task for one standard to encompass successfully. In our view a small

ACS Paragon Plus Environment

23

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 41

number of non-competing data standards can effectively describe the workflows. The first set of standards should concern the description of genetic constructs. SBOL has been designed for such task, as well as sequence annotations 32,33. It is now being expanded to improve the description of the modular properties of bioparts 34. The second set of standards — SBML 35, Kappa 48, for example — are used to model the behavior of the constructs and come from systems biology. They can easily be interfaced with the first set of standards 39. In the paper we have made the case that the synthetic biology community (from academia to industry) should also focus its attention on the development and adoption of a third set of complementary standards that would support and indeed enable data acquisition. To this end we have developed DICOM-SB, an extension of the DICOM data standard designed for synthetic biology. DICOM-SB was specifically developed with biopart characterization and constructtesting in mind. Both are likely to involve heavy data acquisition. We have developed a DICOM-SB a data model built around a typical experiment in synthetic biology. It is totally compatible, complementary indeed with SBOL, in relation to the description of the constructs involved in experiments. The data model contains in its header all the metadata required to describe experimental context accurately — crucially, it also standardizes the description. We have shown that one of the advantages of DICOM-SB is that it optimizes data storage. In this regard instead of using text-based representations like most standards (e.g. XML, SBML, SBOL), DICOM-SB encodes data in binary format. While this doesn't make a difference when dealing with text strings or characters, it offers significant savings when dealing with numbers: binary representations — allowing encoding of up to 256 different numbers per byte.

ACS Paragon Plus Environment

24

Page 25 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Conversely, text based representations use 1 byte per digit, meaning up to 10 numbers per byte. In sum, DICOM offers up to a 25:1 downscale just by using binary encoding without data compression. This feature becomes especially important when dealing with data intensive modalities, such as flow cytometry (up to 50000 events per file), microarrays etc. It is even more important in the context of characterization, where a significant number of experiments and repeats may be needed in order to extract the properties of a biopart (for instance to induce a promoter over a large range of concentration, or if the biopart exhibits a very stochastic behavior). We have also described a less obvious, but practically important, feature of DICOM-SB in relation to data acquisition: its communication layer. DICOM (and by extension DICOM-SB) is a communication standard as well as a data standard. The DICOM data representation is incorporated within a corresponding communication service, built to facilitate the automated distribution of results between measuring equipment and repositories. Medical ICT has greatly benefited from such automation over the last two decades, so much so that DICOM is supported by all the major stakeholders in the medical ICT industry. This, in our opinion, is a crucial advantage of DICOM-SB over potential rival standards for data acquisition. DICOM is a wellknown, widely adopted standard by both industry and academia. There already are a large number of programmers and engineers familiar with DICOM. It would therefore take a relatively small amount of development work for manufacturers to adapt their equipment to support DICOM-SB. For some modalities such as microscopy that already use DICOM, transitioning to DICOM-SB would be straightforward due to the similarity between the standards.

ACS Paragon Plus Environment

25

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 41

Far from competing with other synthetic biology standards, DICOM-SB can be the perfect complement to the standards currently available in the area. Taking the example of SBOL, it is currently the most powerful and successful tool for the representation of structural 32,33 and functional descriptions 34 in synthetic biology. However, when encoding data inclusive representations, DICOM-SB is highly effective due to its binary representation of files and its storing efficiency. DICOM-SB is also ideal for encoding experimental results. The underlying data model can, where appropriate, readily provide a binary implementation of an SBOL data model. The DICOM data model is the first extensive attempt at standardizing the data acquisition process in synthetic biology. The performance module standardizes the encoding of raw data in synthetic biology. As a follow up to DICOM-SB, we have also developed a data model to standardize the encoding of analyzed (experimental) data in synthetic biology — the datasheet module. It is based on the model that was developed to help disseminate the biopart datasheets hosted by SynBIS, which matches the workflow for a canonical characterization pipeline shown in Figure 1. The data model has also been designed to promote compatibility between standards. We plan to release the datasheets, together with a DICOM-SB implementation based on DICOM Structured Reports 49. It is our belief the datasheet module will provide a simple way for existing repositories (which mainly deal with designs) to host raw characterization data (as DICOM-SB) and their interpretation.

ACS Paragon Plus Environment

26

Page 27 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

The adoption of a DICOM-SB by the synthetic biology community (academia and industry) would represent a discrete, important milestone in the development of synthetic biology — particularly in relation to interoperability and industrial translation. METHODS Two different software applications have supported the results presented in the coming sections:



A DICOM-SB converter, which takes as input the ad hoc text and fcs data files generated by the different modalities (flow cytometry and plate reader, in our case) and produces DICOM-SB files. This software has been developed using Java SE as programming language 50, and has used the dcm4che Toolkit 51 to import the libraries needed to produce DICOM data. Dcm4che is a popular collection of open source applications for healthcare. The dcm4che toolkit constitutes an excellent starting point for the development of DICOM-SB applications in JAVA SE.



A DICOM Store service, responsible of uploading the DICOM-SB formatted raw data into our SynBIS repository. This application has been developed as a RESTful Web Service under Java EE 7 52. Although this service is not publicly available through SynBIS, there are a number of open tools available elsewhere — e.g. the ones under the dcm4che project 51 that can be used instead.

ASSOCIATED CONTENT A detailed description of the DICOM-SB standard is available free of charge via the Internet at http://pubs.acs.org. The software tools used to generate DICOM-SB files from experimental data, as well as sample DICOM-SB files, are available via the Internet at

ACS Paragon Plus Environment

27

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 41

http://synbis.bg.ic.ac.uk/dicomsb. Description of the characterization protocol as well as examples of SBOL files for the characterization constructs can also be found there. AUTHOR INFORMATION Corresponding Author * Richard I Kitney, Co-Director of the Centre for Synthetic Biology and Innovation, Imperial College London, SW7 2AZ, United Kingdom. [email protected] Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. All authors contributed equally. ACKNOWLEDGMENT The authors acknowledge the support provided for synthetic biology research by the Engineering and Physical Science Research Council [EP/J02175X/1] and the European Commission funded 7th Framework Program [FP7-KBBE 289326]. REFERENCES (1) Kitney, R., Calvert, J., Challis, R., Cooper, J., Elfick, A., Freemont, P., Haseloff, J., Kelly, M., and Paterson, L. (2009) Synthetic Biology: scope, applications and implications. The Royal Academy of Engineering. (2) Clarke, L., Adams, J., Sutton, P., Bainbridge, J., Birney, E., Calvert, J., Collis, A., Kitney, R., Freemont, P., Manson, P., Pandya, K., Ghaffar, T., Rose, N., Marris, C., and Woolfson, D. (2012) A Synthetic Biology Roadmap for the UK., pp 1–35. Research Councils UK.

ACS Paragon Plus Environment

28

Page 29 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

(3) Cheng, A. A., and Lu, T. K. (2012) Synthetic Biology: An Emerging Engineering Discipline. Annu. Rev. Biomed. Eng. 14, 155–178. (4) Isaacs, F. J., Dwyer, D. J., Ding, C., Pervouchine, D. D., Cantor, C. R., and Collins, J. J. (2004) Engineered riboregulators enable post-transcriptional control of gene expression. Nat. Biotechnol. 22, 841–847. (5) Win, M. N., Liang, J. C., and Smolke, C. D. (2009) Frameworks for Programming Biological Function through RNA Parts and Devices. Chem. Biol. 16, 298–310. (6) Salis, H. M., Mirsky, E. A., and Voigt, C. A. (2009) Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950. (7) Elowitz, M. B., and Leibler, S. (2000) A synthetic oscillatory network of transcriptional regulators. Nature 403, 335–338. (8) Gardner, T. S., Cantor, C. R., and Collins, J. J. (2000) Construction of a genetic toggle switch in Escherichia coli. Nature 403, 339–342. (9) Atkinson, M. R., Savageau, M. A., Myers, J. T., and Ninfa, A. J. (2003) Development of Genetic Circuitry Exhibiting Toggle Switch or Oscillatory Behavior in Escherichia coli. Cell 113, 597–607. (10) Deans, T. L., Cantor, C. R., and Collins, J. J. (2007) A Tunable Genetic Switch Based on RNAi and Repressor Proteins for Regulating Gene Expression in Mammalian Cells. Cell 130, 363–372. (11) Ham, T. S., Lee, S. K., Keasling, J. D., and Arkin, A. P. (2008) Design and Construction of a Double Inversion Recombination Switch for Heritable Sequential Genetic Memory. PLoS ONE 3, e2815.

ACS Paragon Plus Environment

29

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 41

(12) Kramer, B. P., and Fussenegger, M. (2005) Hysteresis in a synthetic mammalian gene network. Proc. Natl. Acad. Sci. U. S. A. 102, 9517–9522. (13) Fung, E., Wong, W. W., Suen, J. K., Bulter, T., Lee, S., and Liao, J. C. (2005) A synthetic gene–metabolic oscillator. Nature 435, 118–122. (14) Danino, T., Mondragón-Palomino, O., Tsimring, L., and Hasty, J. (2010) A synchronized quorum of genetic clocks. Nature 463, 326–330. (15) Anderson, J. C., Voigt, C. A., and Arkin, A. P. (2007) Environmental signal integration by a modular AND gate. Mol. Syst. Biol. 3, 133. (16) Win, M. N., and Smolke, C. D. (2008) Higher-Order Cellular Information Processing with Synthetic RNA Devices. Science 322, 456–460. (17) Wang, B., Kitney, R. I., Joly, N., and Buck, M. (2011) Engineering modular and orthogonal genetic logic gates for robust digital-like synthetic biology. Nat. Commun. 2, 508. (18) Basu, S., Mehreja, R., Thiberge, S., Chen, M.-T., and Weiss, R. (2004) Spatiotemporal control of gene expression with pulse-generating networks. Proc. Natl. Acad. Sci. U. S. A. 101, 6355–6360. (19) Basu, S., Gerchman, Y., Collins, C. H., Arnold, F. H., and Weiss, R. (2005) A synthetic multicellular system for programmed pattern formation. Nature 434, 1130–1134. (20) You, L., Cox, R. S., Weiss, R., and Arnold, F. H. (2004) Programmed population control by cell–cell communication and regulated killing. Nature 428, 868–871. (21) Khalil, A. S., and Collins, J. J. (2010) Synthetic biology: applications come of age. Nat. Rev. Genet. 11, 367–379. (22) Weber, W., and Fussenegger, M. (2012) Emerging biomedical applications of synthetic biology. Nat. Rev. Genet. 13, 21–35.

ACS Paragon Plus Environment

30

Page 31 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

(23) Kitney, R., and Freemont, P. (2012) Synthetic biology – the state of play. FEBS Lett. 586, 2029–2036. (24) Ellis, T., Adie, T., and Baldwin, G. S. (2011) DNA assembly for synthetic biology: from parts to pathways and beyond. Integr. Biol. 3, 109–118. (25) Kelwick, R., MacDonald, J. T., Webb, A. J., and Freemont, P. (2014) Developments in the tools and methodologies of synthetic biology. Synth. Biol. 2, 60. (26) Registry of Standard Biological Parts. http://parts.igem.org (accessed Oct 29, 2015). (27) JBEI Inventory of Composable Elements (ICE). https://public-registry.jbei.org (accessed Nov 29, 2015). (28) Virtual Parts Repository. http://sbol.ncl.ac.uk:8081 (accessed Oct 29, 2015) (29) Synthetic Biology Information System (SynBIS). http://synbis.bg.ic.ac.uk (accessed Oct 29, 2015). On-line repository of biopart-datasheets. (30) Pearson, W. R., and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. 85, 2444–2448. (31) Bilofsky, H. S., and Christian, B. (1988) The GenBank® genetic sequence data bank. Nucleic Acids Res. 16, 1861–1863. (32) Galdzicki, M., Wilson, M., Rodriguez, C. A., Pocock, M. R., Oberortner, E., Adam, L., Adler, A., Anderson, J. C., Beal, J., Cai, Y., Chandran, D., Densmore, D., Drory, O. A., Endy, D., Gennari, J. H., Grünberg, R., Ham, T. S., Hillson, N. J., Johnson, J. D., Kuchinsky, A., Lux, M. W., Madsen, C., Misirli, G., Myers, C. J., Olguin, C., Peccoud, J., Plahar, H., Platt, D., Roehner, N., Sirin, E., Smith, T. F., Stan, G.-B., Villabos, A., Wipat, A., and Sauro, H. M. (2012) Synthetic Biology Open Language (SBOL) Version 1.1.0.

ACS Paragon Plus Environment

31

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 41

(33) Galdzicki, M., Clancy, K. P., Oberortner, E., Pocock, M., Quinn, J. Y., Rodriguez, C. A., Roehner, N., Wilson, M. L., Adam, L., Anderson, J. C., Bartley, B. A., Beal, J., Chandran, D., Chen, J., Densmore, D., Endy, D., Grünberg, R., Hallinan, J., Hillson, N. J., Johnson, J. D., Kuchinsky, A., Lux, M., Misirli, G., Peccoud, J., Plahar, H. A., Sirin, E., Stan, G.-B., Villalobos, A., Wipat, A., Gennari, J. H., Myers, C. J., and Sauro, H. M. (2014) The Synthetic Biology Open Language (SBOL) provides a community standard for communicating designs in synthetic biology. Nat. Biotechnol. 32, 545–550. (34) Roehner, N., Oberortner, E., Pocock, M., Beal, J., Clancy, K., Madsen, C., Misirli, G., Wipat, A., Sauro, H., and Myers, C. J. (2014) Proposed Data Model for the Next Version of the Synthetic Biology Open Language. ACS Synth. Biol. (35) Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., Forum, and the rest of the S., Arkin, A. P., Bornstein, B. J., Bray, D., Cornish-Bowden, A., Cuellar, A. A., Dronov, S., Gilles, E. D., Ginkel, M., Gor, V., Goryanin, I. I., Hedley, W. J., Hodgman, T. C., Hofmeyr, J.-H., Hunter, P. J., Juty, N. S., Kasberger, J. L., Kremling, A., Kummer, U., Novère, N. L., Loew, L. M., Lucio, D., Mendes, P., Minch, E., Mjolsness, E. D., Nakayama, Y., Nelson, M. R., Nielsen, P. F., Sakurada, T., Schaff, J. C., Shapiro, B. E., Shimizu, T. S., Spence, H. D., Stelling, J., Takahashi, K., Tomita, M., Wagner, J., and Wang, J. (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524–531. (36) Hucka, M., Finney, A., Bornstein, B. J., Keating, S. M., Shapiro, B. E., Matthews, J., Kovitz, B. L., Schilstra, M. J., Funahashi, A., Doyle, J. C., and Kitano, H. (2004) Evolving a lingua franca and associated software infrastructure for computational systems biology: the Systems Biology Markup Language (SBML) project. Syst. Biol. IEE Proc. 1, 41–53.

ACS Paragon Plus Environment

32

Page 33 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

(37) A, F., and M, H. (2003, December 1) Systems biology markup language: Level 2 and beyond. (38) Bartley, B., Beal, J., Clancy, K., Misirli, G., Roehner, N., Oberortner, E., Pocock, M., Bissell, M., Madsen, C., Nguyen, T., Zhang, Z., Gennari, J. H., Myers, C., Wipat, A., and Sauro, H. (2015) Synthetic Biology Open Language (SBOL) Version 2.0.0. J. Integr. Bioinforma. 12, 272. (39) Roehner, N., and Myers, C. J. (2014) A Methodology to Annotate Systems Biology Markup Language Models with the Synthetic Biology Open Language. ACS Synth. Biol. 3, 57–66. (40) NEMA PS3 / ISO 12052, Digital Imaging and Communications in Medicine (DICOM) Standard. National Electrical Manufactureres Association, Rosslyn, VA, USA. (41) Gunter, T. D., and Terry, N. P. (2005) The Emergence of National Electronic Health Record Architectures in the United States and Australia: Models, Costs, and Questions. J. Med. Internet Res. 7. (42) Hoerbst, A., and Ammenwerth, E. (2010) Electronic Health Records: A Systematic Review on Quality Requirements. Methods Inf. Med. 49, 320–336. (43) Poh, C.-L., Kitney, R. I., and Shrestha, R. B. K. (2007) Addressing the Future of Clinical Information Systems — Web-Based Multilayer Visualization. IEEE Trans. Inf. Technol. Biomed. 11, 127–140. (44) Choplin, R. H., Boehme, J. M., and Maynard, C. D. (1992) Picture archiving and communication systems: an overview. RadioGraphics 12, 127–129. (45) Meyer-Ebrecht, D. (1994) Picture archiving and communication systems (PACS) for medical application. Int. J. Biomed. Comput. 35, 91–124.

ACS Paragon Plus Environment

33

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 41

(46) Müller, H., Michoux, N., Bandon, D., and Geissbuhler, A. (2004) A review of content-based image retrieval systems in medical applications—clinical benefits and future directions. Int. J. Med. Inf. 73, 1–23. (47) Leif, R. C., and Leif, S. B. (2001) DICOM-compatible format for analytical cytology data that can be expressed in XML, pp 238–248. (48) Danos, V., Feret, J., Fontana, W., Harmer, R., and Krivine, J. (2008) Rule-Based Modelling, Symmetries, Refinements, in Formal Methods in Systems Biology (Fisher, J., Ed.), pp 103–122. Springer Berlin Heidelberg. (49) Clunie, D. D. A. (2000) DICOM Structured Reporting. PixelMed Publishing, Bangor, Pa. (50) Java Standard Edition (SE). http://www.oracle.com/technetwork/java/javase/overview/index.html (accessed Oct 29, 2015). (51) dcm4che2 DICOM Toolkit. http://www.dcm4che.org/confluence/display/d2/dcm4che2+DICOM+Toolkit (accessed Oct 29, 2015. (52) The JavaTM API for RESTful Web Services. https://jcp.org/en/jsr/detail?id=311 (accessed Oct 29, 2015).

ACS Paragon Plus Environment

34

Page 35 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

For Table of Contents Use Only

We present in this paper the first data acquisition standard for synthetic biology - DICOM-SB. Built on a modular data model for the experimental process, DICOM-SB optimizes data storage and has a communication layer supporting exchange of information between modalities and repositories. To demonstrate these features, we use the example of the biopart characterization pipeline at Imperial College London.

ACS Paragon Plus Environment

35

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. The synthetic biology design cycle. 160x209mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 36 of 41

Page 37 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Figure 2. The CSynBI characterization pipeline. 160x145mm (300 x 300 DPI)

ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. (A) DICOM encoding of data elements. (B) Nesting data elements using the SQ value representation. (C) Building Modules and Information Entities. (D) The DICOM-SB data model. 189x239mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 38 of 41

Page 39 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Figure 4. Encoding the results of the plate reader using the Raw Data IOD. 180x129mm (300 x 300 DPI)

ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5. (A) The Synthetic Biology SOP Classes. (B) Receiving characterization data from external partners. 189x170mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 40 of 41

Page 41 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

For Table of Contents Use Only 80x36mm (300 x 300 DPI)

ACS Paragon Plus Environment