Efficient Analysis of Systems Biology Markup Language Models of

Feb 25, 2016 - Leandro WatanabeTramy NguyenMichael ZhangZach ZundelZhen ZhangCurtis MadsenNicholas RoehnerChris Myers. ACS Synthetic Biology ...
0 downloads 0 Views 3MB Size
Research Article pubs.acs.org/synthbio

Efficient Analysis of Systems Biology Markup Language Models of Cellular Populations Using Arrays Leandro Watanabe* and Chris J. Myers* Department of Electrical and Computer Engineering, University of Utah, Salt Lake City, Utah 84112, United States ABSTRACT: The Systems Biology Markup Language (SBML) has been widely used for modeling biological systems. Although SBML has been successful in representing a wide variety of biochemical models, the core standard lacks the structure for representing large complex regular systems in a standard way, such as whole-cell and cellular population models. These models require a large number of variables to represent certain aspects of these types of models, such as the chromosome in the whole-cell model and the many identical cell models in a cellular population. While SBML core is not designed to handle these types of models efficiently, the proposed SBML arrays package can represent such regular structures more easily. However, in order to take full advantage of the package, analysis needs to be aware of the arrays structure. When expanding the array constructs within a model, some of the advantages of using arrays are lost. This paper describes a more efficient way to simulate arrayed models. To illustrate the proposed method, this paper uses a population of repressilator and genetic toggle switch circuits as examples. Results show that there are memory benefits using this approach with a modest cost in runtime. KEYWORDS: SBML, standards, whole-cell modeling, population modeling, simulation, arrays

S

This paper describes a new simulation method within the GDA tool iBioSim4 that handles array constructs on-the-fly rather than flattening the model. Although this method has some overhead with index calculations, it provides significant memory advantages over simulating flattened models. This paper presents two case studies of population models in which each cell includes a genetic circuit. In the first case study, each cell in the population includes a repressilator circuit5 while in the second, each cell includes a genetic toggle switch circuit.6 Arrays allow the creation of any arbitrary number of cells containing these circuits. Results show significant improvements in memory usage with comparable runtimes.

tandards are key to the success of systems and synthetic biology since standards give biologists the ability to share models. The leading standard representation of biological systems is the Systems Biology Markup Language (SBML).1,2 Unfortunately, not all modeling efforts in systems biology use SBML. For example, Karr et al.3 developed a computational model of a whole-cell using MATLAB. This model includes all of the molecular components of the cell and their interactions, which makes the model quite involved. There are some aspects of the model that are difficult to encode in SBML. For example, the model makes heavy use of vectors to represent regular structures, such as, the chromosome. In SBML, you need a species for each position in the chromosome. This does not scale, since it requires hundreds of thousands of species with unique identifiers and duplicated common properties. The field of synthetic biology and the Genetic Design Automation (GDA) tools that support this field require efficient modeling of cellular populations. For example, a synthetic biologist may be interested in modeling a population of cells that include a genetic circuit design. The SBML arrays package can also be helpful in this domain. Although SBML cannot currently represent such structures efficiently, the arrays package has been proposed to overcome such a limitation by enabling the construction of complex biological models in a standard manner. The draft specification is currently under review by the community. Although arrays are useful in constructing more concise models, efficiency of analysis is not improved if the arrays are simply “flattened” (inlined) for simulation. © XXXX American Chemical Society



RESULTS AND DISCUSSION iBioSim provides a user interface to construct SBML models. This includes all SBML core constructs: Compartments, Species, Reactions, Parameters, Rules, Events, Constraints, etc. Figure 1 shows an example of an SBML model constructed in iBioSim. This model corresponds to the repressilator circuit. In the repressilator, there are three proteins produced from three promoters in which each protein acts as a transcription factor for one promoter creating a loop that forms an oscillator. Namely, the first protein, LacI, inhibits the transcription of the production of the second, TetR, which inhibits Special Issue: Programmable Biology Received: November 20, 2015

A

DOI: 10.1021/acssynbio.5b00242 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology

Figure 1. Array of repressilator circuits in iBioSim. The blue rectangles represent the chemical species for the proteins LacI, TetR, and CI. The red diamonds represent promoters, pLac, pTet, and pRM. The red arcs from protein to promoter represent repression of the promoter by the corresponding species. The blue arrows from promoter to species represent genetic production of the protein from the corresponding promoter. The bracketed entry indicates that each object in this model is actually an array of objects of size n.

the production of the third protein, CI, which inhibits the production of LacI. The tool provides high-level constructs to represent genetic circuits using SBML core elements. For example, the promoters are represented as chemical species while the genetic regulation and production processes are represented using chemical reactions in which the repressor species is a modifier and the protein produced is a product. In addition, for each species in the model there is a degradation reaction not shown on the figure. The arrays package allows an SBML model to include regular constructs more efficiently. An SBML object is an array when it is given dimensions. iBioSim has been extended to support such array constructs. For example, Figure 1 shows how this is currently done in iBioSim. Namely, each SBML object has an optional dimension, which is indicated in the figure by square brackets enclosing the id for a constant parameter, for example, “[” size “]”, with a non-negative integer value. In this particular case, the proteins LacI, TetR, and CI, and the promoters pLac, pTet, and pRM are arrays of size n. The index math is specified within the attributes box of each object. Without arrays, the same population of genetic circuits would have to be instantiated explicitly multiple times. Not only would this be a tedious process, but also it would not scale, since the model would grow quickly for large population sizes. To illustrate this limitation, memory consumption of the JSBML7 data object for both the arrayed version of the repressilator and the flattened version are shown in Table 1.

Figure 2. Performance comparison for the repressilator model. (a) Runtime comparison for stochastic simulation of a population of cells that include a repressilator genetic circuit. Results are shown for both a flattened and arrayed model. (b) Memory comparison for stochastic simulation of a population of cells that include a repressilator genetic circuit. Results are shown for both a flattened and arrayed model.

benefits of simulating arrayed models without flattening, comparison between the two approaches is conducted by simulating the arrayed version of the repressilator with different sizes. While the runtime is a bit higher as shown in Figure 2a, the method discussed in this paper reduces the memory usage substantially as shown in Figure 2b. In addition to the repressilator circuit, a population of cells with a genetic toggle switch, shown in Figure 3, is also analyzed. In the genetic toggle switch, there are two proteins, LacI and TetR, where they act as transcription factors for one another. This scheme leads to bistability. Namely, LacI and TetR can either be in a high or low state. To switch from a high to low state, small molecules are injected. The injected small molecule reacts with the corresponding protein to form a complex, which no longer represses the production of the other protein. In this case, LacI and IPTG form a complex and TetR and aTc form another complex. As shown in Figure 5, cells that start in the low state (i.e., LacI is high), remain in the high state. However, there is a small probability of a cell switching to the high state. This is shown in the plot where one of the blue traces goes to

Table 1. Memory Consumption of the Data Object Corresponding to a Population of Repressilator Circuits. The Table Shows Both the Results for the Arrayed Model and the Flat Model population count

species count

arrayed memory (MB)

flat memory (MB)

10 100 250 1000 2500 5000

60 600 1500 6000 15000 30000

3.568 3.569 3.568 3.568 3.567 3.568

4.052 9.093 19.803 69.234 168.270 333.206

Note that regardless of the size of the population, the SBML data object of the model consumes the same amount of memory, whereas for the flat version, memory consumption grows quickly. Using knowledge of the structure of the arrays, the simulator is able to condense memory usage. To illustrate the memory B

DOI: 10.1021/acssynbio.5b00242 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology

Figure 5. Population of genetic toggle switch circuits with diffusion. This model includes both cell interior and exterior compartments to construct a two-dimensional grid. Each noncomplex species (LacI, TetR, IPTG, and aTc) can reside either in the cell’s interior or exterior. The model includes diffusion reactions to move a species to/from the cell’s interior from/to the cell’s exterior. There are also diffusion reactions to move between the grid locations in the horizontal or vertical directions.

Figure 3. Population of genetic toggle switch circuits without diffusion. This model represents a two-dimensional population of cells that contain genetic toggle switches. Each toggle switch includes the TetR and LacI proteins that mutually repress each other on different promoters. The model also includes complex formation reactions, indicated by dashed arrows, to represent the reactions that combine aTc with TetR and IPTG with LacI to sequester them from being able to act as repressors. Figure 6. Simulation Results for a population of cells that include a genetic toggle switch circuit and diffusion between cells. It is now much less likely that a cell should remain in the low state (i.e., LacI is high and TetR is low) changes to the high state erroneously (i.e., LacI is low and TetR is high.

are LacI, TetR, aTc, and IPTG molecules. These species are given a relational position, where the environment space is discretized into a two-dimensional space, resembling a grid model. For each grid location, there are reactions for moving a species from one grid location to a neighbor location. In each grid position, there may exist a cell corresponding to the genetic toggle switch circuit, where the circuits are enclosed by their own compartment. These compartments clearly distinguish the interior from the exterior of the cells. Diffusion is represented as a reaction, which is a reversible reaction that takes one species from the exterior and moves it to the interior of a cell and vice versa. Diffusion between the cells is also modeled using additional reactions. By allowing diffusion, the cells are less likely to go into a bad state, since diffusion allows neighbors to share the state they are in with each other leading to a consensus being reached by the population. Simulation results for this model are shown in Figure 6. The performance for simulating the genetic toggle switch with diffusion is shown in Figure 7a,b. Only the results of the genetic toggle switch with diffusion are shown, since it is more

Figure 4. Simulation results for a population of cells that include a genetic toggle switch circuit and no diffusion between cells. In most cases, the cells remain in the low state (i.e., LacI is high and TetR is low), but sometimes a cell changes to the high state erroneously (i.e., LacI is low and TetR is high.).

0 and one of the red traces goes high (i.e., TetR is going into the high state). To construct a more robust system, the genetic toggle switch model is modified by allowing the diffusion of species as shown in Figure 5. The cells are put inside an environment compartment that represents the cell exterior. In the exterior, there C

DOI: 10.1021/acssynbio.5b00242 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology

handling array structures more efficiently. However, the method adds some overhead when retrieving the value of a certain variable within an array due to the necessity of computing indices when accessing arrayed elements. Handling arrays on-the-fly has another advantage in that it allows one to leverage sparse array data structures to further improve the memory-efficiency. Furthermore, the extended SSA potentially allows for the analysis of arrayed models with dynamically changing sizes. One of the conditions for flattening is that the model has to be statically computable. In the future, we plan to extend our simulator to support dynamic arrays enabling the simulation of populations of cells as they grow and divide.



METHODS Extending SBML To Include Arrays. An SBML package extension called the arrays package has been proposed to allow the expression of regular constructs in SBML models more efficiently as shown in Figure 8. SBML objects are extended

Figure 8. Unified Modeling Language (UML) diagram for data structure modifications required for the SBML Arrays package. Namely, all SBML objects inherit from the SBase class. This class is extended to include a ListOfDimensions and a ListOfIndices. If an object is arrayed, it has a Dimension object for each of its dimensions to specify its size via a constant Parameter. If an object has attributes that are arrayed, it can have an Index to indicate the specific object being referenced. This Index is specified using a MathML expression.

Figure 7. Performance comparison for the genetic toggle switch model. (a) Runtime comparison for stochastic simulation of a population of cells that include a genetic toggle switch. Results are shown for both a flattened and arrayed model. (b) Memory comparison for stochastic simulation of a population of cells that include a genetic toggle switch. Results are shown for both a flattened and arrayed model.

with the addition of Dimensions and Indices. When an SBML object is given Dimensions, it indicates this SBML object is an array. Each Dimension can have a name and id. Dimensions are required to have a size that points to a non-negative constant Parameter and an integer to indicate which array dimension that it corresponds to. An object that has attributes that are arrayed can have Indices to specify an array index to reference an arrayed object for a specified attribute. Note that objects in an array share the same attributes, including annotations. However, annotations can be structured to make use of the index and dimension fields in a way similar to how the arrays package uses them. Further details of the package can be found in the latest specification, which can be found at http://sbml.org/ Documents/Specifications/SBML_Level_3/Packages/Arrays_ and_Sets_(arrays). Analysis of Arrayed Models. There are two ways to simulate models with arrays. The first way is to flatten a model before simulation. This procedure simply inlines the array elements within the model. The arrays package is just syntatic sugar for SBML models. That is, semantically equivalent models can be constructed without this extension. Therefore, an SBML

Table 2. Probability of a Cell Going into a Bad State model

number of cells

number of failures

probability

without diffusion with diffusion

18 750 18 750

219 90

∼1.2% ∼0.5%

involved than the model without diffusion. As you can see from the plot, the results are consistent with the ones obtained for the repressilator circuit. The runtime is a bit higher, but memory usage is significantly improved. By simulating the models with large array sizes and performing multiple runs, the probability of a cell going into a bad state can be approximated. The results are summarized in Table 2. These results indicate that a probability of a bad state is improved by more than 2-fold when the cells are able to communicate with each other. Discussion. In summary, the method presented in this paper prevents the unnecessary duplication of data caused by flattening. This simulation method enables the simulation of larger models that otherwise would not fit in memory by D

DOI: 10.1021/acssynbio.5b00242 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology document using arrays can be flattened into a new SBML document without arrays. This flattening procedure has been implemented in the Java-based library of SBML called JSBML.7 This routine eases the integration of the arrays package into existing analysis tools. However, this approach has some limitations: it restricts arrays objects to be statically computable (i.e., constant sizes), and it can cause model size to grow substantially. The flattening algorithm, as presented in the specification, is shown in Algorithm 1. The flattening routine starts off by cloning the document that is being flattened to preserve the original document. Then, the algorithm loops through each SBML element in the cloned document (e.g., Species, Parameters, Reactions, etc.). Every object in SBML is derived from an SBase object. For a given SBase object, the object is deleted from the document and Algorithm 2 is called. This algorithm is used to inline arrayed objects. Starting from the highest dimension to the lowest, the algorithm dereferences the elements in the array until a scalar element is retrieved. Namely, if the current dimension dim is non-negative, then for each element in the array, a clone of the current SBase is created. The id of the clone is updated to ensure that new elements are created with unique ids. The metaId is updated in a similar fashion. Then, math associated

with the current SBase is updated, since dimension ids are being replaced by their corresponding integer value. This is needed because dimensions are being deleted while new elements are being added to the document. Lastly, the algorithm makes a recursive call so lower dimensions can be expanded. If the current dimension is negative, it means the current SBase is a scalar and there is no dimension left to be expanded. In this case, the algorithm goes through each attribute that references an arrayed object and updates the value of the attribute by computing the respective indices and dereferencing the corresponding element in the array. Once this is completed, the element is added to the document. The main advantage of the flattening method is that it can be analyzed using a variety of different methods that are used to analyze ordinary chemical reaction networks, such as Gillespie’s stochastic simulation algorithm (SSA).8 There are several variants of SSA. One that is often used is the direct method. This method is illustrated in Algorithm 3. One of the problems with the flattening approach is that when a model is flattened to an intermediate representation,

E

DOI: 10.1021/acssynbio.5b00242 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology valuable information is lost. Another approach is to simulate on the arrayed model as is. An extension to SSA, where arrays are handled on-the-fly, is implemented within iBioSim, and it will be described below. The SSA takes a chemical reaction network model, M, and computes a time series simulation, α. The simulation begins by initializing α to an empty sequence. In Algorithm 4, the initial state

ind, are retrieved. The indices are used to reference the corresponding species participating in the reaction. If the vector ind is empty, then the reaction is a scalar and only contains scalar species participating in the reaction. Once again, the total propensity is computed in the end. The total propensity is used to determine the time until the next reaction. The next reaction time is computed using Algorithm 8. The next reaction is an exponential random

of the model is computed, where t denotes the current simulation time and x = ⟨x1, ... xn⟩ denotes the state of all species. To handle arrays, modifications to the algorithm are introduced, which is shown in Algorithm 5. The first modification is the addition of

vector r. This vector is a one-dimensional vector where each element corresponds to a reaction. In this case, arrayed reactions are inlined. Although this is not necessary and compromises the full benefits of memory efficiency, this simplifies the algorithm and the algorithm implementation can take advantage of this to speed up certain functions. This is a trade-off between memory and runtime. In addition, the state vector x is modified, and it can now take either scalar values or multidimensional vectors. If an SBML object is an array, then the function expandArray is used to retrieve the state vector corresponding to the given array. The simulation proceeds until the current simulation time exceeds a given time limit. For each simulation step, the current state of the simulation is appended to α. Then, the reaction propensities, a, are computed as shown in Algorithm 6. Each

variable with mean proportional to the inverse of the total propensity. This function remains unmodified for the arrayed simulation. The next step is to select a reaction to fire. To compute the next reaction, a running sum of propensities is computed. The selected reaction is the one where the running sum exceeds the total propensity times a random number from an uniform distribution with value [0,1]. Note that the algorithm remains the same for the arrayed simulation since the reactions have been inlined before computing the propensities. Finally, the simulation time and state are updated as shown in Algorithm 10, where vμ is a vector representing the change in

state due to reaction Rμ. This process repeats until the time, t, exceeds the simulation time limit. For the arrays simulation, there is a small change as shown in Algorithm 11. In this case, the inlined reaction vector is used to determine how the state

reaction propensity, aj, is the propensity for reaction, Rj, which can be approximated by taking the product of the rate constant times each of the reactants raised to the power of their respective stoichiometry. Lastly, the total propensity is computed, which is simply the sum of all reaction propensities. To compute propensities when arrayed reactions are present in the model, Algorithm 7 is used. In this algorithm, there are p reactions, where p corresponds to the number of reactions inlined. For each reaction Rj, the corresponding index values, F

DOI: 10.1021/acssynbio.5b00242 ACS Synth. Biol. XXXX, XXX, XXX−XXX

Research Article

ACS Synthetic Biology should be updated using the update function since arrayed reactions can be selected, which can possibly refer to arrayed species. When updating the states, the fact that the state vector can have scalar and vector entries needs to be taken into account. As shown in this paper, the structure of the algorithm stays the same in the arrays simulation approach. However, the simulator can make smarter choices since the arrays give hints on the structure of the objects. Only one copy of each construct is necessary with the exception of variables. Variables, such as Species, Parameters, and Compartments, among others, need a record of the state of each member of the array. However, attributes such as constant fields, boundary condition for Species, and number of space dimensions for Compartments, among others, are stored only once since all of the arrayed objects can refer to the same parent object to look up attribute values. In addition, optimizations that are implementation-specific can be applied. Rather than using maps to store values, arrays can be used, which are more efficient. Other constructs need a record of the size of the array. When performing arrayed Reactions, Events, Rules, and other constructs that change the state of the simulation, the simulator iterates through each of the components of the array and performs the necessary updates. Objects that reference other arrayed objects need to calculate the index for each object being referenced.



AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. *E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors of this work are supported by the National Science Foundation under Grant No. CCF-1218095. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.



REFERENCES

(1) Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., et al. (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524−531. (2) Hucka, M., Bergmann, F. T., Hoops, S., Keating, S. M., Sahle, S., Schaff, J. C., Smith, L. P., and Wilkinson, D. J. (2015) The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 1 Core. J. Integrative Bioinform. 12, 266. (3) Karr, J. R., Sanghvi, J. C., Macklin, D. N., Gutschow, M. V., Jacobs, J. M., Bolival, B., Assad-Garcia, N., Glass, J. I., and Covert, M. W. (2012) A Whole-Cell Computational Model Predicts Phenotype from Genotype. Cell 150, 389−401. (4) Madsen, C., Myers, C., Patterson, T., Roehner, N., Stevens, J., and Winstead, C. (2012) Design and Test of Genetic Circuits Using iBioSim. IEEE Des. Test Comput. 29, 32−39. (5) Elowitz, M., and Leibler, S. (2000) A synthetic oscillatory network of transcriptional regulators. Nature 403, 335−338. (6) Gardner, T. S., Cantor, C. R., and Collins, J. J. (2000) Construction of a genetic toggle switch in Escherichia coli. Nature 403, 339−342. (7) Rodriguez, N., et al. (2015) JSBML 1.0: providing a smorgasbord of options to encode systems biology models. Bioinformatics 31, 3383. (8) Gillespie, D. T. (1977) Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340−2361. G

DOI: 10.1021/acssynbio.5b00242 ACS Synth. Biol. XXXX, XXX, XXX−XXX